Tuesday 7 February 2017

R: multiple imputation in R

R: multiple imputation in R: USE MICE PACKAGE FOR   multiple imputation >install.packages("mice") > library(mice) AND create new data set using ol...

sign function in R

sign returns a vector with the signs of the corresponding elements of x (the sign of a real number is 1, 0, or  -1  if the number is positive, zero, or negative, respectively).

sign {base}

> sign(20)
[1] 1
> sign(-10)
[1] -1
> sign(0)

[1] 0

Note that sign does not operate on complex vectors.

R: multiple imputation in R

R: multiple imputation in R

Missing values in R (Missing values treatment )

you can also go this link

https://www.blogger.com/blogger.g?blogID=3973483800164253588#editor/target=post;postID=6457933726508849760;onPublishedMenu=allposts;onClosedMenu=allposts;postNum=1;src=postname

http://rcodee.blogspot.sg/2017/02/multiple-imputation-in-r.html

>install.packages("mice")


> library(mice)
new data set
> set.seed(144)

> imputed = complete(mice(sample))

multiple imputation in R

USE MICE PACKAGE FOR  multiple imputation

>install.packages("mice")

> library(mice)
AND create new data set using
old data set "r"
new data set "sample"
sample = r[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount")]
 summary(r)
         State          Year        Rasmussen          SurveyUSA          DiffCount    
 Arizona    :  3   Min.   :2004   Min.   :-41.0000   Min.   :-33.0000   Min.   :-19.000
 Arkansas   :  3   1st Qu.:2004   1st Qu.: -8.0000   1st Qu.:-11.7500   1st Qu.: -6.000
 California :  3   Median :2008   Median :  1.0000   Median : -2.0000   Median :  1.000
 Colorado   :  3   Mean   :2008   Mean   :  0.0404   Mean   : -0.8243   Mean   : -1.269
 Connecticut:  3   3rd Qu.:2012   3rd Qu.:  8.5000   3rd Qu.:  8.0000   3rd Qu.:  4.000
 Florida    :  3   Max.   :2012   Max.   : 39.0000   Max.   : 30.0000   Max.   : 11.000
 (Other)    :127                  NA's   :46         NA's   :71                        
     PropR          Republican  
 Min.   :0.0000   Min.   :0.0000
 1st Qu.:0.0000   1st Qu.:0.0000
 Median :0.6250   Median :1.0000
 Mean   :0.5259   Mean   :0.5103
 3rd Qu.:1.0000   3rd Qu.:1.0000
 Max.   :1.0000   Max.   :1.0000
                                 
> install.packages("mice")

> library(mice)
]
> sample = r[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount")]
> summary(sample)
   Rasmussen          SurveyUSA            PropR          DiffCount    
 Min.   :-41.0000   Min.   :-33.0000   Min.   :0.0000   Min.   :-19.000
 1st Qu.: -8.0000   1st Qu.:-11.7500   1st Qu.:0.0000   1st Qu.: -6.000
 Median :  1.0000   Median : -2.0000   Median :0.6250   Median :  1.000
 Mean   :  0.0404   Mean   : -0.8243   Mean   :0.5259   Mean   : -1.269
 3rd Qu.:  8.5000   3rd Qu.:  8.0000   3rd Qu.:1.0000   3rd Qu.:  4.000
 Max.   : 39.0000   Max.   : 30.0000   Max.   :1.0000   Max.   : 11.000
 NA's   :46         NA's   :71      
                                   
> set.seed(144)
> imputed = complete(mice(sample))

 iter imp variable
  1   1  Rasmussen  SurveyUSA
  1   2  Rasmussen  SurveyUSA
  1   3  Rasmussen  SurveyUSA
  1   4  Rasmussen  SurveyUSA
  1   5  Rasmussen  SurveyUSA
  2   1  Rasmussen  SurveyUSA
  2   2  Rasmussen  SurveyUSA
  2   3  Rasmussen  SurveyUSA
  2   4  Rasmussen  SurveyUSA
  2   5  Rasmussen  SurveyUSA
  3   1  Rasmussen  SurveyUSA
  3   2  Rasmussen  SurveyUSA
  3   3  Rasmussen  SurveyUSA
  3   4  Rasmussen  SurveyUSA
  3   5  Rasmussen  SurveyUSA
  4   1  Rasmussen  SurveyUSA
  4   2  Rasmussen  SurveyUSA
  4   3  Rasmussen  SurveyUSA
  4   4  Rasmussen  SurveyUSA
  4   5  Rasmussen  SurveyUSA
  5   1  Rasmussen  SurveyUSA
  5   2  Rasmussen  SurveyUSA
  5   3  Rasmussen  SurveyUSA
  5   4  Rasmussen  SurveyUSA
  5   5  Rasmussen  SurveyUSA
> summary(imputed)
   Rasmussen         SurveyUSA          
 Min.   :-41.000   Min.   :-33.000  
 1st Qu.: -8.000   1st Qu.:-11.000  
 Median :  3.000   Median :  1.000  
 Mean   :  1.731   Mean   :  1.517  
 3rd Qu.: 11.000   3rd Qu.: 18.000
 Max.   : 39.000   Max.   : 30.000  
> no NA values 
now change both  the new variable 

given below

r$Rasmussen = imputed$Rasmussen

> r$SurveyUSA = imputed$SurveyUSA
> summary(r)
         State          Year        Rasmussen         SurveyUSA         DiffCount    
 Arizona    :  3   Min.   :2004   Min.   :-41.000   Min.   :-33.000   Min.   :-19.000
 Arkansas   :  3   1st Qu.:2004   1st Qu.: -8.000   1st Qu.:-11.000   1st Qu.: -6.000
 California :  3   Median :2008   Median :  3.000   Median :  1.000   Median :  1.000
 Colorado   :  3   Mean   :2008   Mean   :  1.731   Mean   :  1.517   Mean   : -1.269
 Connecticut:  3   3rd Qu.:2012   3rd Qu.: 11.000   3rd Qu.: 18.000   3rd Qu.:  4.000
 Florida    :  3   Max.   :2012   Max.   : 39.000   Max.   : 30.000   Max.   : 11.000
 (Other)    :127                                                                      
     PropR          Republican  
 Min.   :0.0000   Min.   :0.0000
 1st Qu.:0.0000   1st Qu.:0.0000
 Median :0.6250   Median :1.0000
 Mean   :0.5259   Mean   :0.5103
 3rd Qu.:1.0000   3rd Qu.:1.0000
 Max.   :1.0000   Max.   :1.0000  

multiple imputation

multiple imputation?

 it is a statistical technique for analyzing incomplete data sets, that is, data sets for which some entries are missing. Application of the technique requires three steps: imputation, analysis and pooling. The figure illustrates these steps.
  1. Imputation: Impute (=fill in) the missing entries of the incomplete data sets, not once, but m times (m=3 in the figure). Imputed values are drawn for a distribution (that can be different for each missing entry). This step results is m complete data sets. 
  2. Analysis: Analyze each of the m completed data sets. This step results in m analyses.
  3. Pooling: Integrate the m analysis results into a final result. Simple rules exist for combining the m analyses.

Monday 6 February 2017

partial correlation coefficient in R

correlation coefficient to calculate the partial correlation
partial.correlation = cor(residuals.air.acid, residuals.water.acid, method = 'spearman')

Sunday 5 February 2017

you can buy your effective stuff at


you can buy your effective stuff at 


https://www.instamojo.com/shikhasha


https://www.instamojo.com/shikhasha

Solved case study for R


you can buy your effective stuff at 


https://www.instamojo.com/shikhasha/?ref=offer_header

how to remove heteroscedasticity in r

how to remove heteroscedasticity in r

NCV Test
car::ncvTest(lmMod)  # Breusch-Pagan test
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 4.650233    Df = 1     p = 0.03104933 
p-value less that a significance level of 0.05, therefore we can reject the null hypothesis that the variance of the residuals is constant and infer that heteroscedasticity is indeed present, thereby confirming our graphical inference.

treatment for multicollinearity

Box-Cox transformation

Box-cox transformation is a mathematical transformation of the variable to make it approximate to a normal distribution. Often, doing a box-cox transformation of the Y variable solves the issue, which is exactly what I am going to do now.
 library("caret", lib.loc="~/R/win-library/3.2")

> distBCMod=BoxCoxTrans(r$Crime)
> distBCMod
Box-Cox Transformation

47 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  342.0   658.5   831.0   905.1  1058.0  1993.0 

Largest/Smallest: 5.83 
Sample Skewness: 1.05 

Estimated Lambda: -0.1 
With fudge factor, Lambda = 0 will be used for transformations

> r <- cbind(r, Crime_new=predict(distBCMod, r$Crime)) # append the transformed variable to r
> head(r) # view the top 6 rows
 Crime Crime_new
1   791  6.673298
2  1635  7.399398
3   578  6.359574
4  1969  7.585281
5  1234  7.118016
6   682  6.525030
> lmMod_bc <- lm(Crime_new ~ Wealth+Ineq, data=r)
> ncvTest(lmMod_bc)
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 0.003153686    Df = 1     p = 0.9552162 
> ncvTest(mod3)

With a p-value of  0.9552162, we fail to reject the null hypothesis (that variance of residuals is constant) and therefore infer that ther residuals are homoscedastic. Lets check this graphically as well.
plot(lmMod_bc)
Here it is the plot:
Transformed-model-no-heteroscedasticity

normality test OF RESIDUAL in R

normality test OF RESIDUAL  in R

 in the nortest package 

shapiro.test(mod3$residuals)

Shapiro-Wilk normality test

data:  mod3$residuals
W = 0.95036, p-value = 0.04473

NORMAL Q-Q PLOTS

plot(mod3)
press enter 4 times to get 4 different graph

The code above generates data from a normal distribution (command “rnorm”), reshapes it into a series of columns, and runs what is called a normal quantile-quantile plot (QQ Plot, for short) on the first column.
Q-Q Plot (Normal)
Q-Q Plot (Normal)
The Q-Q plot tells us what proportion of the data set (in this case, the first column of variable x), compares with the expected proportion (theoretically) of the normal distribution model based on the sample’s mean and standard deviation. We’re able to do this, because of the normal distribution’s properties. The normal distribution is thicker around the mean, and thinner as you move away from it – specifically, around 68% of the points you can expect to see in normally distributed data will only be 1 standard deviation away from the mean. There are similar metrics for normally distributed data, for 2 and 3 standard deviations (95.4% and 99.7% respectively).
However, as you see, testing a large set of data (such as the 100 columns of data we have here) can quickly become tedious, if we’re using a graphical approach. Then there’s the fact that the graphical approach may not be a rigorous enough evaluation for most statistical analysis situations, where you want to compare multiple sets of data easily. Unsurprisingly, we therefore use test statistics, and normality tests, to assess the data’s normality.

how to compare two model in r using ANOVA

how to compare two model in r using ANOVA

 mod2=lm(r$Crime~r$Ineq)
> mod3=lm(r$Crime~r$Wealth+r$Ineq)
> anova(mod2,mod3)
Analysis of Variance Table

Model 1: r$Crime ~ r$Ineq
Model 2: r$Crime ~ r$Wealth + r$Ineq
  Res.Df     RSS Df Sum of Sq      F   Pr(>F)  
1     45 6660397                                
2     44 4137694  1   2522703 26.826 5.32e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

TO SHOW F= T^2

 TO SHOW F= T^2
mod1=lm(y~x,data=   )
anova=aov(mod1)
> summary(anova)
            Df  Sum Sq Mean Sq F value Pr(>F)  
r$Wealth     1 1340152 1340152   10.88 0.0019 **
Residuals   45 5540775  123128                
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(mod1)

Call:
lm(formula = r$Crime ~ r$Wealth)

Residuals:
    Min      1Q  Median      3Q     Max
-631.40 -272.84  -46.17  197.25  825.02

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) -24.28261  286.31386  -0.085   0.9328  
r$Wealth      0.17689    0.05362   3.299   0.0019 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 350.9 on 45 degrees of freedom
Multiple R-squared:  0.1948, Adjusted R-squared:  0.1769
F-statistic: 10.88 on 1 and 45 DF,  p-value: 0.001902

> f=3.299^2
> f
[1] 10.8834

t test

 t.test(r,mu=0,alt="two.sided",conf=0.95)

One Sample t-test

data:  r
t = 8.442, df = 751, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 306.7586 492.6578
sample estimates:
mean of x
 399.7082 

anova

 anova=aov(mod1)
> summary(anova)
            Df  Sum Sq Mean Sq F value Pr(>F)  
r$Wealth     1 1340152 1340152   10.88 0.0019 **
Residuals   45 5540775  123128                
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlations pearson, spearman or kendall.

 cor(r, use="complete.obs", method="pearson" )

 cor(x, use=, method= ) where
OptionDescription
xMatrix or data frame
useSpecifies the handling of missing data. Options are all.obs (assumes no missing data - missing data will produce an error), complete.obs (listwise deletion), and pairwise.complete.obs (pairwise deletion)
methodSpecifies the type of correlation. Options are pearsonspearman or kendall.






                 M          So          Ed         Po1         Po2         LF         M.F
M       1.00000000  0.58435534 -0.53023964 -0.50573690 -0.51317336 -0.1609488 -0.02867993
So      0.58435534  1.00000000 -0.70274132 -0.37263633 -0.37616753 -0.5054695 -0.31473291
Ed     -0.53023964 -0.70274132  1.00000000  0.48295213  0.49940958  0.5611780  0.43691492
Po1    -0.50573690 -0.37263633  0.48295213  1.00000000  0.99358648  0.1214932  0.03376027
Po2    -0.51317336 -0.37616753  0.49940958  0.99358648  1.00000000  0.1063496  0.02284250
LF     -0.16094882 -0.50546948  0.56117795  0.12149320  0.10634960  1.0000000  0.51355879
M.F    -0.02867993 -0.31473291  0.43691492  0.03376027  0.02284250  0.5135588  1.00000000
Pop    -0.28063762 -0.04991832 -0.01722740  0.52628358  0.51378940 -0.1236722 -0.41062750
NW      0.59319826  0.76710262 -0.66488190 -0.21370878 -0.21876821 -0.3412144 -0.32730454
U1     -0.22438060 -0.17241931  0.01810345 -0.04369761 -0.05171199 -0.2293997  0.35189190
U2     -0.24484339  0.07169289 -0.21568155  0.18509304  0.16922422 -0.4207625 -0.01869169
Wealth -0.67005506 -0.63694543  0.73599704  0.78722528  0.79426205  0.2946323  0.17960864
Ineq    0.63921138  0.73718106 -0.76865789 -0.63050025 -0.64815183 -0.2698865 -0.16708869
Prob    0.36111641  0.53086199 -0.38992286 -0.47324704 -0.47302729 -0.2500861 -0.05085826

Saturday 4 February 2017

Logistic Regression Model

# Logistic Regression Model
mod1 = glm(Republican~PropR, data=Train, family="binomial")
summary(mod1)

Scatter Plots,Boxplots,Histogram in R


# Scatter Plots
  plot(USDA$Protein, USDA$TotalFat)
# Add xlabel, ylabel and title
  plot(USDA$Protein, USDA$TotalFat, xlab="Protein", ylab = "Fat", main = "Protein vs Fat", col = "red")
# Creating a histogram
  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C")
# Add limits to x-axis
  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100))
# Specify breaks of histogram
  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=100)
  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=2000)
# Boxplots
  boxplot(USDA$Sugar, ylab = "Sugar (g)", main = "Boxplot of Sugar")

Make test set predictions in R

# Make test set predictions
predictTest = predict(model4, newdata=wineTest)
predictTest

HOW to find R-squared in R

# Compute R-squared
SSE = sum((wineTest$Price - predictTest)^2)
SST = sum((wineTest$Price - mean(wine$Price))^2)
1 - SSE/SST

Linear Regression in R


# Linear Regression (all variables)
model3 = lm(Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop, data=wine)
summary(model3)

# Sum of Squared Errors
SSE = sum(model3$residuals^2)
SSE



# Remove FrancePop
model4 = lm(Price ~ AGST + HarvestRain + WinterRain + Age, data=wine)
summary(model4)

R: covariance matrix plot

R: covariance matrix plot


# Plot #1: Basic scatterplot matrix of the four measurements
pairs(~x1+x2+x3+x4, data=data name)

covariance matrix plot

<
# Load the  dataset.
data(data name)
 
# Plot #1: Basic scatterplot matrix of the four measurements
pairs(~x1+x2+x3+x4, data=data name)


or

in corrplot package, named "circle""square""ellipse""number""shade""color""pie".
library(corrplot)
M <- cor(mtcars)
corrplot(M, method="circle")

corrplot(M, method="number")
data(iris)
cor(iris[,1:4])
pairs(iris[,1:4])

Friday 3 February 2017

data for R

write us for excel file and data for R to Practice - shakti.kumar@ifmr.ac.in

Thursday 2 February 2017

code-of-r-_casestudy

MULTICOLLINEARTY

MULTICOLLINEARTY
library(car)
> fit <- lm(Fertility ~ . , data = swiss)

> vif(fit)

Wednesday 1 February 2017

find all your r coding

https://www.instamojo.com/shikhasha/complete-code-of-r-_casestudy/?ref=store

intepretation of tapply and tapply function

intepret tapply 


tapply(USDA$Iron, USDA$HighProtein, mean, na.rm=TRUE)# Maximum level of Vitamin C in hfoods with high and low carbs?  tapply(USDA$VitaminC, USDA$HighCarbs, max, na.rm=TRUE)# Using summary function with tapply  tapply(USDA$VitaminC, USDA$HighCarbs, summary, na.rm=TRUE)  data------- available on   ---- http://rcodee.blogspot.sg/