R: February 2017

Tuesday, 7 February 2017

R: multiple imputation in R

R: multiple imputation in R: USE MICE PACKAGE FOR multiple imputation >install.packages("mice") > library(mice) AND create new data set using ol...

sign returns a vector with the signs of the corresponding elements of x (the sign of a real number is 1, 0, or -1 if the number is positive, zero, or negative, respectively).

sign {base}

> sign(20)
[1] 1
> sign(-10)
[1] -1
> sign(0)

[1] 0

Note that sign does not operate on complex vectors.

R: multiple imputation in R

Missing values in R (Missing values treatment )

you can also go this link

https://www.blogger.com/blogger.g?blogID=3973483800164253588#editor/target=post;postID=6457933726508849760;onPublishedMenu=allposts;onClosedMenu=allposts;postNum=1;src=postname

http://rcodee.blogspot.sg/2017/02/multiple-imputation-in-r.html

>install.packages("mice")

> library(mice)
new data set
> set.seed(144)

> imputed = complete(mice(sample))

multiple imputation in R

USE MICE PACKAGE FOR multiple imputation

>install.packages("mice")

> library(mice)
AND create new data set using
old data set "r"
new data set "sample"
sample = r[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount")]
summary(r)
State Year Rasmussen SurveyUSA DiffCount
Arizona : 3 Min. :2004 Min. :-41.0000 Min. :-33.0000 Min. :-19.000
Arkansas : 3 1st Qu.:2004 1st Qu.: -8.0000 1st Qu.:-11.7500 1st Qu.: -6.000
California : 3 Median :2008 Median : 1.0000 Median : -2.0000 Median : 1.000
Colorado : 3 Mean :2008 Mean : 0.0404 Mean : -0.8243 Mean : -1.269
Connecticut: 3 3rd Qu.:2012 3rd Qu.: 8.5000 3rd Qu.: 8.0000 3rd Qu.: 4.000
Florida : 3 Max. :2012 Max. : 39.0000 Max. : 30.0000 Max. : 11.000
(Other) :127 NA's :46 NA's :71
PropR Republican
Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000
Median :0.6250 Median :1.0000
Mean :0.5259 Mean :0.5103
3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000

> install.packages("mice")

> library(mice)
]
> sample = r[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount")]
> summary(sample)
Rasmussen SurveyUSA PropR DiffCount
Min. :-41.0000 Min. :-33.0000 Min. :0.0000 Min. :-19.000
1st Qu.: -8.0000 1st Qu.:-11.7500 1st Qu.:0.0000 1st Qu.: -6.000
Median : 1.0000 Median : -2.0000 Median :0.6250 Median : 1.000
Mean : 0.0404 Mean : -0.8243 Mean :0.5259 Mean : -1.269
3rd Qu.: 8.5000 3rd Qu.: 8.0000 3rd Qu.:1.0000 3rd Qu.: 4.000
Max. : 39.0000 Max. : 30.0000 Max. :1.0000 Max. : 11.000
NA's :46 NA's :71

> set.seed(144)
> imputed = complete(mice(sample))

iter imp variable
1 1 Rasmussen SurveyUSA
1 2 Rasmussen SurveyUSA
1 3 Rasmussen SurveyUSA
1 4 Rasmussen SurveyUSA
1 5 Rasmussen SurveyUSA
2 1 Rasmussen SurveyUSA
2 2 Rasmussen SurveyUSA
2 3 Rasmussen SurveyUSA
2 4 Rasmussen SurveyUSA
2 5 Rasmussen SurveyUSA
3 1 Rasmussen SurveyUSA
3 2 Rasmussen SurveyUSA
3 3 Rasmussen SurveyUSA
3 4 Rasmussen SurveyUSA
3 5 Rasmussen SurveyUSA
4 1 Rasmussen SurveyUSA
4 2 Rasmussen SurveyUSA
4 3 Rasmussen SurveyUSA
4 4 Rasmussen SurveyUSA
4 5 Rasmussen SurveyUSA
5 1 Rasmussen SurveyUSA
5 2 Rasmussen SurveyUSA
5 3 Rasmussen SurveyUSA
5 4 Rasmussen SurveyUSA
5 5 Rasmussen SurveyUSA
> summary(imputed)
Rasmussen SurveyUSA
Min. :-41.000 Min. :-33.000
1st Qu.: -8.000 1st Qu.:-11.000
Median : 3.000 Median : 1.000
Mean : 1.731 Mean : 1.517
3rd Qu.: 11.000 3rd Qu.: 18.000
Max. : 39.000 Max. : 30.000
> no NA values
now change both the new variable

given below

r$Rasmussen = imputed$Rasmussen

> r$SurveyUSA = imputed$SurveyUSA
> summary(r)
State Year Rasmussen SurveyUSA DiffCount
Arizona : 3 Min. :2004 Min. :-41.000 Min. :-33.000 Min. :-19.000
Arkansas : 3 1st Qu.:2004 1st Qu.: -8.000 1st Qu.:-11.000 1st Qu.: -6.000
California : 3 Median :2008 Median : 3.000 Median : 1.000 Median : 1.000
Colorado : 3 Mean :2008 Mean : 1.731 Mean : 1.517 Mean : -1.269
Connecticut: 3 3rd Qu.:2012 3rd Qu.: 11.000 3rd Qu.: 18.000 3rd Qu.: 4.000
Florida : 3 Max. :2012 Max. : 39.000 Max. : 30.000 Max. : 11.000
(Other) :127
PropR Republican
Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000
Median :0.6250 Median :1.0000
Mean :0.5259 Mean :0.5103
3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000

multiple imputation

multiple imputation?

it is a statistical technique for analyzing incomplete data sets, that is, data sets for which some entries are missing. Application of the technique requires three steps: imputation, analysis and pooling. The figure illustrates these steps.

Imputation: Impute (=fill in) the missing entries of the incomplete data sets, not once, but m times (m=3 in the figure). Imputed values are drawn for a distribution (that can be different for each missing entry). This step results is m complete data sets.
Analysis: Analyze each of the m completed data sets. This step results in m analyses.
Pooling: Integrate the m analysis results into a final result. Simple rules exist for combining the m analyses.

Monday, 6 February 2017

partial correlation coefficient in R

correlation coefficient to calculate the partial correlation
partial.correlation = cor(residuals.air.acid, residuals.water.acid, method = 'spearman')

Sunday, 5 February 2017

you can buy your effective stuff at

you can buy your effective stuff at

https://www.instamojo.com/shikhasha

https://www.instamojo.com/shikhasha

Solved case study for R

you can buy your effective stuff at

https://www.instamojo.com/shikhasha/?ref=offer_header

how to remove heteroscedasticity in r

NCV Test

car::ncvTest(lmMod)  # Breusch-Pagan test
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 4.650233    Df = 1     p = 0.03104933

p-value less that a significance level of 0.05, therefore we can reject the null hypothesis that the variance of the residuals is constant and infer that heteroscedasticity is indeed present, thereby confirming our graphical inference.

treatment for multicollinearity

Box-Cox transformation

Box-cox transformation is a mathematical transformation of the variable to make it approximate to a normal distribution. Often, doing a box-cox transformation of the Y variable solves the issue, which is exactly what I am going to do now.

library("caret", lib.loc="~/R/win-library/3.2")

> distBCMod=BoxCoxTrans(r$Crime)

> distBCMod

Box-Cox Transformation

47 data points used to estimate Lambda

Input data summary:

Min. 1st Qu. Median Mean 3rd Qu. Max.

342.0 658.5 831.0 905.1 1058.0 1993.0

Largest/Smallest: 5.83

Sample Skewness: 1.05

Estimated Lambda: -0.1

With fudge factor, Lambda = 0 will be used for transformations

> r <- cbind(r, Crime_new=predict(distBCMod, r$Crime)) # append the transformed variable to r

> head(r) # view the top 6 rows

Crime Crime_new

1 791 6.673298

2 1635 7.399398

3 578 6.359574

4 1969 7.585281

5 1234 7.118016

6 682 6.525030

> lmMod_bc <- lm(Crime_new ~ Wealth+Ineq, data=r)

> ncvTest(lmMod_bc)

Non-constant Variance Score Test

Variance formula: ~ fitted.values

Chisquare = 0.003153686 Df = 1 p = 0.9552162

> ncvTest(mod3)

With a p-value of 0.9552162, we fail to reject the null hypothesis (that variance of residuals is constant) and therefore infer that ther residuals are homoscedastic. Lets check this graphically as well.

plot(lmMod_bc)

Here it is the plot:

normality test OF RESIDUAL in R

normality test OF RESIDUAL in R

in the nortest package

shapiro.test(mod3$residuals)

Shapiro-Wilk normality test

data: mod3$residuals
W = 0.95036, p-value = 0.04473

NORMAL Q-Q PLOTS

plot(mod3)

press enter 4 times to get 4 different graph

The code above generates data from a normal distribution (command “rnorm”), reshapes it into a series of columns, and runs what is called a normal quantile-quantile plot (QQ Plot, for short) on the first column.
Q-Q Plot (Normal)
The Q-Q plot tells us what proportion of the data set (in this case, the first column of variable x), compares with the expected proportion (theoretically) of the normal distribution model based on the sample’s mean and standard deviation. We’re able to do this, because of the normal distribution’s properties. The normal distribution is thicker around the mean, and thinner as you move away from it – specifically, around 68% of the points you can expect to see in normally distributed data will only be 1 standard deviation away from the mean. There are similar metrics for normally distributed data, for 2 and 3 standard deviations (95.4% and 99.7% respectively).
However, as you see, testing a large set of data (such as the 100 columns of data we have here) can quickly become tedious, if we’re using a graphical approach. Then there’s the fact that the graphical approach may not be a rigorous enough evaluation for most statistical analysis situations, where you want to compare multiple sets of data easily. Unsurprisingly, we therefore use test statistics, and normality tests, to assess the data’s normality.

how to compare two model in r using ANOVA

how to compare two model in r using ANOVA

mod2=lm(r$Crime~r$Ineq)
> mod3=lm(r$Crime~r$Wealth+r$Ineq)
> anova(mod2,mod3)
Analysis of Variance Table

Model 1: r$Crime ~ r$Ineq
Model 2: r$Crime ~ r$Wealth + r$Ineq
Res.Df RSS Df Sum of Sq F Pr(>F)
1 45 6660397
2 44 4137694 1 2522703 26.826 5.32e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

TO SHOW F= T^2

TO SHOW F= T^2
mod1=lm(y~x,data= )
anova=aov(mod1)
> summary(anova)
Df Sum Sq Mean Sq F value Pr(>F)
r$Wealth 1 1340152 1340152 10.88 0.0019 **
Residuals 45 5540775 123128
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(mod1)

Call:
lm(formula = r$Crime ~ r$Wealth)

Residuals:
Min 1Q Median 3Q Max
-631.40 -272.84 -46.17 197.25 825.02

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -24.28261 286.31386 -0.085 0.9328
r$Wealth 0.17689 0.05362 3.299 0.0019 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 350.9 on 45 degrees of freedom
Multiple R-squared: 0.1948, Adjusted R-squared: 0.1769
F-statistic: 10.88 on 1 and 45 DF, p-value: 0.001902

> f=3.299^2
> f
[1] 10.8834

t test

t.test(r,mu=0,alt="two.sided",conf=0.95)

One Sample t-test

data: r
t = 8.442, df = 751, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
306.7586 492.6578
sample estimates:
mean of x
399.7082

anova

anova=aov(mod1)
> summary(anova)
Df Sum Sq Mean Sq F value Pr(>F)
r$Wealth 1 1340152 1340152 10.88 0.0019 **
Residuals 45 5540775 123128
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlations pearson, spearman or kendall.

cor(r, use="complete.obs", method="pearson" )

cor(x, use=, method= ) where

Option	Description
x	Matrix or data frame
use	Specifies the handling of missing data. Options are all.obs (assumes no missing data - missing data will produce an error), complete.obs (listwise deletion), and pairwise.complete.obs (pairwise deletion)
method	Specifies the type of correlation. Options are pearson, spearman or kendall.

M So Ed Po1 Po2 LF M.F
M 1.00000000 0.58435534 -0.53023964 -0.50573690 -0.51317336 -0.1609488 -0.02867993
So 0.58435534 1.00000000 -0.70274132 -0.37263633 -0.37616753 -0.5054695 -0.31473291
Ed -0.53023964 -0.70274132 1.00000000 0.48295213 0.49940958 0.5611780 0.43691492
Po1 -0.50573690 -0.37263633 0.48295213 1.00000000 0.99358648 0.1214932 0.03376027
Po2 -0.51317336 -0.37616753 0.49940958 0.99358648 1.00000000 0.1063496 0.02284250
LF -0.16094882 -0.50546948 0.56117795 0.12149320 0.10634960 1.0000000 0.51355879
M.F -0.02867993 -0.31473291 0.43691492 0.03376027 0.02284250 0.5135588 1.00000000
Pop -0.28063762 -0.04991832 -0.01722740 0.52628358 0.51378940 -0.1236722 -0.41062750
NW 0.59319826 0.76710262 -0.66488190 -0.21370878 -0.21876821 -0.3412144 -0.32730454
U1 -0.22438060 -0.17241931 0.01810345 -0.04369761 -0.05171199 -0.2293997 0.35189190
U2 -0.24484339 0.07169289 -0.21568155 0.18509304 0.16922422 -0.4207625 -0.01869169
Wealth -0.67005506 -0.63694543 0.73599704 0.78722528 0.79426205 0.2946323 0.17960864
Ineq 0.63921138 0.73718106 -0.76865789 -0.63050025 -0.64815183 -0.2698865 -0.16708869
Prob 0.36111641 0.53086199 -0.38992286 -0.47324704 -0.47302729 -0.2500861 -0.05085826

Saturday, 4 February 2017

Logistic Regression Model

# Logistic Regression Model
mod1 = glm(Republican~PropR, data=Train, family="binomial")
summary(mod1)

Scatter Plots,Boxplots,Histogram in R

# Scatter Plots
plot(USDA$Protein, USDA$TotalFat)
# Add xlabel, ylabel and title
plot(USDA$Protein, USDA$TotalFat, xlab="Protein", ylab = "Fat", main = "Protein vs Fat", col = "red")
# Creating a histogram
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C")
# Add limits to x-axis
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100))
# Specify breaks of histogram
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=100)
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=2000)
# Boxplots
boxplot(USDA$Sugar, ylab = "Sugar (g)", main = "Boxplot of Sugar")

Make test set predictions in R

# Make test set predictions
predictTest = predict(model4, newdata=wineTest)
predictTest

HOW to find R-squared in R

# Compute R-squared
SSE = sum((wineTest$Price - predictTest)^2)
SST = sum((wineTest$Price - mean(wine$Price))^2)
1 - SSE/SST

Linear Regression in R

# Linear Regression (all variables)
model3 = lm(Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop, data=wine)
summary(model3)

# Sum of Squared Errors
SSE = sum(model3$residuals^2)
SSE

# Remove FrancePop
model4 = lm(Price ~ AGST + HarvestRain + WinterRain + Age, data=wine)
summary(model4)

R: covariance matrix plot

# Plot #1: Basic scatterplot matrix of the four measurements
pairs(~x1+x2+x3+x4, data=data name)

covariance matrix plot

# Load the  dataset.
data(data name)
 
# Plot #1: Basic scatterplot matrix of the four measurements
pairs(~x1+x2+x3+x4, data=data name)

or

in corrplot package, named "circle", "square", "ellipse", "number", "shade", "color", "pie".

library(corrplot)
M <- cor(mtcars)
corrplot(M, method="circle")



corrplot(M, method="number")

                          data(iris)
cor(iris[,1:4])



                          pairs(iris[,1:4])

Friday, 3 February 2017

data for R

write us for excel file and data for R to Practice - shakti.kumar@ifmr.ac.in

Thursday, 2 February 2017

code-of-r-_casestudy

https://www.instamojo.com/shikhasha/complete-code-of-r-_casestudy/?ref=storef-r-_casestudy/?ref=store

MULTICOLLINEARTY

library(car)

> fit <- lm(Fertility ~ . , data = swiss)

> vif(fit)

Wednesday, 1 February 2017

find all your r coding

https://www.instamojo.com/shikhasha/complete-code-of-r-_casestudy/?ref=store

intepretation of tapply and tapply function

intepret tapply

tapply(USDA$Iron, USDA$HighProtein, mean, na.rm=TRUE)# Maximum level of Vitamin C in hfoods with high and low carbs? tapply(USDA$VitaminC, USDA$HighCarbs, max, na.rm=TRUE)# Using summary function with tapply tapply(USDA$VitaminC, USDA$HighCarbs, summary, na.rm=TRUE) data------- available on ---- http://rcodee.blogspot.sg/