R
GET YOUR R CODE AND SOLUTION FELL FREE TO ASK US https://www.instamojo.com/shikhasha/complete-code-of-r-_casestudy/?ref=store
Tuesday 7 February 2017
R: multiple imputation in R
R: multiple imputation in R: USE MICE PACKAGE FOR multiple imputation >install.packages("mice") > library(mice) AND create new data set using ol...
sign function in R
sign
returns a vector with the signs of the corresponding elements of x
(the sign of a real number is 1, 0, or -1 if the number is positive, zero, or negative, respectively).sign {base}
> sign(20)
[1] 1
> sign(-10)
[1] -1
> sign(0)
[1] 0
sign
does not operate on complex vectors.Missing values in R (Missing values treatment )
you can also go this link
https://www.blogger.com/blogger.g?blogID=3973483800164253588#editor/target=post;postID=6457933726508849760;onPublishedMenu=allposts;onClosedMenu=allposts;postNum=1;src=postname
http://rcodee.blogspot.sg/2017/02/multiple-imputation-in-r.html
>install.packages("mice")
> library(mice)
new data set
> set.seed(144)
> imputed = complete(mice(sample))
https://www.blogger.com/blogger.g?blogID=3973483800164253588#editor/target=post;postID=6457933726508849760;onPublishedMenu=allposts;onClosedMenu=allposts;postNum=1;src=postname
http://rcodee.blogspot.sg/2017/02/multiple-imputation-in-r.html
>install.packages("mice")
> library(mice)
new data set
> set.seed(144)
> imputed = complete(mice(sample))
multiple imputation in R
USE MICE PACKAGE FOR multiple imputation
>install.packages("mice")
> library(mice)
AND create new data set using
old data set "r"
new data set "sample"
sample = r[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount")]
summary(r)
State Year Rasmussen SurveyUSA DiffCount
Arizona : 3 Min. :2004 Min. :-41.0000 Min. :-33.0000 Min. :-19.000
Arkansas : 3 1st Qu.:2004 1st Qu.: -8.0000 1st Qu.:-11.7500 1st Qu.: -6.000
California : 3 Median :2008 Median : 1.0000 Median : -2.0000 Median : 1.000
Colorado : 3 Mean :2008 Mean : 0.0404 Mean : -0.8243 Mean : -1.269
Connecticut: 3 3rd Qu.:2012 3rd Qu.: 8.5000 3rd Qu.: 8.0000 3rd Qu.: 4.000
Florida : 3 Max. :2012 Max. : 39.0000 Max. : 30.0000 Max. : 11.000
(Other) :127 NA's :46 NA's :71
PropR Republican
Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000
Median :0.6250 Median :1.0000
Mean :0.5259 Mean :0.5103
3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000
> install.packages("mice")
> library(mice)
]
> sample = r[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount")]
> summary(sample)
Rasmussen SurveyUSA PropR DiffCount
Min. :-41.0000 Min. :-33.0000 Min. :0.0000 Min. :-19.000
1st Qu.: -8.0000 1st Qu.:-11.7500 1st Qu.:0.0000 1st Qu.: -6.000
Median : 1.0000 Median : -2.0000 Median :0.6250 Median : 1.000
Mean : 0.0404 Mean : -0.8243 Mean :0.5259 Mean : -1.269
3rd Qu.: 8.5000 3rd Qu.: 8.0000 3rd Qu.:1.0000 3rd Qu.: 4.000
Max. : 39.0000 Max. : 30.0000 Max. :1.0000 Max. : 11.000
NA's :46 NA's :71
> set.seed(144)
> imputed = complete(mice(sample))
iter imp variable
1 1 Rasmussen SurveyUSA
1 2 Rasmussen SurveyUSA
1 3 Rasmussen SurveyUSA
1 4 Rasmussen SurveyUSA
1 5 Rasmussen SurveyUSA
2 1 Rasmussen SurveyUSA
2 2 Rasmussen SurveyUSA
2 3 Rasmussen SurveyUSA
2 4 Rasmussen SurveyUSA
2 5 Rasmussen SurveyUSA
3 1 Rasmussen SurveyUSA
3 2 Rasmussen SurveyUSA
3 3 Rasmussen SurveyUSA
3 4 Rasmussen SurveyUSA
3 5 Rasmussen SurveyUSA
4 1 Rasmussen SurveyUSA
4 2 Rasmussen SurveyUSA
4 3 Rasmussen SurveyUSA
4 4 Rasmussen SurveyUSA
4 5 Rasmussen SurveyUSA
5 1 Rasmussen SurveyUSA
5 2 Rasmussen SurveyUSA
5 3 Rasmussen SurveyUSA
5 4 Rasmussen SurveyUSA
5 5 Rasmussen SurveyUSA
> summary(imputed)
Rasmussen SurveyUSA
Min. :-41.000 Min. :-33.000
1st Qu.: -8.000 1st Qu.:-11.000
Median : 3.000 Median : 1.000
Mean : 1.731 Mean : 1.517
3rd Qu.: 11.000 3rd Qu.: 18.000
Max. : 39.000 Max. : 30.000
> no NA values
now change both the new variable
given below
r$Rasmussen = imputed$Rasmussen
> r$SurveyUSA = imputed$SurveyUSA
> summary(r)
State Year Rasmussen SurveyUSA DiffCount
Arizona : 3 Min. :2004 Min. :-41.000 Min. :-33.000 Min. :-19.000
Arkansas : 3 1st Qu.:2004 1st Qu.: -8.000 1st Qu.:-11.000 1st Qu.: -6.000
California : 3 Median :2008 Median : 3.000 Median : 1.000 Median : 1.000
Colorado : 3 Mean :2008 Mean : 1.731 Mean : 1.517 Mean : -1.269
Connecticut: 3 3rd Qu.:2012 3rd Qu.: 11.000 3rd Qu.: 18.000 3rd Qu.: 4.000
Florida : 3 Max. :2012 Max. : 39.000 Max. : 30.000 Max. : 11.000
(Other) :127
PropR Republican
Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000
Median :0.6250 Median :1.0000
Mean :0.5259 Mean :0.5103
3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000
>install.packages("mice")
> library(mice)
AND create new data set using
old data set "r"
new data set "sample"
sample = r[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount")]
summary(r)
State Year Rasmussen SurveyUSA DiffCount
Arizona : 3 Min. :2004 Min. :-41.0000 Min. :-33.0000 Min. :-19.000
Arkansas : 3 1st Qu.:2004 1st Qu.: -8.0000 1st Qu.:-11.7500 1st Qu.: -6.000
California : 3 Median :2008 Median : 1.0000 Median : -2.0000 Median : 1.000
Colorado : 3 Mean :2008 Mean : 0.0404 Mean : -0.8243 Mean : -1.269
Connecticut: 3 3rd Qu.:2012 3rd Qu.: 8.5000 3rd Qu.: 8.0000 3rd Qu.: 4.000
Florida : 3 Max. :2012 Max. : 39.0000 Max. : 30.0000 Max. : 11.000
(Other) :127 NA's :46 NA's :71
PropR Republican
Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000
Median :0.6250 Median :1.0000
Mean :0.5259 Mean :0.5103
3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000
> install.packages("mice")
> library(mice)
]
> sample = r[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount")]
> summary(sample)
Rasmussen SurveyUSA PropR DiffCount
Min. :-41.0000 Min. :-33.0000 Min. :0.0000 Min. :-19.000
1st Qu.: -8.0000 1st Qu.:-11.7500 1st Qu.:0.0000 1st Qu.: -6.000
Median : 1.0000 Median : -2.0000 Median :0.6250 Median : 1.000
Mean : 0.0404 Mean : -0.8243 Mean :0.5259 Mean : -1.269
3rd Qu.: 8.5000 3rd Qu.: 8.0000 3rd Qu.:1.0000 3rd Qu.: 4.000
Max. : 39.0000 Max. : 30.0000 Max. :1.0000 Max. : 11.000
NA's :46 NA's :71
> set.seed(144)
> imputed = complete(mice(sample))
iter imp variable
1 1 Rasmussen SurveyUSA
1 2 Rasmussen SurveyUSA
1 3 Rasmussen SurveyUSA
1 4 Rasmussen SurveyUSA
1 5 Rasmussen SurveyUSA
2 1 Rasmussen SurveyUSA
2 2 Rasmussen SurveyUSA
2 3 Rasmussen SurveyUSA
2 4 Rasmussen SurveyUSA
2 5 Rasmussen SurveyUSA
3 1 Rasmussen SurveyUSA
3 2 Rasmussen SurveyUSA
3 3 Rasmussen SurveyUSA
3 4 Rasmussen SurveyUSA
3 5 Rasmussen SurveyUSA
4 1 Rasmussen SurveyUSA
4 2 Rasmussen SurveyUSA
4 3 Rasmussen SurveyUSA
4 4 Rasmussen SurveyUSA
4 5 Rasmussen SurveyUSA
5 1 Rasmussen SurveyUSA
5 2 Rasmussen SurveyUSA
5 3 Rasmussen SurveyUSA
5 4 Rasmussen SurveyUSA
5 5 Rasmussen SurveyUSA
> summary(imputed)
Rasmussen SurveyUSA
Min. :-41.000 Min. :-33.000
1st Qu.: -8.000 1st Qu.:-11.000
Median : 3.000 Median : 1.000
Mean : 1.731 Mean : 1.517
3rd Qu.: 11.000 3rd Qu.: 18.000
Max. : 39.000 Max. : 30.000
> no NA values
now change both the new variable
given below
r$Rasmussen = imputed$Rasmussen
> r$SurveyUSA = imputed$SurveyUSA
> summary(r)
State Year Rasmussen SurveyUSA DiffCount
Arizona : 3 Min. :2004 Min. :-41.000 Min. :-33.000 Min. :-19.000
Arkansas : 3 1st Qu.:2004 1st Qu.: -8.000 1st Qu.:-11.000 1st Qu.: -6.000
California : 3 Median :2008 Median : 3.000 Median : 1.000 Median : 1.000
Colorado : 3 Mean :2008 Mean : 1.731 Mean : 1.517 Mean : -1.269
Connecticut: 3 3rd Qu.:2012 3rd Qu.: 11.000 3rd Qu.: 18.000 3rd Qu.: 4.000
Florida : 3 Max. :2012 Max. : 39.000 Max. : 30.000 Max. : 11.000
(Other) :127
PropR Republican
Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000
Median :0.6250 Median :1.0000
Mean :0.5259 Mean :0.5103
3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000
multiple imputation
multiple imputation?
it is a statistical technique for analyzing incomplete data sets, that is, data sets for which some entries are missing. Application of the technique requires three steps: imputation, analysis and pooling. The figure illustrates these steps.
- Imputation: Impute (=fill in) the missing entries of the incomplete data sets, not once, but m times (m=3 in the figure). Imputed values are drawn for a distribution (that can be different for each missing entry). This step results is m complete data sets.
- Analysis: Analyze each of the m completed data sets. This step results in m analyses.
- Pooling: Integrate the m analysis results into a final result. Simple rules exist for combining the m analyses.
Monday 6 February 2017
partial correlation coefficient in R
correlation coefficient to calculate the partial correlation
partial.correlation = cor(residuals.air.acid, residuals.water.acid, method = 'spearman')
Sunday 5 February 2017
you can buy your effective stuff at
you can buy your effective stuff at
https://www.instamojo.com/shikhasha
https://www.instamojo.com/shikhasha
Solved case study for R
you can buy your effective stuff at
https://www.instamojo.com/shikhasha/?ref=offer_header
how to remove heteroscedasticity in r
how to remove heteroscedasticity in r
NCV Test
car::ncvTest(lmMod) # Breusch-Pagan test Non-constant Variance Score Test Variance formula: ~ fitted.values Chisquare = 4.650233 Df = 1 p = 0.03104933
p-value less that a significance level of 0.05, therefore we can reject the null hypothesis that the variance of the residuals is constant and infer that heteroscedasticity is indeed present, thereby confirming our graphical inference.
treatment for multicollinearity
Box-Cox transformation
Box-cox transformation is a mathematical transformation of the variable to make it approximate to a normal distribution. Often, doing a box-cox transformation of the Y variable solves the issue, which is exactly what I am going to do now.
library("caret", lib.loc="~/R/win-library/3.2")
> distBCMod=BoxCoxTrans(r$Crime)
> distBCMod
Box-Cox Transformation
47 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
342.0 658.5 831.0 905.1 1058.0 1993.0
Largest/Smallest: 5.83
Sample Skewness: 1.05
Estimated Lambda: -0.1
With fudge factor, Lambda = 0 will be used for transformations
> r <- cbind(r, Crime_new=predict(distBCMod, r$Crime)) # append the transformed variable to r
> head(r) # view the top 6 rows
Crime Crime_new
1 791 6.673298
2 1635 7.399398
3 578 6.359574
4 1969 7.585281
5 1234 7.118016
6 682 6.525030
> lmMod_bc <- lm(Crime_new ~ Wealth+Ineq, data=r)
>
> ncvTest(lmMod_bc)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 0.003153686 Df = 1 p = 0.9552162
> ncvTest(mod3)
With a p-value of 0.9552162, we fail to reject the null hypothesis (that variance of residuals is constant) and therefore infer that ther residuals are homoscedastic. Lets check this graphically as well.
plot(lmMod_bc)
normality test OF RESIDUAL in R
normality test OF RESIDUAL in R
in the nortest package
shapiro.test(mod3$residuals)
Shapiro-Wilk normality test
data: mod3$residuals
W = 0.95036, p-value = 0.04473
The code above generates data from a normal distribution (command “rnorm”), reshapes it into a series of columns, and runs what is called a normal quantile-quantile plot (QQ Plot, for short) on the first column.The Q-Q plot tells us what proportion of the data set (in this case, the first column of variable x), compares with the expected proportion (theoretically) of the normal distribution model based on the sample’s mean and standard deviation. We’re able to do this, because of the normal distribution’s properties. The normal distribution is thicker around the mean, and thinner as you move away from it – specifically, around 68% of the points you can expect to see in normally distributed data will only be 1 standard deviation away from the mean. There are similar metrics for normally distributed data, for 2 and 3 standard deviations (95.4% and 99.7% respectively).
in the nortest package
shapiro.test(mod3$residuals)
Shapiro-Wilk normality test
data: mod3$residuals
W = 0.95036, p-value = 0.04473
NORMAL Q-Q PLOTS
plot(mod3)
press enter 4 times to get 4 different graph
The code above generates data from a normal distribution (command “rnorm”), reshapes it into a series of columns, and runs what is called a normal quantile-quantile plot (QQ Plot, for short) on the first column.The Q-Q plot tells us what proportion of the data set (in this case, the first column of variable x), compares with the expected proportion (theoretically) of the normal distribution model based on the sample’s mean and standard deviation. We’re able to do this, because of the normal distribution’s properties. The normal distribution is thicker around the mean, and thinner as you move away from it – specifically, around 68% of the points you can expect to see in normally distributed data will only be 1 standard deviation away from the mean. There are similar metrics for normally distributed data, for 2 and 3 standard deviations (95.4% and 99.7% respectively).
However, as you see, testing a large set of data (such as the 100 columns of data we have here) can quickly become tedious, if we’re using a graphical approach. Then there’s the fact that the graphical approach may not be a rigorous enough evaluation for most statistical analysis situations, where you want to compare multiple sets of data easily. Unsurprisingly, we therefore use test statistics, and normality tests, to assess the data’s normality.
how to compare two model in r using ANOVA
how to compare two model in r using ANOVA
mod2=lm(r$Crime~r$Ineq)
> mod3=lm(r$Crime~r$Wealth+r$Ineq)
> anova(mod2,mod3)
Analysis of Variance Table
Model 1: r$Crime ~ r$Ineq
Model 2: r$Crime ~ r$Wealth + r$Ineq
Res.Df RSS Df Sum of Sq F Pr(>F)
1 45 6660397
2 44 4137694 1 2522703 26.826 5.32e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
mod2=lm(r$Crime~r$Ineq)
> mod3=lm(r$Crime~r$Wealth+r$Ineq)
> anova(mod2,mod3)
Analysis of Variance Table
Model 1: r$Crime ~ r$Ineq
Model 2: r$Crime ~ r$Wealth + r$Ineq
Res.Df RSS Df Sum of Sq F Pr(>F)
1 45 6660397
2 44 4137694 1 2522703 26.826 5.32e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
TO SHOW F= T^2
TO SHOW F= T^2
mod1=lm(y~x,data= )
anova=aov(mod1)
> summary(anova)
Df Sum Sq Mean Sq F value Pr(>F)
r$Wealth 1 1340152 1340152 10.88 0.0019 **
Residuals 45 5540775 123128
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(mod1)
Call:
lm(formula = r$Crime ~ r$Wealth)
Residuals:
Min 1Q Median 3Q Max
-631.40 -272.84 -46.17 197.25 825.02
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -24.28261 286.31386 -0.085 0.9328
r$Wealth 0.17689 0.05362 3.299 0.0019 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 350.9 on 45 degrees of freedom
Multiple R-squared: 0.1948, Adjusted R-squared: 0.1769
F-statistic: 10.88 on 1 and 45 DF, p-value: 0.001902
> f=3.299^2
> f
[1] 10.8834
mod1=lm(y~x,data= )
anova=aov(mod1)
> summary(anova)
Df Sum Sq Mean Sq F value Pr(>F)
r$Wealth 1 1340152 1340152 10.88 0.0019 **
Residuals 45 5540775 123128
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(mod1)
Call:
lm(formula = r$Crime ~ r$Wealth)
Residuals:
Min 1Q Median 3Q Max
-631.40 -272.84 -46.17 197.25 825.02
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -24.28261 286.31386 -0.085 0.9328
r$Wealth 0.17689 0.05362 3.299 0.0019 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 350.9 on 45 degrees of freedom
Multiple R-squared: 0.1948, Adjusted R-squared: 0.1769
F-statistic: 10.88 on 1 and 45 DF, p-value: 0.001902
> f=3.299^2
> f
[1] 10.8834
t test
t.test(r,mu=0,alt="two.sided",conf=0.95)
One Sample t-test
data: r
t = 8.442, df = 751, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
306.7586 492.6578
sample estimates:
mean of x
399.7082
One Sample t-test
data: r
t = 8.442, df = 751, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
306.7586 492.6578
sample estimates:
mean of x
399.7082
anova
anova=aov(mod1)
> summary(anova)
Df Sum Sq Mean Sq F value Pr(>F)
r$Wealth 1 1340152 1340152 10.88 0.0019 **
Residuals 45 5540775 123128
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(anova)
Df Sum Sq Mean Sq F value Pr(>F)
r$Wealth 1 1340152 1340152 10.88 0.0019 **
Residuals 45 5540775 123128
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlations pearson, spearman or kendall.
cor(r, use="complete.obs", method="pearson" )
M So Ed Po1 Po2 LF M.F
M 1.00000000 0.58435534 -0.53023964 -0.50573690 -0.51317336 -0.1609488 -0.02867993
So 0.58435534 1.00000000 -0.70274132 -0.37263633 -0.37616753 -0.5054695 -0.31473291
Ed -0.53023964 -0.70274132 1.00000000 0.48295213 0.49940958 0.5611780 0.43691492
Po1 -0.50573690 -0.37263633 0.48295213 1.00000000 0.99358648 0.1214932 0.03376027
Po2 -0.51317336 -0.37616753 0.49940958 0.99358648 1.00000000 0.1063496 0.02284250
LF -0.16094882 -0.50546948 0.56117795 0.12149320 0.10634960 1.0000000 0.51355879
M.F -0.02867993 -0.31473291 0.43691492 0.03376027 0.02284250 0.5135588 1.00000000
Pop -0.28063762 -0.04991832 -0.01722740 0.52628358 0.51378940 -0.1236722 -0.41062750
NW 0.59319826 0.76710262 -0.66488190 -0.21370878 -0.21876821 -0.3412144 -0.32730454
U1 -0.22438060 -0.17241931 0.01810345 -0.04369761 -0.05171199 -0.2293997 0.35189190
U2 -0.24484339 0.07169289 -0.21568155 0.18509304 0.16922422 -0.4207625 -0.01869169
Wealth -0.67005506 -0.63694543 0.73599704 0.78722528 0.79426205 0.2946323 0.17960864
Ineq 0.63921138 0.73718106 -0.76865789 -0.63050025 -0.64815183 -0.2698865 -0.16708869
Prob 0.36111641 0.53086199 -0.38992286 -0.47324704 -0.47302729 -0.2500861 -0.05085826
cor(x, use=, method= ) where
Option | Description |
x | Matrix or data frame |
use | Specifies the handling of missing data. Options are all.obs (assumes no missing data - missing data will produce an error), complete.obs (listwise deletion), and pairwise.complete.obs (pairwise deletion) |
method | Specifies the type of correlation. Options are pearson, spearman or kendall. |
M So Ed Po1 Po2 LF M.F
M 1.00000000 0.58435534 -0.53023964 -0.50573690 -0.51317336 -0.1609488 -0.02867993
So 0.58435534 1.00000000 -0.70274132 -0.37263633 -0.37616753 -0.5054695 -0.31473291
Ed -0.53023964 -0.70274132 1.00000000 0.48295213 0.49940958 0.5611780 0.43691492
Po1 -0.50573690 -0.37263633 0.48295213 1.00000000 0.99358648 0.1214932 0.03376027
Po2 -0.51317336 -0.37616753 0.49940958 0.99358648 1.00000000 0.1063496 0.02284250
LF -0.16094882 -0.50546948 0.56117795 0.12149320 0.10634960 1.0000000 0.51355879
M.F -0.02867993 -0.31473291 0.43691492 0.03376027 0.02284250 0.5135588 1.00000000
Pop -0.28063762 -0.04991832 -0.01722740 0.52628358 0.51378940 -0.1236722 -0.41062750
NW 0.59319826 0.76710262 -0.66488190 -0.21370878 -0.21876821 -0.3412144 -0.32730454
U1 -0.22438060 -0.17241931 0.01810345 -0.04369761 -0.05171199 -0.2293997 0.35189190
U2 -0.24484339 0.07169289 -0.21568155 0.18509304 0.16922422 -0.4207625 -0.01869169
Wealth -0.67005506 -0.63694543 0.73599704 0.78722528 0.79426205 0.2946323 0.17960864
Ineq 0.63921138 0.73718106 -0.76865789 -0.63050025 -0.64815183 -0.2698865 -0.16708869
Prob 0.36111641 0.53086199 -0.38992286 -0.47324704 -0.47302729 -0.2500861 -0.05085826
Saturday 4 February 2017
Logistic Regression Model
# Logistic Regression Model
mod1 = glm(Republican~PropR, data=Train, family="binomial")
summary(mod1)
mod1 = glm(Republican~PropR, data=Train, family="binomial")
summary(mod1)
Scatter Plots,Boxplots,Histogram in R
# Scatter Plots
plot(USDA$Protein, USDA$TotalFat)
# Add xlabel, ylabel and title
plot(USDA$Protein, USDA$TotalFat, xlab="Protein", ylab = "Fat", main = "Protein vs Fat", col = "red")
# Creating a histogram
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C")
# Add limits to x-axis
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100))
# Specify breaks of histogram
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=100)
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=2000)
# Boxplots
boxplot(USDA$Sugar, ylab = "Sugar (g)", main = "Boxplot of Sugar")
Make test set predictions in R
# Make test set predictions
predictTest = predict(model4, newdata=wineTest)
predictTest
predictTest = predict(model4, newdata=wineTest)
predictTest
HOW to find R-squared in R
# Compute R-squared
SSE = sum((wineTest$Price - predictTest)^2)
SST = sum((wineTest$Price - mean(wine$Price))^2)
1 - SSE/SST
SSE = sum((wineTest$Price - predictTest)^2)
SST = sum((wineTest$Price - mean(wine$Price))^2)
1 - SSE/SST
Linear Regression in R
# Linear Regression (all variables)
model3 = lm(Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop, data=wine)
summary(model3)
# Sum of Squared Errors
SSE = sum(model3$residuals^2)
SSE
# Remove FrancePop
model4 = lm(Price ~ AGST + HarvestRain + WinterRain + Age, data=wine)
summary(model4)
R: covariance matrix plot
R: covariance matrix plot
# Plot #1: Basic scatterplot matrix of the four measurements pairs(~x1+x2+x3+x4, data=data name)
covariance matrix plot
<
or
in corrplot package, named
# Load the dataset. data(data name) # Plot #1: Basic scatterplot matrix of the four measurements pairs(~x1+x2+x3+x4, data=data name)
or
in corrplot package, named
"circle"
, "square"
, "ellipse"
, "number"
, "shade"
, "color"
, "pie"
.library(corrplot)
M <- cor(mtcars)
corrplot(M, method="circle")
corrplot(M, method="number")
data(iris)
cor(iris[,1:4])
pairs(iris[,1:4])
Friday 3 February 2017
Thursday 2 February 2017
Wednesday 1 February 2017
intepretation of tapply and tapply function
intepret tapply
tapply(USDA$Iron, USDA$HighProtein, mean, na.rm=TRUE)# Maximum level of Vitamin C in hfoods with high and low carbs? tapply(USDA$VitaminC, USDA$HighCarbs, max, na.rm=TRUE)# Using summary function with tapply tapply(USDA$VitaminC, USDA$HighCarbs, summary, na.rm=TRUE) data------- available on ---- http://rcodee.blogspot.sg/
tapply(USDA$Iron, USDA$HighProtein, mean, na.rm=TRUE)# Maximum level of Vitamin C in hfoods with high and low carbs? tapply(USDA$VitaminC, USDA$HighCarbs, max, na.rm=TRUE)# Using summary function with tapply tapply(USDA$VitaminC, USDA$HighCarbs, summary, na.rm=TRUE) data------- available on ---- http://rcodee.blogspot.sg/
Subscribe to:
Posts (Atom)