sign returns a vector with the signs of the corresponding elements of x (the sign of a real number is 1, 0, or -1 if the number is positive, zero, or negative, respectively).
sign {base}
> sign(20)
[1] 1
> sign(-10)
[1] -1
> sign(0)
[1] 0
Note that sign does not operate on complex vectors.
> library(mice)
AND create new data set using
old data set "r"
new data set "sample"
sample = r[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount")]
summary(r)
State Year Rasmussen SurveyUSA DiffCount
Arizona : 3 Min. :2004 Min. :-41.0000 Min. :-33.0000 Min. :-19.000
Arkansas : 3 1st Qu.:2004 1st Qu.: -8.0000 1st Qu.:-11.7500 1st Qu.: -6.000
California : 3 Median :2008 Median : 1.0000 Median : -2.0000 Median : 1.000
Colorado : 3 Mean :2008 Mean : 0.0404 Mean : -0.8243 Mean : -1.269
Connecticut: 3 3rd Qu.:2012 3rd Qu.: 8.5000 3rd Qu.: 8.0000 3rd Qu.: 4.000
Florida : 3 Max. :2012 Max. : 39.0000 Max. : 30.0000 Max. : 11.000
(Other) :127 NA's :46 NA's :71
PropR Republican
Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000
Median :0.6250 Median :1.0000
Mean :0.5259 Mean :0.5103
3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000
> install.packages("mice")
> library(mice)
]
> sample = r[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount")]
> summary(sample)
Rasmussen SurveyUSA PropR DiffCount
Min. :-41.0000 Min. :-33.0000 Min. :0.0000 Min. :-19.000
1st Qu.: -8.0000 1st Qu.:-11.7500 1st Qu.:0.0000 1st Qu.: -6.000
Median : 1.0000 Median : -2.0000 Median :0.6250 Median : 1.000
Mean : 0.0404 Mean : -0.8243 Mean :0.5259 Mean : -1.269
3rd Qu.: 8.5000 3rd Qu.: 8.0000 3rd Qu.:1.0000 3rd Qu.: 4.000
Max. : 39.0000 Max. : 30.0000 Max. :1.0000 Max. : 11.000 NA's :46 NA's :71
it is a statistical technique for analyzing incomplete data sets, that is, data sets for which some entries are missing. Application of the technique requires three steps: imputation, analysis and pooling. The figure illustrates these steps.
Imputation: Impute (=fill in) the missing entries of the incomplete data sets, not once, but m times (m=3 in the figure). Imputed values are drawn for a distribution (that can be different for each missing entry). This step results is m complete data sets.
Analysis: Analyze each of the m completed data sets. This step results in m analyses.
Pooling: Integrate the m analysis results into a final result. Simple rules exist for combining the m analyses.
car::ncvTest(lmMod) # Breusch-Pagan test
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 4.650233 Df = 1 p = 0.03104933
p-value less that a significance level of 0.05, therefore we can reject the null hypothesis that the variance of the residuals is constant and infer that heteroscedasticity is indeed present, thereby confirming our graphical inference.
treatment for multicollinearity
Box-Cox transformation
Box-cox transformation is a mathematical transformation of the variable to make it approximate to a normal distribution. Often, doing a box-cox transformation of the Y variable solves the issue, which is exactly what I am going to do now.
library("caret", lib.loc="~/R/win-library/3.2")
> distBCMod=BoxCoxTrans(r$Crime)
> distBCMod
Box-Cox Transformation
47 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
342.0 658.5 831.0 905.1 1058.0 1993.0
Largest/Smallest: 5.83
Sample Skewness: 1.05
Estimated Lambda: -0.1
With fudge factor, Lambda = 0 will be used for transformations
> r <- cbind(r, Crime_new=predict(distBCMod, r$Crime)) # append the transformed variable to r
> head(r) # view the top 6 rows
Crime Crime_new
1 791 6.673298
2 1635 7.399398
3 578 6.359574
4 1969 7.585281
5 1234 7.118016
6 682 6.525030
> lmMod_bc <- lm(Crime_new ~ Wealth+Ineq, data=r)
>
> ncvTest(lmMod_bc)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 0.003153686 Df = 1 p = 0.9552162
> ncvTest(mod3)
With a p-value of 0.9552162, we fail to reject the null hypothesis (that variance of residuals is constant) and therefore infer that ther residuals are homoscedastic. Lets check this graphically as well.
data: mod3$residuals
W = 0.95036, p-value = 0.04473
NORMAL Q-Q PLOTS
plot(mod3)
press enter 4 times to get 4 different graph
The code above generates data from a normal distribution (command “rnorm”), reshapes it into a series of columns, and runs what is called a normal quantile-quantile plot (QQ Plot, for short) on the first column.Q-Q Plot (Normal)The Q-Q plot tells us what proportion of the data set (in this case, the first column of variable x), compares with the expected proportion (theoretically) of the normal distribution model based on the sample’s mean and standard deviation. We’re able to do this, because of the normal distribution’s properties. The normal distribution is thicker around the mean, and thinner as you move away from it – specifically, around 68% of the points you can expect to see in normally distributed data will only be 1 standard deviation away from the mean. There are similar metrics for normally distributed data, for 2 and 3 standard deviations (95.4% and 99.7% respectively). However, as you see, testing a large set of data (such as the 100 columns of data we have here) can quickly become tedious, if we’re using a graphical approach. Then there’s the fact that the graphical approach may not be a rigorous enough evaluation for most statistical analysis situations, where you want to compare multiple sets of data easily. Unsurprisingly, we therefore use test statistics, and normality tests, to assess the data’s normality.
Residual standard error: 350.9 on 45 degrees of freedom
Multiple R-squared: 0.1948,Adjusted R-squared: 0.1769
F-statistic: 10.88 on 1 and 45 DF, p-value: 0.001902 > f=3.299^2 > f [1] 10.8834
data: r
t = 8.442, df = 751, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
306.7586 492.6578
sample estimates:
mean of x
399.7082
Specifies the handling of missing data. Options are all.obs (assumes no missing data - missing data will produce an error), complete.obs (listwise deletion), and pairwise.complete.obs (pairwise deletion)
method
Specifies the type of correlation. Options are pearson, spearman or kendall.
# Scatter Plots
plot(USDA$Protein, USDA$TotalFat)
# Add xlabel, ylabel and title
plot(USDA$Protein, USDA$TotalFat, xlab="Protein", ylab = "Fat", main = "Protein vs Fat", col = "red")
# Creating a histogram
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C")
# Add limits to x-axis
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100))
# Specify breaks of histogram
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=100)
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=2000)
# Boxplots
boxplot(USDA$Sugar, ylab = "Sugar (g)", main = "Boxplot of Sugar")
tapply(USDA$Iron, USDA$HighProtein, mean,
na.rm=TRUE)# Maximum level of
Vitamin C in hfoods with high and low carbs? tapply(USDA$VitaminC, USDA$HighCarbs, max,
na.rm=TRUE)# Using summary
function with tapply tapply(USDA$VitaminC, USDA$HighCarbs,
summary, na.rm=TRUE)data-------available
on ---- http://rcodee.blogspot.sg/