--- title: "LR_assumptions" author: "Kathleen Durant" date: "March 19, 2018" output: html_document: df_print: paged --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` Assumption 1: the regression model is linear in parameters. This means you need some indication that the independent terms (X terms) can be combined linearly to determine the dependent variable (Y) Assumption 2: the mean of the residuals is 0 or close to 0 ```{r cars} mod <- lm(dist ~ speed, data=cars) mean(mod$residuals) ``` ## Homeoscadasity Assumption 3: equal variance for the residuals. This means the residual values do not change for different values of the x variable. This can be reviewed by plotting the residual against different values.If the residual values increase and decrease for different predicted values then we may have heteroscadasity. The Normal Q-Q plot is a graph of the residuals versus the expected order statistics of the standard normal distribution - so the values of the residuals we have versus the expected residual values from a normal distribution. If the residuls are normally distributed the Q-Q plot should resemble a straight line. If the residuals do not match what is expected for a normal distribution - then we have an issue with the residual values. is created by ordering the residuals by their values and ```{r homoscadasity} par(mfrow=c(2,2)) m <- lm(mpg ~ disp, data=mtcars) plot(m) ``` Assumption 4: No autocorrelation of residuals. The current value should not be dependent on the previous (historic) values. We should NOT see a pattern in the dependent variable. However we do. The X axis corresponds to the lags of the residual, increasing in steps of 1. The very first line (to the left) shows the correlation of residual with itself (Lag0), therefore, it will always be equal to 1. If the residuals were not autocorrelated, the correlation (Y-axis) from the immediate next line onwards will drop to a near zero value below the dashed blue line (significance level). Clearly, this is not the case here. So we can conclude that the residuals are autocorrelated. ```{r} library(ggplot2) data(economics) lmMod <- lm(pce ~ pop, data=economics) acf(lmMod$residuals) # estimates of the autocovariance ``` Assumption 5: The X variables and residuals are uncorrelated. Verify by determining the correlation between the x values and the corresponding residual. ```{r} library(stats) library(graphics) mod.lm <- lm(dist ~ speed, data=cars) cor.test(cars$speed, mod.lm$residuals) ``` Assumption 6: the number of observations must be greater than the number of features. This can be reviewed by looking at the data. Assumption 7: The variability in X values is positive. So we have different measured values for a variable. We can easily check this by reviewing the variance of the variable. ```{r} var(cars$speed) var (c(1,1,1, .999999, .998, .9998)) ``` Assumption 8: The regression model has correctly identified the relationship between each independent variable and the dependent variable. Either as a direct relationship or an inverse relationship. Assumption 9: No perfect multicollinearity - there is no perfect linear relationship between the independent variables. If 2 of the independent variables are correlated then this can lead to a problem. We can use the variance inflation factor to identify this. VIF is computed for each x, if VIF is high - it means the information in that variable is already explained by other X variables present in the given model. Lower VIF values is desired. Rule of thumb X < 4, strict threshold < 2. You can remove the variable with the highest vif - one at a time to determine the change in the model. The other solution is to create a correlation plot of all X variables. ```{r} library(car) mod2 <- lm (mpg ~., data = mtcars) vif(mod2) ``` ```{r} library(corrplot) corrplot(cor(mtcars[,-1])) # what variable pairs do I need to be aware of? ``` ```{r} mod <- lm(mpg ~cyl + gear, data = mtcars) vif(mod) ``` Assumption 10: normality of residuals. The qqnorm plot display this. The data points should fall on the line - for small and large values, some deviation is expected from the line. ```{r} par(mfrow=c(2,2)) mod <- lm(dist~speed, data=cars) plot(mod) ``` Interpretation of the charts: Residual versus Fitted - shows if the residuals have a nonlinear pattern. Equally spread residuals around a horizontal line without distinct patterns, is a good indication you don't have non-linear relationships. This is not quite a straight line. The points labeled have residual values larger than expected. Do we have a non-linear relationship? The QQ-plot - plots the residual values against the expected values from a normal distribution. If the residual values (dots) fall along the line (expected value from normal distribution) then the residual values are normally distributed. Are the points on the line? The scale-location or spread-location plot shows if the residuals are spread equally along the predictor values. This checks the assumption of equal variance (homoscedasticity). Homoscedasticity exists if the plot has a horizontal line with equally (randomly) spread points. Do our points look as though they are randomly placed? Is the line horizontal? Residuals vs. Leverage - Identify influential points (points with leverage) with regard to the regression line - which points are bucking the trend of the other data points to such an extreme they influence the placement of the regression line. outlying values at the upper right corner or at the lower right corner. Those spots are the places where cases can be influential against a regression line. Indentify cases outside of a dashed line, Cook's distance. When cases are outside of the Cook's distance (meaning they have high Cook's distance scores), the cases are influential to the regression results. The regression results will be altered if we exclude those cases. To check the assumptions of a linear model you can use the gvlma package ```{r} par(mfrow=c(2,2)) mod <- lm(dist~speed, data=cars) summary(mod) gvlma::gvlma(mod) summary(gvlma::gvlma(mod)) plot(mod) ``` We can remove points that have been identified as troubling to see if that changes the results of the LR assumptions ```{r} mod_clean <- lm(dist~speed, data=cars[-c(23, 35, 49),]) gvlma::gvlma(mod_clean) summary(mod_clean) ``` Compare the adjusted R-squared for the original model mod and the model with the samples removed mod_clean ```{r} CarModel <- lm(dist~., data=cars) CarModelAssess <- gvlma::gvlma(CarModel) CarModelAssess summary(CarModelAssess) ``` ```{r} CarModelDel <- gvlma::deletion.gvlma(CarModelAssess) #deletion.gvlma performs leave-one-out to assess unusual # data samples used to fit the line CarModelDel #summary(CarModelDel) ``` ```{r} plot(CarModelAssess) ``` ```{r} library(gvlma) par(mfrow=c(2,2)) # draw 4 plots in same window mod <- lm(dist ~ speed, data=cars) gvlma::gvlma(mod) plot(mod) ``` Residual versus Fitted - shows if the residuals have a nonlinear pattern. Equally spread residuals around a horizontal line without distinct patterns, is a good indication you don't have non-linear relationships. This is not quite a straight line. The points labeled have residual values larger than expected. Do we have a non-linear relationship? The QQ-plot - plots the residual values against the expected values from a normal distribution. If the residual values (dots) fall along the line (expected value from normal distribution) then the residual values are normally distributed. Are the points on the line? The scale-location or spread-location plot shows if the residuals are spread equally along the predictor values. This checks the assumption of equal variance (homoscedasticity). Homoscedasticity exists if the plot has a horizontal line with equally (randomly) spread points. Do our points look as though they are randomly placed? Is the line horizontal? Residuals vs. Leverage - Identify influential points (points with leverage) with regard to the regression line - which points are bucking the trend of the other data points to such an extreme they influence the placement of the regression line. outlying values at the upper right corner or at the lower right corner. Those spots are the places where cases can be influential against a regression line. Indentify cases outside of a dashed line, Cook's distance. When cases are outside of the Cook's distance (meaning they have high Cook's distance scores), the cases are influential to the regression results. The regression results will be altered if we exclude those cases.