--- title: "Logistic" author: "Kathleen Durant" date: "November 26, 2018" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## R Markdown File source https://rpubs.com/rslbliss/r_logistic_ws ```{r read_data} ldata_all <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv") ldata_all$rank <-as.factor(ldata_all$rank) summary(ldata_all) ldata <- ldata_all[1:300,] ldata_test <- ldata_all[301:400,] ``` Binary response variable is admit, gre and gpa student are student success measures, rank is prestige of undergraduate school 1-4 - lower value means more prestigious Use glm to create the logistic regression model specifying the family of functions as "binomial" Model to create log(Pr(Admit)/NotAdmit)=β0+β1(gre)+β2(gpa)+β3(rank)+ε ```{r } model <- glm(admit ~ gre + gpa + rank, data=ldata, family="binomial") summary(model) ``` the regression coefficients are returned in logged odds units ```{r} coefs <-coef(model) coefs ``` Second, we raise to the coefficients using the exp() command to get the odds ratio from the log odds ```{r} exp(coefs) ``` we can take the odds ratio, subtract 1, and multiply by 100 to get the percent change in the odds for a one unit increase in the independent variable. ```{r} (exp(coefs)-1) *100 ``` Interpretation: For a one point increase in GRE score, we expect to see a 0.30% increase in the odds of being admitted to graduate school. ```{r} anova(model, test="Chisq") ``` The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. The McFadden measure is a pseudo R-square value ```{r} library(pscl) pR2(model) ``` By setting the parameter type='response', R will output probabilities in the form of P(y=1|X). Our decision boundary will be 0.5. If P(y=1|X) > 0.5 then y = 1 otherwise y=0. ```{r} probs <- predict(model,new_data=ldata_test,type="response") classes <-ifelse(probs>.5,1,0) miss_rate <- mean(classes != ldata_test$admit) sprintf("Accuracy rate = %f", 100*(1-miss_rate)) ``` multinomial logistic regression models a response variable that has more than 2 categorical categories. It reports the odds of being in the different outcome categories in reference to some base group. ```{r} model_multi <- glm(rank ~ admit +gre + gpa, data=ldata, family=binomial) summary(model_multi) confint(model_multi) ``` AIC can be used for comparing models on the same samples - compute the difference for the 2 models, diff=2 about the same, diff = 5 - an improvement associated with the model with smaller AIC, diff=10 - lower AIC is better model ```{r} model_multi <- glm(rank ~ admit , data=ldata, family=binomial) summary(model_multi) confint(model_multi) ``` Let's use the caret package to build a multinomial model for the iris dataset ```{r} library(caret) library(tidyverse) # Load the data data("iris") # Inspect the data # Split the data into training and test set set.seed(123) train_samples <- iris$Species %>% createDataPartition(p = 0.8, list = FALSE) train_set <- iris[train_samples, ] test_set <- iris[-train_samples, ] ``` ```{r} # Fit the model model_iris <- nnet::multinom(Species ~., data = train_set) levels(train_set$Species) # Summarize the model summary(model_iris) # Make predictions predicted_classes <- model_iris %>% predict(test_set) head(predicted_classes) # Model accuracy mean(predicted_classes == test_set$Species) ``` Base level is setosa The following code generates an error, error typicaly means predictors are not linearly-independent or one of them is perfectly associated with the response variable ```{r} # Fit the model model_iris2 <- glm(Species ~Sepal.Length+Sepal.Width+Petal.Width+Petal.Length, data = train_set, family=binomial(link="logit")) # Summarize the model summary(model_iris2) # Should not make predictions #predicted_classes2 <- model_iris %>% predict(test_set, type="class") #head(predicted_classes2) # Model accuracy #mean(predicted_classes2 == test_set$Species) ``` ```{r} # Fit the model model_iris2 <- glm(Species ~Sepal.Length, data = train_set, family=binomial(link="logit")) # Summarize the model summary(model_iris2) # Should not make predictions predicted_classes2 <- model_iris %>% predict(test_set, type="class") head(predicted_classes2) # Model accuracy mean(predicted_classes2 == test_set$Species) ```