---
title: "Logistic"
author: "Kathleen Durant"
date: "November 26, 2018"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## R Markdown

File source https://rpubs.com/rslbliss/r_logistic_ws

```{r read_data}
ldata_all <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

ldata_all$rank <-as.factor(ldata_all$rank)
summary(ldata_all)
ldata <- ldata_all[1:300,]
ldata_test <- ldata_all[301:400,]
```

Binary response variable is admit, gre and gpa student are student success measures, rank is prestige of undergraduate school 1-4 - lower value means more prestigious  

Use glm to create the logistic regression model specifying the family of functions  as "binomial"

Model to create

log(Pr(Admit)/NotAdmit)=β0+β1(gre)+β2(gpa)+β3(rank)+ε
```{r }
model <- glm(admit ~ gre + gpa + rank, data=ldata, family="binomial")
summary(model)
```
the regression coefficients are returned in logged odds units
```{r}
coefs <-coef(model)
coefs
```



Second, we raise to the coefficients using the exp() command to get the odds ratio from the log odds
```{r}
exp(coefs)
```

we can take the odds ratio, subtract 1, and multiply by 100 to get the percent change in the odds for a one unit increase in the independent variable.
```{r}
(exp(coefs)-1) *100
```

Interpretation:
For a one point increase in GRE score, we expect to see a 0.30% increase in the odds of being admitted to graduate school. 
```{r}
anova(model, test="Chisq")
```


The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. The McFadden measure is a pseudo R-square value

```{r}
library(pscl)
pR2(model)
```
By setting the parameter type='response', R will output probabilities in the form of P(y=1|X). Our decision boundary will be 0.5. If P(y=1|X) > 0.5 then y = 1 otherwise y=0. 
```{r}
probs <- predict(model,new_data=ldata_test,type="response")
classes <-ifelse(probs>.5,1,0)
miss_rate <- mean(classes != ldata_test$admit)
sprintf("Accuracy rate = %f", 100*(1-miss_rate))
```

multinomial logistic regression models a  response variable that has more than 2 categorical categories. It reports the odds of being in the different outcome categories in reference to some base group.
```{r}
model_multi <- glm(rank ~ admit +gre + gpa, data=ldata, family=binomial)
summary(model_multi)
confint(model_multi)
```
AIC can be used for comparing models on the same samples - compute the difference for the 2 models, diff=2 about the same, diff = 5 - an improvement associated with the model with smaller AIC, diff=10 - lower AIC is better model
```{r}
model_multi <- glm(rank ~ admit , data=ldata, family=binomial)
summary(model_multi)
confint(model_multi)
```
Let's use the caret package to build a multinomial model for the iris dataset
```{r}
library(caret)
library(tidyverse)
# Load the data
data("iris")
# Inspect the data

# Split the data into training and test set
set.seed(123)
train_samples <- iris$Species %>% 
  createDataPartition(p = 0.8, list = FALSE)
train_set  <- iris[train_samples, ]
test_set <- iris[-train_samples, ]
```






```{r}
# Fit the model
model_iris <- nnet::multinom(Species ~., data = train_set)
levels(train_set$Species)
# Summarize the model
summary(model_iris)
# Make predictions
predicted_classes <- model_iris %>% predict(test_set)
head(predicted_classes)
# Model accuracy

mean(predicted_classes == test_set$Species)
```
Base level is setosa

The following code generates an error, error typicaly means predictors are not linearly-independent or one of them is perfectly associated with the response variable
```{r}
# Fit the model
model_iris2 <- glm(Species ~Sepal.Length+Sepal.Width+Petal.Width+Petal.Length, data = train_set, family=binomial(link="logit"))
# Summarize the model
summary(model_iris2)
# Should not make predictions
#predicted_classes2 <- model_iris %>% predict(test_set, type="class")
#head(predicted_classes2)
# Model accuracy

#mean(predicted_classes2 == test_set$Species)
```

```{r}
# Fit the model
model_iris2 <- glm(Species ~Sepal.Length, data = train_set, family=binomial(link="logit"))
# Summarize the model
summary(model_iris2)
# Should not make predictions
predicted_classes2 <- model_iris %>% predict(test_set, type="class")
head(predicted_classes2)
# Model accuracy

mean(predicted_classes2 == test_set$Species)
```