---
title: "Comparing Groups"
author: "Kathleen Durant"
date: "January 24, 2018"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r}
library(tidyverse)
cereal <- read.csv("cereal.csv")
summary(cereal)
```
```{r}
Manufacturer_avgs <- cereal %>% select(Manufacturer, Calories) %>% group_by(Manufacturer) %>% summary(cal_avg = mean(Calories))

cereal_Nabisco <- cereal %>% select(Manufacturer, Calories) %>% filter(Manufacturer == "Nabisco")

cereal_GM <- cereal %>% select(Manufacturer, Calories) %>% filter(Manufacturer == "General Mills")

t.test(cereal_Nabisco$Calories, cereal_GM$Calories)
```
```{r}
library(BSDA)
#Given the number of data samples per group is this an appropriate test to perform, since we are using the samples to compute the standard deviation for the population
z.test(cereal_Nabisco$Calories, cereal_GM$Calories, sigma.x=sd(cereal_Nabisco$Calories),
       sigma.y=sd(cereal_GM$Calories) )
```

```{r}
comp <- aov(Calories~Manufacturer, data=cereal)
summary(comp)

```
As the p-value is less than the significance level 0.05, we can conclude that there are significant differences between the Manufacturer groups highlighted with "*" in the model summary.

```{r}
# Since a significant difference was identified we can use Tukey multiple pairwise comparisons to identify the differences. The Tukey Honest Significant Difference (HSD) method controls for the Type I error rate across multiple comparisons and is generally considered an acceptable technique. There are other correction methods such as the bonferroni error correction method. This method divides the type 1 error rate (.05) by the number of tests performed 
TukeyHSD(comp)

```

```{r}
plot(TukeyHSD(comp))
```
```{r}
# create a binary variable reprsenting high fiber
# is fiber level independent of sugar level?
cereal <- cereal %>% mutate(high_fiber = ifelse(Fiber >3, TRUE, FALSE))
cereal <- cereal %>% mutate(high_sugar = ifelse(Sugars >7, TRUE, FALSE))
sugar_fiber_counts <- table(cereal$high_fiber, cereal$high_sugar)
sugar_fiber_counts
# test the hypothesis whether sugar and fiber level are independent
chisq.test(sugar_fiber_counts)
# since the p-value is 1 we do not reject the null hypothesis that the two variables are independent 
```

```{r}
#fisher exact test is used when the # of samples is small
fisher.test(sugar_fiber_counts)
# same results as above
```


```{r}
fit_lm <- lm(Calories ~ Carbs + Sugars, data = cereal)
fit_glm <- glm(Calories ~ Carbs + Sugars, data = cereal)
predictions_lm <- predict(fit_lm)
predictions_glm <- predict(fit_glm)
cereal$Calories - predictions
summary(fit_lm)
```