--- title: "Comparing Groups" author: "Kathleen Durant" date: "January 24, 2018" output: pdf_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ```{r} library(tidyverse) cereal <- read.csv("cereal.csv") summary(cereal) ``` ```{r} Manufacturer_avgs <- cereal %>% select(Manufacturer, Calories) %>% group_by(Manufacturer) %>% summary(cal_avg = mean(Calories)) cereal_Nabisco <- cereal %>% select(Manufacturer, Calories) %>% filter(Manufacturer == "Nabisco") cereal_GM <- cereal %>% select(Manufacturer, Calories) %>% filter(Manufacturer == "General Mills") t.test(cereal_Nabisco$Calories, cereal_GM$Calories) ``` ```{r} library(BSDA) #Given the number of data samples per group is this an appropriate test to perform, since we are using the samples to compute the standard deviation for the population z.test(cereal_Nabisco$Calories, cereal_GM$Calories, sigma.x=sd(cereal_Nabisco$Calories), sigma.y=sd(cereal_GM$Calories) ) ``` ```{r} comp <- aov(Calories~Manufacturer, data=cereal) summary(comp) ``` As the p-value is less than the significance level 0.05, we can conclude that there are significant differences between the Manufacturer groups highlighted with "*" in the model summary. ```{r} # Since a significant difference was identified we can use Tukey multiple pairwise comparisons to identify the differences. The Tukey Honest Significant Difference (HSD) method controls for the Type I error rate across multiple comparisons and is generally considered an acceptable technique. There are other correction methods such as the bonferroni error correction method. This method divides the type 1 error rate (.05) by the number of tests performed TukeyHSD(comp) ``` ```{r} plot(TukeyHSD(comp)) ``` ```{r} # create a binary variable reprsenting high fiber # is fiber level independent of sugar level? cereal <- cereal %>% mutate(high_fiber = ifelse(Fiber >3, TRUE, FALSE)) cereal <- cereal %>% mutate(high_sugar = ifelse(Sugars >7, TRUE, FALSE)) sugar_fiber_counts <- table(cereal$high_fiber, cereal$high_sugar) sugar_fiber_counts # test the hypothesis whether sugar and fiber level are independent chisq.test(sugar_fiber_counts) # since the p-value is 1 we do not reject the null hypothesis that the two variables are independent ``` ```{r} #fisher exact test is used when the # of samples is small fisher.test(sugar_fiber_counts) # same results as above ``` ```{r} fit_lm <- lm(Calories ~ Carbs + Sugars, data = cereal) fit_glm <- glm(Calories ~ Carbs + Sugars, data = cereal) predictions_lm <- predict(fit_lm) predictions_glm <- predict(fit_glm) cereal$Calories - predictions summary(fit_lm) ```