Introduction
Hello! And welcome to our first tutorial! In this tutorial, we will analyze data from Chicago Food Inspections from the years 2010-2018. We will visualize and try to create predictions for grocery stores, restaurants, and bakeries passing their inspection for 2019. We will use methods and tools that were taught in class in addition to other tools we have researched.

The dataset can be found here - https://www.kaggle.com/chicago/chicago-food-inspections#food-inspections.csv

Download the full R markdown file here - https://pdonnelly3.github.io/final.Rmd

We will utilize both linear and logistic regression models to check if there is a relationship between facility types passing and year.

Here we load the data locally to be able to manipulate it and use it later.

Let’s take a look at a small portion of our data.

chicago[1:5, 1:22] %>%
  mutate(Violations = substr(Violations, 1, 50))%>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Inspection ID DBA Name AKA Name License # Facility Type Risk Address City State Zip Inspection Date Inspection Type Results Violations Latitude Longitude Location Historical Wards 2003-2015 Zip Codes Community Areas Census Tracts Wards
2288309 BLACK EAGLE CLUB BLACK EAGLE CLUB 2653260 Restaurant Risk 1 (High) 1938 W IRVING PARK RD CHICAGO IL 60613 2019-05-07 License Fail
  1. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLO
41.95428 -87.67788 {‘longitude’: ‘-87.67788273287823’, ‘latitude’: ‘41.95427796120616’} 13 21186 46 622 18
2288284 NIBBLES & NOSH/MIDNIGHT MAC & CHEESERIE/XO MARSHMA NIBBLES & NOSH/MIDNIGHT MAC & CHEESERIE/XO MARSHMA 2658934 Restaurant Risk 1 (High) 6977-6981 N SHERIDAN RD CHICAGO IL 60626 2019-05-07 License Pass w/ Conditions
  1. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLO
42.00903 -87.66189 {‘longitude’: ‘-87.66188729620244’, ‘latitude’: ‘42.00902867704194’} 3 21853 10 267 5
2288315 ROBERT’S PIZZA AND DOUGH COMPANY ROBERT’S PIZZA AND DOUGH COMPANY 2617087 Restaurant Risk 1 (High) 411 E ILLINOIS ST CHICAGO IL 60611 2019-05-07 License Pass w/ Conditions
  1. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLO
41.89095 -87.61719 {‘longitude’: ‘-87.61718975017543’, ‘latitude’: ‘41.89095012024265’} 22 21182 37 534 36
2288303 THE FAT SHALLOT THE FAT SHALLOT 2373569 Mobile Food Preparer Risk 2 (Medium) 2300 S THROOP ST CHICAGO IL 60608 2019-05-07 License Fail
  1. CITY OF CHICAGO FOOD SERVICE SANITATION CERTIFI
41.85045 -87.65880 {‘longitude’: ‘-87.65879785567869’, ‘latitude’: ‘41.85045102427’} 8 14920 33 126 26
2288325 ROBERT’S PIZZA AND DOUGH COMPANY ROBERT’S PIZZA AND DOUGH COMPANY 2617143 Restaurant Risk 1 (High) 411 E ILLINOIS ST CHICAGO IL 60611 2019-05-07 License Pass w/ Conditions NA 41.89095 -87.61719 {‘longitude’: ‘-87.61718975017543’, ‘latitude’: ‘41.89095012024265’} 22 21182 37 534 36


Okay. great. We have the names of the establishment and other relavent information about the inspection including but not limited to address, risk level, result of inspections and violations.

Data Cleaning

We are only interested in establishments that either pass, pass with conditions or fail, therefore we are going to remove any business that does not fit in these categories.

Data from 2019 is removed as 2019 only has 5 months worth of data compared to the data collected between 2010 and 2018 where each of these years has twelve months worth of data.

New columns are created to help with later computations. For example, a year column is added where we extract the year from the Inspection Date column.

A risk value column is added to discretize risk. The Risk column has 3 categories: “Risk 1 (High)”, “Risk 2 (Medium)”, and “Risk 3(Low)”. We create a new column which converts the 3 categories into 3 numeric values. 3 represents Risk 1 (High), 2 represents Risk 2 (Medium) and 1 represents Risk 3 (Low).

We create a new column called Pass which converts the 3 categories of the result of inspection, into 2 numeric values. If the establishment fails inspections, the score is 0, if the establishments passes or passes with a condition, the score is 1.

chicago <- chicago %>%
  #filter out things we do not need or want
  filter(Results != "Not Ready" & Results != "Business Not Located" & Results != "No Entry" & Results 
         != "Out of Business" & Results != "NA" & Risk != "All" & `Facility Type` != "NA")%>%
  
  #make a column for if they pass/failed
  mutate(Pass = ifelse(Results == "Pass" | Results =="Pass w/ Conditions", 1, 0))%>%

  #grab the year of the inspection date using regular expressions
  mutate(year = as.numeric(gsub("(^[0-9]{4})(-[0-9]{2}-)([0-9]{2})", "\\1", `Inspection Date`))) %>%
  
  filter(year != 2019)%>%
  #adding risk value
  mutate(`Risk Value` = ifelse(Risk == "Risk 1 (High)", 3, 
                               ifelse(Risk == "Risk 2 (Medium)", 2, 1)))


#lets rename the column Facility Type to Facility
chicago <- rename(chicago, Facility=`Facility Type`)

Here is a small snippet of what our new dataset looks like.

chicago[1:5, 1:25] %>%
  mutate(Violations = substr(Violations, 1, 50))%>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Inspection ID DBA Name AKA Name License # Facility Risk Address City State Zip Inspection Date Inspection Type Results Violations Latitude Longitude Location Historical Wards 2003-2015 Zip Codes Community Areas Census Tracts Wards Pass year Risk Value
2243951 DONGPO IMPRESSION DONGPO IMPRESSION 2551932 Restaurant Risk 1 (High) 228 W CERMAK RD CHICAGO IL 60616 2018-12-31 Complaint Fail
  1. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLO
41.85294 -87.63322 {‘longitude’: ‘-87.6332244476211’, ‘latitude’: ‘41.85294483663146’} 8 21194 35 3 26 0 2018 3
2243947 DOMINO’S DOMINO’S 2637330 Restaurant Risk 2 (Medium) 509 N ORLEANS ST CHICAGO IL 60654 2018-12-31 License Pass NA 41.89108 -87.63687 {‘longitude’: ‘-87.63687267718794’, ‘latitude’: ‘41.8910754076385’} 22 4446 37 652 36 1 2018 2
2243952 RANALLIS RANALLIS 1738099 Restaurant Risk 1 (High) 1508-1512 W BERWYN AVE CHICAGO IL 60640 2018-12-31 Canvass Pass w/ Conditions
  1. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLO
41.97812 -87.66876 {‘longitude’: ‘-87.6687558692362’, ‘latitude’: ‘41.97812335776488’} 46 22616 76 564 24 1 2018 3
2243942 DUNKIN DONUTS DUNKIN DONUTS 2536449 Restaurant Risk 2 (Medium) 309 W CHICAGO AVE CHICAGO IL 60654 2018-12-31 Canvass Pass w/ Conditions
  1. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLO
41.89646 -87.63610 {‘longitude’: ‘-87.63609804227903’, ‘latitude’: ‘41.8964577550414’} 22 4446 37 652 36 1 2018 2
2243937 WALGREENS #2025 WALGREENS #2025 18593 Grocery Store Risk 3 (Low) 3000 S HALSTED ST CHICAGO IL 60608 2018-12-31 Canvass Pass
  1. PHYSICAL FACILITIES INSTALLED, MAINTAINED & CL
41.83977 -87.64647 {‘longitude’: ‘-87.64646751112869’, ‘latitude’: ‘41.839771946583845’} 26 14920 58 59 48 1 2018 1

Exploratory Data Analysis: Data Visualization

In this section, we are visualizing the data.
Below, we create an interactive map of Chicago. You will be able to scroll through Chicago and see which business pass or fail inspection, which business has a high or low risk level, where these businesses are located and the name of the business.

To get a better understanding of the data, we sample 2000 entries.

Here is a quick guide on how to interpret the map below:

Icon: Indicates Result
Pass- thumbs-up
Fail- thumbs-down
Pass with Conditions- check-circle

MarkerColor: Indicates Risk Value
Highest = 3 = red
Medium = 2 = white
Low = 1 = blue

The business name pops up when the icon is clicked.
The cluster colors bear no significance to the data.

#sampling 2000 entries
dat <- chicago %>%
  filter(!is.na(`Longitude`)) %>%
  sample_n(2000)

#If facility passes, icon is a thumbs up, if it fails it is a thumbs down. otherwise, a check-circle
getResult <- function(dat) {
  ifelse(dat$Results == "Pass", 'thumbs-up', 
         ifelse(dat$Results == "Fail", 'thumbs-down', 'check-circle'))
}

#if facility risk value is high (3) the color is red, if it is low (1) the color is white, else the color is blue
getRiskValue <- function(dat) {
  ifelse(dat$`Risk Value` == 3, 'red',
         ifelse(dat$`Risk Value` == 1, 'white', 'blue'))
}

#icon functions
icons <- awesomeIcons(
  icon = getResult(dat),
  markerColor = getRiskValue(dat),
  library = 'fa'
)

leaflet(dat) %>%
  addTiles() %>%
  addAwesomeMarkers(~Longitude,
                    ~Latitude,
                    icon = icons,
                    clusterOptions = markerClusterOptions(),
                    popup = ~(`DBA Name`)) %>%
  addFullscreenControl()


The map is helpful to visualize small clusters of businesses in Chicago. However, it’s hard to get an idea of the results as we have to scroll in to view the distributions of only a small portion of businesses. Just by looking at the map, it’s hard to grasp how many businesses actually pass, pass with conditions and fail. Also, since this map only samples 2000 entries across 8 years, it makes it even more difficult to understand the disrtibution of results per year.

Instead, let’s look at a graph representation of total results between 2010 - 2018.

chicago %>%
  ggplot(aes(x=Results, fill = Results))+
  geom_bar() +
  labs(title="Total Results for 2010 - 2018")


That looks better! We can see that overall, there are more pass than there are fails and pass with conditions. However, we see all the results from 2010 - 2018.

To get a better understanding of how businesses are performing over the years, let’s break down the results by year. Each plot will represent one year.

chicago %>%
  
  #create 9 time periods by year
  mutate(discrete_period = factor(year)) %>%
  
  ggplot(aes(x=Results, fill = Results)) +
  geom_bar() +
  
  facet_wrap(~discrete_period) +
  theme(legend.position = "top")+
  
  #make the labels
  ggtitle("Results vs Year")


We can now get a better idea of the amount of passing businesses per year. As we can see from the plots above, there is a trend of more passing businesses than failing businesses. We can also see that passing with conditions increases across the years.

Now, let’s look at how risks, high, low and medium, are broken down by year by using a similar graph as the one above.

chicago %>%
  
  #create 9 time periods by year
  mutate(discrete_period = factor(year)) %>%
  
  #group by risk
  group_by(Risk) %>%
  
  ggplot(aes(x=Risk, fill = Risk)) +
  geom_bar() +
  
  facet_wrap(~discrete_period) +
  theme(legend.position = "top")+
  ggtitle("Risk vs Year")


Overall, more facilities are high risk as opposed to medium/low risk.

Instead of visualizing the total number of passing facilities per year, we will look at the average amount of passing facilities per year vs the total number of passing per year.

By adding geom_smooth(), it will add a line to the plot and make it easier to understand what is happening.

chicago %>%
  group_by(year)%>%
  summarize(passMean = mean(Pass), total = sum(Pass))%>%
  ggplot(aes(x=total, y = passMean))+
  geom_point()+
  theme_minimal()+
  geom_smooth()+
  labs(x="Total Passes", y="Mean of Passes", title="Mean of Passes vs Total Passes")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'


From the above graph, it is evident the more facilities are being inspected, the average number of passing inspections increases.

Let’s dig deeper!

Let’s see if there is a relationship between facility type and passing rates. We will first examine the distribution of passing bakeries in Chicago.

chicago %>%
  filter(Facility == "Bakery") %>%
  group_by(year) %>%
  summarize(mean_pass = mean(Pass)) %>%
  ggplot(aes(x=year, y = mean_pass)) +
  geom_point() +
  theme_minimal() +
  geom_smooth() +
  labs(y="Mean of Passing Bakeries", x="Year", title="Mean of Passing Bakeries vs Year")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'


From the above plot, bakeries tend to pass about 75% of inspections each year. This leads us to believe that if there is a relationship between bakeries passing and year, it may not be a linear relationship.

Let’s examine another facility type: Grocery Stores

Since there does not appear to be a linear relationship between passing bakeries and year, we will try plotting the distribution of grocery stores passing inspections, over time and see if there is a linear relationship.

chicago %>%
  filter(Facility == "Grocery Store")%>%
  group_by(year)%>%
  summarize(mean_pass = mean(Pass))%>%
  ggplot(aes(x=year, y = mean_pass))+
  geom_point()+
  theme_minimal()+
  geom_smooth()+
  labs(x="Year", y="Mean of Passing Grocery Stores", title="Mean of Passing Grocery Stores vs Year")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'


There seems to be a releationship between grocery stores passing over time but it does not seem to be linear.

Let’s examine another facility type: Restaurants

Since there does not appear to be a linear relationship between passing grocery stores and year, we will try plotting the distribution of restaurants passing inspections, over time and see if there is a linear relationship.

chicago %>%
  filter(Facility == "Restaurant")%>%
  group_by(year)%>%
  summarize(mean_pass = mean(Pass))%>%
  ggplot(aes(x=year, y = mean_pass))+
  geom_point()+
  theme_minimal()+
  geom_smooth()+
  labs(x="Year", y="Mean of Passing Restaurants", title="Mean of Passing Restaurants vs Year")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'


There seems to be a releationship between restaurants passing over time but it does not seem to be linear.

We will test this by creating a linear regression model to ensure that this relationship is non-linear. If this relationship turns out to be non-linear, we will try exploring a logistic regression model to see if we get better results!

Experiment Design and Hypothesis

Based on the above facility type plots, we believe that there is no linear correlation between facility type passing and year. One popular way of determining if there is a correlation is to reject the hypothesis also known as the Null hypothesis.

Our null hypothesis is that there is a linear correlation between facility type passing and year.

To test our hypothesis, we build a linear model between facility type passing and year. We are only looking at facility types that are either restaurants, bakeries, or grocery stores. Linear regression models are used in data analysis. It allows to us to construct confidence intervals and hypothesis testing between variables.

In order to test only restaurants, bakeries, and grocery stores, we will create a new dataset from the Chicago dataset grouped by year. From this new dataset, we create a linear regression model between mean_pass and year.

stores_mean <- chicago %>%
  filter(Facility == "Grocery Store" | Facility == "Bakery" | Facility == "Restaurant") %>%
  group_by(year, Facility) %>%
  summarize(mean_pass = mean(Pass)) %>%
  select(year, mean_pass)

mean_lm <- lm(mean_pass ~ year, data=stores_mean)

summary(mean_lm)
## 
## Call:
## lm(formula = mean_pass ~ year, data = stores_mean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.07715 -0.01855  0.01004  0.02319  0.05828 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.566255   5.630666  -0.989    0.332
## year         0.003141   0.002796   1.123    0.272
## 
## Residual standard error: 0.03751 on 25 degrees of freedom
## Multiple R-squared:  0.04805,    Adjusted R-squared:  0.009977 
## F-statistic: 1.262 on 1 and 25 DF,  p-value: 0.2719


Confidence Interval

Let’s construct a confidence interval so we can say how precise we think our estimates of the facility type regression line is.

Using the tidy function provided by the broom package, we will calculate a standard error estimate for and we will construct a 95% confidence interval.

#gather the stats from the linear_model using tidy() which is defined by the broom package.
stats <- mean_lm %>%
  tidy() %>%
  select(term, estimate, std.error)
stats
## # A tibble: 2 x 3
##   term        estimate std.error
##   <chr>          <dbl>     <dbl>
## 1 (Intercept) -5.57      5.63   
## 2 year         0.00314   0.00280

Now let’s construct the confidence interval using the stats we gathered above.

confidence_interval_offset <- 1.95 * stats$std.error[2]
confidence_interval <- round(c(stats$estimate[2] - confidence_interval_offset,
                               stats$estimate[2],
                               stats$estimate[2] + confidence_interval_offset), 4)

confidence_interval
## [1] -0.0023  0.0031  0.0086

Based on the given confidence interval, we could say that “on average, facilities pass -0.0023 to 0.0031 to .0086 more for each year”

Global Statistics

Now using the glance function provided by the broom package, we will get some global statistics from our regression model.

mean_lm %>%
  glance() %>%
  select(r.squared, sigma, statistic, df, p.value)
## # A tibble: 1 x 5
##   r.squared  sigma statistic    df p.value
##       <dbl>  <dbl>     <dbl> <int>   <dbl>
## 1    0.0481 0.0375      1.26     2   0.272


R-squared is used to evaluate how well the model fits the data. It identifies how much of the variance in the predicated variable, or dependent variable, can be explained by independent variable. For example, our R-squared value of 0.04805467 can explain 4.8% of the variation in the outcome. Therefore, this indicates there is close to no correlation between a facility passing and year. We concur with our null hypothesis and reject our original hypothesis.

Skew

Let’s try to see if the data is skewed, meaning let’s check if the first and third quartiles are either less than or greater than the median.

# The best way to extract these values is to make a data frame from the linear model
quantiles <- quantile(mean_lm$residuals)

quantiles
##          0%         25%         50%         75%        100% 
## -0.07715287 -0.01855324  0.01003876  0.02318608  0.05828431

To determine if there is a skew in the data, we need to check that the First Quartile, or the 25% column, and the Third Quartile, the 75% column, are equal distance from the median. To determine that, you can simply subtract the median from the third quartile and check if that value is equal to the first quartile subtracted from the median.

In short we have to check:

75% column - median = median - 25% column

0.02318608 - 0.01003876 =? 0.01003876 - (-0.01855324)

0.01314732 =? 0.028592

Since the right side is not equal to the left side, the data is slightly skewed.

Augment

Next we will use the augment function which will help us to tell us about the fitted model.

mean_augment <- mean_lm %>%
  augment()

mean_augment %>% 
  head()
## # A tibble: 6 x 9
##   mean_pass  year .fitted .se.fit  .resid   .hat .sigma .cooksd .std.resid
##       <dbl> <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>      <dbl>
## 1     0.722  2010   0.747  0.0133 -0.0247 0.126  0.0379 0.0356      -0.703
## 2     0.684  2010   0.747  0.0133 -0.0624 0.126  0.0358 0.228       -1.78 
## 3     0.761  2010   0.747  0.0133  0.0139 0.126  0.0382 0.0113       0.396
## 4     0.765  2011   0.750  0.0111  0.0151 0.0870 0.0381 0.00851      0.423
## 5     0.673  2011   0.750  0.0111 -0.0772 0.0870 0.0346 0.221       -2.15 
## 6     0.770  2011   0.750  0.0111  0.0207 0.0870 0.0380 0.0159       0.577


Test for Linearity

So now that we have made our model and processed some of the data, we will plot the model. Plotting the model will help us to see if there is a linear relationship between the average number of passes and year. If there is a linear relationship, we should expect to see the residuals clustered around 0.

mean_augment %>%
  ggplot(aes(x=.fitted,y=.resid)) +
    geom_point() + 
    geom_smooth() +
    labs(x="fitted", y="residual", title="Residuals vs Fitted")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'


As we can see above, the residuals do not cluster around 0. Therefore, we can say that there is no linear relationship between faciility types passing and year. Now we will try using a logistic regression model to see if we get better results.

Logistic Regression

Before we make our logistic regression model, we are going to make a dataset called facilities that only contains information on bakeries, grocery stores and restaurants. This will help us with our model by removing unwanted information.

facilities <- chicago %>%
  select(year, Facility, Results, Pass, `DBA Name`, `Inspection Date`, `Zip Codes`)%>%
  filter(Facility == "Bakery" | Facility == "Grocery Store" | Facility == "Restaurant") 

head(facilities)
## # A tibble: 6 x 7
##    year Facility  Results   Pass `DBA Name` `Inspection Date`   `Zip Codes`
##   <dbl> <chr>     <chr>    <dbl> <chr>      <dttm>                    <dbl>
## 1  2018 Restaura~ Fail         0 DONGPO IM~ 2018-12-31 00:00:00       21194
## 2  2018 Restaura~ Pass         1 DOMINO'S   2018-12-31 00:00:00        4446
## 3  2018 Restaura~ Pass w/~     1 RANALLIS   2018-12-31 00:00:00       22616
## 4  2018 Restaura~ Pass w/~     1 DUNKIN DO~ 2018-12-31 00:00:00        4446
## 5  2018 Grocery ~ Pass         1 WALGREENS~ 2018-12-31 00:00:00       14920
## 6  2018 Restaura~ Pass w/~     1 DOMINO'S   2018-12-31 00:00:00       22615


Great! Now that we grabbed those attributes, we want to know if there is a correlation between facility types passing and year with passing results. We will also add Facility as a factor in computing the model. This time we will be making a logistic regression model instead of the linear regression model so we will be using the glm function insteaf of lm.

#We are trying to see if there is a relationship for results and year.
logistic_model = glm(formula = Pass ~ year+Facility, data = facilities, family="binomial")

summary(logistic_model)
## 
## Call:
## glm(formula = Pass ~ year + Facility, family = "binomial", data = facilities)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7990   0.6648   0.6817   0.7047   0.8142  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -36.51254    5.41767  -6.740 1.59e-11 ***
## year                    0.01870    0.00269   6.951 3.64e-12 ***
## FacilityGrocery Store  -0.13362    0.05093  -2.624 0.008700 ** 
## FacilityRestaurant      0.17998    0.04904   3.670 0.000243 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136310  on 129311  degrees of freedom
## Residual deviance: 135921  on 129308  degrees of freedom
## AIC: 135929
## 
## Number of Fisher Scoring iterations: 4


Because the null deviance is very large, this indicates that the null model does not explain the model very well.

Interaction Model

Now we are going to make an interaction model between year and the Facility types. This time notice we use * instead of +.

#We are trying to see if there is a relationship for results and year.
interaction_logistic_model = glm(formula = Pass ~ Facility*year, data = facilities, family="binomial")

summary(interaction_logistic_model)
## 
## Call:
## glm(formula = Pass ~ Facility * year, family = "binomial", data = facilities)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7880   0.6722   0.6835   0.6987   0.8513  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                 2.318e+00  3.930e+01   0.059   0.9530  
## FacilityGrocery Store      -9.903e+01  4.134e+01  -2.396   0.0166 *
## FacilityRestaurant         -2.605e+01  3.977e+01  -0.655   0.5124  
## year                       -5.836e-04  1.951e-02  -0.030   0.9761  
## FacilityGrocery Store:year  4.911e-02  2.052e-02   2.393   0.0167 *
## FacilityRestaurant:year     1.303e-02  1.975e-02   0.660   0.5095  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136310  on 129311  degrees of freedom
## Residual deviance: 135893  on 129306  degrees of freedom
## AIC: 135905
## 
## Number of Fisher Scoring iterations: 4


Comparing the residual deviance from the interaction model to the residual deviance of our first logistic model, the number is smaller in the interacion model signifying that the interaction model is a better prediction for the model.

Using the anova function, we can compare the two models more clearly.

anova(logistic_model, interaction_logistic_model)
## Analysis of Deviance Table
## 
## Model 1: Pass ~ year + Facility
## Model 2: Pass ~ Facility * year
##   Resid. Df Resid. Dev Df Deviance
## 1    129308     135921            
## 2    129306     135893  2   27.398


This further proves our point that the interaction model is an improved model.

To better visualize our model, we will plot the distribution of it.

plot_summs(interaction_logistic_model, scale = TRUE, plot.distributions = TRUE, inner_ci_level = .9)

And we will also plot a comparison of distributions between the two models.

plot_summs(interaction_logistic_model, logistic_model, scale = TRUE, plot.distributions = TRUE)

Let’s see what stats are produced by this interaction model.

interaction_logistic_model_stats <- interaction_logistic_model %>% 
  tidy()
interaction_logistic_model_stats
## # A tibble: 6 x 5
##   term                         estimate std.error statistic p.value
##   <chr>                           <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)                  2.32       39.3       0.0590  0.953 
## 2 FacilityGrocery Store      -99.0        41.3      -2.40    0.0166
## 3 FacilityRestaurant         -26.1        39.8      -0.655   0.512 
## 4 year                        -0.000584    0.0195   -0.0299  0.976 
## 5 FacilityGrocery Store:year   0.0491      0.0205    2.39    0.0167
## 6 FacilityRestaurant:year      0.0130      0.0197    0.660   0.509


Notice that there is one interaction missing, the interaction between Bakery and year. This will be the base the for our estimates. This means that for the estimate column, if the numbers are positive, the estimate is highter than the Bakery estimate and if it is negative, it is lower than the Bakery estimate.

Now we are going to see what the increase is for the facility type and year interactions. First we will use the year estimate as our base and then make a data frame out of the estimates.

Using these estimates we will add them to the year to find the increase! Since this is a relatively small amount of estimates, we could just make variables for each estimate and show the increase but if we were to have a lot of estimates, it would be tough to make variables for all of them.

year <- interaction_logistic_model_stats$estimate[4]

estimate<- interaction_logistic_model_stats%>%
  mutate(Facility = term, estimate = estimate)%>%
  select(Facility, estimate)%>%
  slice(5:6)
  
estimate_df <- data.frame(estimate)


estimate_df <- estimate_df %>%
  mutate(increase = year + estimate)

estimate_df
##                     Facility  estimate   increase
## 1 FacilityGrocery Store:year 0.0491134 0.04852979
## 2    FacilityRestaurant:year 0.0130257 0.01244209


When bakeries is the intercept, grocery stores on average pass 4.852979% more per year than bakeries while Restaurants pass on average 1.244209% more than bakeries.

Now we are going to use the augment function again to help us plot residual and fitted values to see how well the model performs.

augmented_logistic <- interaction_logistic_model %>%
  augment()

augmented_logistic %>%
  head()
## # A tibble: 6 x 10
##    Pass Facility  year .fitted .se.fit .resid    .hat .sigma .cooksd
##   <dbl> <chr>    <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
## 1     0 Restaur~  2018    1.37  0.0140 -1.79  3.15e-5   1.03 2.07e-5
## 2     1 Restaur~  2018    1.37  0.0140  0.672 3.15e-5   1.03 1.33e-6
## 3     1 Restaur~  2018    1.37  0.0140  0.672 3.15e-5   1.03 1.33e-6
## 4     1 Restaur~  2018    1.37  0.0140  0.672 3.15e-5   1.03 1.33e-6
## 5     1 Grocery~  2018    1.22  0.0326  0.720 1.88e-4   1.03 9.27e-6
## 6     1 Restaur~  2018    1.37  0.0140  0.672 3.15e-5   1.03 1.33e-6
## # ... with 1 more variable: .std.resid <dbl>


Now lets make two plots using plot_model provided by the packages sjPlot and sjmisc.

The first plot will plot marginal effects of the interaction terms.

#The Interaction
plot_model(interaction_logistic_model, type="int")


The probablility of a Restaurant passing increases in an 8 year span by roughly 2% and probability of a grocery store passing increased by roughly 18% in the 8 year time frame, and bakeries passing stayed about the same.

These plots will show the predicted values (marginal effects) for the specific model terms

#The Interaction
plot_model(interaction_logistic_model, type="pred", terms=c("year", "Facility"))


We can see that there is a upward trend of grocery stores and restaurants passing and it seems that bakeries stay about the same but have a high passing rate.

Let’s Make a Prediction!

Now lets make a prediction of our three facilities passing for 2019! We will make a dummy data frame called test_data and call the predict() function!

test_data <- data.frame(Facility=c("Bakery", "Restaurant", "Grocery Store"), year=2019, Pass=1)
predict(interaction_logistic_model, test_data, type="response")
##         1         2         3 
## 0.7575376 0.7998017 0.7799125


Results

This prediction tells us that a bakery has a 75% chance of passing inspection while a restaurant will have a 79% chance of passing inspection and while a grocery store will have a 77% chance of passing inspection in 2019.

Summary

We began our tutorial by trying to see if there is a linear relationship between facilities passing and year. Unfortunately, through our linear regression model, we found there is no linear relationship. We then tried testing out hypothesis with a logistic regression model which was better then linear, but still not as good as we’d hoped. We then tried an interaction model and this proved to be a much better prediction model. From our data analysis, we conclude that restaurants will pass at a higher rate than either bakeries or grocery stores in 2019.