Introduction
Hello! And welcome to our first tutorial! In this tutorial, we will analyze data from Chicago Food Inspections from the years 2010-2018. We will visualize and try to create predictions for grocery stores, restaurants, and bakeries passing their inspection for 2019. We will use methods and tools that were taught in class in addition to other tools we have researched.
The dataset can be found here - https://www.kaggle.com/chicago/chicago-food-inspections#food-inspections.csv
Download the full R markdown file here - https://pdonnelly3.github.io/final.Rmd
We will utilize both linear and logistic regression models to check if there is a relationship between facility types passing and year.
Here we load the data locally to be able to manipulate it and use it later.
Let’s take a look at a small portion of our data.
chicago[1:5, 1:22] %>%
mutate(Violations = substr(Violations, 1, 50))%>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Inspection ID | DBA Name | AKA Name | License # | Facility Type | Risk | Address | City | State | Zip | Inspection Date | Inspection Type | Results | Violations | Latitude | Longitude | Location | Historical Wards 2003-2015 | Zip Codes | Community Areas | Census Tracts | Wards |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2288309 | BLACK EAGLE CLUB | BLACK EAGLE CLUB | 2653260 | Restaurant | Risk 1 (High) | 1938 W IRVING PARK RD | CHICAGO | IL | 60613 | 2019-05-07 | License | Fail |
|
41.95428 | -87.67788 | {‘longitude’: ‘-87.67788273287823’, ‘latitude’: ‘41.95427796120616’} | 13 | 21186 | 46 | 622 | 18 |
2288284 | NIBBLES & NOSH/MIDNIGHT MAC & CHEESERIE/XO MARSHMA | NIBBLES & NOSH/MIDNIGHT MAC & CHEESERIE/XO MARSHMA | 2658934 | Restaurant | Risk 1 (High) | 6977-6981 N SHERIDAN RD | CHICAGO | IL | 60626 | 2019-05-07 | License | Pass w/ Conditions |
|
42.00903 | -87.66189 | {‘longitude’: ‘-87.66188729620244’, ‘latitude’: ‘42.00902867704194’} | 3 | 21853 | 10 | 267 | 5 |
2288315 | ROBERT’S PIZZA AND DOUGH COMPANY | ROBERT’S PIZZA AND DOUGH COMPANY | 2617087 | Restaurant | Risk 1 (High) | 411 E ILLINOIS ST | CHICAGO | IL | 60611 | 2019-05-07 | License | Pass w/ Conditions |
|
41.89095 | -87.61719 | {‘longitude’: ‘-87.61718975017543’, ‘latitude’: ‘41.89095012024265’} | 22 | 21182 | 37 | 534 | 36 |
2288303 | THE FAT SHALLOT | THE FAT SHALLOT | 2373569 | Mobile Food Preparer | Risk 2 (Medium) | 2300 S THROOP ST | CHICAGO | IL | 60608 | 2019-05-07 | License | Fail |
|
41.85045 | -87.65880 | {‘longitude’: ‘-87.65879785567869’, ‘latitude’: ‘41.85045102427’} | 8 | 14920 | 33 | 126 | 26 |
2288325 | ROBERT’S PIZZA AND DOUGH COMPANY | ROBERT’S PIZZA AND DOUGH COMPANY | 2617143 | Restaurant | Risk 1 (High) | 411 E ILLINOIS ST | CHICAGO | IL | 60611 | 2019-05-07 | License | Pass w/ Conditions | NA | 41.89095 | -87.61719 | {‘longitude’: ‘-87.61718975017543’, ‘latitude’: ‘41.89095012024265’} | 22 | 21182 | 37 | 534 | 36 |
Okay. great. We have the names of the establishment and other relavent information about the inspection including but not limited to address, risk level, result of inspections and violations.
We are only interested in establishments that either pass, pass with conditions or fail, therefore we are going to remove any business that does not fit in these categories.
Data from 2019 is removed as 2019 only has 5 months worth of data compared to the data collected between 2010 and 2018 where each of these years has twelve months worth of data.
New columns are created to help with later computations. For example, a year column is added where we extract the year from the Inspection Date column.
A risk value column is added to discretize risk. The Risk column has 3 categories: “Risk 1 (High)”, “Risk 2 (Medium)”, and “Risk 3(Low)”. We create a new column which converts the 3 categories into 3 numeric values. 3 represents Risk 1 (High), 2 represents Risk 2 (Medium) and 1 represents Risk 3 (Low).
We create a new column called Pass which converts the 3 categories of the result of inspection, into 2 numeric values. If the establishment fails inspections, the score is 0, if the establishments passes or passes with a condition, the score is 1.
chicago <- chicago %>%
#filter out things we do not need or want
filter(Results != "Not Ready" & Results != "Business Not Located" & Results != "No Entry" & Results
!= "Out of Business" & Results != "NA" & Risk != "All" & `Facility Type` != "NA")%>%
#make a column for if they pass/failed
mutate(Pass = ifelse(Results == "Pass" | Results =="Pass w/ Conditions", 1, 0))%>%
#grab the year of the inspection date using regular expressions
mutate(year = as.numeric(gsub("(^[0-9]{4})(-[0-9]{2}-)([0-9]{2})", "\\1", `Inspection Date`))) %>%
filter(year != 2019)%>%
#adding risk value
mutate(`Risk Value` = ifelse(Risk == "Risk 1 (High)", 3,
ifelse(Risk == "Risk 2 (Medium)", 2, 1)))
#lets rename the column Facility Type to Facility
chicago <- rename(chicago, Facility=`Facility Type`)
Here is a small snippet of what our new dataset looks like.
chicago[1:5, 1:25] %>%
mutate(Violations = substr(Violations, 1, 50))%>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Inspection ID | DBA Name | AKA Name | License # | Facility | Risk | Address | City | State | Zip | Inspection Date | Inspection Type | Results | Violations | Latitude | Longitude | Location | Historical Wards 2003-2015 | Zip Codes | Community Areas | Census Tracts | Wards | Pass | year | Risk Value |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2243951 | DONGPO IMPRESSION | DONGPO IMPRESSION | 2551932 | Restaurant | Risk 1 (High) | 228 W CERMAK RD | CHICAGO | IL | 60616 | 2018-12-31 | Complaint | Fail |
|
41.85294 | -87.63322 | {‘longitude’: ‘-87.6332244476211’, ‘latitude’: ‘41.85294483663146’} | 8 | 21194 | 35 | 3 | 26 | 0 | 2018 | 3 |
2243947 | DOMINO’S | DOMINO’S | 2637330 | Restaurant | Risk 2 (Medium) | 509 N ORLEANS ST | CHICAGO | IL | 60654 | 2018-12-31 | License | Pass | NA | 41.89108 | -87.63687 | {‘longitude’: ‘-87.63687267718794’, ‘latitude’: ‘41.8910754076385’} | 22 | 4446 | 37 | 652 | 36 | 1 | 2018 | 2 |
2243952 | RANALLIS | RANALLIS | 1738099 | Restaurant | Risk 1 (High) | 1508-1512 W BERWYN AVE | CHICAGO | IL | 60640 | 2018-12-31 | Canvass | Pass w/ Conditions |
|
41.97812 | -87.66876 | {‘longitude’: ‘-87.6687558692362’, ‘latitude’: ‘41.97812335776488’} | 46 | 22616 | 76 | 564 | 24 | 1 | 2018 | 3 |
2243942 | DUNKIN DONUTS | DUNKIN DONUTS | 2536449 | Restaurant | Risk 2 (Medium) | 309 W CHICAGO AVE | CHICAGO | IL | 60654 | 2018-12-31 | Canvass | Pass w/ Conditions |
|
41.89646 | -87.63610 | {‘longitude’: ‘-87.63609804227903’, ‘latitude’: ‘41.8964577550414’} | 22 | 4446 | 37 | 652 | 36 | 1 | 2018 | 2 |
2243937 | WALGREENS #2025 | WALGREENS #2025 | 18593 | Grocery Store | Risk 3 (Low) | 3000 S HALSTED ST | CHICAGO | IL | 60608 | 2018-12-31 | Canvass | Pass |
|
41.83977 | -87.64647 | {‘longitude’: ‘-87.64646751112869’, ‘latitude’: ‘41.839771946583845’} | 26 | 14920 | 58 | 59 | 48 | 1 | 2018 | 1 |
In this section, we are visualizing the data.
Below, we create an interactive map of Chicago. You will be able to scroll through Chicago and see which business pass or fail inspection, which business has a high or low risk level, where these businesses are located and the name of the business.
To get a better understanding of the data, we sample 2000 entries.
Here is a quick guide on how to interpret the map below:
Icon: Indicates Result
Pass- thumbs-up
Fail- thumbs-down
Pass with Conditions- check-circle
MarkerColor: Indicates Risk Value
Highest = 3 = red
Medium = 2 = white
Low = 1 = blue
The business name pops up when the icon is clicked.
The cluster colors bear no significance to the data.
#sampling 2000 entries
dat <- chicago %>%
filter(!is.na(`Longitude`)) %>%
sample_n(2000)
#If facility passes, icon is a thumbs up, if it fails it is a thumbs down. otherwise, a check-circle
getResult <- function(dat) {
ifelse(dat$Results == "Pass", 'thumbs-up',
ifelse(dat$Results == "Fail", 'thumbs-down', 'check-circle'))
}
#if facility risk value is high (3) the color is red, if it is low (1) the color is white, else the color is blue
getRiskValue <- function(dat) {
ifelse(dat$`Risk Value` == 3, 'red',
ifelse(dat$`Risk Value` == 1, 'white', 'blue'))
}
#icon functions
icons <- awesomeIcons(
icon = getResult(dat),
markerColor = getRiskValue(dat),
library = 'fa'
)
leaflet(dat) %>%
addTiles() %>%
addAwesomeMarkers(~Longitude,
~Latitude,
icon = icons,
clusterOptions = markerClusterOptions(),
popup = ~(`DBA Name`)) %>%
addFullscreenControl()
chicago %>%
ggplot(aes(x=Results, fill = Results))+
geom_bar() +
labs(title="Total Results for 2010 - 2018")
That looks better! We can see that overall, there are more pass than there are fails and pass with conditions. However, we see all the results from 2010 - 2018.
To get a better understanding of how businesses are performing over the years, let’s break down the results by year. Each plot will represent one year.
chicago %>%
#create 9 time periods by year
mutate(discrete_period = factor(year)) %>%
ggplot(aes(x=Results, fill = Results)) +
geom_bar() +
facet_wrap(~discrete_period) +
theme(legend.position = "top")+
#make the labels
ggtitle("Results vs Year")
We can now get a better idea of the amount of passing businesses per year. As we can see from the plots above, there is a trend of more passing businesses than failing businesses. We can also see that passing with conditions increases across the years.
Now, let’s look at how risks, high, low and medium, are broken down by year by using a similar graph as the one above.
chicago %>%
#create 9 time periods by year
mutate(discrete_period = factor(year)) %>%
#group by risk
group_by(Risk) %>%
ggplot(aes(x=Risk, fill = Risk)) +
geom_bar() +
facet_wrap(~discrete_period) +
theme(legend.position = "top")+
ggtitle("Risk vs Year")
Overall, more facilities are high risk as opposed to medium/low risk.
Instead of visualizing the total number of passing facilities per year, we will look at the average amount of passing facilities per year vs the total number of passing per year.
By adding geom_smooth(), it will add a line to the plot and make it easier to understand what is happening.
chicago %>%
group_by(year)%>%
summarize(passMean = mean(Pass), total = sum(Pass))%>%
ggplot(aes(x=total, y = passMean))+
geom_point()+
theme_minimal()+
geom_smooth()+
labs(x="Total Passes", y="Mean of Passes", title="Mean of Passes vs Total Passes")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
From the above graph, it is evident the more facilities are being inspected, the average number of passing inspections increases.
Let’s dig deeper!
Let’s see if there is a relationship between facility type and passing rates. We will first examine the distribution of passing bakeries in Chicago.
chicago %>%
filter(Facility == "Bakery") %>%
group_by(year) %>%
summarize(mean_pass = mean(Pass)) %>%
ggplot(aes(x=year, y = mean_pass)) +
geom_point() +
theme_minimal() +
geom_smooth() +
labs(y="Mean of Passing Bakeries", x="Year", title="Mean of Passing Bakeries vs Year")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
From the above plot, bakeries tend to pass about 75% of inspections each year. This leads us to believe that if there is a relationship between bakeries passing and year, it may not be a linear relationship.
Let’s examine another facility type: Grocery Stores
Since there does not appear to be a linear relationship between passing bakeries and year, we will try plotting the distribution of grocery stores passing inspections, over time and see if there is a linear relationship.
chicago %>%
filter(Facility == "Grocery Store")%>%
group_by(year)%>%
summarize(mean_pass = mean(Pass))%>%
ggplot(aes(x=year, y = mean_pass))+
geom_point()+
theme_minimal()+
geom_smooth()+
labs(x="Year", y="Mean of Passing Grocery Stores", title="Mean of Passing Grocery Stores vs Year")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
There seems to be a releationship between grocery stores passing over time but it does not seem to be linear.
Let’s examine another facility type: Restaurants
Since there does not appear to be a linear relationship between passing grocery stores and year, we will try plotting the distribution of restaurants passing inspections, over time and see if there is a linear relationship.
chicago %>%
filter(Facility == "Restaurant")%>%
group_by(year)%>%
summarize(mean_pass = mean(Pass))%>%
ggplot(aes(x=year, y = mean_pass))+
geom_point()+
theme_minimal()+
geom_smooth()+
labs(x="Year", y="Mean of Passing Restaurants", title="Mean of Passing Restaurants vs Year")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
There seems to be a releationship between restaurants passing over time but it does not seem to be linear.
We will test this by creating a linear regression model to ensure that this relationship is non-linear. If this relationship turns out to be non-linear, we will try exploring a logistic regression model to see if we get better results!
Based on the above facility type plots, we believe that there is no linear correlation between facility type passing and year. One popular way of determining if there is a correlation is to reject the hypothesis also known as the Null hypothesis.
Our null hypothesis is that there is a linear correlation between facility type passing and year.
To test our hypothesis, we build a linear model between facility type passing and year. We are only looking at facility types that are either restaurants, bakeries, or grocery stores. Linear regression models are used in data analysis. It allows to us to construct confidence intervals and hypothesis testing between variables.
In order to test only restaurants, bakeries, and grocery stores, we will create a new dataset from the Chicago dataset grouped by year. From this new dataset, we create a linear regression model between mean_pass and year.
stores_mean <- chicago %>%
filter(Facility == "Grocery Store" | Facility == "Bakery" | Facility == "Restaurant") %>%
group_by(year, Facility) %>%
summarize(mean_pass = mean(Pass)) %>%
select(year, mean_pass)
mean_lm <- lm(mean_pass ~ year, data=stores_mean)
summary(mean_lm)
##
## Call:
## lm(formula = mean_pass ~ year, data = stores_mean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.07715 -0.01855 0.01004 0.02319 0.05828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.566255 5.630666 -0.989 0.332
## year 0.003141 0.002796 1.123 0.272
##
## Residual standard error: 0.03751 on 25 degrees of freedom
## Multiple R-squared: 0.04805, Adjusted R-squared: 0.009977
## F-statistic: 1.262 on 1 and 25 DF, p-value: 0.2719
Confidence Interval
Let’s construct a confidence interval so we can say how precise we think our estimates of the facility type regression line is.
Using the tidy function provided by the broom package, we will calculate a standard error estimate for and we will construct a 95% confidence interval.
#gather the stats from the linear_model using tidy() which is defined by the broom package.
stats <- mean_lm %>%
tidy() %>%
select(term, estimate, std.error)
stats
## # A tibble: 2 x 3
## term estimate std.error
## <chr> <dbl> <dbl>
## 1 (Intercept) -5.57 5.63
## 2 year 0.00314 0.00280
Now let’s construct the confidence interval using the stats we gathered above.
confidence_interval_offset <- 1.95 * stats$std.error[2]
confidence_interval <- round(c(stats$estimate[2] - confidence_interval_offset,
stats$estimate[2],
stats$estimate[2] + confidence_interval_offset), 4)
confidence_interval
## [1] -0.0023 0.0031 0.0086
Based on the given confidence interval, we could say that “on average, facilities pass -0.0023 to 0.0031 to .0086 more for each year”
Global Statistics
Now using the glance function provided by the broom package, we will get some global statistics from our regression model.
mean_lm %>%
glance() %>%
select(r.squared, sigma, statistic, df, p.value)
## # A tibble: 1 x 5
## r.squared sigma statistic df p.value
## <dbl> <dbl> <dbl> <int> <dbl>
## 1 0.0481 0.0375 1.26 2 0.272
R-squared is used to evaluate how well the model fits the data. It identifies how much of the variance in the predicated variable, or dependent variable, can be explained by independent variable. For example, our R-squared value of 0.04805467 can explain 4.8% of the variation in the outcome. Therefore, this indicates there is close to no correlation between a facility passing and year. We concur with our null hypothesis and reject our original hypothesis.
Skew
Let’s try to see if the data is skewed, meaning let’s check if the first and third quartiles are either less than or greater than the median.
# The best way to extract these values is to make a data frame from the linear model
quantiles <- quantile(mean_lm$residuals)
quantiles
## 0% 25% 50% 75% 100%
## -0.07715287 -0.01855324 0.01003876 0.02318608 0.05828431
To determine if there is a skew in the data, we need to check that the First Quartile, or the 25% column, and the Third Quartile, the 75% column, are equal distance from the median. To determine that, you can simply subtract the median from the third quartile and check if that value is equal to the first quartile subtracted from the median.
In short we have to check:
75% column - median = median - 25% column
0.02318608 - 0.01003876 =? 0.01003876 - (-0.01855324)
0.01314732 =? 0.028592
Since the right side is not equal to the left side, the data is slightly skewed.
Augment
Next we will use the augment function which will help us to tell us about the fitted model.
mean_augment <- mean_lm %>%
augment()
mean_augment %>%
head()
## # A tibble: 6 x 9
## mean_pass year .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.722 2010 0.747 0.0133 -0.0247 0.126 0.0379 0.0356 -0.703
## 2 0.684 2010 0.747 0.0133 -0.0624 0.126 0.0358 0.228 -1.78
## 3 0.761 2010 0.747 0.0133 0.0139 0.126 0.0382 0.0113 0.396
## 4 0.765 2011 0.750 0.0111 0.0151 0.0870 0.0381 0.00851 0.423
## 5 0.673 2011 0.750 0.0111 -0.0772 0.0870 0.0346 0.221 -2.15
## 6 0.770 2011 0.750 0.0111 0.0207 0.0870 0.0380 0.0159 0.577
Test for Linearity
So now that we have made our model and processed some of the data, we will plot the model. Plotting the model will help us to see if there is a linear relationship between the average number of passes and year. If there is a linear relationship, we should expect to see the residuals clustered around 0.
mean_augment %>%
ggplot(aes(x=.fitted,y=.resid)) +
geom_point() +
geom_smooth() +
labs(x="fitted", y="residual", title="Residuals vs Fitted")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
As we can see above, the residuals do not cluster around 0. Therefore, we can say that there is no linear relationship between faciility types passing and year. Now we will try using a logistic regression model to see if we get better results.
Before we make our logistic regression model, we are going to make a dataset called facilities that only contains information on bakeries, grocery stores and restaurants. This will help us with our model by removing unwanted information.
facilities <- chicago %>%
select(year, Facility, Results, Pass, `DBA Name`, `Inspection Date`, `Zip Codes`)%>%
filter(Facility == "Bakery" | Facility == "Grocery Store" | Facility == "Restaurant")
head(facilities)
## # A tibble: 6 x 7
## year Facility Results Pass `DBA Name` `Inspection Date` `Zip Codes`
## <dbl> <chr> <chr> <dbl> <chr> <dttm> <dbl>
## 1 2018 Restaura~ Fail 0 DONGPO IM~ 2018-12-31 00:00:00 21194
## 2 2018 Restaura~ Pass 1 DOMINO'S 2018-12-31 00:00:00 4446
## 3 2018 Restaura~ Pass w/~ 1 RANALLIS 2018-12-31 00:00:00 22616
## 4 2018 Restaura~ Pass w/~ 1 DUNKIN DO~ 2018-12-31 00:00:00 4446
## 5 2018 Grocery ~ Pass 1 WALGREENS~ 2018-12-31 00:00:00 14920
## 6 2018 Restaura~ Pass w/~ 1 DOMINO'S 2018-12-31 00:00:00 22615
Great! Now that we grabbed those attributes, we want to know if there is a correlation between facility types passing and year with passing results. We will also add Facility as a factor in computing the model. This time we will be making a logistic regression model instead of the linear regression model so we will be using the glm function insteaf of lm.
#We are trying to see if there is a relationship for results and year.
logistic_model = glm(formula = Pass ~ year+Facility, data = facilities, family="binomial")
summary(logistic_model)
##
## Call:
## glm(formula = Pass ~ year + Facility, family = "binomial", data = facilities)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7990 0.6648 0.6817 0.7047 0.8142
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -36.51254 5.41767 -6.740 1.59e-11 ***
## year 0.01870 0.00269 6.951 3.64e-12 ***
## FacilityGrocery Store -0.13362 0.05093 -2.624 0.008700 **
## FacilityRestaurant 0.17998 0.04904 3.670 0.000243 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136310 on 129311 degrees of freedom
## Residual deviance: 135921 on 129308 degrees of freedom
## AIC: 135929
##
## Number of Fisher Scoring iterations: 4
Because the null deviance is very large, this indicates that the null model does not explain the model very well.
Interaction Model
Now we are going to make an interaction model between year and the Facility types. This time notice we use * instead of +.
#We are trying to see if there is a relationship for results and year.
interaction_logistic_model = glm(formula = Pass ~ Facility*year, data = facilities, family="binomial")
summary(interaction_logistic_model)
##
## Call:
## glm(formula = Pass ~ Facility * year, family = "binomial", data = facilities)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7880 0.6722 0.6835 0.6987 0.8513
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.318e+00 3.930e+01 0.059 0.9530
## FacilityGrocery Store -9.903e+01 4.134e+01 -2.396 0.0166 *
## FacilityRestaurant -2.605e+01 3.977e+01 -0.655 0.5124
## year -5.836e-04 1.951e-02 -0.030 0.9761
## FacilityGrocery Store:year 4.911e-02 2.052e-02 2.393 0.0167 *
## FacilityRestaurant:year 1.303e-02 1.975e-02 0.660 0.5095
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136310 on 129311 degrees of freedom
## Residual deviance: 135893 on 129306 degrees of freedom
## AIC: 135905
##
## Number of Fisher Scoring iterations: 4
Comparing the residual deviance from the interaction model to the residual deviance of our first logistic model, the number is smaller in the interacion model signifying that the interaction model is a better prediction for the model.
Using the anova function, we can compare the two models more clearly.
anova(logistic_model, interaction_logistic_model)
## Analysis of Deviance Table
##
## Model 1: Pass ~ year + Facility
## Model 2: Pass ~ Facility * year
## Resid. Df Resid. Dev Df Deviance
## 1 129308 135921
## 2 129306 135893 2 27.398
This further proves our point that the interaction model is an improved model.
To better visualize our model, we will plot the distribution of it.
plot_summs(interaction_logistic_model, scale = TRUE, plot.distributions = TRUE, inner_ci_level = .9)
And we will also plot a comparison of distributions between the two models.
plot_summs(interaction_logistic_model, logistic_model, scale = TRUE, plot.distributions = TRUE)
Let’s see what stats are produced by this interaction model.
interaction_logistic_model_stats <- interaction_logistic_model %>%
tidy()
interaction_logistic_model_stats
## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.32 39.3 0.0590 0.953
## 2 FacilityGrocery Store -99.0 41.3 -2.40 0.0166
## 3 FacilityRestaurant -26.1 39.8 -0.655 0.512
## 4 year -0.000584 0.0195 -0.0299 0.976
## 5 FacilityGrocery Store:year 0.0491 0.0205 2.39 0.0167
## 6 FacilityRestaurant:year 0.0130 0.0197 0.660 0.509
Notice that there is one interaction missing, the interaction between Bakery and year. This will be the base the for our estimates. This means that for the estimate column, if the numbers are positive, the estimate is highter than the Bakery estimate and if it is negative, it is lower than the Bakery estimate.
Now we are going to see what the increase is for the facility type and year interactions. First we will use the year estimate as our base and then make a data frame out of the estimates.
Using these estimates we will add them to the year to find the increase! Since this is a relatively small amount of estimates, we could just make variables for each estimate and show the increase but if we were to have a lot of estimates, it would be tough to make variables for all of them.
year <- interaction_logistic_model_stats$estimate[4]
estimate<- interaction_logistic_model_stats%>%
mutate(Facility = term, estimate = estimate)%>%
select(Facility, estimate)%>%
slice(5:6)
estimate_df <- data.frame(estimate)
estimate_df <- estimate_df %>%
mutate(increase = year + estimate)
estimate_df
## Facility estimate increase
## 1 FacilityGrocery Store:year 0.0491134 0.04852979
## 2 FacilityRestaurant:year 0.0130257 0.01244209
When bakeries is the intercept, grocery stores on average pass 4.852979% more per year than bakeries while Restaurants pass on average 1.244209% more than bakeries.
Now we are going to use the augment function again to help us plot residual and fitted values to see how well the model performs.
augmented_logistic <- interaction_logistic_model %>%
augment()
augmented_logistic %>%
head()
## # A tibble: 6 x 10
## Pass Facility year .fitted .se.fit .resid .hat .sigma .cooksd
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 Restaur~ 2018 1.37 0.0140 -1.79 3.15e-5 1.03 2.07e-5
## 2 1 Restaur~ 2018 1.37 0.0140 0.672 3.15e-5 1.03 1.33e-6
## 3 1 Restaur~ 2018 1.37 0.0140 0.672 3.15e-5 1.03 1.33e-6
## 4 1 Restaur~ 2018 1.37 0.0140 0.672 3.15e-5 1.03 1.33e-6
## 5 1 Grocery~ 2018 1.22 0.0326 0.720 1.88e-4 1.03 9.27e-6
## 6 1 Restaur~ 2018 1.37 0.0140 0.672 3.15e-5 1.03 1.33e-6
## # ... with 1 more variable: .std.resid <dbl>
Now lets make two plots using plot_model provided by the packages sjPlot and sjmisc.
The first plot will plot marginal effects of the interaction terms.
#The Interaction
plot_model(interaction_logistic_model, type="int")
The probablility of a Restaurant passing increases in an 8 year span by roughly 2% and probability of a grocery store passing increased by roughly 18% in the 8 year time frame, and bakeries passing stayed about the same.
These plots will show the predicted values (marginal effects) for the specific model terms
#The Interaction
plot_model(interaction_logistic_model, type="pred", terms=c("year", "Facility"))
We can see that there is a upward trend of grocery stores and restaurants passing and it seems that bakeries stay about the same but have a high passing rate.
Now lets make a prediction of our three facilities passing for 2019! We will make a dummy data frame called test_data and call the predict() function!
test_data <- data.frame(Facility=c("Bakery", "Restaurant", "Grocery Store"), year=2019, Pass=1)
predict(interaction_logistic_model, test_data, type="response")
## 1 2 3
## 0.7575376 0.7998017 0.7799125
Results
This prediction tells us that a bakery has a 75% chance of passing inspection while a restaurant will have a 79% chance of passing inspection and while a grocery store will have a 77% chance of passing inspection in 2019.
We began our tutorial by trying to see if there is a linear relationship between facilities passing and year. Unfortunately, through our linear regression model, we found there is no linear relationship. We then tried testing out hypothesis with a logistic regression model which was better then linear, but still not as good as we’d hoped. We then tried an interaction model and this proved to be a much better prediction model. From our data analysis, we conclude that restaurants will pass at a higher rate than either bakeries or grocery stores in 2019.