7 Lab VI: Goodness of Fit & Multiple Regression

## Packages
library(tidyverse)
library(tidymodels)
library(mosaic)

## Data
load("~/GOVT5001/Lab VI/sh.RData")
house <- SaratogaHouses

## Model
model1 <- lm(price~bedrooms, data=house)

7.1 \(R^2\)

\[1-\frac{SSR}{TSS} = \frac{\sum\epsilon^2}{\sum(y_{i}-\bar{y})^2}=\frac{MSS}{TSS}=\frac{\sum(\hat{y_{i}}-\bar{y})^2}{\sum(y_{i}-\bar{y})^2}\]

  • Coefficient of Determination
  • Proportion of the Variance in \(Y_{i}\) explained by \(X_{i}\)
  • “Goodness of Fit”
##1-SSR/TSS
1-sum((model1$residuals)^2)/
sum((house$price-mean(house$price))^2)
## [1] 0.1602791
## MSS/TSS
sum((model1$fitted.values-mean(house$price))^2)/
  sum((house$price-mean(house$price))^2)
## [1] 0.1602791
summary(model1)$r.squared
## [1] 0.1602791

\[R^2=Cor(Y_i,\hat{Y_i})\]

  • Squared Correlation Between Observed Y and Fitted Y
mod_aug <- augment(model1)

(cor(mod_aug$.fitted, mod_aug$price)^2)
## [1] 0.1602791
  • \(R^2\) as proportional reduction in error
## Constant Only model
null_model <- lm(house$price~1)

## The intercept is just the mean of Y
summary(null_model)$coefficient
##             Estimate Std. Error  t value Pr(>|t|)
## (Intercept) 211966.7   2368.132 89.50798        0
mean(house$price)
## [1] 211966.7
## R Squared is 0
summary(null_model)$r.squared
## [1] 0
## Squared of Model 1
summary(model1)$r.squared
## [1] 0.1602791
  • The number of bedrooms that a house has explains 16% of the variation in the price of a house

7.2 Mean Squared Error & Root Mean Squared Error

\[\hat{MSE} = \frac{1}{n}\sum^{n}_{i=1}(Y_{i}-\hat{Y_{i}})^2\]

  • Average distance from the line-of-best-fit

  • The root mean square error, or RMSE, is calculated by taking the square root of the MSE. Here is the formula: \[\hat{RMSE} = \sqrt{\frac{1}{n}\sum^{n}_{i=1}(Y_{i}-\hat{Y_{i}})^2}\]

  • The average prediction error in units of Y

fit <- lm(Mustangs$Price ~ Mustangs$Miles)

Mustangs$predicted <- predict(fit)
Mustangs$residuals <- residuals(fit)

Mustangs %>%
  ggplot(aes(Miles, Price)) +
  geom_point(col = "black", size = 3) +
  geom_point(aes(y = predicted), shape = 1, size = 3) +
  geom_segment(aes(xend = Miles, yend = predicted)) +
  theme_bw() +
  geom_smooth(method = "lm", se = F) +
  labs(x = "X", y = "Y")

  • Take the square root of the mean of the distance of the solid black lines for RMSE
## MAE of Model 1
summary(model1)$sigma
## [1] 90234.16
## Manually
sqrt(mean((mod_aug$.fitted - mod_aug$price)^2))
## [1] 90181.93

7.3 Multiple Regression

\[Price_{i}=\beta_{0} + \beta_{1}Bedrooms_{i} + \beta_{2}Bathrooms_{i} + \epsilon_{i}\]

lm(price~bedrooms + bathrooms, data=house)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms, data = house)
## 
## Coefficients:
## (Intercept)     bedrooms    bathrooms  
##        2193        19325        78316

7.4 Adjusted \(R^2\)

\[Adjusted\space R^2 = 1- \frac{(1-R^2)(N-1)}{(N-k-1)}\]

  • \(R^2\) is just \(R^2\)
  • \(N\) is the sample size
  • \(k\) is the number of independent variables
  • Penalizes us for just throwing more variables at the model
  • Proportion of variation in \(Y_{i}\) explained by the model
model2 <- lm(price~bedrooms + bathrooms, data=house)

1-(1-summary(model2)$r.squared)*(length(house$price)-1)/
((length(house$price)-2-1))
## [1] 0.3763283
summary(model2)$adj.r.squared
## [1] 0.3763283

7.5 Lab Questions: Congratulations! You’ve been hired as an analyst by the [insert your favorite baseball team name here]. Your first job is to make a few predictions about what your teams’ future will look like.

  • Download lab_vi.Rmd from Canvas
  • Download bbData.RData from Canvas
  • Try the lab questions!

7.5.1 Load in the bbDta.RData file.

load("~/GOVT5001/Lab VI/bbData.RData")

7.5.2 Run a regression model with attendance as the DV and wins as the IV. Save as model1.

## Simple Regression Model
model1 <- lm(home_attend ~ wins, data = bbData)
model1
## 
## Call:
## lm(formula = home_attend ~ wins, data = bbData)
## 
## Coefficients:
## (Intercept)         wins  
##     -378164        27345

7.5.3 Does winning increase attendance? If so, by how much?

## Pulling out coefficient
model1$coefficients[2]
##     wins 
## 27345.18

7.5.4 Create a scatterplot to illustrate your results from A and B.

## Hint: Run the line of code below to remove scientific notation.
options(scipen = 999999)

## Creating Plot
ggplot(bbData, aes(x = wins, y = home_attend/1000000)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) + 
  theme_bw() +
  labs(title = "Baseball Attendance by Wins", x = "Wins", 
       y = "Attendance At Home Games in Millions")

7.5.5 Does a higher number of wins increase attendance when you include runs_scored and runs_allowed into your model? What is wins’ new effect. Save the results as model2.

## Multiple Regression Model
model2 <- lm(home_attend~wins + runs_scored + runs_allowed, data=bbData)
model2
## 
## Call:
## lm(formula = home_attend ~ wins + runs_scored + runs_allowed, 
##     data = bbData)
## 
## Coefficients:
##  (Intercept)          wins   runs_scored  runs_allowed  
##      -270324          3261          4260         -1683

Wins increase attendance by 3,261 while holding all other factors equal.

7.5.6 Assess fit. What is the change in R^2 from model1 to model2?

## Examining R Squared
summary(model2)$adj.r.squared
## [1] 0.2933172
summary(model2)$adj.r.squared - summary(model1)$r.squared
## [1] 0.08311133

Model 2 explains 29% of the variation in home_attendance. The change is about * percentage points.

7.5.7 Which model do you think is best?

For our purposes, model 2 has the highest r squared, so model 2 is the best.

7.5.8 Use your preferred model to predict attendance based on 100 wins, 500 runs scored, and 360 runs allowed.

## Making Predictions
-270324 + (3261*100) + (4260*500) + (-1683*360)
## [1] 1579896
## With Code
model2$coefficients[1] + (model2$coefficients[2]*100) + (model2$coefficients[3]*500) + 
  (model2$coefficients[4]*360)
## (Intercept) 
##     1580066

7.5.9 Based on your models, how can teams increase their attendance?

Answer: Teams should win more games, score more runs, and allow less runs scored against them. Easy!

7.5.10 Your colleague argues that runs have been increasing over time. Create a scatterplot, with a line of best fit, to evaluate their claim.

## Plot
ggplot(bbData, aes(x = season, y = runs_scored)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) + 
  theme_bw() +
  labs(title = "Runs Scored Over Time", x = "Season Year", 
       y = "Runs Scored")

7.5.11 The analysis above is for the full MLB. Rerun your models and scatterplots for one team.

## Subset the data
phiData <- subset(bbData, team == "PHI")


## Models
model1b <- lm(home_attend~wins, data = phiData)
model1b
## 
## Call:
## lm(formula = home_attend ~ wins, data = phiData)
## 
## Coefficients:
## (Intercept)         wins  
##     -293337        29191
model1
## 
## Call:
## lm(formula = home_attend ~ wins, data = bbData)
## 
## Coefficients:
## (Intercept)         wins  
##     -378164        27345
model2b <- lm(home_attend~wins + runs_scored + runs_allowed, data = phiData)
model2b
## 
## Call:
## lm(formula = home_attend ~ wins + runs_scored + runs_allowed, 
##     data = phiData)
## 
## Coefficients:
##  (Intercept)          wins   runs_scored  runs_allowed  
##      1060314         12020          2783         -2745
model2
## 
## Call:
## lm(formula = home_attend ~ wins + runs_scored + runs_allowed, 
##     data = bbData)
## 
## Coefficients:
##  (Intercept)          wins   runs_scored  runs_allowed  
##      -270324          3261          4260         -1683
## Wins and Attendance
ggplot(bbData, aes(x = wins, y = home_attend/1000000)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) + 
  theme_bw() +
  labs(title = "Baseball Attendance by Wins", x = "Wins", 
       y = "Attendance At Home Games in Millions", subtitle = "Philadelphia Phillies")

## Runs Scored by Season
ggplot(phiData, aes(x = season, y = runs_scored)) +
  geom_line() +
  theme_bw() +
  labs(title = "Runs Scored Over Time", x = "Season Year", 
       y = "Runs Scored", subtitle = "Philadelphia Phillies")

## Line Plot
ggplot(phiData, aes(x = season, y = runs_scored)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) + 
  theme_bw() +
  labs(title = "Runs Scored Over Time", x = "Season Year", 
       y = "Runs Scored", subtitle = "Philadelphia Phillies")