4 2/6 | Lab III: Panel Data

# Load packages used in this session of R
library(knitr)
library(tidyverse)
library(lubridate)
library(fixest)
library(stringr)
library(readxl)

In this lab we will estimate standard panel data models on covid policy and cases/deaths in U.S. states. This is not a full-fledged analysis, but rather an initial exploration of the data that illustrates how fixed effects models work. We will also implement a difference-in-difference model to measure the effect of changing the time of an election on teacher salaries.

4.1 Load the data from OxCGRT_latest.csv

Oxford provides data on covid deaths/cases and policy variables by day by state. For more background, see this data archive or this story that uses the data.

We will use a variable called GovernmentResponseIndex. For details, see this. (And feel free to experiment with the other measures.)

4.2 Data Organization

  • Load the OxCGRT_latest.csv data
  • Limit it to U.S. states and DC (CountryName == "United States" & RegionName!="NA" & RegionCode!="US_VI")
  • Select the following variables: RegionName, RegionCode, Date, GovernmentResponseIndex_Average, ConfirmedCases and ConfirmedDeaths
  • Add a variable to this data frame using the following code (this variable will help us when merging below)
## Loading Covid Data
covid <- read_csv("Data/OxCGRT_USA_latest.csv")

## Cleaning & Wrangling Covid Data
df <- covid %>%
  filter(CountryName == "United States" & RegionName!="NA" & RegionCode!="US_VI") %>%
  mutate(Date=ymd(Date), GovernmentResponseIndex = GovernmentResponseIndex_Average) %>%
  dplyr::select(RegionName, RegionCode, Date, GovernmentResponseIndex, ConfirmedCases, ConfirmedDeaths) %>%
  mutate("RegionCode" = str_replace_all(string = RegionCode, pattern = "US_", replacement = "" ))
  • Show the first three variables of the first three lines of this data frame.
head(df[1:3,1:3])
## # A tibble: 3 × 3
##   RegionName RegionCode Date      
##   <chr>      <chr>      <date>    
## 1 Alaska     AK         2020-01-01
## 2 Alaska     AK         2020-01-02
## 3 Alaska     AK         2020-01-03

4.3 Use the lag function in dplyr to create lagged variables for cases and deaths. Also create “difference” (e.g., dCases) that is the change in cases for each state by date. Check to make sure it worked.

## Creating lagged and change in cases
df2 <- df %>%
  group_by(RegionCode) %>%
  mutate(laggedcase = dplyr::lag(ConfirmedCases, order_by = Date),
         laggeddeaths = dplyr::lag(ConfirmedDeaths, order_by = Date),
         dcases = (ConfirmedCases-laggedcase),
         ddeaths = (ConfirmedDeaths-laggeddeaths))

4.4 Merge the above data frame to data in USstates.xlsx

  • Merge by state abbreviation
  • Create percapita measures of change in deaths and cases (e.g. “deathsPC” = 10000*dDeaths/pop2019).
## Reading in State Data
states <- read_excel("Data/USstates.xlsx")

## Joining and creating change in deaths and cases per capita
df3 <- df2 %>%
  left_join(states, by = c("RegionCode" = "stateAbbr")) %>%
  mutate(deaths_pc = 10000*ddeaths/pop2019,
         cases_pc = 10000*dcases/pop2019)

4.5 Write in notation and estimate a pooled model of total cases as a function of state policy. Discuss possible bias.

The model is:

\[confirmed\_cases_{it} = \beta_0 + \beta_1average\_response\_index_{it} + \epsilon_{it}\]

## Pooled Model
pooled_covid_model <- lm(ConfirmedCases~GovernmentResponseIndex, data = df3)
summary(pooled_covid_model)
## 
## Call:
## lm(formula = ConfirmedCases ~ GovernmentResponseIndex, data = df3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1632719  -624780  -388606   191235 10944694 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1730790      20055   86.30   <2e-16 ***
## GovernmentResponseIndex   -18824        409  -46.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1369000 on 54823 degrees of freedom
##   (1071 observations deleted due to missingness)
## Multiple R-squared:  0.0372, Adjusted R-squared:  0.03718 
## F-statistic:  2118 on 1 and 54823 DF,  p-value: < 2.2e-16

4.6 Write in notation and estimate a one-way fixed effect model where the fixed effect is state. Are you still worried about bias?

The model is:

\[confirmed\_cases_{it} = \beta_0 + \beta_1average\_response\_index_{it} + \alpha_{i} + \epsilon_{it}\]

## One-Way Model
one_way <- feols(ConfirmedCases~GovernmentResponseIndex | RegionCode, data = df3,
                     vcov = "iid")
summary(one_way)
## OLS estimation, Dep. Var.: ConfirmedCases
## Observations: 54,825 
## Fixed-effects: RegionCode: 51
## Standard-errors: IID 
##                         Estimate Std. Error  t value  Pr(>|t|)    
## GovernmentResponseIndex -22865.8    297.783 -76.7867 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 966,135.8     Adj. R2: 0.520077
##                   Within R2: 0.097186

4.7 Write in notation and estimate a two-way fixed effect model where the fixed effects are state and date. Does this model address the source of bias identified earlier?

\[confirmed\_cases_{it} = \beta_0 + \beta_1average\_response\_index_{it} + \alpha_{i} + \tau_{t} + \epsilon_{it}\]

## Two-way Model
two_way <- feols(ConfirmedCases~GovernmentResponseIndex | RegionCode + Date, data = df3,
                     vcov = "iid")
summary(two_way)
## OLS estimation, Dep. Var.: ConfirmedCases
## Observations: 54,825 
## Fixed-effects: RegionCode: 51,  Date: 1,075
## Standard-errors: IID 
##                         Estimate Std. Error  t value   Pr(>|t|)    
## GovernmentResponseIndex -4524.85    745.754 -6.06749 1.3079e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 760,870.0     Adj. R2: 0.696389
##                   Within R2: 6.851e-4

4.8 Test a model with robust-clustered standard errors. Do the results change at all?

## Two-way FE with robust-clustered se
two_way_robust <- feols(ConfirmedCases~GovernmentResponseIndex | RegionCode + Date, data = df3,
                     vcov = "twoway")
summary(two_way_robust)
## OLS estimation, Dep. Var.: ConfirmedCases
## Observations: 54,825 
## Fixed-effects: RegionCode: 51,  Date: 1,075
## Standard-errors: Clustered (RegionCode & Date) 
##                         Estimate Std. Error   t value Pr(>|t|) 
## GovernmentResponseIndex -4524.85    13438.6 -0.336705  0.73775 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 760,870.0     Adj. R2: 0.696389
##                   Within R2: 6.851e-4

Yes, the standard errors are much larger.

4.9 Load in the Texas School Board Data and take a look at the variables. The next few questions are from exercise 5 in Bailey (2021). The basic thoery of the paper is that teachers unions will have less influence on school board members when turnout is greater since a wider swath of voters (non-teachers) will be voting. You can read the abstract below.

Paper Abstract: Many governments in the United States hold elections on days other than national Election Day. Recent studies have argued that the low voter turnout that accompanies such off-cycle elections could create an advantage for interest groups. However, the endogeneity of election timing makes it difficult to estimate its causal effect on political outcomes. In this paper, I develop a theoretical framework that explains how changes to election timing affect the electoral fortunes of organized interest groups. I test the theory examining the effects of a 2006 Texas law that forced approximately 20 percent of the state’s school by districts to move their elections to the same day as national elections.

## Loading Data
load("Data/Texas_school_board.RData")

4.10 Write in notation and estimate the pooled model of the effect of OnCycle on LnAvgSalary.

The model is:

\[LnAvgSalary_{it} = \beta_0 + \beta_1OnCycle_{it} + \epsilon_{it}\]

## Pooled model
dta %>% 
  lm(LnAvgSalary ~ OnCycle, .) %>%
  summary()
## 
## Call:
## lm(formula = LnAvgSalary ~ OnCycle, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28910 -0.05527 -0.00646  0.05196  0.65215 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 10.671363   0.001018 10478.815   <2e-16 ***
## OnCycle     -0.030621   0.003766    -8.131    5e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08284 on 7137 degrees of freedom
##   (64 observations deleted due to missingness)
## Multiple R-squared:  0.009178,   Adjusted R-squared:  0.009039 
## F-statistic: 66.11 on 1 and 7137 DF,  p-value: 4.995e-16

4.11 Write in notation and estimate a difference-in-difference model of the effect of OnCycle on LnAvgSalary. What is the key coefficient that we are interested in?

\[LnAvgSalary_{it} = \beta_0 + \beta_1CycleSwitch_{i} + \beta_2After_{t} + \beta_3(CycleSwitch_{i}\times After_{t}) + \epsilon_{it}\]

## DiD model
dta %>% 
  lm(LnAvgSalary ~ CycleSwitch*AfterSwitch, .) %>%
  summary()
## 
## Call:
## lm(formula = LnAvgSalary ~ CycleSwitch * AfterSwitch, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29811 -0.05595 -0.00734  0.05284  0.65245 
## 
## Coefficients:
##                          Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)             10.671068   0.001433 7447.598  < 2e-16 ***
## CycleSwitch             -0.023982   0.003397   -7.059 1.83e-12 ***
## AfterSwitch              0.009303   0.002188    4.251 2.16e-05 ***
## CycleSwitch:AfterSwitch -0.008590   0.005189   -1.655   0.0979 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08334 on 7198 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.0183, Adjusted R-squared:  0.01789 
## F-statistic: 44.72 on 3 and 7198 DF,  p-value: < 2.2e-16