4 2/6 | Lab III: Panel Data
# Load packages used in this session of R
library(knitr)
library(tidyverse)
library(lubridate)
library(fixest)
library(stringr)
library(readxl)
In this lab we will estimate standard panel data models on covid policy and cases/deaths in U.S. states. This is not a full-fledged analysis, but rather an initial exploration of the data that illustrates how fixed effects models work. We will also implement a difference-in-difference model to measure the effect of changing the time of an election on teacher salaries.
4.1 Load the data from OxCGRT_latest.csv
Oxford provides data on covid deaths/cases and policy variables by day by state. For more background, see this data archive or this story that uses the data.
We will use a variable called GovernmentResponseIndex. For details, see this. (And feel free to experiment with the other measures.)
4.2 Data Organization
- Load the OxCGRT_latest.csv data
- Limit it to U.S. states and DC (
CountryName == "United States" & RegionName!="NA" & RegionCode!="US_VI"
) - Select the following variables: RegionName, RegionCode, Date, GovernmentResponseIndex_Average, ConfirmedCases and ConfirmedDeaths
- Add a variable to this data frame using the following code (this variable will help us when merging below)
## Loading Covid Data
<- read_csv("Data/OxCGRT_USA_latest.csv")
covid
## Cleaning & Wrangling Covid Data
<- covid %>%
df filter(CountryName == "United States" & RegionName!="NA" & RegionCode!="US_VI") %>%
mutate(Date=ymd(Date), GovernmentResponseIndex = GovernmentResponseIndex_Average) %>%
::select(RegionName, RegionCode, Date, GovernmentResponseIndex, ConfirmedCases, ConfirmedDeaths) %>%
dplyrmutate("RegionCode" = str_replace_all(string = RegionCode, pattern = "US_", replacement = "" ))
- Show the first three variables of the first three lines of this data frame.
head(df[1:3,1:3])
## # A tibble: 3 × 3
## RegionName RegionCode Date
## <chr> <chr> <date>
## 1 Alaska AK 2020-01-01
## 2 Alaska AK 2020-01-02
## 3 Alaska AK 2020-01-03
4.3 Use the lag function in dplyr to create lagged variables for cases and deaths. Also create “difference” (e.g., dCases) that is the change in cases for each state by date. Check to make sure it worked.
## Creating lagged and change in cases
<- df %>%
df2 group_by(RegionCode) %>%
mutate(laggedcase = dplyr::lag(ConfirmedCases, order_by = Date),
laggeddeaths = dplyr::lag(ConfirmedDeaths, order_by = Date),
dcases = (ConfirmedCases-laggedcase),
ddeaths = (ConfirmedDeaths-laggeddeaths))
4.4 Merge the above data frame to data in USstates.xlsx
- Merge by state abbreviation
- Create percapita measures of change in deaths and cases (e.g. “deathsPC” = 10000*dDeaths/pop2019).
## Reading in State Data
<- read_excel("Data/USstates.xlsx")
states
## Joining and creating change in deaths and cases per capita
<- df2 %>%
df3 left_join(states, by = c("RegionCode" = "stateAbbr")) %>%
mutate(deaths_pc = 10000*ddeaths/pop2019,
cases_pc = 10000*dcases/pop2019)
4.5 Write in notation and estimate a pooled model of total cases as a function of state policy. Discuss possible bias.
The model is:
\[confirmed\_cases_{it} = \beta_0 + \beta_1average\_response\_index_{it} + \epsilon_{it}\]
## Pooled Model
<- lm(ConfirmedCases~GovernmentResponseIndex, data = df3)
pooled_covid_model summary(pooled_covid_model)
##
## Call:
## lm(formula = ConfirmedCases ~ GovernmentResponseIndex, data = df3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1632719 -624780 -388606 191235 10944694
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1730790 20055 86.30 <2e-16 ***
## GovernmentResponseIndex -18824 409 -46.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1369000 on 54823 degrees of freedom
## (1071 observations deleted due to missingness)
## Multiple R-squared: 0.0372, Adjusted R-squared: 0.03718
## F-statistic: 2118 on 1 and 54823 DF, p-value: < 2.2e-16
4.6 Write in notation and estimate a one-way fixed effect model where the fixed effect is state. Are you still worried about bias?
The model is:
\[confirmed\_cases_{it} = \beta_0 + \beta_1average\_response\_index_{it} + \alpha_{i} + \epsilon_{it}\]
## One-Way Model
<- feols(ConfirmedCases~GovernmentResponseIndex | RegionCode, data = df3,
one_way vcov = "iid")
summary(one_way)
## OLS estimation, Dep. Var.: ConfirmedCases
## Observations: 54,825
## Fixed-effects: RegionCode: 51
## Standard-errors: IID
## Estimate Std. Error t value Pr(>|t|)
## GovernmentResponseIndex -22865.8 297.783 -76.7867 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 966,135.8 Adj. R2: 0.520077
## Within R2: 0.097186
4.7 Write in notation and estimate a two-way fixed effect model where the fixed effects are state and date. Does this model address the source of bias identified earlier?
\[confirmed\_cases_{it} = \beta_0 + \beta_1average\_response\_index_{it} + \alpha_{i} + \tau_{t} + \epsilon_{it}\]
## Two-way Model
<- feols(ConfirmedCases~GovernmentResponseIndex | RegionCode + Date, data = df3,
two_way vcov = "iid")
summary(two_way)
## OLS estimation, Dep. Var.: ConfirmedCases
## Observations: 54,825
## Fixed-effects: RegionCode: 51, Date: 1,075
## Standard-errors: IID
## Estimate Std. Error t value Pr(>|t|)
## GovernmentResponseIndex -4524.85 745.754 -6.06749 1.3079e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 760,870.0 Adj. R2: 0.696389
## Within R2: 6.851e-4
4.8 Test a model with robust-clustered standard errors. Do the results change at all?
## Two-way FE with robust-clustered se
<- feols(ConfirmedCases~GovernmentResponseIndex | RegionCode + Date, data = df3,
two_way_robust vcov = "twoway")
summary(two_way_robust)
## OLS estimation, Dep. Var.: ConfirmedCases
## Observations: 54,825
## Fixed-effects: RegionCode: 51, Date: 1,075
## Standard-errors: Clustered (RegionCode & Date)
## Estimate Std. Error t value Pr(>|t|)
## GovernmentResponseIndex -4524.85 13438.6 -0.336705 0.73775
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 760,870.0 Adj. R2: 0.696389
## Within R2: 6.851e-4
Yes, the standard errors are much larger.
4.9 Load in the Texas School Board Data and take a look at the variables. The next few questions are from exercise 5 in Bailey (2021). The basic thoery of the paper is that teachers unions will have less influence on school board members when turnout is greater since a wider swath of voters (non-teachers) will be voting. You can read the abstract below.
Paper Abstract: Many governments in the United States hold elections on days other than national Election Day. Recent studies have argued that the low voter turnout that accompanies such off-cycle elections could create an advantage for interest groups. However, the endogeneity of election timing makes it difficult to estimate its causal effect on political outcomes. In this paper, I develop a theoretical framework that explains how changes to election timing affect the electoral fortunes of organized interest groups. I test the theory examining the effects of a 2006 Texas law that forced approximately 20 percent of the state’s school by districts to move their elections to the same day as national elections.
## Loading Data
load("Data/Texas_school_board.RData")