3 Lab II: Panel Data

3.0.1 Preparation

# Load packages used in this session of R
library(knitr)
library(tidyverse)
library(plm)
library(readr)
library(readxl) # Package to read Excel data
opts_chunk$set(echo = TRUE)
options(digits = 2)

In this lab we will estimate standard panel data models on covid policy and cases/deaths in U.S. states. This is not a full-fledged analysis, but rather an initial exploration of the data that illustrates how fixed effects models work.

3.1 Load the data from Oxford_Covid_data_US_latest.csv

Oxford provides data on covid deaths/cases and policy variables by day by state. For more background, see this data archive or this story that uses the data. We will use a variable called GovernmentResponseIndex. For details, see this. (And feel free to experiment with the other measures.)

3.2 Data Organization

  • Load the Oxford_Covid_data_US_latest.csv data
  • Limit it to U.S. states (CountryName == “United States)
  • Create the following variables: RegionName, RegionCode, Date, GovernmentResponseIndex, ConfirmedCases and ConfirmedDeaths
  • Add a variable to this data frame using the following code (this variable will help us when merging below)
  mutate("stAbbrev" = str_replace_all(string = RegionCode, pattern = "US_", replacement = "" ))
  • Show the first three variables of the first three lines of this data frame.
# Load and filter state stringency data
stPolicy = read_csv("Data/Oxford_Covid_data_US_latest.csv") %>%
  filter(CountryName == "United States" & 
           is.na(RegionName) == 0 & 
           RegionName != "" &
           RegionCode != "US_VI") %>% 
  dplyr::select(RegionName, RegionCode, Date, ContainmentHealthIndex,
         GovernmentResponseIndex, StringencyIndex, 
         ConfirmedCases, ConfirmedDeaths) %>%
  mutate("stAbbrev" = str_replace_all(string = RegionCode, pattern = "US_", replacement = "" ))
## Rows: 20724 Columns: 69
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (25): CountryName, CountryCode, RegionName, RegionCode, Jurisdiction, C1_Notes, C2_Notes, C3_Notes, C4_Notes, C5_N...
## dbl (43): Date, C1_School closing, C1_Flag, C2_Workplace closing, C2_Flag, C3_Cancel public events, C3_Flag, C4_Restri...
## lgl  (1): M1_Wildcard
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Show the first three variables of first three lines
stPolicy[1:3, 1:3]
## # A tibble: 3 × 3
##   RegionName RegionCode     Date
##   <chr>      <chr>         <dbl>
## 1 Alaska     US_AK      20200101
## 2 Alaska     US_AK      20200102
## 3 Alaska     US_AK      20200103

3.3 Use the lag function in dplyr to create lagged variables for cases and deaths. Also create “difference” (e.g., dCases) that is the change in cases for each state by date.

See slide “Creating differenced data in R” in Topic 3 class slides.

## Create lagged value 
stPolicy = stPolicy %>% 
  group_by(RegionName) %>% 
  mutate(lagCases  = dplyr::lag(ConfirmedCases, order_by = Date),
         dCases    = ConfirmedCases - lagCases,
         lagDeaths = dplyr::lag(ConfirmedDeaths, order_by = Date),
         dDeaths   = ConfirmedDeaths - lagDeaths) %>%
  ungroup()

3.4 Merge the above data frame to data in USstates.xlsx

  • Merge by state abbreviation
  • Create per capita measures of change in deaths and cases (e.g. “deathsPC” = 10000*dDeaths/pop2019).

Check your data by looking at level and lagged data for a given state for a few years. The lagged data should match up to the previous period observation.

# Load excel data
stFacts <- read_excel("Data/USstates.xlsx", sheet = "data")

# Merge with data frame and create per capita data
dfState <- stPolicy %>%
  left_join(stFacts, by = c("stAbbrev" = "stateAbbr")) %>% 
  mutate("deathsPC" = 10000*dDeaths/pop2019, 
         "casesPC"  = 10000*dCases/pop2019) 

# Check data
dfState %>% filter(RegionName == "California") %>% 
  dplyr::select(RegionName, Date, ConfirmedDeaths, lagDeaths, dDeaths, deathsPC) %>% 
  slice(245:248)
## # A tibble: 4 × 6
##   RegionName     Date ConfirmedDeaths lagDeaths dDeaths deathsPC
##   <chr>         <dbl>           <dbl>     <dbl>   <dbl>    <dbl>
## 1 California 20200901           13150      5044    8106     2.05
## 2 California 20200902           13317      5065    8252     2.09
## 3 California 20200903           13493      5130    8363     2.12
## 4 California 20200904           13638      5171    8467     2.14

3.5 Estimate a pooled model of total cases per capita as a function of state policy. Discuss.

ols.1 = lm(casesPC ~ GovernmentResponseIndex, data = dfState)
summary(ols.1)
## 
## Call:
## lm(formula = casesPC ~ GovernmentResponseIndex, data = dfState)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7972     20    215    342   1120 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -19.67      14.78   -1.33     0.18    
## GovernmentResponseIndex    -5.04       0.31  -16.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 815 on 18547 degrees of freedom
##   (1392 observations deleted due to missingness)
## Multiple R-squared:  0.0141, Adjusted R-squared:  0.014 
## F-statistic:  265 on 1 and 18547 DF,  p-value: <2e-16

3.6 Estimate a one-way fixed effect model where the fixed effect is state. (Note that state is indicated in a variable called RegionName.) Estimate using both LSDV and the de-meaned version in the plm package. Can you identify a source of bias?

fe.1 = lm(casesPC ~ GovernmentResponseIndex + factor(RegionName), data = dfState)
#summary(fe.1)
coefficients(summary(fe.1))[1:2,]
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                404.5      33.55      12  2.4e-33
## GovernmentResponseIndex     -5.7       0.26     -22 7.5e-107
fe.1plm = plm(casesPC ~ GovernmentResponseIndex, 
  data = dfState, 
  index = c("RegionName","Date"),
  model="within")
summary(fe.1plm)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = casesPC ~ GovernmentResponseIndex, data = dfState, 
##     model = "within", index = c("RegionName", "Date"))
## 
## Unbalanced Panel: n = 51, T = 349-369, N = 18549
## 
## Residuals:
##    Min. 1st Qu.  Median 3rd Qu.    Max. 
## -6256.0   -95.8    31.4   152.4  1915.2 
## 
## Coefficients:
##                         Estimate Std. Error t-value Pr(>|t|)    
## GovernmentResponseIndex   -5.657      0.256   -22.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    7.17e+09
## Residual Sum of Squares: 6.98e+09
## R-Squared:      0.0257
## Adj. R-Squared: 0.023
## F-statistic: 488.43 on 1 and 18497 DF, p-value: <2e-16

3.7 Estimate a two-way fixed effect model where the fixed effects are state and date. Estimate using both LSDV and the de-meaned version in the plm package. Does this model address the source of bias identified earlier?

fe.2 = lm(casesPC ~ GovernmentResponseIndex + factor(RegionName) + factor(Date), data = dfState)
#summary(fe.2)
# Show non-fixed effect coefficients
coefficients(summary(fe.2))[1:2,]
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                393.5      86.62     4.5  5.6e-06
## GovernmentResponseIndex     -6.8       0.76    -9.0  2.4e-19
fe.2plm = plm(casesPC ~ GovernmentResponseIndex, 
  data = dfState, 
  index = c("RegionName","Date"),
  model="within", 
  effect="twoways")
summary(fe.2plm)
## Twoways effects Within Model
## 
## Call:
## plm(formula = casesPC ~ GovernmentResponseIndex, data = dfState, 
##     effect = "twoways", model = "within", index = c("RegionName", 
##         "Date"))
## 
## Unbalanced Panel: n = 51, T = 349-369, N = 18549
## 
## Residuals:
##    Min. 1st Qu.  Median 3rd Qu.    Max. 
##   -5510    -239     -14     173    1736 
## 
## Coefficients:
##                         Estimate Std. Error t-value Pr(>|t|)    
## GovernmentResponseIndex   -6.822      0.758      -9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    6e+09
## Residual Sum of Squares: 5.97e+09
## R-Squared:      0.00445
## Adj. R-Squared: -0.0186
## F-statistic: 81.0794 on 1 and 18129 DF, p-value: <2e-16

3.7.1 This is an initial analysis. We would also want to

  • think through whether it is useful to control for days of the week (there is a well-known pattern in reporting across days of the week)
  • assess the data for outliers (e.g, min(dfState$casesPC, na.rm = TRUE))
  • consider a lagged dependent variable
  • For more, read this