3 Lab II: Panel Data
3.0.1 Preparation
# Load packages used in this session of R
library(knitr)
library(tidyverse)
library(plm)
library(readr)
library(readxl) # Package to read Excel data
$set(echo = TRUE)
opts_chunkoptions(digits = 2)
In this lab we will estimate standard panel data models on covid policy and cases/deaths in U.S. states. This is not a full-fledged analysis, but rather an initial exploration of the data that illustrates how fixed effects models work.
3.1 Load the data from Oxford_Covid_data_US_latest.csv
Oxford provides data on covid deaths/cases and policy variables by day by state. For more background, see this data archive or this story that uses the data. We will use a variable called GovernmentResponseIndex. For details, see this. (And feel free to experiment with the other measures.)
3.2 Data Organization
- Load the Oxford_Covid_data_US_latest.csv data
- Limit it to U.S. states (CountryName == “United States)
- Create the following variables: RegionName, RegionCode, Date, GovernmentResponseIndex, ConfirmedCases and ConfirmedDeaths
- Add a variable to this data frame using the following code (this variable will help us when merging below)
mutate("stAbbrev" = str_replace_all(string = RegionCode, pattern = "US_", replacement = "" ))
- Show the first three variables of the first three lines of this data frame.
# Load and filter state stringency data
= read_csv("Data/Oxford_Covid_data_US_latest.csv") %>%
stPolicy filter(CountryName == "United States" &
is.na(RegionName) == 0 &
!= "" &
RegionName != "US_VI") %>%
RegionCode ::select(RegionName, RegionCode, Date, ContainmentHealthIndex,
dplyr
GovernmentResponseIndex, StringencyIndex, %>%
ConfirmedCases, ConfirmedDeaths) mutate("stAbbrev" = str_replace_all(string = RegionCode, pattern = "US_", replacement = "" ))
## Rows: 20724 Columns: 69
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (25): CountryName, CountryCode, RegionName, RegionCode, Jurisdiction, C1_Notes, C2_Notes, C3_Notes, C4_Notes, C5_N...
## dbl (43): Date, C1_School closing, C1_Flag, C2_Workplace closing, C2_Flag, C3_Cancel public events, C3_Flag, C4_Restri...
## lgl (1): M1_Wildcard
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Show the first three variables of first three lines
1:3, 1:3] stPolicy[
## # A tibble: 3 × 3
## RegionName RegionCode Date
## <chr> <chr> <dbl>
## 1 Alaska US_AK 20200101
## 2 Alaska US_AK 20200102
## 3 Alaska US_AK 20200103
3.3 Use the lag function in dplyr to create lagged variables for cases and deaths. Also create “difference” (e.g., dCases) that is the change in cases for each state by date.
See slide “Creating differenced data in R” in Topic 3 class slides.
## Create lagged value
= stPolicy %>%
stPolicy group_by(RegionName) %>%
mutate(lagCases = dplyr::lag(ConfirmedCases, order_by = Date),
dCases = ConfirmedCases - lagCases,
lagDeaths = dplyr::lag(ConfirmedDeaths, order_by = Date),
dDeaths = ConfirmedDeaths - lagDeaths) %>%
ungroup()
3.4 Merge the above data frame to data in USstates.xlsx
- Merge by state abbreviation
- Create per capita measures of change in deaths and cases (e.g. “deathsPC” = 10000*dDeaths/pop2019).
Check your data by looking at level and lagged data for a given state for a few years. The lagged data should match up to the previous period observation.
# Load excel data
<- read_excel("Data/USstates.xlsx", sheet = "data")
stFacts
# Merge with data frame and create per capita data
<- stPolicy %>%
dfState left_join(stFacts, by = c("stAbbrev" = "stateAbbr")) %>%
mutate("deathsPC" = 10000*dDeaths/pop2019,
"casesPC" = 10000*dCases/pop2019)
# Check data
%>% filter(RegionName == "California") %>%
dfState ::select(RegionName, Date, ConfirmedDeaths, lagDeaths, dDeaths, deathsPC) %>%
dplyrslice(245:248)
## # A tibble: 4 × 6
## RegionName Date ConfirmedDeaths lagDeaths dDeaths deathsPC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 California 20200901 13150 5044 8106 2.05
## 2 California 20200902 13317 5065 8252 2.09
## 3 California 20200903 13493 5130 8363 2.12
## 4 California 20200904 13638 5171 8467 2.14
3.5 Estimate a pooled model of total cases per capita as a function of state policy. Discuss.
.1 = lm(casesPC ~ GovernmentResponseIndex, data = dfState)
olssummary(ols.1)
##
## Call:
## lm(formula = casesPC ~ GovernmentResponseIndex, data = dfState)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7972 20 215 342 1120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.67 14.78 -1.33 0.18
## GovernmentResponseIndex -5.04 0.31 -16.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 815 on 18547 degrees of freedom
## (1392 observations deleted due to missingness)
## Multiple R-squared: 0.0141, Adjusted R-squared: 0.014
## F-statistic: 265 on 1 and 18547 DF, p-value: <2e-16
3.6 Estimate a one-way fixed effect model where the fixed effect is state. (Note that state is indicated in a variable called RegionName.) Estimate using both LSDV and the de-meaned version in the plm package. Can you identify a source of bias?
.1 = lm(casesPC ~ GovernmentResponseIndex + factor(RegionName), data = dfState)
fe#summary(fe.1)
coefficients(summary(fe.1))[1:2,]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 404.5 33.55 12 2.4e-33
## GovernmentResponseIndex -5.7 0.26 -22 7.5e-107
.1plm = plm(casesPC ~ GovernmentResponseIndex,
fedata = dfState,
index = c("RegionName","Date"),
model="within")
summary(fe.1plm)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = casesPC ~ GovernmentResponseIndex, data = dfState,
## model = "within", index = c("RegionName", "Date"))
##
## Unbalanced Panel: n = 51, T = 349-369, N = 18549
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -6256.0 -95.8 31.4 152.4 1915.2
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## GovernmentResponseIndex -5.657 0.256 -22.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 7.17e+09
## Residual Sum of Squares: 6.98e+09
## R-Squared: 0.0257
## Adj. R-Squared: 0.023
## F-statistic: 488.43 on 1 and 18497 DF, p-value: <2e-16
3.7 Estimate a two-way fixed effect model where the fixed effects are state and date. Estimate using both LSDV and the de-meaned version in the plm package. Does this model address the source of bias identified earlier?
.2 = lm(casesPC ~ GovernmentResponseIndex + factor(RegionName) + factor(Date), data = dfState)
fe#summary(fe.2)
# Show non-fixed effect coefficients
coefficients(summary(fe.2))[1:2,]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 393.5 86.62 4.5 5.6e-06
## GovernmentResponseIndex -6.8 0.76 -9.0 2.4e-19
.2plm = plm(casesPC ~ GovernmentResponseIndex,
fedata = dfState,
index = c("RegionName","Date"),
model="within",
effect="twoways")
summary(fe.2plm)
## Twoways effects Within Model
##
## Call:
## plm(formula = casesPC ~ GovernmentResponseIndex, data = dfState,
## effect = "twoways", model = "within", index = c("RegionName",
## "Date"))
##
## Unbalanced Panel: n = 51, T = 349-369, N = 18549
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -5510 -239 -14 173 1736
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## GovernmentResponseIndex -6.822 0.758 -9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 6e+09
## Residual Sum of Squares: 5.97e+09
## R-Squared: 0.00445
## Adj. R-Squared: -0.0186
## F-statistic: 81.0794 on 1 and 18129 DF, p-value: <2e-16
3.7.1 This is an initial analysis. We would also want to
- think through whether it is useful to control for days of the week (there is a well-known pattern in reporting across days of the week)
- assess the data for outliers (e.g,
min(dfState$casesPC, na.rm = TRUE)
) - consider a lagged dependent variable
- For more, read this