Traffic Death Analysis¶
In this exercise, we will be analyzing the effect of alcohol taxes on traffic death in the United States. The data set used in this exercise, fatalities.csv
, is a state-year panel dataset (meaning it includes data on multiple states, and the data includes several years of data for each state. The data contains 336 observations on 34 variables. The variables used in the exercise are defined as follows:
state
: factor variable indicating states
year
: factor variable indicating years
beertax
: numeric variable, Tax on the case of beer
In these exercises, we’ll be looking at how beer taxes (which are believed to reduce alcohol consumption, potentially reducing drunk driving deaths) impact car accident fatality rates.
More specifically, though, we’ll be approaching our estimation of the impact of beer taxes in a few different ways in an effort to give you more of an intuitive sense of what happens when you add fixed effects to a regression.
Exercise 1¶
Download and load the data from this link, or by going to www.github.com/nickeubank/MIDS_Data/ and downloading the us_driving_fatalities.csv
dataset.
How many states does this dataset contain? What’s the time frame of this dataset? (From which year to which year). And what constitutes a single observation (i.e. what is the unit of analysis for each row of the data?)
[1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
[2]:
df = pd.read_csv(
"https://media.githubusercontent.com/media/nickeubank/MIDS_Data/master/us_driving_fatalities.csv"
)
[3]:
df.head()
[3]:
Unnamed: 0 | state | year | spirits | unemp | income | emppop | beertax | baptist | mormon | ... | nfatal2124 | afatal | pop | pop1517 | pop1820 | pop2124 | milestot | unempus | emppopus | gsp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | al | 1982 | 1.37 | 14.4 | 10544.152344 | 50.692039 | 1.539379 | 30.355700 | 0.32829 | ... | 32 | 309.437988 | 3942002.25 | 208999.593750 | 221553.43750 | 290000.06250 | 28516.0 | 9.7 | 57.799999 | -0.022125 |
1 | 2 | al | 1983 | 1.36 | 13.7 | 10732.797852 | 52.147030 | 1.788991 | 30.333599 | 0.34341 | ... | 35 | 341.834015 | 3960008.00 | 202000.078125 | 219125.46875 | 290000.15625 | 31032.0 | 9.6 | 57.900002 | 0.046558 |
2 | 3 | al | 1984 | 1.32 | 11.1 | 11108.791016 | 54.168087 | 1.714286 | 30.311501 | 0.35924 | ... | 34 | 304.872009 | 3988991.75 | 196999.968750 | 216724.09375 | 288000.15625 | 32961.0 | 7.5 | 59.500004 | 0.062798 |
3 | 4 | al | 1985 | 1.28 | 8.9 | 11332.626953 | 55.271137 | 1.652542 | 30.289499 | 0.37579 | ... | 45 | 276.742004 | 4021007.75 | 194999.734375 | 214349.03125 | 284000.31250 | 35091.0 | 7.2 | 60.100002 | 0.027490 |
4 | 5 | al | 1986 | 1.23 | 9.8 | 11661.506836 | 56.514496 | 1.609907 | 30.267401 | 0.39311 | ... | 29 | 360.716003 | 4049993.75 | 203999.890625 | 212000.00000 | 263000.28125 | 36259.0 | 7.0 | 60.700001 | 0.032143 |
5 rows × 35 columns
[4]:
df.state.nunique()
[4]:
48
[5]:
df.year.min()
[5]:
1982
[6]:
df.year.max()
[6]:
1988
[7]:
# Make sure all same min and max
assert (df.groupby("state").year.min() == df.year.min()).all()
assert (df.groupby("state").year.max() == df.year.max()).all()
[8]:
# So 48 states, running from 1982 to 1988
# One row is one state-year.
Exercise 2¶
We use the fatality rate per 10,000 as the dependent variable. Construct this variable. Name it as fat_rate
. Hint: You can compute it using total fatalities(fatal
) and population (pop
). Note that because pop
is often the name of a method in Python, you may have to navigate around some issues.
[9]:
df["fat_rate"] = (df.fatal / df["pop"]) * 10000
Exercise 3¶
Draw a scatter plot using beertax
as the x-axis, and fat_rate
as the y-axis. Draw a fitted line showing the correlation between these two variables
[10]:
# fit a model
from numpy.polynomial.polynomial import polyfit
b, m = polyfit(df["beertax"], df["fat_rate"], 1)
# scatterplot with fitted line
plt.scatter("beertax", "fat_rate", data=df)
plt.plot(df["beertax"], b + m * df["beertax"], "-", c="black")
[10]:
[<matplotlib.lines.Line2D at 0x16c6a9b40>]
Exercise 4¶
Fit a simple OLS regression. This is what is called a “pooled” regression because we’re “pooling” observations from different years into a single regression. What do your results imply about the relationship between Beer Taxes and fatalities?
\begin{align*} FatalityRate_i = \beta_0 + \beta_1 \times BeerTax_i \quad (1982 or 1988 \text{ data}), \\ \end{align*}
[11]:
import statsmodels.formula.api as smf
smf.ols("fat_rate ~ beertax", df).fit().summary()
[11]:
Dep. Variable: | fat_rate | R-squared: | 0.093 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.091 |
Method: | Least Squares | F-statistic: | 34.39 |
Date: | Sat, 04 Mar 2023 | Prob (F-statistic): | 1.08e-08 |
Time: | 13:52:53 | Log-Likelihood: | -271.04 |
No. Observations: | 336 | AIC: | 546.1 |
Df Residuals: | 334 | BIC: | 553.7 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 1.8533 | 0.044 | 42.539 | 0.000 | 1.768 | 1.939 |
beertax | 0.3646 | 0.062 | 5.865 | 0.000 | 0.242 | 0.487 |
Omnibus: | 66.653 | Durbin-Watson: | 0.465 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 112.734 |
Skew: | 1.134 | Prob(JB): | 3.31e-25 |
Kurtosis: | 4.707 | Cond. No. | 2.76 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Exercise 5¶
Now estimate your model again, this time adding state fixed effects (using the C()
notation and your normal linear model machinery). What does this result imply about the relationship between beer taxes and fatalities?
[12]:
smf.ols("fat_rate ~ beertax + C(state)", df).fit().summary()
[12]:
Dep. Variable: | fat_rate | R-squared: | 0.905 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.889 |
Method: | Least Squares | F-statistic: | 56.97 |
Date: | Sat, 04 Mar 2023 | Prob (F-statistic): | 1.96e-120 |
Time: | 13:52:53 | Log-Likelihood: | 107.97 |
No. Observations: | 336 | AIC: | -117.9 |
Df Residuals: | 287 | BIC: | 69.09 |
Df Model: | 48 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 3.4776 | 0.313 | 11.098 | 0.000 | 2.861 | 4.094 |
C(state)[T.ar] | -0.6550 | 0.219 | -2.990 | 0.003 | -1.086 | -0.224 |
C(state)[T.az] | -0.5677 | 0.267 | -2.129 | 0.034 | -1.093 | -0.043 |
C(state)[T.ca] | -1.5095 | 0.304 | -4.960 | 0.000 | -2.109 | -0.910 |
C(state)[T.co] | -1.4843 | 0.287 | -5.165 | 0.000 | -2.050 | -0.919 |
C(state)[T.ct] | -1.8623 | 0.281 | -6.638 | 0.000 | -2.414 | -1.310 |
C(state)[T.de] | -1.3076 | 0.294 | -4.448 | 0.000 | -1.886 | -0.729 |
C(state)[T.fl] | -0.2681 | 0.139 | -1.924 | 0.055 | -0.542 | 0.006 |
C(state)[T.ga] | 0.5246 | 0.184 | 2.852 | 0.005 | 0.163 | 0.887 |
C(state)[T.ia] | -1.5439 | 0.253 | -6.092 | 0.000 | -2.043 | -1.045 |
C(state)[T.id] | -0.6690 | 0.258 | -2.593 | 0.010 | -1.177 | -0.161 |
C(state)[T.il] | -1.9616 | 0.291 | -6.730 | 0.000 | -2.535 | -1.388 |
C(state)[T.in] | -1.4615 | 0.273 | -5.363 | 0.000 | -1.998 | -0.925 |
C(state)[T.ks] | -1.2232 | 0.245 | -4.984 | 0.000 | -1.706 | -0.740 |
C(state)[T.ky] | -1.2175 | 0.287 | -4.240 | 0.000 | -1.783 | -0.652 |
C(state)[T.la] | -0.8471 | 0.189 | -4.490 | 0.000 | -1.218 | -0.476 |
C(state)[T.ma] | -2.1097 | 0.276 | -7.641 | 0.000 | -2.653 | -1.566 |
C(state)[T.md] | -1.7064 | 0.283 | -6.025 | 0.000 | -2.264 | -1.149 |
C(state)[T.me] | -1.1079 | 0.191 | -5.797 | 0.000 | -1.484 | -0.732 |
C(state)[T.mi] | -1.4845 | 0.236 | -6.290 | 0.000 | -1.949 | -1.020 |
C(state)[T.mn] | -1.8972 | 0.265 | -7.157 | 0.000 | -2.419 | -1.375 |
C(state)[T.mo] | -1.2963 | 0.267 | -4.861 | 0.000 | -1.821 | -0.771 |
C(state)[T.ms] | -0.0291 | 0.148 | -0.196 | 0.845 | -0.321 | 0.263 |
C(state)[T.mt] | -0.3604 | 0.264 | -1.365 | 0.173 | -0.880 | 0.159 |
C(state)[T.nc] | -0.2905 | 0.120 | -2.424 | 0.016 | -0.526 | -0.055 |
C(state)[T.nd] | -1.6234 | 0.254 | -6.396 | 0.000 | -2.123 | -1.124 |
C(state)[T.ne] | -1.5222 | 0.249 | -6.106 | 0.000 | -2.013 | -1.032 |
C(state)[T.nh] | -1.2545 | 0.210 | -5.983 | 0.000 | -1.667 | -0.842 |
C(state)[T.nj] | -2.1057 | 0.307 | -6.855 | 0.000 | -2.710 | -1.501 |
C(state)[T.nm] | 0.4264 | 0.254 | 1.677 | 0.095 | -0.074 | 0.927 |
C(state)[T.nv] | -0.6008 | 0.286 | -2.101 | 0.037 | -1.164 | -0.038 |
C(state)[T.ny] | -2.1867 | 0.299 | -7.316 | 0.000 | -2.775 | -1.598 |
C(state)[T.oh] | -1.6744 | 0.254 | -6.597 | 0.000 | -2.174 | -1.175 |
C(state)[T.ok] | -0.5451 | 0.169 | -3.223 | 0.001 | -0.878 | -0.212 |
C(state)[T.or] | -1.1680 | 0.286 | -4.088 | 0.000 | -1.730 | -0.606 |
C(state)[T.pa] | -1.7675 | 0.276 | -6.402 | 0.000 | -2.311 | -1.224 |
C(state)[T.ri] | -2.2651 | 0.294 | -7.711 | 0.000 | -2.843 | -1.687 |
C(state)[T.sc] | 0.5572 | 0.110 | 5.065 | 0.000 | 0.341 | 0.774 |
C(state)[T.sd] | -1.0037 | 0.210 | -4.788 | 0.000 | -1.416 | -0.591 |
C(state)[T.tn] | -0.8757 | 0.268 | -3.267 | 0.001 | -1.403 | -0.348 |
C(state)[T.tx] | -0.9175 | 0.246 | -3.736 | 0.000 | -1.401 | -0.434 |
C(state)[T.ut] | -1.1640 | 0.196 | -5.926 | 0.000 | -1.551 | -0.777 |
C(state)[T.va] | -1.2902 | 0.204 | -6.320 | 0.000 | -1.692 | -0.888 |
C(state)[T.vt] | -0.9660 | 0.211 | -4.576 | 0.000 | -1.382 | -0.550 |
C(state)[T.wa] | -1.6595 | 0.283 | -5.854 | 0.000 | -2.217 | -1.102 |
C(state)[T.wi] | -1.7593 | 0.294 | -5.985 | 0.000 | -2.338 | -1.181 |
C(state)[T.wv] | -0.8968 | 0.247 | -3.636 | 0.000 | -1.382 | -0.411 |
C(state)[T.wy] | -0.2285 | 0.313 | -0.730 | 0.466 | -0.844 | 0.387 |
beertax | -0.6559 | 0.188 | -3.491 | 0.001 | -1.026 | -0.286 |
Omnibus: | 53.045 | Durbin-Watson: | 1.517 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 219.863 |
Skew: | 0.585 | Prob(JB): | 1.81e-48 |
Kurtosis: | 6.786 | Cond. No. | 187. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Exercise 6¶
Explain whey your results in Exercises 4 (without fixed effects) and Exercise 5 (with state fixed effects) look so different. What does this imply about states with high beer taxes?
Fixed Effects by Demeaning¶
Rather than just add indicator variables, we’ll now use a different strategy for estimating fixed effects called an “entity-demeaning.” This method is more computationally efficient, and can also help you understand how fixed effects work.
Let’s begin by assuming we want to estimate the following fixed-effect model:
\begin{align} FatalityRate_{it} = \alpha + \beta BeerTax_{it} + \Psi Z_i + \epsilon_{it} \tag{1} \end{align}
Where \(FatalityRate_{it}\) is the fatality rate of state \(i\) in year \(t\), \(\beta BeerTax_{it}\) is the beer tax of state \(i\) in year \(t\). \(Z_i\) is a state fixed effect.
Rather than adding indicator variables, however, we’ll use entity-demean as follows:
First, we take the average on both sides of the regression. Here \(n\) is the number of periods.
\begin{align*} \frac{1}{n} \sum_{i=1}^n FatalityRate_{it} =& \,\alpha+ \beta_1 \frac{1}{n} \sum_{i=1}^n BeerTax_{it} + \Psi \frac{1}{n} \sum_{i=1}^n Z_i + \frac{1}{n} \sum_{i=1}^n \epsilon_{it} \\ \overline{FatalityRate}_i =& \,\alpha+ \beta_1 \overline{BeerTax}_i + \Psi Z_i + \overline{\epsilon}_i. \tag{2} \end{align*}
Substracting the from the main equation yields:
\begin{align*} FatalityRate_{it} - \overline{FatalityRate}_i =& \, \beta_1(BeerTax_{it}-\overline{BeerTax}_i) + \Psi (Z_i - Z_i) + (\epsilon_{it} - \overline{\epsilon}_i) \\ \overset{\sim}{FatalityRate}_{it} =& \, \beta_1 \overset{\sim}{BeerTax}_{it} + \overset{\sim}{\epsilon}_{it} \tag{3} \end{align*}
Where the \(\sim\) means values have been demeaned by group.
By taking the difference between the value of each observation (state-year) and the mean value of the entity (state) over n periods, we analyze how the within-state variation of beer tax affects that of the fatality rate. Moreover, by doing so we no longer need to estimate the fixed effects of \(Z_i\), saving computing power if we are working on a dataset with a large number of fixed effects.
Exercise 7¶
Implement the above entity-demeaned approach to estimate the fixed-effects model by hand (use basic functions, not full tools like PanelOLS
or C()
notation in python
, or lfe
or C()
notation in R
).
[13]:
# get demeaned variables
df["beertax_dm"] = df["beertax"] - df.groupby("state").beertax.transform("mean")
df["fat_rate_dm"] = df["fat_rate"] - df.groupby("state").fat_rate.transform("mean")
# perform ols
smf.ols(
"fat_rate_dm ~ beertax_dm", df
).fit().summary() # it should be the same as the output using PanelOls
[13]:
Dep. Variable: | fat_rate_dm | R-squared: | 0.041 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.038 |
Method: | Least Squares | F-statistic: | 14.19 |
Date: | Sat, 04 Mar 2023 | Prob (F-statistic): | 0.000196 |
Time: | 13:52:53 | Log-Likelihood: | 107.97 |
No. Observations: | 336 | AIC: | -211.9 |
Df Residuals: | 334 | BIC: | -204.3 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -2.168e-17 | 0.010 | -2.26e-15 | 1.000 | -0.019 | 0.019 |
beertax_dm | -0.6559 | 0.174 | -3.767 | 0.000 | -0.998 | -0.313 |
Omnibus: | 53.045 | Durbin-Watson: | 1.517 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 219.863 |
Skew: | 0.585 | Prob(JB): | 1.81e-48 |
Kurtosis: | 6.786 | Cond. No. | 18.1 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Exercise 8¶
Fit the model with state fixed-effect using PanelOLS
/ lfe
. Compare it to your by-hand output. Interpret the result.
(Note that your standard errors will change a little bit because PanelOLS clusters standard errors by state, which you may not have done above above.)
[14]:
df = df.set_index(["state", "year"])
[15]:
from linearmodels import PanelOLS
# model with state fixed effect
mod = PanelOLS.from_formula("fat_rate ~ beertax + EntityEffects", data=df)
print(mod.fit())
PanelOLS Estimation Summary
================================================================================
Dep. Variable: fat_rate R-squared: 0.0407
Estimator: PanelOLS R-squared (Between): -0.3805
No. Observations: 336 R-squared (Within): 0.0407
Date: Sat, Mar 04 2023 R-squared (Overall): -0.3775
Time: 13:52:53 Log-likelihood 107.97
Cov. Estimator: Unadjusted
F-statistic: 12.190
Entities: 48 P-value 0.0006
Avg Obs: 7.0000 Distribution: F(1,287)
Min Obs: 7.0000
Max Obs: 7.0000 F-statistic (robust): 12.190
P-value 0.0006
Time periods: 7 Distribution: F(1,287)
Avg Obs: 48.000
Min Obs: 48.000
Max Obs: 48.000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
beertax -0.6559 0.1878 -3.4915 0.0006 -1.0256 -0.2861
==============================================================================
F-test for Poolability: 52.179
P-value: 0.0000
Distribution: F(47,287)
Included effects: Entity
Exercise 9¶
Now (using PanelOLS
or lfe
) estimate a fixed effects model using the following specification. Add fixed effects for both the state and the year, as well as the other covariates you think are important \(X_{it}\)).
(Note: you may want to make sure PanelOLS
adds an intercept to aid in interpretation of controls if you include categorical variables. If you use PanelOLS.from_formula()
, just put a 1+
in after the ~
, or add a column of 1s to your X matrix if you’re working with numpy arrays.)
Explain (a) the type of phenomenon we control for by adding year
fixed effects, and (b) your choice of covariates. Cluster the standard error at the state level. Interpret the result.
\begin{align} FatalityRate_{it} = \beta BeerTax_{it} + X_{it} + State_i + Year_t + \epsilon_{it} \end{align}
[16]:
# model with state and year fixed effect effects
mod = PanelOLS.from_formula(
"fat_rate ~ 1 + beertax + unemp " "+ EntityEffects + TimeEffects", data=df
)
print(mod.fit(cov_type="clustered", cluster_entity=True))
PanelOLS Estimation Summary
================================================================================
Dep. Variable: fat_rate R-squared: 0.2939
Estimator: PanelOLS R-squared (Between): -0.8925
No. Observations: 336 R-squared (Within): -0.2563
Date: Sat, Mar 04 2023 R-squared (Overall): -0.8295
Time: 13:52:53 Log-likelihood 167.33
Cov. Estimator: Clustered
F-statistic: 58.269
Entities: 48 P-value 0.0000
Avg Obs: 7.0000 Distribution: F(2,280)
Min Obs: 7.0000
Max Obs: 7.0000 F-statistic (robust): 18.060
P-value 0.0000
Time periods: 7 Distribution: F(2,280)
Avg Obs: 48.000
Min Obs: 48.000
Max Obs: 48.000
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
Intercept 3.0162 0.2245 13.433 0.0000 2.5742 3.4582
beertax -0.5398 0.3521 -1.5330 0.1264 -1.2329 0.1533
unemp -0.0951 0.0160 -5.9261 0.0000 -0.1267 -0.0635
==============================================================================
F-test for Poolability: 64.530
P-value: 0.0000
Distribution: F(53,280)
Included effects: Entity, Time
[17]:
# I added unemployment rate because it won't be absorbed by
# state or year fixed effects—it varies *within* year *across* states.