Traffic Death Analysis

In this exercise, we will be analyzing the effect of alcohol taxes on traffic death in the United States. The data set used in this exercise, fatalities.csv, is a state-year panel dataset (meaning it includes data on multiple states, and the data includes several years of data for each state. The data contains 336 observations on 34 variables. The variables used in the exercise are defined as follows:

state: factor variable indicating states

year: factor variable indicating years

beertax: numeric variable, Tax on the case of beer

In these exercises, we’ll be looking at how beer taxes (which are believed to reduce alcohol consumption, potentially reducing drunk driving deaths) impact car accident fatality rates.

More specifically, though, we’ll be approaching our estimation of the impact of beer taxes in a few different ways in an effort to give you more of an intuitive sense of what happens when you add fixed effects to a regression.

Exercise 1

Download and load the data from this link, or by going to www.github.com/nickeubank/MIDS_Data/ and downloading the us_driving_fatalities.csv dataset.

How many states does this dataset contain? What’s the time frame of this dataset? (From which year to which year). And what constitutes a single observation (i.e. what is the unit of analysis for each row of the data?)

[1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
[2]:
df = pd.read_csv(
    "https://media.githubusercontent.com/media/nickeubank/MIDS_Data/master/us_driving_fatalities.csv"
)
[3]:
df.head()
[3]:
Unnamed: 0 state year spirits unemp income emppop beertax baptist mormon ... nfatal2124 afatal pop pop1517 pop1820 pop2124 milestot unempus emppopus gsp
0 1 al 1982 1.37 14.4 10544.152344 50.692039 1.539379 30.355700 0.32829 ... 32 309.437988 3942002.25 208999.593750 221553.43750 290000.06250 28516.0 9.7 57.799999 -0.022125
1 2 al 1983 1.36 13.7 10732.797852 52.147030 1.788991 30.333599 0.34341 ... 35 341.834015 3960008.00 202000.078125 219125.46875 290000.15625 31032.0 9.6 57.900002 0.046558
2 3 al 1984 1.32 11.1 11108.791016 54.168087 1.714286 30.311501 0.35924 ... 34 304.872009 3988991.75 196999.968750 216724.09375 288000.15625 32961.0 7.5 59.500004 0.062798
3 4 al 1985 1.28 8.9 11332.626953 55.271137 1.652542 30.289499 0.37579 ... 45 276.742004 4021007.75 194999.734375 214349.03125 284000.31250 35091.0 7.2 60.100002 0.027490
4 5 al 1986 1.23 9.8 11661.506836 56.514496 1.609907 30.267401 0.39311 ... 29 360.716003 4049993.75 203999.890625 212000.00000 263000.28125 36259.0 7.0 60.700001 0.032143

5 rows × 35 columns

[4]:
df.state.nunique()
[4]:
48
[5]:
df.year.min()
[5]:
1982
[6]:
df.year.max()
[6]:
1988
[7]:
# Make sure all same min and max
assert (df.groupby("state").year.min() == df.year.min()).all()
assert (df.groupby("state").year.max() == df.year.max()).all()
[8]:
# So 48 states, running from 1982 to 1988
# One row is one state-year.

Exercise 2

We use the fatality rate per 10,000 as the dependent variable. Construct this variable. Name it as fat_rate. Hint: You can compute it using total fatalities(fatal) and population (pop). Note that because pop is often the name of a method in Python, you may have to navigate around some issues.

[9]:
df["fat_rate"] = (df.fatal / df["pop"]) * 10000

Exercise 3

Draw a scatter plot using beertax as the x-axis, and fat_rate as the y-axis. Draw a fitted line showing the correlation between these two variables

[10]:
# fit a model
from numpy.polynomial.polynomial import polyfit

b, m = polyfit(df["beertax"], df["fat_rate"], 1)

# scatterplot with fitted line
plt.scatter("beertax", "fat_rate", data=df)
plt.plot(df["beertax"], b + m * df["beertax"], "-", c="black")
[10]:
[<matplotlib.lines.Line2D at 0x16c6a9b40>]
../_images/exercises_solutions_panel_12_1.png

Exercise 4

Fit a simple OLS regression. This is what is called a “pooled” regression because we’re “pooling” observations from different years into a single regression. What do your results imply about the relationship between Beer Taxes and fatalities?

\begin{align*} FatalityRate_i = \beta_0 + \beta_1 \times BeerTax_i \quad (1982 or 1988 \text{ data}), \\ \end{align*}

[11]:
import statsmodels.formula.api as smf

smf.ols("fat_rate ~ beertax", df).fit().summary()
[11]:
OLS Regression Results
Dep. Variable: fat_rate R-squared: 0.093
Model: OLS Adj. R-squared: 0.091
Method: Least Squares F-statistic: 34.39
Date: Sat, 04 Mar 2023 Prob (F-statistic): 1.08e-08
Time: 13:52:53 Log-Likelihood: -271.04
No. Observations: 336 AIC: 546.1
Df Residuals: 334 BIC: 553.7
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.8533 0.044 42.539 0.000 1.768 1.939
beertax 0.3646 0.062 5.865 0.000 0.242 0.487
Omnibus: 66.653 Durbin-Watson: 0.465
Prob(Omnibus): 0.000 Jarque-Bera (JB): 112.734
Skew: 1.134 Prob(JB): 3.31e-25
Kurtosis: 4.707 Cond. No. 2.76


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Exercise 5

Now estimate your model again, this time adding state fixed effects (using the C() notation and your normal linear model machinery). What does this result imply about the relationship between beer taxes and fatalities?

[12]:
smf.ols("fat_rate ~ beertax + C(state)", df).fit().summary()
[12]:
OLS Regression Results
Dep. Variable: fat_rate R-squared: 0.905
Model: OLS Adj. R-squared: 0.889
Method: Least Squares F-statistic: 56.97
Date: Sat, 04 Mar 2023 Prob (F-statistic): 1.96e-120
Time: 13:52:53 Log-Likelihood: 107.97
No. Observations: 336 AIC: -117.9
Df Residuals: 287 BIC: 69.09
Df Model: 48
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3.4776 0.313 11.098 0.000 2.861 4.094
C(state)[T.ar] -0.6550 0.219 -2.990 0.003 -1.086 -0.224
C(state)[T.az] -0.5677 0.267 -2.129 0.034 -1.093 -0.043
C(state)[T.ca] -1.5095 0.304 -4.960 0.000 -2.109 -0.910
C(state)[T.co] -1.4843 0.287 -5.165 0.000 -2.050 -0.919
C(state)[T.ct] -1.8623 0.281 -6.638 0.000 -2.414 -1.310
C(state)[T.de] -1.3076 0.294 -4.448 0.000 -1.886 -0.729
C(state)[T.fl] -0.2681 0.139 -1.924 0.055 -0.542 0.006
C(state)[T.ga] 0.5246 0.184 2.852 0.005 0.163 0.887
C(state)[T.ia] -1.5439 0.253 -6.092 0.000 -2.043 -1.045
C(state)[T.id] -0.6690 0.258 -2.593 0.010 -1.177 -0.161
C(state)[T.il] -1.9616 0.291 -6.730 0.000 -2.535 -1.388
C(state)[T.in] -1.4615 0.273 -5.363 0.000 -1.998 -0.925
C(state)[T.ks] -1.2232 0.245 -4.984 0.000 -1.706 -0.740
C(state)[T.ky] -1.2175 0.287 -4.240 0.000 -1.783 -0.652
C(state)[T.la] -0.8471 0.189 -4.490 0.000 -1.218 -0.476
C(state)[T.ma] -2.1097 0.276 -7.641 0.000 -2.653 -1.566
C(state)[T.md] -1.7064 0.283 -6.025 0.000 -2.264 -1.149
C(state)[T.me] -1.1079 0.191 -5.797 0.000 -1.484 -0.732
C(state)[T.mi] -1.4845 0.236 -6.290 0.000 -1.949 -1.020
C(state)[T.mn] -1.8972 0.265 -7.157 0.000 -2.419 -1.375
C(state)[T.mo] -1.2963 0.267 -4.861 0.000 -1.821 -0.771
C(state)[T.ms] -0.0291 0.148 -0.196 0.845 -0.321 0.263
C(state)[T.mt] -0.3604 0.264 -1.365 0.173 -0.880 0.159
C(state)[T.nc] -0.2905 0.120 -2.424 0.016 -0.526 -0.055
C(state)[T.nd] -1.6234 0.254 -6.396 0.000 -2.123 -1.124
C(state)[T.ne] -1.5222 0.249 -6.106 0.000 -2.013 -1.032
C(state)[T.nh] -1.2545 0.210 -5.983 0.000 -1.667 -0.842
C(state)[T.nj] -2.1057 0.307 -6.855 0.000 -2.710 -1.501
C(state)[T.nm] 0.4264 0.254 1.677 0.095 -0.074 0.927
C(state)[T.nv] -0.6008 0.286 -2.101 0.037 -1.164 -0.038
C(state)[T.ny] -2.1867 0.299 -7.316 0.000 -2.775 -1.598
C(state)[T.oh] -1.6744 0.254 -6.597 0.000 -2.174 -1.175
C(state)[T.ok] -0.5451 0.169 -3.223 0.001 -0.878 -0.212
C(state)[T.or] -1.1680 0.286 -4.088 0.000 -1.730 -0.606
C(state)[T.pa] -1.7675 0.276 -6.402 0.000 -2.311 -1.224
C(state)[T.ri] -2.2651 0.294 -7.711 0.000 -2.843 -1.687
C(state)[T.sc] 0.5572 0.110 5.065 0.000 0.341 0.774
C(state)[T.sd] -1.0037 0.210 -4.788 0.000 -1.416 -0.591
C(state)[T.tn] -0.8757 0.268 -3.267 0.001 -1.403 -0.348
C(state)[T.tx] -0.9175 0.246 -3.736 0.000 -1.401 -0.434
C(state)[T.ut] -1.1640 0.196 -5.926 0.000 -1.551 -0.777
C(state)[T.va] -1.2902 0.204 -6.320 0.000 -1.692 -0.888
C(state)[T.vt] -0.9660 0.211 -4.576 0.000 -1.382 -0.550
C(state)[T.wa] -1.6595 0.283 -5.854 0.000 -2.217 -1.102
C(state)[T.wi] -1.7593 0.294 -5.985 0.000 -2.338 -1.181
C(state)[T.wv] -0.8968 0.247 -3.636 0.000 -1.382 -0.411
C(state)[T.wy] -0.2285 0.313 -0.730 0.466 -0.844 0.387
beertax -0.6559 0.188 -3.491 0.001 -1.026 -0.286
Omnibus: 53.045 Durbin-Watson: 1.517
Prob(Omnibus): 0.000 Jarque-Bera (JB): 219.863
Skew: 0.585 Prob(JB): 1.81e-48
Kurtosis: 6.786 Cond. No. 187.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Exercise 6

Explain whey your results in Exercises 4 (without fixed effects) and Exercise 5 (with state fixed effects) look so different. What does this imply about states with high beer taxes?

Fixed Effects by Demeaning

Rather than just add indicator variables, we’ll now use a different strategy for estimating fixed effects called an “entity-demeaning.” This method is more computationally efficient, and can also help you understand how fixed effects work.

Let’s begin by assuming we want to estimate the following fixed-effect model:

\begin{align} FatalityRate_{it} = \alpha + \beta BeerTax_{it} + \Psi Z_i + \epsilon_{it} \tag{1} \end{align}

Where \(FatalityRate_{it}\) is the fatality rate of state \(i\) in year \(t\), \(\beta BeerTax_{it}\) is the beer tax of state \(i\) in year \(t\). \(Z_i\) is a state fixed effect.

Rather than adding indicator variables, however, we’ll use entity-demean as follows:

First, we take the average on both sides of the regression. Here \(n\) is the number of periods.

\begin{align*} \frac{1}{n} \sum_{i=1}^n FatalityRate_{it} =& \,\alpha+ \beta_1 \frac{1}{n} \sum_{i=1}^n BeerTax_{it} + \Psi \frac{1}{n} \sum_{i=1}^n Z_i + \frac{1}{n} \sum_{i=1}^n \epsilon_{it} \\ \overline{FatalityRate}_i =& \,\alpha+ \beta_1 \overline{BeerTax}_i + \Psi Z_i + \overline{\epsilon}_i. \tag{2} \end{align*}

Substracting the from the main equation yields:

\begin{align*} FatalityRate_{it} - \overline{FatalityRate}_i =& \, \beta_1(BeerTax_{it}-\overline{BeerTax}_i) + \Psi (Z_i - Z_i) + (\epsilon_{it} - \overline{\epsilon}_i) \\ \overset{\sim}{FatalityRate}_{it} =& \, \beta_1 \overset{\sim}{BeerTax}_{it} + \overset{\sim}{\epsilon}_{it} \tag{3} \end{align*}

Where the \(\sim\) means values have been demeaned by group.

By taking the difference between the value of each observation (state-year) and the mean value of the entity (state) over n periods, we analyze how the within-state variation of beer tax affects that of the fatality rate. Moreover, by doing so we no longer need to estimate the fixed effects of \(Z_i\), saving computing power if we are working on a dataset with a large number of fixed effects.

Exercise 7

Implement the above entity-demeaned approach to estimate the fixed-effects model by hand (use basic functions, not full tools like PanelOLS or C() notation in python, or lfe or C() notation in R).

[13]:
# get demeaned variables
df["beertax_dm"] = df["beertax"] - df.groupby("state").beertax.transform("mean")
df["fat_rate_dm"] = df["fat_rate"] - df.groupby("state").fat_rate.transform("mean")

# perform ols
smf.ols(
    "fat_rate_dm ~ beertax_dm", df
).fit().summary()  # it should be the same as the output using PanelOls
[13]:
OLS Regression Results
Dep. Variable: fat_rate_dm R-squared: 0.041
Model: OLS Adj. R-squared: 0.038
Method: Least Squares F-statistic: 14.19
Date: Sat, 04 Mar 2023 Prob (F-statistic): 0.000196
Time: 13:52:53 Log-Likelihood: 107.97
No. Observations: 336 AIC: -211.9
Df Residuals: 334 BIC: -204.3
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -2.168e-17 0.010 -2.26e-15 1.000 -0.019 0.019
beertax_dm -0.6559 0.174 -3.767 0.000 -0.998 -0.313
Omnibus: 53.045 Durbin-Watson: 1.517
Prob(Omnibus): 0.000 Jarque-Bera (JB): 219.863
Skew: 0.585 Prob(JB): 1.81e-48
Kurtosis: 6.786 Cond. No. 18.1


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Exercise 8

Fit the model with state fixed-effect using PanelOLS / lfe. Compare it to your by-hand output. Interpret the result.

(Note that your standard errors will change a little bit because PanelOLS clusters standard errors by state, which you may not have done above above.)

[14]:
df = df.set_index(["state", "year"])
[15]:
from linearmodels import PanelOLS

# model with state fixed effect
mod = PanelOLS.from_formula("fat_rate ~ beertax + EntityEffects", data=df)
print(mod.fit())
                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:               fat_rate   R-squared:                        0.0407
Estimator:                   PanelOLS   R-squared (Between):             -0.3805
No. Observations:                 336   R-squared (Within):               0.0407
Date:                Sat, Mar 04 2023   R-squared (Overall):             -0.3775
Time:                        13:52:53   Log-likelihood                    107.97
Cov. Estimator:            Unadjusted
                                        F-statistic:                      12.190
Entities:                          48   P-value                           0.0006
Avg Obs:                       7.0000   Distribution:                   F(1,287)
Min Obs:                       7.0000
Max Obs:                       7.0000   F-statistic (robust):             12.190
                                        P-value                           0.0006
Time periods:                       7   Distribution:                   F(1,287)
Avg Obs:                       48.000
Min Obs:                       48.000
Max Obs:                       48.000

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
beertax       -0.6559     0.1878    -3.4915     0.0006     -1.0256     -0.2861
==============================================================================

F-test for Poolability: 52.179
P-value: 0.0000
Distribution: F(47,287)

Included effects: Entity

Exercise 9

Now (using PanelOLS or lfe) estimate a fixed effects model using the following specification. Add fixed effects for both the state and the year, as well as the other covariates you think are important \(X_{it}\)).

(Note: you may want to make sure PanelOLS adds an intercept to aid in interpretation of controls if you include categorical variables. If you use PanelOLS.from_formula(), just put a 1+ in after the ~, or add a column of 1s to your X matrix if you’re working with numpy arrays.)

Explain (a) the type of phenomenon we control for by adding year fixed effects, and (b) your choice of covariates. Cluster the standard error at the state level. Interpret the result.

\begin{align} FatalityRate_{it} = \beta BeerTax_{it} + X_{it} + State_i + Year_t + \epsilon_{it} \end{align}

[16]:
# model with state and year fixed effect effects
mod = PanelOLS.from_formula(
    "fat_rate ~ 1 + beertax + unemp " "+ EntityEffects + TimeEffects", data=df
)
print(mod.fit(cov_type="clustered", cluster_entity=True))

                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:               fat_rate   R-squared:                        0.2939
Estimator:                   PanelOLS   R-squared (Between):             -0.8925
No. Observations:                 336   R-squared (Within):              -0.2563
Date:                Sat, Mar 04 2023   R-squared (Overall):             -0.8295
Time:                        13:52:53   Log-likelihood                    167.33
Cov. Estimator:             Clustered
                                        F-statistic:                      58.269
Entities:                          48   P-value                           0.0000
Avg Obs:                       7.0000   Distribution:                   F(2,280)
Min Obs:                       7.0000
Max Obs:                       7.0000   F-statistic (robust):             18.060
                                        P-value                           0.0000
Time periods:                       7   Distribution:                   F(2,280)
Avg Obs:                       48.000
Min Obs:                       48.000
Max Obs:                       48.000

                             Parameter Estimates
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Intercept      3.0162     0.2245     13.433     0.0000      2.5742      3.4582
beertax       -0.5398     0.3521    -1.5330     0.1264     -1.2329      0.1533
unemp         -0.0951     0.0160    -5.9261     0.0000     -0.1267     -0.0635
==============================================================================

F-test for Poolability: 64.530
P-value: 0.0000
Distribution: F(53,280)

Included effects: Entity, Time
[17]:
# I added unemployment rate because it won't be absorbed by
# state or year fixed effects—it varies *within* year *across* states.