Groupby and Arrest Data¶
In recent years, many US states have decided to legalize the use of marijuana.
When these ideas were first proposed, there were many theories about the relationship between crime and the “War on Drugs” (the term given to US efforts to arrest drug users and dealers over the past several decades).
In this exercise, we’re going to test a few of those theories using drug arrest data from the state of California.
Though California has passed a number of laws lessening penalities for marijuana possession over the years, arguably the biggest changes were in 2010, when the state changed the penalty for possessing a small amount of marijuana from a criminal crime to a “civil” penality (meaning those found guilty only had to pay a fine, not go to jail), though possessing, selling, or producing larger quantities remained illegal. Then in 2016, the state fully legalized marijuana for recreational use, not only making possession of small amounts legal, but also creating a regulatory system for producing marijuana for sale.
Proponents of drug legalization have long argued that the war on drugs contributes to violent crime by creating an opportunity for drug dealers and organized crime to sell and distribute drugs, a business which tends to generate violence when gangs battle over territory. According to this theory, with drug legalization, we should see violent crime decrease after legalization in places where drug arrests had previously been common.
To be clear, this is far from the only argument for drug legalization! It is simply the argument we are well positioned to analyze today.
(Students from Practical Data Science: This should sound familiar! Last semester we did this analysis in a very simple, crude manner; in this class we’ll do it rigorously with your new found difference-in-differences skills!)
Exercise 1¶
Download and import California arrest data from https://www.github.com/nickeubank/MIDS_Data/UDS_arrest_data.csv. What is a unit of observation (a single row) in this data? What entities are being tracked, and over what time period?
Note that VIOLENT
is a count of arrests for violent offenses, and F_DRUGOFF
is a count of felony drug arrests. total_population
is total population.
[1]:
import pandas as pd
arrests = pd.read_csv(
"https://media.githubusercontent.com/media/"
"nickeubank/MIDS_Data/master/UDS_arrest_data.csv"
)
[2]:
arrests.head()
[2]:
YEAR | COUNTY | VIOLENT | F_DRUGOFF | total_population | |
---|---|---|---|---|---|
0 | 1980 | Alameda County | 4504 | 3569 | 1105379.0 |
1 | 1981 | Alameda County | 4699 | 3926 | 1122759.3 |
2 | 1982 | Alameda County | 4389 | 4436 | 1140139.6 |
3 | 1983 | Alameda County | 4500 | 5086 | 1157519.9 |
4 | 1984 | Alameda County | 3714 | 5878 | 1174900.2 |
Exercise 2¶
In this analysis, we will split our sample into “treated” and “control” counties on the basis of whether a given county had a high average drug arrest rate in the three years before California began drug legalization in 2010. Counties with high drug arrest rates, after all, will be more impacted by drug liberalization policies.
Calculate each county’s average drug arrest rate for the period from 2007-2009. Then calculate the median value across counties, and create an indicator called treated
for counties whose average drug arrest rate during this period was above the median average drug arrest rate. In other words, half your counties should be in the “treated” group, and half in “control”.
Note that this indicator should be time-invariant—if a county is in the treated group, it should always be identified as being in the treated group.
[3]:
arrests["drug_rate"] = arrests.F_DRUGOFF / (arrests.total_population / 100_000)
arrests["violent_rate"] = arrests.VIOLENT / (arrests.total_population / 100_000)
[4]:
small = arrests[arrests.YEAR.isin([2007, 2008, 2009])]
avgs = small.groupby("COUNTY").drug_rate.mean()
avgs
[4]:
COUNTY
Alameda County 394.457331
Alpine County 197.551801
Amador County 289.027292
Butte County 271.709690
Calaveras County 312.159811
Colusa County 209.450372
Contra Costa County 276.018946
Del Norte County 358.305449
El Dorado County 228.874748
Fresno County 473.876866
Glenn County 411.831533
Humboldt County 499.572904
Imperial County 460.694485
Inyo County 380.164074
Kern County 432.659812
Kings County 208.035930
Lake County 369.871380
Lassen County 147.165387
Los Angeles County 403.990696
Madera County 261.117203
Marin County 190.030996
Mariposa County 232.969880
Mendocino County 599.799767
Merced County 441.290221
Modoc County 211.204960
Mono County 272.602759
Monterey County 248.094595
Napa County 247.823080
Nevada County 235.403323
Orange County 277.267304
Placer County 290.583530
Plumas County 492.429787
Riverside County 291.458787
Sacramento County 375.134854
San Benito County 210.849718
San Bernardino County 465.333881
San Diego County 249.177208
San Francisco County 873.961738
San Joaquin County 336.824184
San Luis Obispo County 192.583757
San Mateo County 214.668466
Santa Barbara County 225.267192
Santa Clara County 261.348166
Santa Cruz County 322.577393
Shasta County 257.470436
Sierra County 221.364059
Siskiyou County 352.795109
Solano County 388.929454
Sonoma County 331.905273
Stanislaus County 457.198700
Sutter County 207.695398
Tehama County 585.300555
Trinity County 427.859162
Tulare County 447.774248
Tuolumne County 405.897681
Ventura County 264.878553
Yolo County 333.788690
Yuba County 351.561100
Name: drug_rate, dtype: float64
[5]:
avgs.median()
[5]:
301.8092992253924
[6]:
high = (avgs > avgs.median()).astype("int")
high.name = "treated"
high.head()
[6]:
COUNTY
Alameda County 1
Alpine County 0
Amador County 0
Butte County 0
Calaveras County 1
Name: treated, dtype: int64
[7]:
arrests = pd.merge(
arrests,
high,
left_on="COUNTY",
right_index=True,
how="outer",
indicator=True,
validate="m:1",
)
assert (arrests._merge == "both").all()
arrests = arrests.drop("_merge", axis="columns")
arrests.head()
[7]:
YEAR | COUNTY | VIOLENT | F_DRUGOFF | total_population | drug_rate | violent_rate | treated | |
---|---|---|---|---|---|---|---|---|
0 | 1980 | Alameda County | 4504 | 3569 | 1105379.0 | 322.875683 | 407.462056 | 1 |
1 | 1981 | Alameda County | 4699 | 3926 | 1122759.3 | 349.674236 | 418.522474 | 1 |
2 | 1982 | Alameda County | 4389 | 4436 | 1140139.6 | 389.075162 | 384.952860 | 1 |
3 | 1983 | Alameda County | 4500 | 5086 | 1157519.9 | 439.387694 | 388.762215 | 1 |
4 | 1984 | Alameda County | 3714 | 5878 | 1174900.2 | 500.297813 | 316.111956 | 1 |
Exercise 3¶
Our outcome in this analysis is the violent arrest rate – if drug liberalization reduces crime overall, we would expect to see this rate fall in counties with high drug arrest rates after liberalization; if not, we would not expect to see any changes. Create a violent_rate
variable with is violent arrests per 100,000 people.
[8]:
# See above, already did!
Exercise 4¶
Differences-in-differences get their name from the fact that the estimator, in its most basic implementation, is just the difference between:
difference in the average change in outcome among eventually-treated units from before to after when treatment is applied, and
difference in the average change in outcome among never-treated units from before to after when treatment (to the treated units).
(Obviously treatment is never a applied to the never-treated units – when we talk about pre / post, we refer to before and after the point in time in which treatment is applied to the treated units. So if treated units are treated in 2008, then for the never-treated units, we are also comparing outcomes before 2008 to after 2008, even though 2008 has no special significance for the never-treated units).
In its most basic implementation, therefore, calculating a difference-in-difference estimate requires calculating just 4 numbers:
\(\bar y_{T=1,Post}\) Avg for Treatment, Post-Treatment
\(\bar y_{T=0,Post}\) Avg for Control, Post-Treatment
\(\bar y_{T=1,Pre}\) Avg for Treatment, Pre-Treatment
\(\bar y_{T=0,Pre}\) Avg for Control, Pre-Treatment
The difference-in-differences estimator \(\hat \delta\) is defined as
Calculate (a) the change in violent arrest rates for our treated groups from before legalization to after (\(\bar y_{T=1,Post} - \bar y_{T=1, Pre}\)), and (b) our difference in difference estimator \(\hat\delta\) by calculating these four values. Does doing your difference-in-difference estimate tell you something different from what you’d learn if you had just done a pre-post comparison?
For the Pre
period, consider the three years before liberalization begins in 2010 (e.g. 2007-2009). For the Post
period, consider the three years after final legalization took place (2016-2018). We will ignore the middle period in which marijuana was decriminalized but not yet legal.
[9]:
arrests["post_2010"] = (arrests.YEAR >= 2010).astype("int")
arrests_sub = arrests[arrests.YEAR.isin([2007, 2008, 2009, 2016, 2017, 2018])]
y_t1_post = arrests_sub.loc[
(arrests_sub["treated"] == 1) & (arrests_sub["post_2010"] == 1), "violent_rate"
].mean()
y_t1_pre = arrests_sub.loc[
(arrests_sub["treated"] == 1) & (arrests_sub["post_2010"] == 0), "violent_rate"
].mean()
y_t0_post = arrests_sub.loc[
(arrests_sub["treated"] == 0) & (arrests_sub["post_2010"] == 1), "violent_rate"
].mean()
y_t0_pre = arrests_sub.loc[
(arrests_sub["treated"] == 0) & (arrests_sub["post_2010"] == 0), "violent_rate"
].mean()
[10]:
y_t1_post - y_t1_pre
[10]:
-26.799650070477355
[11]:
(y_t1_post - y_t1_pre) - (y_t0_post - y_t0_pre)
[11]:
-7.418061484004568
Exercise 5¶
Now calculate \(\hat\delta\) using a regression with an indicator for post-2010, an indicator for treated, and an interaction of the two. Use only the same set of years you used above. How does your estimate compare to the estimate you calculated in Exercise 4?
What does this tell you about interpretation of interaction terms with two indicator variables?
Note: You need to cluster your standard errors by county, since we expect counties (over time) to be subject to common fluctuations.
[12]:
import statsmodels.formula.api as smf
m = smf.ols("violent_rate ~ post_2010 * treated", arrests_sub).fit()
m.get_robustcov_results(cov_type="cluster", groups=arrests_sub.COUNTY).summary()
[12]:
Dep. Variable: | violent_rate | R-squared: | 0.221 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.214 |
Method: | Least Squares | F-statistic: | 11.00 |
Date: | Sun, 12 Mar 2023 | Prob (F-statistic): | 8.45e-06 |
Time: | 14:57:47 | Log-Likelihood: | -2094.1 |
No. Observations: | 348 | AIC: | 4196. |
Df Residuals: | 344 | BIC: | 4212. |
Df Model: | 3 | ||
Covariance Type: | cluster |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 319.7820 | 17.638 | 18.131 | 0.000 | 284.463 | 355.101 |
post_2010 | -19.3816 | 9.892 | -1.959 | 0.055 | -39.189 | 0.426 |
treated | 106.8289 | 23.385 | 4.568 | 0.000 | 60.001 | 153.657 |
post_2010:treated | -7.4181 | 18.869 | -0.393 | 0.696 | -45.203 | 30.367 |
Omnibus: | 53.945 | Durbin-Watson: | 0.741 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 81.621 |
Skew: | 0.965 | Prob(JB): | 1.89e-18 |
Kurtosis: | 4.380 | Cond. No. | 6.85 |
Notes:
[1] Standard Errors are robust to cluster correlation (cluster)
Exercise 6¶
In the preceding exercise, we did a simple pre-post / treated-control comparison. But one important limitation of these designs is that they do not allow us to test for parallel trends.
Plot a difference-in-difference model using data from 2000-2009 (inclusive) and from 2016-2018 (inclusive). Note this will have four different geometric components: a time trend for treated counties pre-2010, a time trend for control counties pre-2010, a time trend for treated counties post-2016 (include 2016), and a time trend for control counties post-2016 (include 2016).
Do you see evidence of parallel trends for these two datasets? Does that make you feel more or less confident in your diff-in-diff estimates?
[13]:
import altair as alt
alt.data_transformers.enable("data_server")
arrests_long = arrests[
arrests.YEAR.isin(list(range(2000, 2010)) + list(range(2016, 2019)))
]
[14]:
grouped_means = arrests_long.groupby(["treated", "YEAR"], as_index=False)[
["violent_rate"]
].mean()
scatter = (
alt.Chart(grouped_means)
.mark_point()
.encode(
x=alt.X("YEAR:Q", scale=alt.Scale(zero=False)),
y=alt.Y("violent_rate:Q", scale=alt.Scale(zero=False)),
color="treated:N",
)
)
scatter
[14]:
[15]:
loesses = []
colors = {0: "lightblue", 1: "orange"}
for t, p in [(0, 0), (0, 1), (1, 0), (1, 1)]:
c = (
alt.Chart(
arrests_long[
(arrests_long["treated"] == t) & (arrests_long["post_2010"] == p)
]
)
.encode(
x=alt.X("YEAR:Q", scale=alt.Scale(zero=False)),
y=alt.Y("violent_rate:Q", scale=alt.Scale(zero=False)),
)
.transform_regression("YEAR", "violent_rate")
.mark_line(color=colors[t])
)
loesses.append(c)
alt.layer(*loesses, scatter)
[15]:
Exercise 7¶
While we can estimate the model described above precisely as a regression, it’s actually much easier to estimate a more flexible model by running the regression we ran in Exercise 5 but with both county
and year
fixed effects. Use PanelOLS
(or lfe
in R) to estimate this fixed effects regression.
With all these additional fixed effects, do you find evidence that marijuana legalization reduced violent crime?
[16]:
# NOTE TO GRADERS: It's not clear which dataset people will do this
# with -- the data from Eq 5, or hte longer dataset from the previous question.
# Both are fine. Answers for both here.
# NOTE TO GRADERS 2: I _thought_ that PanelOLS clustered errors automatically.
# It does NOT. You have to tell it to cluster in the
# .fit() method—e.g., `.fit(cov_type="clustered", cluster_entity=True)`.
# I mislead some students on this (this is in 2023), so if they don't
# do this and get smaller SEs that's not their fault.
[17]:
from linearmodels import PanelOLS
arrests_for_panelols = arrests_sub.set_index(["COUNTY", "YEAR"])
[18]:
mod = PanelOLS.from_formula(
"violent_rate ~ treated * post_2010 + EntityEffects + TimeEffects",
data=arrests_for_panelols,
drop_absorbed=True,
).fit(cov_type="clustered", cluster_entity=True, cluster_time=True)
mod.summary
/var/folders/fs/h_8_rwsn5hvg9mhp0txgc_s9v6191b/T/ipykernel_84356/3310218310.py:1: AbsorbingEffectWarning:
Variables have been fully absorbed and have removed from the regression:
post_2010, treated
mod = PanelOLS.from_formula(
[18]:
Dep. Variable: | violent_rate | R-squared: | 0.0013 |
---|---|---|---|
Estimator: | PanelOLS | R-squared (Between): | -0.0109 |
No. Observations: | 348 | R-squared (Within): | 0.0155 |
Date: | Sun, Mar 12 2023 | R-squared (Overall): | -0.0104 |
Time: | 14:57:47 | Log-likelihood | -1858.7 |
Cov. Estimator: | Clustered | ||
F-statistic: | 0.3829 | ||
Entities: | 58 | P-value | 0.5366 |
Avg Obs: | 6.0000 | Distribution: | F(1,284) |
Min Obs: | 6.0000 | ||
Max Obs: | 6.0000 | F-statistic (robust): | 0.1899 |
P-value | 0.6633 | ||
Time periods: | 6 | Distribution: | F(1,284) |
Avg Obs: | 58.000 | ||
Min Obs: | 58.000 | ||
Max Obs: | 58.000 | ||
Parameter | Std. Err. | T-stat | P-value | Lower CI | Upper CI | |
---|---|---|---|---|---|---|
post_2010:treated | -7.4181 | 17.023 | -0.4358 | 0.6633 | -40.925 | 26.089 |
F-test for Poolability: 17.282
P-value: 0.0000
Distribution: F(62,284)
Included effects: Entity, Time
[19]:
# With longer data:
arrests_for_panelols = arrests_long.set_index(["COUNTY", "YEAR"])
arrests_long.YEAR.value_counts()
[19]:
2000 58
2001 58
2002 58
2003 58
2004 58
2005 58
2006 58
2007 58
2008 58
2009 58
2016 58
2017 58
2018 58
Name: YEAR, dtype: int64
[20]:
# model with fixed effect
mod = PanelOLS.from_formula(
"violent_rate ~ treated * post_2010 + TimeEffects + EntityEffects",
data=arrests_for_panelols,
drop_absorbed=True,
)
print(mod.fit(cov_type="clustered", cluster_entity=True, cluster_time=True))
PanelOLS Estimation Summary
================================================================================
Dep. Variable: violent_rate R-squared: 0.0097
Estimator: PanelOLS R-squared (Between): -0.0168
No. Observations: 754 R-squared (Within): 0.0588
Date: Sun, Mar 12 2023 R-squared (Overall): -0.0150
Time: 14:57:48 Log-likelihood -4108.4
Cov. Estimator: Clustered
F-statistic: 6.6568
Entities: 58 P-value 0.0101
Avg Obs: 13.000 Distribution: F(1,683)
Min Obs: 13.000
Max Obs: 13.000 F-statistic (robust): 2.6634
P-value 0.1031
Time periods: 13 Distribution: F(1,683)
Avg Obs: 58.000
Min Obs: 58.000
Max Obs: 58.000
Parameter Estimates
=====================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
-------------------------------------------------------------------------------------
post_2010:treated -26.362 16.153 -1.6320 0.1031 -58.078 5.3540
=====================================================================================
F-test for Poolability: 41.619
P-value: 0.0000
Distribution: F(69,683)
Included effects: Entity, Time
/var/folders/fs/h_8_rwsn5hvg9mhp0txgc_s9v6191b/T/ipykernel_84356/3654453326.py:7: AbsorbingEffectWarning:
Variables have been fully absorbed and have removed from the regression:
post_2010, treated
print(mod.fit(cov_type="clustered", cluster_entity=True, cluster_time=True))