Groupby and Arrest Data¶

In recent years, many US states have decided to legalize the use of marijuana.

When these ideas were first proposed, there were many theories about the relationship between crime and the “War on Drugs” (the term given to US efforts to arrest drug users and dealers over the past several decades).

In this exercise, we’re going to test a few of those theories using drug arrest data from the state of California.

Though California has passed a number of laws lessening penalities for marijuana possession over the years, arguably the biggest changes were in 2010, when the state changed the penalty for possessing a small amount of marijuana from a criminal crime to a “civil” penality (meaning those found guilty only had to pay a fine, not go to jail), though possessing, selling, or producing larger quantities remained illegal. Then in 2016, the state fully legalized marijuana for recreational use, not only making possession of small amounts legal, but also creating a regulatory system for producing marijuana for sale.

Proponents of drug legalization have long argued that the war on drugs contributes to violent crime by creating an opportunity for drug dealers and organized crime to sell and distribute drugs, a business which tends to generate violence when gangs battle over territory. According to this theory, with drug legalization, we should see violent crime decrease after legalization in places where drug arrests had previously been common.

To be clear, this is far from the only argument for drug legalization! It is simply the argument we are well positioned to analyze today.

(Students from Practical Data Science: This should sound familiar! Last semester we did this analysis in a very simple, crude manner; in this class we’ll do it rigorously with your new found difference-in-differences skills!)

Exercise 1¶

Download and import California arrest data from https://www.github.com/nickeubank/MIDS_Data/UDS_arrest_data.csv. What is a unit of observation (a single row) in this data? What entities are being tracked, and over what time period?

Note that VIOLENT is a count of arrests for violent offenses, and F_DRUGOFF is a count of felony drug arrests. total_population is total population.

[1]:

import pandas as pd

arrests = pd.read_csv(
    "https://media.githubusercontent.com/media/"
    "nickeubank/MIDS_Data/master/UDS_arrest_data.csv"
)

[2]:

arrests.head()

[2]:

	YEAR	COUNTY	VIOLENT	F_DRUGOFF	total_population
0	1980	Alameda County	4504	3569	1105379.0
1	1981	Alameda County	4699	3926	1122759.3
2	1982	Alameda County	4389	4436	1140139.6
3	1983	Alameda County	4500	5086	1157519.9
4	1984	Alameda County	3714	5878	1174900.2

Exercise 2¶

In this analysis, we will split our sample into “treated” and “control” counties on the basis of whether a given county had a high average drug arrest rate in the three years before California began drug legalization in 2010. Counties with high drug arrest rates, after all, will be more impacted by drug liberalization policies.

Calculate each county’s average drug arrest rate for the period from 2007-2009. Then calculate the median value across counties, and create an indicator called treated for counties whose average drug arrest rate during this period was above the median average drug arrest rate. In other words, half your counties should be in the “treated” group, and half in “control”.

Note that this indicator should be time-invariant—if a county is in the treated group, it should always be identified as being in the treated group.

[3]:

arrests["drug_rate"] = arrests.F_DRUGOFF / (arrests.total_population / 100_000)
arrests["violent_rate"] = arrests.VIOLENT / (arrests.total_population / 100_000)

[4]:

small = arrests[arrests.YEAR.isin([2007, 2008, 2009])]
avgs = small.groupby("COUNTY").drug_rate.mean()
avgs

[4]:

COUNTY
Alameda County            394.457331
Alpine County             197.551801
Amador County             289.027292
Butte County              271.709690
Calaveras County          312.159811
Colusa County             209.450372
Contra Costa County       276.018946
Del Norte County          358.305449
El Dorado County          228.874748
Fresno County             473.876866
Glenn County              411.831533
Humboldt County           499.572904
Imperial County           460.694485
Inyo County               380.164074
Kern County               432.659812
Kings County              208.035930
Lake County               369.871380
Lassen County             147.165387
Los Angeles County        403.990696
Madera County             261.117203
Marin County              190.030996
Mariposa County           232.969880
Mendocino County          599.799767
Merced County             441.290221
Modoc County              211.204960
Mono County               272.602759
Monterey County           248.094595
Napa County               247.823080
Nevada County             235.403323
Orange County             277.267304
Placer County             290.583530
Plumas County             492.429787
Riverside County          291.458787
Sacramento County         375.134854
San Benito County         210.849718
San Bernardino County     465.333881
San Diego County          249.177208
San Francisco County      873.961738
San Joaquin County        336.824184
San Luis Obispo County    192.583757
San Mateo County          214.668466
Santa Barbara County      225.267192
Santa Clara County        261.348166
Santa Cruz County         322.577393
Shasta County             257.470436
Sierra County             221.364059
Siskiyou County           352.795109
Solano County             388.929454
Sonoma County             331.905273
Stanislaus County         457.198700
Sutter County             207.695398
Tehama County             585.300555
Trinity County            427.859162
Tulare County             447.774248
Tuolumne County           405.897681
Ventura County            264.878553
Yolo County               333.788690
Yuba County               351.561100
Name: drug_rate, dtype: float64

[5]:

avgs.median()

[5]:

301.8092992253924

[6]:

high = (avgs > avgs.median()).astype("int")
high.name = "treated"
high.head()

[6]:

COUNTY
Alameda County      1
Alpine County       0
Amador County       0
Butte County        0
Calaveras County    1
Name: treated, dtype: int64

[7]:

arrests = pd.merge(
    arrests,
    high,
    left_on="COUNTY",
    right_index=True,
    how="outer",
    indicator=True,
    validate="m:1",
)
assert (arrests._merge == "both").all()
arrests = arrests.drop("_merge", axis="columns")
arrests.head()

[7]:

	YEAR	COUNTY	VIOLENT	F_DRUGOFF	total_population	drug_rate	violent_rate	treated
0	1980	Alameda County	4504	3569	1105379.0	322.875683	407.462056	1
1	1981	Alameda County	4699	3926	1122759.3	349.674236	418.522474	1
2	1982	Alameda County	4389	4436	1140139.6	389.075162	384.952860	1
3	1983	Alameda County	4500	5086	1157519.9	439.387694	388.762215	1
4	1984	Alameda County	3714	5878	1174900.2	500.297813	316.111956	1

Exercise 3¶

Our outcome in this analysis is the violent arrest rate – if drug liberalization reduces crime overall, we would expect to see this rate fall in counties with high drug arrest rates after liberalization; if not, we would not expect to see any changes. Create a violent_rate variable with is violent arrests per 100,000 people.

[8]:

# See above, already did!

Exercise 4¶

Differences-in-differences get their name from the fact that the estimator, in its most basic implementation, is just the difference between:

difference in the average change in outcome among eventually-treated units from before to after when treatment is applied, and
difference in the average change in outcome among never-treated units from before to after when treatment (to the treated units).

(Obviously treatment is never a applied to the never-treated units – when we talk about pre / post, we refer to before and after the point in time in which treatment is applied to the treated units. So if treated units are treated in 2008, then for the never-treated units, we are also comparing outcomes before 2008 to after 2008, even though 2008 has no special significance for the never-treated units).

In its most basic implementation, therefore, calculating a difference-in-difference estimate requires calculating just 4 numbers:

\(\bar y_{T=1,Post}\) Avg for Treatment, Post-Treatment
\(\bar y_{T=0,Post}\) Avg for Control, Post-Treatment
\(\bar y_{T=1,Pre}\) Avg for Treatment, Pre-Treatment
\(\bar y_{T=0,Pre}\) Avg for Control, Pre-Treatment

The difference-in-differences estimator \(\hat \delta\) is defined as

\[\hat{\delta}= (\bar{y}_{T=1,\,Post}-\bar{y}_{T=1,\,Pre})-(\bar{y}_{T=0,\,Post}-\bar{y}_{T=0,\,Pre})\]

Calculate (a) the change in violent arrest rates for our treated groups from before legalization to after (\(\bar y_{T=1,Post} - \bar y_{T=1, Pre}\)), and (b) our difference in difference estimator \(\hat\delta\) by calculating these four values. Does doing your difference-in-difference estimate tell you something different from what you’d learn if you had just done a pre-post comparison?

For the Pre period, consider the three years before liberalization begins in 2010 (e.g. 2007-2009). For the Post period, consider the three years after final legalization took place (2016-2018). We will ignore the middle period in which marijuana was decriminalized but not yet legal.

[9]:

arrests["post_2010"] = (arrests.YEAR >= 2010).astype("int")
arrests_sub = arrests[arrests.YEAR.isin([2007, 2008, 2009, 2016, 2017, 2018])]

y_t1_post = arrests_sub.loc[
    (arrests_sub["treated"] == 1) & (arrests_sub["post_2010"] == 1), "violent_rate"
].mean()

y_t1_pre = arrests_sub.loc[
    (arrests_sub["treated"] == 1) & (arrests_sub["post_2010"] == 0), "violent_rate"
].mean()

y_t0_post = arrests_sub.loc[
    (arrests_sub["treated"] == 0) & (arrests_sub["post_2010"] == 1), "violent_rate"
].mean()

y_t0_pre = arrests_sub.loc[
    (arrests_sub["treated"] == 0) & (arrests_sub["post_2010"] == 0), "violent_rate"
].mean()

[10]:

y_t1_post - y_t1_pre

[10]:

-26.799650070477355

[11]:

(y_t1_post - y_t1_pre) - (y_t0_post - y_t0_pre)

[11]:

-7.418061484004568

Exercise 5¶

Now calculate \(\hat\delta\) using a regression with an indicator for post-2010, an indicator for treated, and an interaction of the two. Use only the same set of years you used above. How does your estimate compare to the estimate you calculated in Exercise 4?

What does this tell you about interpretation of interaction terms with two indicator variables?

Note: You need to cluster your standard errors by county, since we expect counties (over time) to be subject to common fluctuations.

[12]:

import statsmodels.formula.api as smf

m = smf.ols("violent_rate ~ post_2010 * treated", arrests_sub).fit()
m.get_robustcov_results(cov_type="cluster", groups=arrests_sub.COUNTY).summary()

[12]:

OLS Regression Results
Dep. Variable:	violent_rate	R-squared:	0.221
Model:	OLS	Adj. R-squared:	0.214
Method:	Least Squares	F-statistic:	11.00
Date:	Sun, 12 Mar 2023	Prob (F-statistic):	8.45e-06
Time:	14:57:47	Log-Likelihood:	-2094.1
No. Observations:	348	AIC:	4196.
Df Residuals:	344	BIC:	4212.
Df Model:	3
Covariance Type:	cluster

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	319.7820	17.638	18.131	0.000	284.463	355.101
post_2010	-19.3816	9.892	-1.959	0.055	-39.189	0.426
treated	106.8289	23.385	4.568	0.000	60.001	153.657
post_2010:treated	-7.4181	18.869	-0.393	0.696	-45.203	30.367

Omnibus:	53.945	Durbin-Watson:	0.741
Prob(Omnibus):	0.000	Jarque-Bera (JB):	81.621
Skew:	0.965	Prob(JB):	1.89e-18
Kurtosis:	4.380	Cond. No.	6.85

Notes:
[1] Standard Errors are robust to cluster correlation (cluster)

Exercise 6¶

In the preceding exercise, we did a simple pre-post / treated-control comparison. But one important limitation of these designs is that they do not allow us to test for parallel trends.

Plot a difference-in-difference model using data from 2000-2009 (inclusive) and from 2016-2018 (inclusive). Note this will have four different geometric components: a time trend for treated counties pre-2010, a time trend for control counties pre-2010, a time trend for treated counties post-2016 (include 2016), and a time trend for control counties post-2016 (include 2016).

Do you see evidence of parallel trends for these two datasets? Does that make you feel more or less confident in your diff-in-diff estimates?

[13]:

import altair as alt

alt.data_transformers.enable("data_server")
arrests_long = arrests[
    arrests.YEAR.isin(list(range(2000, 2010)) + list(range(2016, 2019)))
]

[14]:

grouped_means = arrests_long.groupby(["treated", "YEAR"], as_index=False)[
    ["violent_rate"]
].mean()
scatter = (
    alt.Chart(grouped_means)
    .mark_point()
    .encode(
        x=alt.X("YEAR:Q", scale=alt.Scale(zero=False)),
        y=alt.Y("violent_rate:Q", scale=alt.Scale(zero=False)),
        color="treated:N",
    )
)

scatter

[14]:

[15]:

loesses = []
colors = {0: "lightblue", 1: "orange"}
for t, p in [(0, 0), (0, 1), (1, 0), (1, 1)]:
    c = (
        alt.Chart(
            arrests_long[
                (arrests_long["treated"] == t) & (arrests_long["post_2010"] == p)
            ]
        )
        .encode(
            x=alt.X("YEAR:Q", scale=alt.Scale(zero=False)),
            y=alt.Y("violent_rate:Q", scale=alt.Scale(zero=False)),
        )
        .transform_regression("YEAR", "violent_rate")
        .mark_line(color=colors[t])
    )
    loesses.append(c)


alt.layer(*loesses, scatter)

[15]:

Exercise 7¶

While we can estimate the model described above precisely as a regression, it’s actually much easier to estimate a more flexible model by running the regression we ran in Exercise 5 but with both county and year fixed effects. Use PanelOLS (or lfe in R) to estimate this fixed effects regression.

With all these additional fixed effects, do you find evidence that marijuana legalization reduced violent crime?

[16]:

# NOTE TO GRADERS: It's not clear which dataset people will do this
# with -- the data from Eq 5, or hte longer dataset from the previous question.
# Both are fine. Answers for both here.

# NOTE TO GRADERS 2: I _thought_ that PanelOLS clustered errors automatically.
# It does NOT. You have to tell it to cluster in the
# .fit() method—e.g., `.fit(cov_type="clustered", cluster_entity=True)`.
# I mislead some students on this (this is in 2023), so if they don't
# do this and get smaller SEs that's not their fault.

[17]:

from linearmodels import PanelOLS

arrests_for_panelols = arrests_sub.set_index(["COUNTY", "YEAR"])

[18]:

mod = PanelOLS.from_formula(
    "violent_rate ~ treated * post_2010 + EntityEffects + TimeEffects",
    data=arrests_for_panelols,
    drop_absorbed=True,
).fit(cov_type="clustered", cluster_entity=True, cluster_time=True)
mod.summary

/var/folders/fs/h_8_rwsn5hvg9mhp0txgc_s9v6191b/T/ipykernel_84356/3310218310.py:1: AbsorbingEffectWarning:
Variables have been fully absorbed and have removed from the regression:

post_2010, treated

  mod = PanelOLS.from_formula(

[18]:

PanelOLS Estimation Summary
Dep. Variable:	violent_rate	R-squared:	0.0013
Estimator:	PanelOLS	R-squared (Between):	-0.0109
No. Observations:	348	R-squared (Within):	0.0155
Date:	Sun, Mar 12 2023	R-squared (Overall):	-0.0104
Time:	14:57:47	Log-likelihood	-1858.7
Cov. Estimator:	Clustered
		F-statistic:	0.3829
Entities:	58	P-value	0.5366
Avg Obs:	6.0000	Distribution:	F(1,284)
Min Obs:	6.0000
Max Obs:	6.0000	F-statistic (robust):	0.1899
		P-value	0.6633
Time periods:	6	Distribution:	F(1,284)
Avg Obs:	58.000
Min Obs:	58.000
Max Obs:	58.000

Parameter Estimates
	Parameter	Std. Err.	T-stat	P-value	Lower CI	Upper CI
post_2010:treated	-7.4181	17.023	-0.4358	0.6633	-40.925	26.089

F-test for Poolability: 17.282
P-value: 0.0000
Distribution: F(62,284)

Included effects: Entity, Time

[19]:

# With longer data:
arrests_for_panelols = arrests_long.set_index(["COUNTY", "YEAR"])
arrests_long.YEAR.value_counts()

[19]:

2000    58
2001    58
2002    58
2003    58
2004    58
2005    58
2006    58
2007    58
2008    58
2009    58
2016    58
2017    58
2018    58
Name: YEAR, dtype: int64

[20]:

# model with fixed effect
mod = PanelOLS.from_formula(
    "violent_rate ~ treated * post_2010 + TimeEffects + EntityEffects",
    data=arrests_for_panelols,
    drop_absorbed=True,
)
print(mod.fit(cov_type="clustered", cluster_entity=True, cluster_time=True))

                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:           violent_rate   R-squared:                        0.0097
Estimator:                   PanelOLS   R-squared (Between):             -0.0168
No. Observations:                 754   R-squared (Within):               0.0588
Date:                Sun, Mar 12 2023   R-squared (Overall):             -0.0150
Time:                        14:57:48   Log-likelihood                   -4108.4
Cov. Estimator:             Clustered
                                        F-statistic:                      6.6568
Entities:                          58   P-value                           0.0101
Avg Obs:                       13.000   Distribution:                   F(1,683)
Min Obs:                       13.000
Max Obs:                       13.000   F-statistic (robust):             2.6634
                                        P-value                           0.1031
Time periods:                      13   Distribution:                   F(1,683)
Avg Obs:                       58.000
Min Obs:                       58.000
Max Obs:                       58.000

                                 Parameter Estimates
=====================================================================================
                   Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-------------------------------------------------------------------------------------
post_2010:treated    -26.362     16.153    -1.6320     0.1031     -58.078      5.3540
=====================================================================================

F-test for Poolability: 41.619
P-value: 0.0000
Distribution: F(69,683)

Included effects: Entity, Time

/var/folders/fs/h_8_rwsn5hvg9mhp0txgc_s9v6191b/T/ipykernel_84356/3654453326.py:7: AbsorbingEffectWarning:
Variables have been fully absorbed and have removed from the regression:

post_2010, treated

  print(mod.fit(cov_type="clustered", cluster_entity=True, cluster_time=True))