Matching Exercise¶

In this exercise, we’ll be evaluating how getting a college degree impacts earnings in the US using matching.

Matching Packages: Python v. R¶

Just as the best tools for machine learning tend to be in Python since they’re developed by CS people (who prefer Python), most of the best tools for causal inference are implemented in R since innovation in causal inference tends to be lead by social scientists using R. As a result, the most well developed matching package is called MatchIt, and is only available in R (though you can always call it from Python using rpy2).

In the last couple years, though, a group of computer scientists and statisticians here at Duke have made some great advancements in matching (especially the computational side of things), and they recently released a set of matching packages in both R and Python that we’ll be using today. They have some great algorithms we’ll use today, but be aware these packages aren’t as mature, and aren’t general purpose packages yet. So if you ever get deep into matching, be aware you will probably still want to make at least partial use of the R package MatchIt, as well as some other R packages for new innovative techniques (like Matching Frontier estimation), or Adaptive Hyper-Box Matching.

Installing dame-flame.¶

For this lesson, begin by installing dame-flame with pip install dame-flame (it’s not on conda yet).

DAME is an algorithm that we can use for a version of course exact matching. The package only accepts a list of categorical variables, and then attempts to match pairs that match exactly on those variables. That means that if you want to match on, say, age, you have to break it up into categories (say, under 18, 18-29, 30-39, etc. etc.).

Of course, one cannot always find exact matches on all variables, so what DAME does is:

Find all observations that match on all matching variables.
Figure out which matching variable is least useful in predicting the outcome of interest \(Y\) and drops that, then tries to match the remaining observations on the narrowed set of matching variables.
This repeats until you run out of variables, all observations are matched, or you hit a stopping run (namely: quality of matches falls below a threshold).

In addition, the lab has also created FLAME, which does the same thing, but employs some tricks to make it massively more computationally efficient, meaning it can be used on datasets with millions of observations (which most matching algorithms cannot). It’s a little less accurate, but an amazing contribution never the less.

Data Setup¶

To save you some time and let you focus on matching, I’ve pre-cleaned about one month worth of of data from the US Current Population Survey data we used for our gender discrimination analysis. You can download the data from here, or read it directly with:

cps = pd.read_stata(
    "https://github.com/nickeubank/MIDS_Data/blob/master"
    "/Current_Population_Survey/cps_for_matching.dta?raw=true"
)

Load the data and quickly familiarize yourself with its contents.

[1]:

# Load critical packages
import pandas as pd
import numpy as np
import dame_flame

[2]:

# Load our Current Population Survey data
# a regular  survey of US citizens

cps = pd.read_stata(
    "https://github.com/nickeubank/MIDS_Data/blob/master"
    "/Current_Population_Survey/cps_for_matching.dta?raw=true"
)

[3]:

# Take a look at the data
cps.head()

[3]:

	index	annual_earnings	female	simplified_race	has_college	age	county	class94
0	151404	NaN	1	3.0	1	30	0-WV	Private, For Profit
1	123453	NaN	0	0.0	0	21	251-TX	Private, For Profit
2	187982	NaN	0	0.0	0	40	5-MA	Self-Employed, Unincorporated
3	122356	NaN	1	0.0	1	27	0-TN	Private, Nonprofit
4	210750	42900.0	1	0.0	0	52	0-IA	Private, For Profit

Getting To Know Your Data¶

Before you start matching, it is important to examine your data to ensure that matching is feasible (you have some overlap the the features of people in the treated and untreated groups), and also that there is a reason to match: either you’re unsure about some of the functional forms at play, or your have some imbalance between the two groups.

Exercise 1¶

Show the raw difference of annual_earnings between those with and without a college degree (has_college). Is the difference statistically significant?

[4]:

import statsmodels.formula.api as smf

smf.ols("annual_earnings ~ has_college", cps).fit().summary()

[4]:

OLS Regression Results
Dep. Variable:	annual_earnings	R-squared:	0.063
Model:	OLS	Adj. R-squared:	0.063
Method:	Least Squares	F-statistic:	370.2
Date:	Sat, 04 Mar 2023	Prob (F-statistic):	6.56e-80
Time:	13:59:39	Log-Likelihood:	-63018.
No. Observations:	5515	AIC:	1.260e+05
Df Residuals:	5513	BIC:	1.261e+05
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	3.887e+04	336.007	115.669	0.000	3.82e+04	3.95e+04
has_college	1.416e+04	735.820	19.242	0.000	1.27e+04	1.56e+04

Omnibus:	2214.375	Durbin-Watson:	1.974
Prob(Omnibus):	0.000	Jarque-Bera (JB):	10578.287
Skew:	1.910	Prob(JB):	0.00
Kurtosis:	8.608	Cond. No.	2.59

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[5]:

# About 14,000 a year, and it's very significant.

Exercise 2¶

Next we can check for balance. Check the share of people in different racial groups who have college degrees. Are those differences statistically significant?

Race is coded as White Non-Hispanic (0), Black Non-Hispanic (1), Hispanic (2), Other (3).

Does the distribution also look different across counties (I don’t need statistical significance for this)?

Does the data seem balanced?

[6]:

# This question wording is, admittedly, a little iffy.
# Basically, while we want frequency tables to do our chi2
# test, I know _I_ can't look at a frequency table and have
# any sense of whether the groups are actually balanced.
# So I like to see shares with my eyes, then use freq table to test.

[7]:

# One easy way to get differences in shares (and bi-variate significance)
smf.ols("has_college ~ C(simplified_race)", cps).fit().summary()

[7]:

OLS Regression Results
Dep. Variable:	has_college	R-squared:	0.032
Model:	OLS	Adj. R-squared:	0.032
Method:	Least Squares	F-statistic:	122.1
Date:	Sat, 04 Mar 2023	Prob (F-statistic):	7.74e-78
Time:	13:59:39	Log-Likelihood:	-7675.0
No. Observations:	11150	AIC:	1.536e+04
Df Residuals:	11146	BIC:	1.539e+04
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.4382	0.006	79.420	0.000	0.427	0.449
C(simplified_race)[T.1.0]	-0.1206	0.016	-7.507	0.000	-0.152	-0.089
C(simplified_race)[T.2.0]	-0.2398	0.014	-17.682	0.000	-0.266	-0.213
C(simplified_race)[T.3.0]	0.0367	0.016	2.261	0.024	0.005	0.069

Omnibus:	46681.807	Durbin-Watson:	1.965
Prob(Omnibus):	0.000	Jarque-Bera (JB):	1670.333
Skew:	0.377	Prob(JB):	0.00
Kurtosis:	1.261	Cond. No.	3.97

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[8]:

# Or just groupby:

cps.groupby("simplified_race")["has_college"].mean()

[8]:

simplified_race
0.0    0.438205
1.0    0.317647
2.0    0.198413
3.0    0.474900
Name: has_college, dtype: float64

[9]:

# Then for statistical significance:
ctab = pd.crosstab(cps["simplified_race"], cps["has_college"])
ctab

[9]:

has_college	0	1
simplified_race
0.0	4282	3340
1.0	696	324
2.0	1212	300
3.0	523	473

[10]:

import scipy.stats

chi2, p, dof, expected = scipy.stats.chi2_contingency(ctab.values)
p

[10]:

1.2993875943569016e-76

[11]:

# Insanely significant. :)

[12]:

# And look at counties.
cps.groupby("county")[["has_college"]].describe().sort_values(("has_college", "mean"))

[12]:

	has_college
	count	mean	std	min	25%	50%	75%	max
county
71-MO	4.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0
700-VA	3.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0
69-NY	1.0	0.000000	NaN	0.0	0.0	0.0	0.0	0.0
69-FL	4.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0
17-MD	2.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...
75-CA	17.0	0.882353	0.332106	0.0	1.0	1.0	1.0	1.0
21-NJ	4.0	1.000000	0.000000	1.0	1.0	1.0	1.0	1.0
19-NJ	2.0	1.000000	0.000000	1.0	1.0	1.0	1.0	1.0
81-IN	1.0	1.000000	NaN	1.0	1.0	1.0	1.0	1.0
171-MN	1.0	1.000000	NaN	1.0	1.0	1.0	1.0	1.0

326 rows × 8 columns

[13]:

cps.groupby("county")[["has_college"]].mean().describe()

[13]:

	has_college
count	326.000000
mean	0.390058
std	0.207616
min	0.000000
25%	0.262319
50%	0.375000
75%	0.500000
max	1.000000

[14]:

# Good in the middle, but many counties have no college grads, and a few only have college grads.

Exercise 3¶

One of the other advantages of matching is that even when you have balanced data, you don’t have to go through the process of testing out different functional forms to see what fits the data base.

In our last exercise, we looked at the relationship between gender and earnings “controlling for age”, where we just put in age as a linear control. Plot a non-linear regression of annual_earnings on age (if you’re using plotnine, use geom_smooth(method="lowess") — if you’re using altair, use transform_loess (tutorial examples here)).

Does the relationship look linear?

Does this speak to why it’s nice to not have to think about functional forms with matching as much?

[15]:

import altair as alt

alt.data_transformers.enable("data_server")
alt.Chart(cps).encode(x="age", y="annual_earnings").transform_loess(
    on="age", loess="annual_earnings"
).mark_line()

[15]:

[16]:

# Not even remotely linear. Thank goodness we don't have to worry about that with matching!
# Though it wouldn't be *that* hard to fit a quadratic.

Matching!¶

Because DAME is an implementation of exact matching, we have to discretize all of our continuous variables. Thankfully, in this case we only have age, so this shouldn’t be too hard!

Exercise 4¶

Create a new variable that discretizes age into a single value for each decade of age.

Because CPS only has employment data on people 18 or over, though, include people who are 18 or 19 with the 20 year olds so that group isn’t too small, and if you see any other really small groups, please merge those too.

[17]:

cps["discretized_age"] = cps.age // 10
cps.loc[cps["discretized_age"] == 1, "discretized_age"] = 2
cps["discretized_age"].value_counts()

[17]:

3    2760
4    2551
5    2397
2    1990
6    1236
7     173
8      43
Name: discretized_age, dtype: int64

[18]:

# 70 and 80 year olds are tiny groups.
cps.loc[cps["discretized_age"] == 8, "discretized_age"] = 7

Exercise 5¶

We also have to covert our string variables into numeric variables for DAME, so convert county and class94 to a numeric vector of intergers.

(Note: it’s not clear whether class94 belongs: if it reflects people choosing fields based on passion, it belongs; if people choose certain jobs because of their degrees, its not something we’d actually want in our regression.

Hint: if you use pd.Categorical to convert you var to a categorical, you can pull the underlying integer codes with .codes.

[19]:

cps["county"] = pd.Categorical(cps["county"]).codes
cps["county"].value_counts()

[19]:

41     576
200    275
12     230
33     225
51     223
      ...
122      1
263      1
285      1
154      1
213      1
Name: county, Length: 326, dtype: int64

[20]:

cps["class94"] = pd.Categorical(cps["class94"]).codes
cps["class94"].value_counts()

[20]:

3    7809
1     740
4     706
2     615
6     552
5     387
0     337
7       4
Name: class94, dtype: int64

Let’s Do Matching with DAME¶

Exercise 6¶

First, drop all the variables you don’t want in matching (e.g. your original age variable), and any observations for which annual_earnings is missing.

You will probably also have to drop a column named index: DAME will try and match on ANY included variables, and so because there was a column called index in the data we imported, if we leave it in DAME will try (and obviously fail) to match on index.

Also, it’s best to reset your index, as dame_flame using index labels (e.g., the values in df.index) to identify matches. So you want to be sure those are unique.

[21]:

for_matching = cps.drop(["age", "index"], axis="columns")
for_matching = for_matching[for_matching.annual_earnings.notnull()]
for_matching = for_matching.reset_index(drop=True)
for_matching

[21]:

	annual_earnings	female	simplified_race	has_college	county	class94	discretized_age
0	42900.0	1	0.0	0	10	3	5
1	31200.0	0	2.0	0	31	3	3
2	20020.0	0	0.0	1	8	3	6
3	22859.2	0	0.0	0	44	1	4
4	73860.8	0	0.0	1	24	3	3
...	...	...	...	...	...	...	...
5510	33800.0	1	3.0	0	247	3	3
5511	23920.0	0	3.0	0	272	3	5
5512	31200.0	0	2.0	0	246	3	2
5513	37440.0	0	0.0	0	99	3	2
5514	26000.0	0	1.0	0	23	2	5

5515 rows × 7 columns

Exercise 7¶

The syntax of dame_flame is similar to the syntax of sklearn. If you start with a dataset called my_data with a treat variable with treatment assignment and an outcome variable for my outcome of interest (\(Y\)), the syntax to do basic matching would be:

import dame_flame
model = dame_flame.matching.DAME(repeats=False, verbose=3, want_pe=True)
model.fit(
    my_data,
    treatment_column_name="treat",
    outcome_column_name="outcome",
)
result = model.predict(my_data)

Where the arguments:

repeats=False says that I only want each observation to get matched once. We’ll talk about what happens if we use repeats=True below.
verbose=3 tells dame to report everything it’s doing as it goes.
want_pe says “please include the predictive error in your printout at each step”. This is a measure of match quality.

So run DAME on your data!

[22]:

model = dame_flame.matching.DAME(repeats=False, verbose=3, want_pe=True)
model.fit(
    for_matching,
    treatment_column_name="has_college",
    outcome_column_name="annual_earnings",
)
result = model.predict(for_matching)

Completed iteration 0 of matching
        Number of matched groups formed in total:  370
        Unmatched treated units:  644 out of a total of  1150 treated units
        Unmatched control units:  3187 out of a total of  4365 control units
        Number of matches made this iteration:  1684
        Number of matches made so far:  1684
        Covariates dropped so far:  set()
        Predictive error of covariate set used to match:  1199312680.0957854
Completed iteration 1 of matching
        Number of matched groups formed in total:  494
        Unmatched treated units:  25 out of a total of  1150 treated units
        Unmatched control units:  180 out of a total of  4365 control units
        Number of matches made this iteration:  3626
        Number of matches made so far:  5310
        Covariates dropped so far:  frozenset({'county'})
        Predictive error of covariate set used to match:  1199421883.1095908
Completed iteration 2 of matching
        Number of matched groups formed in total:  494
        Unmatched treated units:  25 out of a total of  1150 treated units
        Unmatched control units:  180 out of a total of  4365 control units
        Number of matches made this iteration:  0
        Number of matches made so far:  5310
        Covariates dropped so far:  frozenset({'simplified_race'})
        Predictive error of covariate set used to match:  1204727749.8949614
Completed iteration 3 of matching
        Number of matched groups formed in total:  505
        Unmatched treated units:  8 out of a total of  1150 treated units
        Unmatched control units:  129 out of a total of  4365 control units
        Number of matches made this iteration:  68
        Number of matches made so far:  5378
        Covariates dropped so far:  frozenset({'county', 'simplified_race'})
        Predictive error of covariate set used to match:  1204742613.479154
Completed iteration 4 of matching
        Number of matched groups formed in total:  505
        Unmatched treated units:  8 out of a total of  1150 treated units
        Unmatched control units:  129 out of a total of  4365 control units
        Number of matches made this iteration:  0
        Number of matches made so far:  5378
        Covariates dropped so far:  frozenset({'class94'})
        Predictive error of covariate set used to match:  1205072671.3262901
Completed iteration 5 of matching
        Number of matched groups formed in total:  508
        Unmatched treated units:  5 out of a total of  1150 treated units
        Unmatched control units:  120 out of a total of  4365 control units
        Number of matches made this iteration:  12
        Number of matches made so far:  5390
        Covariates dropped so far:  frozenset({'class94', 'county'})
        Predictive error of covariate set used to match:  1205171280.4727237
Completed iteration 6 of matching
        Number of matched groups formed in total:  509
        Unmatched treated units:  4 out of a total of  1150 treated units
        Unmatched control units:  119 out of a total of  4365 control units
        Number of matches made this iteration:  2
        Number of matches made so far:  5392
        Covariates dropped so far:  frozenset({'class94', 'simplified_race'})
        Predictive error of covariate set used to match:  1210524158.7436352
Completed iteration 7 of matching
        Number of matched groups formed in total:  511
        Unmatched treated units:  0 out of a total of  1150 treated units
        Unmatched control units:  110 out of a total of  4365 control units
        Number of matches made this iteration:  13
        Number of matches made so far:  5405
        Covariates dropped so far:  frozenset({'class94', 'county', 'simplified_race'})
        Predictive error of covariate set used to match:  1210539313.933855
5405 units matched. We finished with no more treated units to match

Interpreting DAME output¶

The output you get from doing this should be reports from about 8 iterations of matching. In each iteration, you’ll see a description of the number of matches made in the iteration, the number of treatment units still unmatched, and the number of control units unmatched.

In the first iteration, the algorithm tries to match observations that match on all the variables in your data. That’s why in the first iteration, you see the set of variables being dropped is an empty set – it hasn’t dropped any variables:

Completed iteration 0 of matching
    Number of matched groups formed in total:  370
    Unmatched treated units:  644 out of a total of  1150 treated units
    Unmatched control units:  3187 out of a total of  4365 control units
    Number of matches made this iteration:  1684
    Number of matches made so far:  1684
    Covariates dropped so far:  set()
    Predictive error of covariate set used to match:  1199312680.0957854

(Note depending on how you binned ages, you may get slightly different results than this)

But as we can see from this output, the algorithm found 1,684 perfect matches—pairs of observations (one treated, one untreated) that had exactly the same value of all the variables we included. But we also see we still have 644 unmatched treated units, so what do we do?

The answer is that if we want to match more of our treatment variables, we have to try and match on a subset of our variables.

But what variable should we drop? This is the secret sauce of DAME. DAME picks the variables to drop by trying to predict our outcome \(Y\) using all our variables (by default using a ridge regression), then it drops the matching variable that is contributing the least to that prediction. Since our goal in matching is to eliminate baseline differences (\(E(Y_0|D=1) - E(Y_1|D=0)\)), dropping the covariates least related to \(Y\) makes sense.

As a result, in the second iteration (called iteration 1, since it uses 0-based indexing), we see that the variable it drops first is county, and it’s subsequently able to make another 3,626 new matches on the remaining variables!

Completed iteration 1 of matching
    Number of matched groups formed in total:  494
    Unmatched treated units:  25 out of a total of  1150 treated units
    Unmatched control units:  180 out of a total of  4365 control units
    Number of matches made this iteration:  3626
    Number of matches made so far:  5310
    Covariates dropped so far:  frozenset({'county'})
    Predictive error of covariate set used to match:  1199421883.1095908

And so DAME continues until after 8 iterations, it’s matched all treated observations.

Exercise 8¶

Congratulations! You just on your first one-to-many matching!

The next step is to think about which of the matches that DAME generated are good enough for inclusion in our analysis. As you may recall, one of the choices you have to make as a researcher when doing matching is how “good” a match has to be in order to be included in your final data set. By default, DAME will keep dropping matching variables until it has been able to match all the treated observations or runs out of variables. It will do this no matter how bad the matches start to become – if it ends up with the treated observation and a control observation that can only be matched on gender, it will match them just on gender, even though we probably don’t think that that’s a “good” match.

The way to control this behavior is to tell DAME when to stop manually using the early_stop_iterations argument to tell the matching algorithm when to stop.

So when is a good time to stop? There’s no objective or “right” answer to that question. It fundamentally comes down to a trade-off between bias (which gets higher is you allow more low quality matches into your data) and variance (which will go down as you increase the number of matches you keep).

But one way to start the process of picking a cut point is to examine how the quality of matches evolves over iterations. DAME keeps this information in model.pe_each_iter. This shows, for each iteration, the “prediction error” resulting from dropping the variables excluded in each step. This “prediction error” is the difference in the mean-squared error of regressing \(Y\) on our matching variables (by default in a ridge regression) with all variables versus with the subset being used for matching in a given iteration. By design, of course, this is always increasing.

To see how this evolves, plot your pe against iteration numbers. You can also see the pe values for each iteration reported in the output from when DAME ran above if you want to make your you’re lining up the errors with iterations right.

Are there any points where the match quality seems to fall off dramatically?

[23]:

model.pe_each_iter

[23]:

[1199312680.0957854,
 1199421883.1095908,
 1204727749.8949614,
 1204742613.479154,
 1205072671.3262901,
 1205171280.4727237,
 1210524158.7436352,
 1210539313.933855]

[24]:

for_pe = pd.DataFrame(
    {"pe": model.pe_each_iter, "i": range(0, len(model.pe_each_iter))}
)
for_pe

[24]:

	pe	i
0	1.199313e+09	0
1	1.199422e+09	1
2	1.204728e+09	2
3	1.204743e+09	3
4	1.205073e+09	4
5	1.205171e+09	5
6	1.210524e+09	6
7	1.210539e+09	7

[25]:

alt.Chart(for_pe).encode(x="i", y=alt.Y("pe", scale=alt.Scale(zero=False))).mark_line()

[25]:

[26]:

# Yup! Iteration 2 and 6 are the really the big ones...

Exercise 9¶

Suppose we want to ensure we have at least 5,000 observations in our matched data—where might you cut off the data to get a sample size of at least that but before a big quality falloff?

[27]:

# I'd stop after iteration 1 (the second iteration)—things fall off fast
# starting after that, but with very few added matches.

Exercise 10¶

Re-run your matching, stopping at the point you picked above using early_stop_iterations.

[28]:

model = dame_flame.matching.DAME(
    repeats=False, verbose=3, want_pe=True, early_stop_iterations=1
)
model.fit(
    for_matching,
    treatment_column_name="has_college",
    outcome_column_name="annual_earnings",
)
result = model.predict(for_matching)

Completed iteration 0 of matching
        Number of matched groups formed in total:  370
        Unmatched treated units:  644 out of a total of  1150 treated units
        Unmatched control units:  3187 out of a total of  4365 control units
        Number of matches made this iteration:  1684
        Number of matches made so far:  1684
        Covariates dropped so far:  set()
        Predictive error of covariate set used to match:  1199312680.0957854
Completed iteration 1 of matching
        Number of matched groups formed in total:  494
        Unmatched treated units:  25 out of a total of  1150 treated units
        Unmatched control units:  180 out of a total of  4365 control units
        Number of matches made this iteration:  3626
        Number of matches made so far:  5310
        Covariates dropped so far:  frozenset({'county'})
        Predictive error of covariate set used to match:  1199421883.1095908
5310 units matched. We stopped after iteration 1

Getting Back a Dataset¶

OK, my one current complaint with DAME is that it doesn’t just give you back a nice dataset of your matches for analysis. If we look at our results – matches – it’s almost what we want, except it’s dropped our treatment and outcome columns, and it’s put a string * in any entry where a value wasn’t used for matching:

  female simplified_race   county   class94   discretized_age
1.0     0.0              10.0      3.0          5.0
0.0     2.0              *         3.0          3.0
0.0     0.0              8.0        3.0         6.0
0.0     0.0              *         1.0          4.0
0.0     0.0              24.0      3.0          3.0

So for now (though I think this will get updated in the package), we’ll have to do it ourselves! Just copy-paste this:

def get_dataframe(model, result_of_fit):

    # Get original data
    better = model.input_data.loc[result_of_fit.index]
    if not better.index.is_unique:
        raise ValueError("Need index values in input data to be unique")

    # Get match groups for clustering
    better["match_group"] = np.nan
    better["match_group_size"] = np.nan
    for idx, group in enumerate(model.units_per_group):
        better.loc[group, "match_group"] = idx
        better.loc[group, "match_group_size"] = len(group)

    # Get weights. I THINK this is right?! At least for with repeat=False?
    t = model.treatment_column_name
    better["t_in_group"] = better.groupby("match_group")[t].transform(np.sum)

    # Make weights
    better["weights"] = np.nan
    better.loc[better[t] == 1, "weights"] = 1  # treaments are 1

    # Controls start as proportional to num of treatments
    # each observation is matched to.
    better.loc[better[t] == 0, "weights"] = better["t_in_group"] / (
        better["match_group_size"] - better["t_in_group"]
    )

    # Then re-normalize for num unique control observations.
    control_weights = better[better[t] == 0]["weights"].sum()

    num_control_obs = len(better[better[t] == 0].index.drop_duplicates())
    renormalization = num_control_obs / control_weights
    better.loc[better[t] == 0, "weights"] = (
        better.loc[better[t] == 0, "weights"] * renormalization
    )
    assert better.weights.notnull().all()

    better = better.drop(["t_in_group"], axis="columns")

    # Make sure right length and values!
    assert len(result_of_fit) == len(better)
    assert better.loc[better[t] == 0, "weights"].sum() == num_control_obs

    return better

Exercise 11¶

Copy-paste that code and run it with your original data, your (fit) model, and what you got back when you ran result_of_fit. Then we’ll work with the output of that. You should get back a single dataframe of the same length as your original model.

[29]:

result

[29]:

	female	simplified_race	county	class94	discretized_age
0	1.0	0.0	10.0	3.0	5.0
1	0.0	2.0	*	3.0	3.0
2	0.0	0.0	8.0	3.0	6.0
3	0.0	0.0	*	1.0	4.0
4	0.0	0.0	24.0	3.0	3.0
...	...	...	...	...	...
5509	0.0	0.0	*	3.0	6.0
5510	1.0	3.0	247.0	3.0	3.0
5511	0.0	3.0	*	3.0	5.0
5512	0.0	2.0	246.0	3.0	2.0
5513	0.0	0.0	99.0	3.0	2.0

5310 rows × 5 columns

[30]:

def get_dataframe(model, result_of_fit):

    # Get original data
    better = model.input_data.loc[result_of_fit.index]
    if not better.index.is_unique:
        raise ValueError("Need index values in input data to be unique")

    # Get match groups for clustering
    better["match_group"] = np.nan
    better["match_group_size"] = np.nan
    for idx, group in enumerate(model.units_per_group):
        better.loc[group, "match_group"] = idx
        better.loc[group, "match_group_size"] = len(group)

    # Get weights. I THINK this is right?! At least for with repeat=False?
    t = model.treatment_column_name
    better["t_in_group"] = better.groupby("match_group")[t].transform(np.sum)

    # Make weights
    better["weights"] = np.nan
    better.loc[better[t] == 1, "weights"] = 1  # treaments are 1

    # Controls start as proportional to num of treatments
    # each observation is matched to.
    better.loc[better[t] == 0, "weights"] = better["t_in_group"] / (
        better["match_group_size"] - better["t_in_group"]
    )

    # Then re-normalize for num unique control observations.
    control_weights = better[better[t] == 0]["weights"].sum()

    num_control_obs = len(better[better[t] == 0].index.drop_duplicates())
    renormalization = num_control_obs / control_weights
    better.loc[better[t] == 0, "weights"] = (
        better.loc[better[t] == 0, "weights"] * renormalization
    )
    assert better.weights.notnull().all()

    better = better.drop(["t_in_group"], axis="columns")

    # Make sure right length and values!
    assert len(result_of_fit) == len(better)
    assert better.loc[better[t] == 0, "weights"].sum() == num_control_obs

    return better

[31]:

matched_data = get_dataframe(model, result)

[32]:

matched_data.head()

[32]:

	annual_earnings	female	simplified_race	has_college	county	class94	discretized_age	match_group	match_group_size	weights
0	42900.0	1	0.0	0	10	3	5	59.0	5.0	0.930000
1	31200.0	0	2.0	0	31	3	3	411.0	108.0	0.070189
2	20020.0	0	0.0	1	8	3	6	52.0	3.0	1.000000
3	22859.2	0	0.0	0	44	1	4	424.0	28.0	1.240000
4	73860.8	0	0.0	1	24	3	3	106.0	7.0	1.000000

Check Your Matches and Analyze¶

Exercise 12¶

We previously tested balance on simplified_race, and by county. Check those again. Are there still statistically significant differences in college education by simplified_race?

Note that when you test for this, you’ll need to take into account the weights column you got back from get_dataframe. What DAME does is not actually the 1-to-1 matching described in our readings – instead, however many observations that exact match it finds it puts in the same “group”. (These groups are identified in the dataframe you got from get_dataframe by the column match_group, and the size of each group is in match_group_size.)

So to analyze the data, you need to use the wls (weighted least squares) function in statsmodels. For example, if your data is called matched_data, you might run:

smf.wls(
    "has_college ~ C(simplified_race)", matched_data, weights=matched_data["weights"]
).fit().summary()

[33]:

import statsmodels.formula.api as smf

smf.wls(
    "has_college ~ C(simplified_race)", matched_data, weights=matched_data["weights"]
).fit().summary()

[33]:

WLS Regression Results
Dep. Variable:	has_college	R-squared:	0.000
Model:	WLS	Adj. R-squared:	-0.001
Method:	Least Squares	F-statistic:	1.134e-12
Date:	Sat, 04 Mar 2023	Prob (F-statistic):	1.00
Time:	13:59:41	Log-Likelihood:	-3736.0
No. Observations:	5310	AIC:	7480.
Df Residuals:	5306	BIC:	7506.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.2119	0.007	31.608	0.000	0.199	0.225
C(simplified_race)[T.1.0]	3.469e-17	0.018	1.92e-15	1.000	-0.036	0.036
C(simplified_race)[T.2.0]	-5.378e-17	0.019	-2.86e-15	1.000	-0.037	0.037
C(simplified_race)[T.3.0]	1.18e-16	0.020	5.83e-15	1.000	-0.040	0.040

Omnibus:	860.389	Durbin-Watson:	2.000
Prob(Omnibus):	0.000	Jarque-Bera (JB):	1353.227
Skew:	1.234	Prob(JB):	1.41e-294
Kurtosis:	2.851	Cond. No.	3.95

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Exercise 13¶

Now use a weighted least squares regression on your matched data to regress annual earnings on just having a college eduction. What is the apparent effect of a BA? How does that compare to our initial estimate using the raw CPS data (before matching)?

[34]:

smf.wls(
    "annual_earnings ~ has_college", matched_data, weights=matched_data["weights"]
).fit().summary()

[34]:

WLS Regression Results
Dep. Variable:	annual_earnings	R-squared:	0.058
Model:	WLS	Adj. R-squared:	0.057
Method:	Least Squares	F-statistic:	324.1
Date:	Sat, 04 Mar 2023	Prob (F-statistic):	2.19e-70
Time:	13:59:41	Log-Likelihood:	-61753.
No. Observations:	5310	AIC:	1.235e+05
Df Residuals:	5308	BIC:	1.235e+05
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	3.909e+04	351.293	111.287	0.000	3.84e+04	3.98e+04
has_college	1.374e+04	763.203	18.003	0.000	1.22e+04	1.52e+04

Omnibus:	2934.035	Durbin-Watson:	2.006
Prob(Omnibus):	0.000	Jarque-Bera (JB):	33100.529
Skew:	2.424	Prob(JB):	0.00
Kurtosis:	14.230	Cond. No.	2.58

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[35]:

# dame_flame.utils.post_processing.ATE(matching_object=model)

Exercise 14¶

Now include our other matching variables as controls (e.g. all the coefficients you gave to DAME to use). Does the coefficient change?

[36]:

smf.wls(
    "annual_earnings ~ has_college + C(simplified_race)"
    " + C(discretized_age) + female + C(county)",
    matched_data,
    weights=matched_data["weights"],
).fit().summary()

[36]:

WLS Regression Results
Dep. Variable:	annual_earnings	R-squared:	0.238
Model:	WLS	Adj. R-squared:	0.188
Method:	Least Squares	F-statistic:	4.786
Date:	Sat, 04 Mar 2023	Prob (F-statistic):	1.62e-132
Time:	13:59:42	Log-Likelihood:	-61189.
No. Observations:	5310	AIC:	1.230e+05
Df Residuals:	4984	BIC:	1.252e+05
Df Model:	325
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	4.761e+04	2429.505	19.595	0.000	4.28e+04	5.24e+04
C(simplified_race)[T.1.0]	-8344.9150	1067.331	-7.818	0.000	-1.04e+04	-6252.476
C(simplified_race)[T.2.0]	-6753.9175	1140.523	-5.922	0.000	-8989.844	-4517.991
C(simplified_race)[T.3.0]	-3220.6308	1202.997	-2.677	0.007	-5579.035	-862.227
C(discretized_age)[T.3]	8584.0505	868.037	9.889	0.000	6882.316	1.03e+04
C(discretized_age)[T.4]	1.251e+04	923.078	13.558	0.000	1.07e+04	1.43e+04
C(discretized_age)[T.5]	1.266e+04	964.214	13.131	0.000	1.08e+04	1.46e+04
C(discretized_age)[T.6]	9235.0616	1189.062	7.767	0.000	6903.976	1.16e+04
C(discretized_age)[T.7]	1.347e+04	2975.342	4.528	0.000	7639.580	1.93e+04
C(county)[T.1]	-1.114e+04	3231.653	-3.446	0.001	-1.75e+04	-4799.550
C(county)[T.2]	-1.279e+04	3245.734	-3.942	0.000	-1.92e+04	-6430.115
C(county)[T.3]	-9142.9921	1.19e+04	-0.771	0.441	-3.24e+04	1.41e+04
C(county)[T.4]	-6471.4363	2990.234	-2.164	0.030	-1.23e+04	-609.261
C(county)[T.5]	-6378.3577	4178.131	-1.527	0.127	-1.46e+04	1812.617
C(county)[T.6]	-1.627e+04	1.04e+04	-1.566	0.118	-3.66e+04	4103.579
C(county)[T.7]	-1.023e+04	3835.754	-2.666	0.008	-1.77e+04	-2706.649
C(county)[T.8]	-1.153e+04	3175.916	-3.629	0.000	-1.78e+04	-5300.388
C(county)[T.9]	-1.382e+04	5130.249	-2.694	0.007	-2.39e+04	-3762.096
C(county)[T.10]	-1.501e+04	3427.926	-4.380	0.000	-2.17e+04	-8293.693
C(county)[T.11]	-1.418e+04	3040.008	-4.664	0.000	-2.01e+04	-8218.173
C(county)[T.12]	-6849.3954	3121.285	-2.194	0.028	-1.3e+04	-730.303
C(county)[T.13]	-1.381e+04	3368.446	-4.101	0.000	-2.04e+04	-7209.235
C(county)[T.14]	-1.369e+04	3918.748	-3.493	0.000	-2.14e+04	-6007.378
C(county)[T.15]	-9190.4689	4799.618	-1.915	0.056	-1.86e+04	218.895
C(county)[T.16]	-1.061e+04	3616.372	-2.935	0.003	-1.77e+04	-3523.982
C(county)[T.17]	-1.601e+04	7731.556	-2.070	0.038	-3.12e+04	-848.909
C(county)[T.18]	-1788.7046	4787.690	-0.374	0.709	-1.12e+04	7597.275
C(county)[T.19]	-1.684e+04	6555.900	-2.569	0.010	-2.97e+04	-3989.067
C(county)[T.20]	-1.4e+04	4328.098	-3.235	0.001	-2.25e+04	-5515.772
C(county)[T.21]	-6184.3776	3595.547	-1.720	0.085	-1.32e+04	864.476
C(county)[T.22]	-1.826e+04	3694.928	-4.941	0.000	-2.55e+04	-1.1e+04
C(county)[T.23]	-1.242e+04	3244.837	-3.828	0.000	-1.88e+04	-6059.940
C(county)[T.24]	-1.104e+04	3171.193	-3.482	0.001	-1.73e+04	-4824.300
C(county)[T.25]	-1.506e+04	3496.105	-4.309	0.000	-2.19e+04	-8211.049
C(county)[T.26]	-1.391e+04	3137.745	-4.432	0.000	-2.01e+04	-7755.819
C(county)[T.27]	-1.337e+04	3345.523	-3.997	0.000	-1.99e+04	-6814.638
C(county)[T.28]	-1.297e+04	5671.204	-2.287	0.022	-2.41e+04	-1853.870
C(county)[T.29]	-4845.0790	5203.125	-0.931	0.352	-1.5e+04	5355.335
C(county)[T.30]	-2248.8602	5067.398	-0.444	0.657	-1.22e+04	7685.469
C(county)[T.31]	-1.155e+04	4710.979	-2.452	0.014	-2.08e+04	-2313.897
C(county)[T.32]	-7114.5598	3632.908	-1.958	0.050	-1.42e+04	7.538
C(county)[T.33]	-1.272e+04	3067.035	-4.149	0.000	-1.87e+04	-6711.354
C(county)[T.34]	-1.32e+04	3443.339	-3.833	0.000	-1.99e+04	-6447.428
C(county)[T.35]	-9324.2429	3384.603	-2.755	0.006	-1.6e+04	-2688.932
C(county)[T.36]	-1.505e+04	3658.198	-4.115	0.000	-2.22e+04	-7881.001
C(county)[T.37]	-1.023e+04	3762.247	-2.720	0.007	-1.76e+04	-2856.766
C(county)[T.38]	-1.308e+04	3310.408	-3.951	0.000	-1.96e+04	-6590.153
C(county)[T.39]	-1.177e+04	3252.230	-3.618	0.000	-1.81e+04	-5391.513
C(county)[T.40]	-1.516e+04	3677.299	-4.123	0.000	-2.24e+04	-7954.028
C(county)[T.41]	-9224.6049	2690.524	-3.429	0.001	-1.45e+04	-3949.994
C(county)[T.42]	-1.365e+04	3460.095	-3.944	0.000	-2.04e+04	-6864.991
C(county)[T.43]	-1.165e+04	3698.161	-3.151	0.002	-1.89e+04	-4404.402
C(county)[T.44]	-1.07e+04	2961.782	-3.611	0.000	-1.65e+04	-4889.799
C(county)[T.45]	-7052.4496	3127.876	-2.255	0.024	-1.32e+04	-920.437
C(county)[T.46]	-9064.0230	3380.497	-2.681	0.007	-1.57e+04	-2436.761
C(county)[T.47]	-1.681e+04	3013.844	-5.579	0.000	-2.27e+04	-1.09e+04
C(county)[T.48]	-8897.8731	3296.249	-2.699	0.007	-1.54e+04	-2435.774
C(county)[T.49]	2.125e+04	7993.324	2.659	0.008	5580.256	3.69e+04
C(county)[T.50]	-2.251e+04	1.13e+04	-1.997	0.046	-4.46e+04	-409.459
C(county)[T.51]	-5880.2544	3452.043	-1.703	0.089	-1.26e+04	887.270
C(county)[T.52]	-1.006e+04	8297.064	-1.213	0.225	-2.63e+04	6201.302
C(county)[T.53]	-7396.8710	1.97e+04	-0.376	0.707	-4.59e+04	3.12e+04
C(county)[T.54]	-1.349e+04	8853.843	-1.524	0.128	-3.08e+04	3865.030
C(county)[T.55]	6016.8132	9691.581	0.621	0.535	-1.3e+04	2.5e+04
C(county)[T.56]	-7435.2407	7281.280	-1.021	0.307	-2.17e+04	6839.272
C(county)[T.57]	8434.9441	1.4e+04	0.604	0.546	-1.89e+04	3.58e+04
C(county)[T.58]	-5918.5652	5278.657	-1.121	0.262	-1.63e+04	4429.925
C(county)[T.59]	-1.631e+04	9894.838	-1.649	0.099	-3.57e+04	3086.114
C(county)[T.60]	-1.499e+04	6719.761	-2.231	0.026	-2.82e+04	-1819.273
C(county)[T.61]	-8015.9218	1.09e+04	-0.732	0.464	-2.95e+04	1.34e+04
C(county)[T.62]	1.171e+04	1.32e+04	0.884	0.377	-1.43e+04	3.77e+04
C(county)[T.63]	-1.124e+04	8626.461	-1.303	0.193	-2.82e+04	5671.473
C(county)[T.64]	1.57e+04	8596.970	1.826	0.068	-1156.314	3.26e+04
C(county)[T.65]	-1.964e+04	9254.939	-2.122	0.034	-3.78e+04	-1498.590
C(county)[T.66]	-2.286e+04	1.47e+04	-1.550	0.121	-5.18e+04	6046.864
C(county)[T.67]	-1.286e+04	2.97e+04	-0.434	0.664	-7.1e+04	4.53e+04
C(county)[T.68]	265.8046	9594.437	0.028	0.978	-1.85e+04	1.91e+04
C(county)[T.69]	-1.382e+04	9042.247	-1.528	0.127	-3.15e+04	3909.886
C(county)[T.70]	-2.086e+04	1.33e+04	-1.567	0.117	-4.7e+04	5231.063
C(county)[T.71]	-1.914e+04	8999.435	-2.127	0.033	-3.68e+04	-1501.949
C(county)[T.72]	1.203e+04	1.91e+04	0.630	0.528	-2.54e+04	4.95e+04
C(county)[T.73]	-4151.5319	5648.737	-0.735	0.462	-1.52e+04	6922.479
C(county)[T.74]	-1.341e+04	7534.403	-1.780	0.075	-2.82e+04	1357.560
C(county)[T.75]	-1.719e+04	3924.766	-4.381	0.000	-2.49e+04	-9500.690
C(county)[T.76]	-3.746e+04	2.62e+04	-1.430	0.153	-8.88e+04	1.39e+04
C(county)[T.77]	237.8624	6897.951	0.034	0.972	-1.33e+04	1.38e+04
C(county)[T.78]	6770.8989	2.24e+04	0.302	0.763	-3.72e+04	5.07e+04
C(county)[T.79]	4.146e+04	8462.736	4.899	0.000	2.49e+04	5.8e+04
C(county)[T.80]	-1.741e+04	5589.953	-3.115	0.002	-2.84e+04	-6454.909
C(county)[T.81]	-1.143e+04	5721.800	-1.998	0.046	-2.27e+04	-216.304
C(county)[T.82]	-1.658e+04	1.61e+04	-1.032	0.302	-4.81e+04	1.49e+04
C(county)[T.83]	-2.307e+04	1.25e+04	-1.840	0.066	-4.77e+04	1514.953
C(county)[T.84]	-8999.9730	1.22e+04	-0.740	0.459	-3.28e+04	1.48e+04
C(county)[T.85]	3419.8384	9046.856	0.378	0.705	-1.43e+04	2.12e+04
C(county)[T.86]	1.824e+04	1.43e+04	1.279	0.201	-9720.390	4.62e+04
C(county)[T.87]	-2.403e+04	1.09e+04	-2.205	0.027	-4.54e+04	-2667.316
C(county)[T.88]	-1.326e+04	1.74e+04	-0.763	0.445	-4.73e+04	2.08e+04
C(county)[T.89]	-1.574e+04	8531.472	-1.845	0.065	-3.25e+04	985.816
C(county)[T.90]	-8870.8793	4968.614	-1.785	0.074	-1.86e+04	869.791
C(county)[T.91]	-1.04e+04	9791.880	-1.062	0.288	-2.96e+04	8799.744
C(county)[T.92]	-1.687e+04	1.17e+04	-1.441	0.150	-3.98e+04	6076.622
C(county)[T.93]	-4593.8182	9052.966	-0.507	0.612	-2.23e+04	1.32e+04
C(county)[T.94]	-66.5470	9761.637	-0.007	0.995	-1.92e+04	1.91e+04
C(county)[T.95]	-1.199e+04	5085.007	-2.357	0.018	-2.2e+04	-2017.441
C(county)[T.96]	-1.27e+04	1.03e+04	-1.228	0.219	-3.3e+04	7572.284
C(county)[T.97]	-2.457e+04	9432.408	-2.605	0.009	-4.31e+04	-6080.404
C(county)[T.98]	2197.3011	1.53e+04	0.144	0.886	-2.78e+04	3.22e+04
C(county)[T.99]	-9289.2093	3098.282	-2.998	0.003	-1.54e+04	-3215.213
C(county)[T.100]	1.121e+04	7876.656	1.423	0.155	-4235.032	2.66e+04
C(county)[T.101]	-1.684e+04	9598.189	-1.754	0.079	-3.57e+04	1979.720
C(county)[T.102]	-2597.8743	1.35e+04	-0.192	0.847	-2.91e+04	2.39e+04
C(county)[T.103]	-1.014e+04	9066.754	-1.118	0.264	-2.79e+04	7638.542
C(county)[T.104]	-252.2631	1.16e+04	-0.022	0.983	-2.3e+04	2.25e+04
C(county)[T.105]	-1.145e+04	1.3e+04	-0.880	0.379	-3.7e+04	1.41e+04
C(county)[T.106]	-3.539e+04	3.35e+04	-1.055	0.291	-1.01e+05	3.03e+04
C(county)[T.107]	-3.364e+04	1.71e+04	-1.962	0.050	-6.72e+04	-29.720
C(county)[T.109]	-4795.8027	8034.001	-0.597	0.551	-2.05e+04	1.1e+04
C(county)[T.110]	-8716.5745	1.06e+04	-0.822	0.411	-2.95e+04	1.21e+04
C(county)[T.111]	-1.946e+04	2.19e+04	-0.889	0.374	-6.24e+04	2.35e+04
C(county)[T.112]	-7351.2007	1.42e+04	-0.516	0.606	-3.53e+04	2.06e+04
C(county)[T.113]	-2.293e+04	2.6e+04	-0.882	0.378	-7.39e+04	2.8e+04
C(county)[T.114]	-1.908e+04	1.61e+04	-1.185	0.236	-5.07e+04	1.25e+04
C(county)[T.115]	-2798.0105	1e+04	-0.280	0.780	-2.24e+04	1.68e+04
C(county)[T.116]	-8785.2988	1.35e+04	-0.650	0.516	-3.53e+04	1.77e+04
C(county)[T.117]	-2.473e+04	4.28e+04	-0.577	0.564	-1.09e+05	5.92e+04
C(county)[T.118]	-1.698e+04	1.1e+04	-1.538	0.124	-3.86e+04	4669.861
C(county)[T.120]	-1.704e+04	1.32e+04	-1.287	0.198	-4.3e+04	8908.447
C(county)[T.121]	-2.767e+04	1.32e+04	-2.093	0.036	-5.36e+04	-1747.007
C(county)[T.123]	-1.291e+04	5091.347	-2.536	0.011	-2.29e+04	-2931.559
C(county)[T.124]	-3.209e+04	1e+04	-3.197	0.001	-5.18e+04	-1.24e+04
C(county)[T.125]	-1.795e+04	1.25e+04	-1.440	0.150	-4.24e+04	6490.095
C(county)[T.126]	-6948.1460	9793.690	-0.709	0.478	-2.61e+04	1.23e+04
C(county)[T.127]	-4111.7736	9598.180	-0.428	0.668	-2.29e+04	1.47e+04
C(county)[T.128]	-1.407e+04	3.35e+04	-0.420	0.675	-7.98e+04	5.17e+04
C(county)[T.129]	-1.338e+04	1.22e+04	-1.094	0.274	-3.73e+04	1.06e+04
C(county)[T.130]	-2.958e+04	1.24e+04	-2.379	0.017	-5.4e+04	-5205.688
C(county)[T.131]	-9462.1890	2.13e+04	-0.445	0.656	-5.11e+04	3.22e+04
C(county)[T.132]	-7889.2394	5638.330	-1.399	0.162	-1.89e+04	3164.369
C(county)[T.133]	2175.4884	1.51e+04	0.144	0.886	-2.75e+04	3.18e+04
C(county)[T.134]	5.467e+04	1.8e+04	3.045	0.002	1.95e+04	8.99e+04
C(county)[T.135]	-2.083e+04	1.38e+04	-1.509	0.131	-4.79e+04	6228.141
C(county)[T.136]	-4420.8608	5997.620	-0.737	0.461	-1.62e+04	7337.113
C(county)[T.137]	-4087.2387	8100.626	-0.505	0.614	-2e+04	1.18e+04
C(county)[T.138]	-7939.1789	8029.739	-0.989	0.323	-2.37e+04	7802.643
C(county)[T.139]	-692.3178	1.02e+04	-0.068	0.946	-2.08e+04	1.94e+04
C(county)[T.140]	-1.651e+04	1.37e+04	-1.202	0.229	-4.34e+04	1.04e+04
C(county)[T.141]	-8251.4595	1.22e+04	-0.676	0.499	-3.22e+04	1.57e+04
C(county)[T.142]	-1373.8309	2.13e+04	-0.065	0.948	-4.3e+04	4.03e+04
C(county)[T.143]	-5545.8845	5862.828	-0.946	0.344	-1.7e+04	5947.838
C(county)[T.144]	-4.741e+04	2.12e+04	-2.232	0.026	-8.91e+04	-5771.921
C(county)[T.145]	-9556.0440	3.47e+04	-0.276	0.783	-7.75e+04	5.84e+04
C(county)[T.146]	-1.903e+04	1.59e+04	-1.196	0.232	-5.02e+04	1.22e+04
C(county)[T.147]	-2.8e+04	1.52e+04	-1.848	0.065	-5.77e+04	1707.239
C(county)[T.148]	-2.012e+04	1.76e+04	-1.146	0.252	-5.45e+04	1.43e+04
C(county)[T.149]	-1.012e+04	4824.670	-2.097	0.036	-1.96e+04	-657.758
C(county)[T.150]	-2.052e+04	3.35e+04	-0.612	0.541	-8.63e+04	4.52e+04
C(county)[T.151]	-2.235e+04	1.13e+04	-1.974	0.048	-4.45e+04	-157.867
C(county)[T.152]	-1.249e+04	7332.914	-1.703	0.089	-2.69e+04	1887.273
C(county)[T.153]	-1.074e+04	2.99e+04	-0.360	0.719	-6.93e+04	4.78e+04
C(county)[T.155]	-1.174e+04	1.32e+04	-0.892	0.373	-3.75e+04	1.41e+04
C(county)[T.157]	-2.379e+04	1.34e+04	-1.777	0.076	-5e+04	2454.505
C(county)[T.158]	-3.57e+04	1.24e+04	-2.878	0.004	-6e+04	-1.14e+04
C(county)[T.159]	-3961.2072	2.67e+04	-0.149	0.882	-5.62e+04	4.83e+04
C(county)[T.160]	-1.811e+04	8356.451	-2.167	0.030	-3.45e+04	-1729.232
C(county)[T.161]	6091.1910	8574.294	0.710	0.477	-1.07e+04	2.29e+04
C(county)[T.162]	9087.0830	1.55e+04	0.586	0.558	-2.13e+04	3.95e+04
C(county)[T.163]	-2.133e+04	2.02e+04	-1.054	0.292	-6.1e+04	1.84e+04
C(county)[T.164]	-2.283e+04	1.51e+04	-1.511	0.131	-5.24e+04	6795.581
C(county)[T.165]	-2.21e+04	2.13e+04	-1.037	0.300	-6.39e+04	1.97e+04
C(county)[T.166]	-1.551e+04	9441.670	-1.642	0.101	-3.4e+04	3002.990
C(county)[T.167]	-3.947e+04	3.24e+04	-1.218	0.223	-1.03e+05	2.41e+04
C(county)[T.168]	1.007e+04	7103.076	1.417	0.156	-3857.374	2.4e+04
C(county)[T.169]	1.283e+04	1.14e+04	1.121	0.262	-9600.934	3.53e+04
C(county)[T.170]	-1.615e+04	1.4e+04	-1.150	0.250	-4.37e+04	1.14e+04
C(county)[T.171]	2.365e+04	7177.844	3.295	0.001	9576.024	3.77e+04
C(county)[T.172]	-3.626e+04	3.24e+04	-1.119	0.263	-9.98e+04	2.73e+04
C(county)[T.173]	-2.629e+04	2.38e+04	-1.106	0.269	-7.29e+04	2.03e+04
C(county)[T.174]	-2.314e+04	1.59e+04	-1.458	0.145	-5.43e+04	7983.357
C(county)[T.176]	-8331.5679	1.76e+04	-0.472	0.637	-4.29e+04	2.62e+04
C(county)[T.177]	-7787.9729	6943.898	-1.122	0.262	-2.14e+04	5825.123
C(county)[T.178]	-8426.9047	9527.061	-0.885	0.376	-2.71e+04	1.03e+04
C(county)[T.179]	5061.4958	9108.387	0.556	0.578	-1.28e+04	2.29e+04
C(county)[T.180]	-9156.0300	7253.999	-1.262	0.207	-2.34e+04	5065.000
C(county)[T.181]	-6042.8525	1.16e+04	-0.522	0.602	-2.87e+04	1.67e+04
C(county)[T.182]	-1.6e+04	8180.157	-1.956	0.051	-3.2e+04	40.230
C(county)[T.183]	1688.4654	4051.315	0.417	0.677	-6253.894	9630.825
C(county)[T.184]	-1.566e+04	3739.789	-4.189	0.000	-2.3e+04	-8333.281
C(county)[T.185]	-1.158e+04	1.5e+04	-0.769	0.442	-4.11e+04	1.79e+04
C(county)[T.186]	-1.279e+04	1.07e+04	-1.198	0.231	-3.37e+04	8134.088
C(county)[T.187]	-6652.3137	8781.097	-0.758	0.449	-2.39e+04	1.06e+04
C(county)[T.188]	-1.083e+04	3548.268	-3.053	0.002	-1.78e+04	-3878.254
C(county)[T.189]	-2.122e+04	5407.161	-3.925	0.000	-3.18e+04	-1.06e+04
C(county)[T.190]	-2.661e+04	1.48e+04	-1.803	0.071	-5.55e+04	2321.109
C(county)[T.191]	-3036.3270	1.92e+04	-0.158	0.874	-4.06e+04	3.45e+04
C(county)[T.192]	6940.9690	9281.647	0.748	0.455	-1.13e+04	2.51e+04
C(county)[T.193]	-9444.8116	1.05e+04	-0.902	0.367	-3e+04	1.11e+04
C(county)[T.194]	2498.3312	1.35e+04	0.185	0.853	-2.4e+04	2.9e+04
C(county)[T.195]	-2.03e+04	9595.165	-2.116	0.034	-3.91e+04	-1493.798
C(county)[T.196]	-2753.6229	6585.461	-0.418	0.676	-1.57e+04	1.02e+04
C(county)[T.197]	1.33e+04	5757.604	2.311	0.021	2015.983	2.46e+04
C(county)[T.198]	7328.9890	7484.437	0.979	0.328	-7343.801	2.2e+04
C(county)[T.199]	-2.136e+04	8594.063	-2.485	0.013	-3.82e+04	-4510.853
C(county)[T.200]	-7508.9207	3016.843	-2.489	0.013	-1.34e+04	-1594.581
C(county)[T.201]	-2.737e+04	1.51e+04	-1.811	0.070	-5.7e+04	2251.151
C(county)[T.202]	-1237.8204	2.25e+04	-0.055	0.956	-4.54e+04	4.29e+04
C(county)[T.203]	-6666.3212	1.53e+04	-0.435	0.664	-3.67e+04	2.34e+04
C(county)[T.204]	-1.464e+04	9850.437	-1.486	0.137	-3.39e+04	4674.099
C(county)[T.205]	-1.706e+04	6442.944	-2.648	0.008	-2.97e+04	-4427.328
C(county)[T.206]	-3.733e+04	1.51e+04	-2.465	0.014	-6.7e+04	-7640.473
C(county)[T.209]	6087.6907	1.35e+04	0.452	0.651	-2.03e+04	3.25e+04
C(county)[T.210]	-2.578e+04	2.65e+04	-0.973	0.330	-7.77e+04	2.61e+04
C(county)[T.211]	-1093.8388	7382.081	-0.148	0.882	-1.56e+04	1.34e+04
C(county)[T.213]	-2.153e+04	2.35e+04	-0.916	0.360	-6.76e+04	2.45e+04
C(county)[T.214]	-2.351e+04	1.27e+04	-1.858	0.063	-4.83e+04	1299.507
C(county)[T.215]	-1.66e+04	1.12e+04	-1.488	0.137	-3.85e+04	5271.613
C(county)[T.216]	-9052.8944	5337.794	-1.696	0.090	-1.95e+04	1411.531
C(county)[T.217]	-2.914e+04	9643.602	-3.021	0.003	-4.8e+04	-1.02e+04
C(county)[T.218]	-1.642e+04	1.66e+04	-0.987	0.324	-4.9e+04	1.62e+04
C(county)[T.219]	-482.9429	8190.878	-0.059	0.953	-1.65e+04	1.56e+04
C(county)[T.220]	1.038e+04	1.11e+04	0.934	0.350	-1.14e+04	3.22e+04
C(county)[T.221]	7271.3058	1.15e+04	0.634	0.526	-1.52e+04	2.97e+04
C(county)[T.222]	-1.622e+04	1.02e+04	-1.588	0.112	-3.63e+04	3804.653
C(county)[T.223]	-1.878e+04	5199.813	-3.612	0.000	-2.9e+04	-8586.505
C(county)[T.224]	-9192.1466	1.45e+04	-0.636	0.525	-3.75e+04	1.92e+04
C(county)[T.225]	2641.2389	1.98e+04	0.134	0.894	-3.61e+04	4.14e+04
C(county)[T.226]	-8830.2891	6639.675	-1.330	0.184	-2.18e+04	4186.397
C(county)[T.227]	-2.143e+04	1.09e+04	-1.970	0.049	-4.28e+04	-102.683
C(county)[T.228]	3.233e+04	1.71e+04	1.894	0.058	-1135.703	6.58e+04
C(county)[T.229]	-1.428e+04	1.91e+04	-0.748	0.454	-5.17e+04	2.31e+04
C(county)[T.230]	-1.28e+04	6264.355	-2.043	0.041	-2.51e+04	-517.493
C(county)[T.231]	-8271.2748	7245.759	-1.142	0.254	-2.25e+04	5933.602
C(county)[T.232]	-2.299e+04	8519.159	-2.699	0.007	-3.97e+04	-6289.960
C(county)[T.233]	-1.285e+04	8491.101	-1.513	0.130	-2.95e+04	3796.614
C(county)[T.234]	-1.735e+04	3.1e+04	-0.559	0.576	-7.82e+04	4.35e+04
C(county)[T.235]	-1.766e+04	3.16e+04	-0.559	0.576	-7.96e+04	4.43e+04
C(county)[T.236]	-1.217e+04	8349.230	-1.457	0.145	-2.85e+04	4200.746
C(county)[T.237]	-1.141e+04	4505.618	-2.533	0.011	-2.02e+04	-2581.176
C(county)[T.238]	-1.613e+04	7139.990	-2.258	0.024	-3.01e+04	-2127.974
C(county)[T.239]	-2.712e+04	2.14e+04	-1.266	0.206	-6.91e+04	1.49e+04
C(county)[T.240]	6490.6376	9154.156	0.709	0.478	-1.15e+04	2.44e+04
C(county)[T.241]	-3359.7031	6189.659	-0.543	0.587	-1.55e+04	8774.753
C(county)[T.242]	-1.613e+04	1.27e+04	-1.272	0.203	-4.1e+04	8724.775
C(county)[T.243]	-1.788e+04	1.39e+04	-1.286	0.199	-4.51e+04	9383.449
C(county)[T.244]	-3.071e+04	1.43e+04	-2.150	0.032	-5.87e+04	-2703.321
C(county)[T.245]	-8153.4624	1.35e+04	-0.603	0.547	-3.47e+04	1.84e+04
C(county)[T.246]	5281.9970	4036.138	1.309	0.191	-2630.610	1.32e+04
C(county)[T.247]	-9765.4698	5639.622	-1.732	0.083	-2.08e+04	1290.671
C(county)[T.248]	143.0204	8692.204	0.016	0.987	-1.69e+04	1.72e+04
C(county)[T.249]	-1.476e+04	3.35e+04	-0.440	0.660	-8.05e+04	5.1e+04
C(county)[T.250]	1.616e+04	1.05e+04	1.541	0.123	-4398.223	3.67e+04
C(county)[T.251]	-1.651e+04	9569.822	-1.726	0.084	-3.53e+04	2247.701
C(county)[T.252]	-1.57e+04	2.01e+04	-0.781	0.435	-5.51e+04	2.37e+04
C(county)[T.253]	-1.801e+04	1.58e+04	-1.141	0.254	-4.89e+04	1.29e+04
C(county)[T.254]	-5819.5313	1.24e+04	-0.470	0.638	-3.01e+04	1.85e+04
C(county)[T.256]	-9352.6145	1.31e+04	-0.712	0.476	-3.51e+04	1.64e+04
C(county)[T.257]	-1.474e+04	4650.454	-3.170	0.002	-2.39e+04	-5626.575
C(county)[T.258]	-2.95e+04	1.86e+04	-1.587	0.113	-6.6e+04	6945.417
C(county)[T.259]	5.662e+04	1.78e+04	3.176	0.002	2.17e+04	9.16e+04
C(county)[T.260]	-2.075e+04	1.42e+04	-1.461	0.144	-4.86e+04	7093.704
C(county)[T.261]	-1.214e+04	9508.704	-1.277	0.202	-3.08e+04	6499.043
C(county)[T.262]	-1.499e+04	3.16e+04	-0.474	0.635	-7.69e+04	4.7e+04
C(county)[T.263]	1.094e+04	1.25e+04	0.872	0.383	-1.36e+04	3.55e+04
C(county)[T.264]	-2479.7477	1.18e+04	-0.210	0.834	-2.57e+04	2.07e+04
C(county)[T.265]	-2.203e+04	1.05e+04	-2.105	0.035	-4.25e+04	-1517.160
C(county)[T.266]	6210.2882	3.35e+04	0.185	0.853	-5.95e+04	7.19e+04
C(county)[T.267]	-1.138e+04	2.33e+04	-0.488	0.626	-5.71e+04	3.44e+04
C(county)[T.268]	-2.097e+04	1.51e+04	-1.389	0.165	-5.06e+04	8629.225
C(county)[T.269]	-1.836e+04	7924.529	-2.317	0.021	-3.39e+04	-2824.110
C(county)[T.270]	-1.361e+04	1.72e+04	-0.789	0.430	-4.74e+04	2.02e+04
C(county)[T.271]	-1.204e+04	1.55e+04	-0.777	0.437	-4.24e+04	1.84e+04
C(county)[T.272]	-1.485e+04	8880.399	-1.672	0.095	-3.23e+04	2561.281
C(county)[T.273]	-1.149e+04	2.63e+04	-0.438	0.662	-6.3e+04	4e+04
C(county)[T.274]	-9891.3781	4694.196	-2.107	0.035	-1.91e+04	-688.687
C(county)[T.275]	-5173.7930	1.15e+04	-0.449	0.653	-2.77e+04	1.74e+04
C(county)[T.276]	-1.966e+04	1.66e+04	-1.182	0.237	-5.22e+04	1.29e+04
C(county)[T.277]	-3.402e+04	2.86e+04	-1.188	0.235	-9.01e+04	2.21e+04
C(county)[T.278]	2322.2439	1.9e+04	0.122	0.903	-3.49e+04	3.95e+04
C(county)[T.279]	-1.119e+04	1.84e+04	-0.609	0.543	-4.72e+04	2.48e+04
C(county)[T.280]	-1.345e+04	6306.646	-2.133	0.033	-2.58e+04	-1085.930
C(county)[T.281]	-8403.4354	2.99e+04	-0.281	0.778	-6.69e+04	5.01e+04
C(county)[T.282]	-3681.0429	1.2e+04	-0.307	0.759	-2.72e+04	1.99e+04
C(county)[T.283]	-5339.8789	9936.127	-0.537	0.591	-2.48e+04	1.41e+04
C(county)[T.284]	-1.452e+04	8179.287	-1.775	0.076	-3.06e+04	1519.240
C(county)[T.285]	-3.43e+04	2.13e+04	-1.613	0.107	-7.6e+04	7382.687
C(county)[T.286]	7684.0660	1.16e+04	0.664	0.507	-1.5e+04	3.04e+04
C(county)[T.287]	-1.775e+04	5134.756	-3.456	0.001	-2.78e+04	-7680.041
C(county)[T.288]	2038.4699	1.76e+04	0.116	0.908	-3.26e+04	3.66e+04
C(county)[T.289]	-8041.0146	1.11e+04	-0.726	0.468	-2.97e+04	1.37e+04
C(county)[T.290]	-1.581e+04	7099.249	-2.227	0.026	-2.97e+04	-1889.232
C(county)[T.291]	-7663.1814	1.35e+04	-0.567	0.571	-3.42e+04	1.88e+04
C(county)[T.292]	-1.567e+04	7319.153	-2.141	0.032	-3e+04	-1323.652
C(county)[T.293]	-5825.2555	1.34e+04	-0.433	0.665	-3.22e+04	2.05e+04
C(county)[T.294]	1.947e+04	9840.528	1.978	0.048	177.105	3.88e+04
C(county)[T.295]	-1.75e+04	1.76e+04	-0.996	0.319	-5.2e+04	1.7e+04
C(county)[T.296]	-1096.1249	1.8e+04	-0.061	0.951	-3.64e+04	3.42e+04
C(county)[T.297]	-1.141e+04	4502.234	-2.535	0.011	-2.02e+04	-2586.102
C(county)[T.298]	-8732.4700	2.87e+04	-0.304	0.761	-6.5e+04	4.75e+04
C(county)[T.299]	-1.74e+04	2.99e+04	-0.583	0.560	-7.59e+04	4.11e+04
C(county)[T.300]	-1.986e+04	1.04e+04	-1.907	0.057	-4.03e+04	560.482
C(county)[T.301]	-4.329e+04	3.24e+04	-1.336	0.182	-1.07e+05	2.03e+04
C(county)[T.302]	4581.0131	7593.047	0.603	0.546	-1.03e+04	1.95e+04
C(county)[T.303]	-1.484e+04	7473.001	-1.985	0.047	-2.95e+04	-186.892
C(county)[T.304]	-2.222e+04	8772.347	-2.533	0.011	-3.94e+04	-5022.926
C(county)[T.305]	-2514.2672	7217.330	-0.348	0.728	-1.67e+04	1.16e+04
C(county)[T.306]	-2.336e+04	1.99e+04	-1.174	0.240	-6.24e+04	1.57e+04
C(county)[T.307]	-1.116e+04	9796.527	-1.139	0.255	-3.04e+04	8045.695
C(county)[T.308]	8.973e+04	1.66e+04	5.411	0.000	5.72e+04	1.22e+05
C(county)[T.309]	-1.179e+04	5120.504	-2.302	0.021	-2.18e+04	-1747.726
C(county)[T.310]	-1.268e+04	3.35e+04	-0.378	0.705	-7.84e+04	5.31e+04
C(county)[T.311]	-1.624e+04	1.01e+04	-1.611	0.107	-3.6e+04	3524.476
C(county)[T.312]	-9191.1488	9873.286	-0.931	0.352	-2.85e+04	1.02e+04
C(county)[T.313]	-1.765e+04	4.28e+04	-0.412	0.680	-1.02e+05	6.63e+04
C(county)[T.314]	-1.793e+04	7170.454	-2.501	0.012	-3.2e+04	-3875.671
C(county)[T.315]	879.4379	1.04e+04	0.084	0.933	-1.96e+04	2.13e+04
C(county)[T.316]	-5807.9640	6758.678	-0.859	0.390	-1.91e+04	7442.020
C(county)[T.317]	1.794e+04	9335.924	1.922	0.055	-362.354	3.62e+04
C(county)[T.318]	-4980.2142	8582.095	-0.580	0.562	-2.18e+04	1.18e+04
C(county)[T.319]	1051.3237	1.42e+04	0.074	0.941	-2.68e+04	2.89e+04
C(county)[T.320]	-1.431e+04	1.5e+04	-0.951	0.341	-4.38e+04	1.52e+04
C(county)[T.321]	-1.684e+04	6968.208	-2.416	0.016	-3.05e+04	-3176.799
C(county)[T.322]	-406.6903	6523.846	-0.062	0.950	-1.32e+04	1.24e+04
C(county)[T.323]	-1.853e+04	5726.625	-3.236	0.001	-2.98e+04	-7302.079
C(county)[T.324]	-2501.1256	7866.013	-0.318	0.751	-1.79e+04	1.29e+04
C(county)[T.325]	-1.951e+04	2.24e+04	-0.873	0.383	-6.33e+04	2.43e+04
has_college	1.325e+04	742.630	17.843	0.000	1.18e+04	1.47e+04
female	-8657.2875	609.011	-14.215	0.000	-9851.217	-7463.358

Omnibus:	2414.039	Durbin-Watson:	1.981
Prob(Omnibus):	0.000	Jarque-Bera (JB):	22245.534
Skew:	1.945	Prob(JB):	0.00
Kurtosis:	12.242	Cond. No.	198.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Exercise 15¶

If you stopped matching after the second iteration (Iteration 1) back in Exercise 10, you may be wondering if that was a good choice! Let’s check by restricting our attention to ONLY exact matches (iteration = 0). Run that match.

[37]:

model2 = dame_flame.matching.DAME(
    repeats=False, verbose=3, want_pe=True, early_stop_iterations=0
)
model2.fit(
    for_matching,
    treatment_column_name="has_college",
    outcome_column_name="annual_earnings",
)
result2 = model2.predict(for_matching)

Completed iteration 0 of matching
        Number of matched groups formed in total:  370
        Unmatched treated units:  644 out of a total of  1150 treated units
        Unmatched control units:  3187 out of a total of  4365 control units
        Number of matches made this iteration:  1684
        Number of matches made so far:  1684
        Covariates dropped so far:  set()
        Predictive error of covariate set used to match:  1199312680.0957854
1684 units matched. We stopped after iteration 0

[38]:

matched_data2 = get_dataframe(model2, result2)

Exercise 16¶

Now use a weighted linear regression on your matched data to regress annual earnings on just having a college eduction. Is that different from what you had when you allowed more low quality matches?

[39]:

smf.wls(
    "annual_earnings ~ has_college", matched_data2, weights=matched_data2["weights"]
).fit().summary()

[39]:

WLS Regression Results
Dep. Variable:	annual_earnings	R-squared:	0.049
Model:	WLS	Adj. R-squared:	0.048
Method:	Least Squares	F-statistic:	86.65
Date:	Sat, 04 Mar 2023	Prob (F-statistic):	3.92e-20
Time:	13:59:42	Log-Likelihood:	-19512.
No. Observations:	1684	AIC:	3.903e+04
Df Residuals:	1682	BIC:	3.904e+04
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	3.914e+04	664.386	58.907	0.000	3.78e+04	4.04e+04
has_college	1.128e+04	1212.039	9.308	0.000	8904.805	1.37e+04

Omnibus:	855.250	Durbin-Watson:	2.037
Prob(Omnibus):	0.000	Jarque-Bera (JB):	6653.000
Skew:	2.256	Prob(JB):	0.00
Kurtosis:	11.629	Cond. No.	2.42

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Other Forms of Matching¶

OK, hopefully this gives you a taste of matching! There are, of course, many other permutations to be aware of though.

Matching with replacement. In this exercise, we set repeat=False, so each observation could only end up in our final dataset once. However, if we use repeat=True, if an untreated observation is the closest observation to multiple treated observations, it may get put in the dataset multiple times. We can still use this dataset in almost the same way, though, except we have to make use of weights so that if an observation appears, say, twice, each observation has a weight that’s 1/2 the weight of an observation only appearing once.
Matching with continuous variables: DAME is used for exact matching, but if you have lots of continuous variables, you can also match on those. In fact, the Almost Exact Matching Lab also has a library called MALTS that will do matching with continuous variables. That package does something like Mahalanobis Distance matching, but ulike Mahalanobis, which calculates the distance between observations in terms of the difference in all the matching variables normalized by each matching variable’s standard deviation, MALTS does something much more clever. (Here’s the paper describing the technique if you want all the details). Basically, it figures out how well each matching variable predicts our outcome \(Y\), then weights the different variables by their predictive power instead of just normalizing by something arbitrary like their standard deviation. As a result, final matches will prioritize matching more closely on variables that are outcome-relevant. In addition, when it sees a categorical variable, it recognizes that and only pairs observations when they are an exact match on that categorical variable.
If you’re dataset is huge, use FLAME: this dataset is small, but if you have lots of observations and lots of matching variable, the computational complexity of this task explodes, so the AEML created FLAME, which works with millions of observations at only a small cost to match quality.

Absolutely positively need the solutions?¶

Don’t use this link until you’ve really, really spent time struggling with your code! Doing so only results in you cheating yourself.

Link