Matching Exercise

In this exercise, we’ll be evaluating how getting a college degree impacts earnings in the US using matching.

Matching Packages: Python v. R

Just as the best tools for machine learning tend to be in Python since they’re developed by CS people (who prefer Python), most of the best tools for causal inference are implemented in R since innovation in causal inference tends to be lead by social scientists using R. As a result, the most well developed matching package is called MatchIt, and is only available in R (though you can always call it from Python using rpy2).

In the last couple years, though, a group of computer scientists and statisticians here at Duke have made some great advancements in matching (especially the computational side of things), and they recently released a set of matching packages in both R and Python that we’ll be using today. They have some great algorithms we’ll use today, but be aware these packages aren’t as mature, and aren’t general purpose packages yet. So if you ever get deep into matching, be aware you will probably still want to make at least partial use of the R package MatchIt, as well as some other R packages for new innovative techniques (like Matching Frontier estimation), or Adaptive Hyper-Box Matching.

Installing dame-flame.

For this lesson, begin by installing dame-flame with pip install dame-flame (it’s not on conda yet).

DAME is an algorithm that we can use for a version of course exact matching. The package only accepts a list of categorical variables, and then attempts to match pairs that match exactly on those variables. That means that if you want to match on, say, age, you have to break it up into categories (say, under 18, 18-29, 30-39, etc. etc.).

Of course, one cannot always find exact matches on all variables, so what DAME does is:

  1. Find all observations that match on all matching variables.

  2. Figure out which matching variable is least useful in predicting the outcome of interest \(Y\) and drops that, then tries to match the remaining observations on the narrowed set of matching variables.

  3. This repeats until you run out of variables, all observations are matched, or you hit a stopping run (namely: quality of matches falls below a threshold).

In addition, the lab has also created FLAME, which does the same thing, but employs some tricks to make it massively more computationally efficient, meaning it can be used on datasets with millions of observations (which most matching algorithms cannot). It’s a little less accurate, but an amazing contribution never the less.

Data Setup

To save you some time and let you focus on matching, I’ve pre-cleaned about one month worth of of data from the US Current Population Survey data we used for our gender discrimination analysis. You can download the data from here, or read it directly with:

cps = pd.read_stata(
    "https://github.com/nickeubank/MIDS_Data/blob/master"
    "/Current_Population_Survey/cps_for_matching.dta?raw=true"
)

Load the data and quickly familiarize yourself with its contents.

[1]:
# Load critical packages
import pandas as pd
import numpy as np
import dame_flame
[2]:
# Load our Current Population Survey data
# a regular  survey of US citizens

cps = pd.read_stata(
    "https://github.com/nickeubank/MIDS_Data/blob/master"
    "/Current_Population_Survey/cps_for_matching.dta?raw=true"
)
[3]:
# Take a look at the data
cps.head()
[3]:
index annual_earnings female simplified_race has_college age county class94
0 151404 NaN 1 3.0 1 30 0-WV Private, For Profit
1 123453 NaN 0 0.0 0 21 251-TX Private, For Profit
2 187982 NaN 0 0.0 0 40 5-MA Self-Employed, Unincorporated
3 122356 NaN 1 0.0 1 27 0-TN Private, Nonprofit
4 210750 42900.0 1 0.0 0 52 0-IA Private, For Profit

Getting To Know Your Data

Before you start matching, it is important to examine your data to ensure that matching is feasible (you have some overlap the the features of people in the treated and untreated groups), and also that there is a reason to match: either you’re unsure about some of the functional forms at play, or your have some imbalance between the two groups.

Exercise 1

Show the raw difference of annual_earnings between those with and without a college degree (has_college). Is the difference statistically significant?

[4]:
import statsmodels.formula.api as smf

smf.ols("annual_earnings ~ has_college", cps).fit().summary()
[4]:
OLS Regression Results
Dep. Variable: annual_earnings R-squared: 0.063
Model: OLS Adj. R-squared: 0.063
Method: Least Squares F-statistic: 370.2
Date: Sat, 04 Mar 2023 Prob (F-statistic): 6.56e-80
Time: 13:59:39 Log-Likelihood: -63018.
No. Observations: 5515 AIC: 1.260e+05
Df Residuals: 5513 BIC: 1.261e+05
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3.887e+04 336.007 115.669 0.000 3.82e+04 3.95e+04
has_college 1.416e+04 735.820 19.242 0.000 1.27e+04 1.56e+04
Omnibus: 2214.375 Durbin-Watson: 1.974
Prob(Omnibus): 0.000 Jarque-Bera (JB): 10578.287
Skew: 1.910 Prob(JB): 0.00
Kurtosis: 8.608 Cond. No. 2.59


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[5]:
# About 14,000 a year, and it's very significant.

Exercise 2

Next we can check for balance. Check the share of people in different racial groups who have college degrees. Are those differences statistically significant?

Race is coded as White Non-Hispanic (0), Black Non-Hispanic (1), Hispanic (2), Other (3).

Does the distribution also look different across counties (I don’t need statistical significance for this)?

Does the data seem balanced?

[6]:
# This question wording is, admittedly, a little iffy.
# Basically, while we want frequency tables to do our chi2
# test, I know _I_ can't look at a frequency table and have
# any sense of whether the groups are actually balanced.
# So I like to see shares with my eyes, then use freq table to test.
[7]:
# One easy way to get differences in shares (and bi-variate significance)
smf.ols("has_college ~ C(simplified_race)", cps).fit().summary()
[7]:
OLS Regression Results
Dep. Variable: has_college R-squared: 0.032
Model: OLS Adj. R-squared: 0.032
Method: Least Squares F-statistic: 122.1
Date: Sat, 04 Mar 2023 Prob (F-statistic): 7.74e-78
Time: 13:59:39 Log-Likelihood: -7675.0
No. Observations: 11150 AIC: 1.536e+04
Df Residuals: 11146 BIC: 1.539e+04
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.4382 0.006 79.420 0.000 0.427 0.449
C(simplified_race)[T.1.0] -0.1206 0.016 -7.507 0.000 -0.152 -0.089
C(simplified_race)[T.2.0] -0.2398 0.014 -17.682 0.000 -0.266 -0.213
C(simplified_race)[T.3.0] 0.0367 0.016 2.261 0.024 0.005 0.069
Omnibus: 46681.807 Durbin-Watson: 1.965
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1670.333
Skew: 0.377 Prob(JB): 0.00
Kurtosis: 1.261 Cond. No. 3.97


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[8]:
# Or just groupby:

cps.groupby("simplified_race")["has_college"].mean()

[8]:
simplified_race
0.0    0.438205
1.0    0.317647
2.0    0.198413
3.0    0.474900
Name: has_college, dtype: float64
[9]:
# Then for statistical significance:
ctab = pd.crosstab(cps["simplified_race"], cps["has_college"])
ctab
[9]:
has_college 0 1
simplified_race
0.0 4282 3340
1.0 696 324
2.0 1212 300
3.0 523 473
[10]:
import scipy.stats

chi2, p, dof, expected = scipy.stats.chi2_contingency(ctab.values)
p
[10]:
1.2993875943569016e-76
[11]:
# Insanely significant. :)
[12]:
# And look at counties.
cps.groupby("county")[["has_college"]].describe().sort_values(("has_college", "mean"))

[12]:
has_college
count mean std min 25% 50% 75% max
county
71-MO 4.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
700-VA 3.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
69-NY 1.0 0.000000 NaN 0.0 0.0 0.0 0.0 0.0
69-FL 4.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
17-MD 2.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ...
75-CA 17.0 0.882353 0.332106 0.0 1.0 1.0 1.0 1.0
21-NJ 4.0 1.000000 0.000000 1.0 1.0 1.0 1.0 1.0
19-NJ 2.0 1.000000 0.000000 1.0 1.0 1.0 1.0 1.0
81-IN 1.0 1.000000 NaN 1.0 1.0 1.0 1.0 1.0
171-MN 1.0 1.000000 NaN 1.0 1.0 1.0 1.0 1.0

326 rows × 8 columns

[13]:
cps.groupby("county")[["has_college"]].mean().describe()
[13]:
has_college
count 326.000000
mean 0.390058
std 0.207616
min 0.000000
25% 0.262319
50% 0.375000
75% 0.500000
max 1.000000
[14]:
# Good in the middle, but many counties have no college grads, and a few only have college grads.

Exercise 3

One of the other advantages of matching is that even when you have balanced data, you don’t have to go through the process of testing out different functional forms to see what fits the data base.

In our last exercise, we looked at the relationship between gender and earnings “controlling for age”, where we just put in age as a linear control. Plot a non-linear regression of annual_earnings on age (if you’re using plotnine, use geom_smooth(method="lowess") — if you’re using altair, use transform_loess (tutorial examples here)).

Does the relationship look linear?

Does this speak to why it’s nice to not have to think about functional forms with matching as much?

[15]:
import altair as alt

alt.data_transformers.enable("data_server")
alt.Chart(cps).encode(x="age", y="annual_earnings").transform_loess(
    on="age", loess="annual_earnings"
).mark_line()
[15]:
[16]:
# Not even remotely linear. Thank goodness we don't have to worry about that with matching!
# Though it wouldn't be *that* hard to fit a quadratic.

Matching!

Because DAME is an implementation of exact matching, we have to discretize all of our continuous variables. Thankfully, in this case we only have age, so this shouldn’t be too hard!

Exercise 4

Create a new variable that discretizes age into a single value for each decade of age.

Because CPS only has employment data on people 18 or over, though, include people who are 18 or 19 with the 20 year olds so that group isn’t too small, and if you see any other really small groups, please merge those too.

[17]:
cps["discretized_age"] = cps.age // 10
cps.loc[cps["discretized_age"] == 1, "discretized_age"] = 2
cps["discretized_age"].value_counts()
[17]:
3    2760
4    2551
5    2397
2    1990
6    1236
7     173
8      43
Name: discretized_age, dtype: int64
[18]:
# 70 and 80 year olds are tiny groups.
cps.loc[cps["discretized_age"] == 8, "discretized_age"] = 7

Exercise 5

We also have to covert our string variables into numeric variables for DAME, so convert county and class94 to a numeric vector of intergers.

(Note: it’s not clear whether class94 belongs: if it reflects people choosing fields based on passion, it belongs; if people choose certain jobs because of their degrees, its not something we’d actually want in our regression.

Hint: if you use pd.Categorical to convert you var to a categorical, you can pull the underlying integer codes with .codes.

[19]:
cps["county"] = pd.Categorical(cps["county"]).codes
cps["county"].value_counts()
[19]:
41     576
200    275
12     230
33     225
51     223
      ...
122      1
263      1
285      1
154      1
213      1
Name: county, Length: 326, dtype: int64
[20]:
cps["class94"] = pd.Categorical(cps["class94"]).codes
cps["class94"].value_counts()
[20]:
3    7809
1     740
4     706
2     615
6     552
5     387
0     337
7       4
Name: class94, dtype: int64

Let’s Do Matching with DAME

Exercise 6

First, drop all the variables you don’t want in matching (e.g. your original age variable), and any observations for which annual_earnings is missing.

You will probably also have to drop a column named index: DAME will try and match on ANY included variables, and so because there was a column called index in the data we imported, if we leave it in DAME will try (and obviously fail) to match on index.

Also, it’s best to reset your index, as dame_flame using index labels (e.g., the values in df.index) to identify matches. So you want to be sure those are unique.

[21]:
for_matching = cps.drop(["age", "index"], axis="columns")
for_matching = for_matching[for_matching.annual_earnings.notnull()]
for_matching = for_matching.reset_index(drop=True)
for_matching
[21]:
annual_earnings female simplified_race has_college county class94 discretized_age
0 42900.0 1 0.0 0 10 3 5
1 31200.0 0 2.0 0 31 3 3
2 20020.0 0 0.0 1 8 3 6
3 22859.2 0 0.0 0 44 1 4
4 73860.8 0 0.0 1 24 3 3
... ... ... ... ... ... ... ...
5510 33800.0 1 3.0 0 247 3 3
5511 23920.0 0 3.0 0 272 3 5
5512 31200.0 0 2.0 0 246 3 2
5513 37440.0 0 0.0 0 99 3 2
5514 26000.0 0 1.0 0 23 2 5

5515 rows × 7 columns

Exercise 7

The syntax of dame_flame is similar to the syntax of sklearn. If you start with a dataset called my_data with a treat variable with treatment assignment and an outcome variable for my outcome of interest (\(Y\)), the syntax to do basic matching would be:

import dame_flame
model = dame_flame.matching.DAME(repeats=False, verbose=3, want_pe=True)
model.fit(
    my_data,
    treatment_column_name="treat",
    outcome_column_name="outcome",
)
result = model.predict(my_data)

Where the arguments:

  • repeats=False says that I only want each observation to get matched once. We’ll talk about what happens if we use repeats=True below.

  • verbose=3 tells dame to report everything it’s doing as it goes.

  • want_pe says “please include the predictive error in your printout at each step”. This is a measure of match quality.

So run DAME on your data!

[22]:
model = dame_flame.matching.DAME(repeats=False, verbose=3, want_pe=True)
model.fit(
    for_matching,
    treatment_column_name="has_college",
    outcome_column_name="annual_earnings",
)
result = model.predict(for_matching)
Completed iteration 0 of matching
        Number of matched groups formed in total:  370
        Unmatched treated units:  644 out of a total of  1150 treated units
        Unmatched control units:  3187 out of a total of  4365 control units
        Number of matches made this iteration:  1684
        Number of matches made so far:  1684
        Covariates dropped so far:  set()
        Predictive error of covariate set used to match:  1199312680.0957854
Completed iteration 1 of matching
        Number of matched groups formed in total:  494
        Unmatched treated units:  25 out of a total of  1150 treated units
        Unmatched control units:  180 out of a total of  4365 control units
        Number of matches made this iteration:  3626
        Number of matches made so far:  5310
        Covariates dropped so far:  frozenset({'county'})
        Predictive error of covariate set used to match:  1199421883.1095908
Completed iteration 2 of matching
        Number of matched groups formed in total:  494
        Unmatched treated units:  25 out of a total of  1150 treated units
        Unmatched control units:  180 out of a total of  4365 control units
        Number of matches made this iteration:  0
        Number of matches made so far:  5310
        Covariates dropped so far:  frozenset({'simplified_race'})
        Predictive error of covariate set used to match:  1204727749.8949614
Completed iteration 3 of matching
        Number of matched groups formed in total:  505
        Unmatched treated units:  8 out of a total of  1150 treated units
        Unmatched control units:  129 out of a total of  4365 control units
        Number of matches made this iteration:  68
        Number of matches made so far:  5378
        Covariates dropped so far:  frozenset({'county', 'simplified_race'})
        Predictive error of covariate set used to match:  1204742613.479154
Completed iteration 4 of matching
        Number of matched groups formed in total:  505
        Unmatched treated units:  8 out of a total of  1150 treated units
        Unmatched control units:  129 out of a total of  4365 control units
        Number of matches made this iteration:  0
        Number of matches made so far:  5378
        Covariates dropped so far:  frozenset({'class94'})
        Predictive error of covariate set used to match:  1205072671.3262901
Completed iteration 5 of matching
        Number of matched groups formed in total:  508
        Unmatched treated units:  5 out of a total of  1150 treated units
        Unmatched control units:  120 out of a total of  4365 control units
        Number of matches made this iteration:  12
        Number of matches made so far:  5390
        Covariates dropped so far:  frozenset({'class94', 'county'})
        Predictive error of covariate set used to match:  1205171280.4727237
Completed iteration 6 of matching
        Number of matched groups formed in total:  509
        Unmatched treated units:  4 out of a total of  1150 treated units
        Unmatched control units:  119 out of a total of  4365 control units
        Number of matches made this iteration:  2
        Number of matches made so far:  5392
        Covariates dropped so far:  frozenset({'class94', 'simplified_race'})
        Predictive error of covariate set used to match:  1210524158.7436352
Completed iteration 7 of matching
        Number of matched groups formed in total:  511
        Unmatched treated units:  0 out of a total of  1150 treated units
        Unmatched control units:  110 out of a total of  4365 control units
        Number of matches made this iteration:  13
        Number of matches made so far:  5405
        Covariates dropped so far:  frozenset({'class94', 'county', 'simplified_race'})
        Predictive error of covariate set used to match:  1210539313.933855
5405 units matched. We finished with no more treated units to match

Interpreting DAME output

The output you get from doing this should be reports from about 8 iterations of matching. In each iteration, you’ll see a description of the number of matches made in the iteration, the number of treatment units still unmatched, and the number of control units unmatched.

In the first iteration, the algorithm tries to match observations that match on all the variables in your data. That’s why in the first iteration, you see the set of variables being dropped is an empty set – it hasn’t dropped any variables:

Completed iteration 0 of matching
    Number of matched groups formed in total:  370
    Unmatched treated units:  644 out of a total of  1150 treated units
    Unmatched control units:  3187 out of a total of  4365 control units
    Number of matches made this iteration:  1684
    Number of matches made so far:  1684
    Covariates dropped so far:  set()
    Predictive error of covariate set used to match:  1199312680.0957854

(Note depending on how you binned ages, you may get slightly different results than this)

But as we can see from this output, the algorithm found 1,684 perfect matches—pairs of observations (one treated, one untreated) that had exactly the same value of all the variables we included. But we also see we still have 644 unmatched treated units, so what do we do?

The answer is that if we want to match more of our treatment variables, we have to try and match on a subset of our variables.

But what variable should we drop? This is the secret sauce of DAME. DAME picks the variables to drop by trying to predict our outcome \(Y\) using all our variables (by default using a ridge regression), then it drops the matching variable that is contributing the least to that prediction. Since our goal in matching is to eliminate baseline differences (\(E(Y_0|D=1) - E(Y_1|D=0)\)), dropping the covariates least related to \(Y\) makes sense.

As a result, in the second iteration (called iteration 1, since it uses 0-based indexing), we see that the variable it drops first is county, and it’s subsequently able to make another 3,626 new matches on the remaining variables!

Completed iteration 1 of matching
    Number of matched groups formed in total:  494
    Unmatched treated units:  25 out of a total of  1150 treated units
    Unmatched control units:  180 out of a total of  4365 control units
    Number of matches made this iteration:  3626
    Number of matches made so far:  5310
    Covariates dropped so far:  frozenset({'county'})
    Predictive error of covariate set used to match:  1199421883.1095908

And so DAME continues until after 8 iterations, it’s matched all treated observations.

Exercise 8

Congratulations! You just on your first one-to-many matching!

The next step is to think about which of the matches that DAME generated are good enough for inclusion in our analysis. As you may recall, one of the choices you have to make as a researcher when doing matching is how “good” a match has to be in order to be included in your final data set. By default, DAME will keep dropping matching variables until it has been able to match all the treated observations or runs out of variables. It will do this no matter how bad the matches start to become – if it ends up with the treated observation and a control observation that can only be matched on gender, it will match them just on gender, even though we probably don’t think that that’s a “good” match.

The way to control this behavior is to tell DAME when to stop manually using the early_stop_iterations argument to tell the matching algorithm when to stop.

So when is a good time to stop? There’s no objective or “right” answer to that question. It fundamentally comes down to a trade-off between bias (which gets higher is you allow more low quality matches into your data) and variance (which will go down as you increase the number of matches you keep).

But one way to start the process of picking a cut point is to examine how the quality of matches evolves over iterations. DAME keeps this information in model.pe_each_iter. This shows, for each iteration, the “prediction error” resulting from dropping the variables excluded in each step. This “prediction error” is the difference in the mean-squared error of regressing \(Y\) on our matching variables (by default in a ridge regression) with all variables versus with the subset being used for matching in a given iteration. By design, of course, this is always increasing.

To see how this evolves, plot your pe against iteration numbers. You can also see the pe values for each iteration reported in the output from when DAME ran above if you want to make your you’re lining up the errors with iterations right.

Are there any points where the match quality seems to fall off dramatically?

[23]:
model.pe_each_iter

[23]:
[1199312680.0957854,
 1199421883.1095908,
 1204727749.8949614,
 1204742613.479154,
 1205072671.3262901,
 1205171280.4727237,
 1210524158.7436352,
 1210539313.933855]
[24]:
for_pe = pd.DataFrame(
    {"pe": model.pe_each_iter, "i": range(0, len(model.pe_each_iter))}
)
for_pe
[24]:
pe i
0 1.199313e+09 0
1 1.199422e+09 1
2 1.204728e+09 2
3 1.204743e+09 3
4 1.205073e+09 4
5 1.205171e+09 5
6 1.210524e+09 6
7 1.210539e+09 7
[25]:
alt.Chart(for_pe).encode(x="i", y=alt.Y("pe", scale=alt.Scale(zero=False))).mark_line()
[25]:
[26]:
# Yup! Iteration 2 and 6 are the really the big ones...

Exercise 9

Suppose we want to ensure we have at least 5,000 observations in our matched data—where might you cut off the data to get a sample size of at least that but before a big quality falloff?

[27]:
# I'd stop after iteration 1 (the second iteration)—things fall off fast
# starting after that, but with very few added matches.

Exercise 10

Re-run your matching, stopping at the point you picked above using early_stop_iterations.

[28]:
model = dame_flame.matching.DAME(
    repeats=False, verbose=3, want_pe=True, early_stop_iterations=1
)
model.fit(
    for_matching,
    treatment_column_name="has_college",
    outcome_column_name="annual_earnings",
)
result = model.predict(for_matching)
Completed iteration 0 of matching
        Number of matched groups formed in total:  370
        Unmatched treated units:  644 out of a total of  1150 treated units
        Unmatched control units:  3187 out of a total of  4365 control units
        Number of matches made this iteration:  1684
        Number of matches made so far:  1684
        Covariates dropped so far:  set()
        Predictive error of covariate set used to match:  1199312680.0957854
Completed iteration 1 of matching
        Number of matched groups formed in total:  494
        Unmatched treated units:  25 out of a total of  1150 treated units
        Unmatched control units:  180 out of a total of  4365 control units
        Number of matches made this iteration:  3626
        Number of matches made so far:  5310
        Covariates dropped so far:  frozenset({'county'})
        Predictive error of covariate set used to match:  1199421883.1095908
5310 units matched. We stopped after iteration 1

Getting Back a Dataset

OK, my one current complaint with DAME is that it doesn’t just give you back a nice dataset of your matches for analysis. If we look at our results – matches – it’s almost what we want, except it’s dropped our treatment and outcome columns, and it’s put a string * in any entry where a value wasn’t used for matching:

  female simplified_race   county   class94   discretized_age
0  1.0     0.0              10.0      3.0          5.0
1  0.0     2.0              *         3.0          3.0
2  0.0     0.0              8.0        3.0         6.0
3  0.0     0.0              *         1.0          4.0
4  0.0     0.0              24.0      3.0          3.0

So for now (though I think this will get updated in the package), we’ll have to do it ourselves! Just copy-paste this:

def get_dataframe(model, result_of_fit):

    # Get original data
    better = model.input_data.loc[result_of_fit.index]
    if not better.index.is_unique:
        raise ValueError("Need index values in input data to be unique")

    # Get match groups for clustering
    better["match_group"] = np.nan
    better["match_group_size"] = np.nan
    for idx, group in enumerate(model.units_per_group):
        better.loc[group, "match_group"] = idx
        better.loc[group, "match_group_size"] = len(group)

    # Get weights. I THINK this is right?! At least for with repeat=False?
    t = model.treatment_column_name
    better["t_in_group"] = better.groupby("match_group")[t].transform(np.sum)

    # Make weights
    better["weights"] = np.nan
    better.loc[better[t] == 1, "weights"] = 1  # treaments are 1

    # Controls start as proportional to num of treatments
    # each observation is matched to.
    better.loc[better[t] == 0, "weights"] = better["t_in_group"] / (
        better["match_group_size"] - better["t_in_group"]
    )

    # Then re-normalize for num unique control observations.
    control_weights = better[better[t] == 0]["weights"].sum()

    num_control_obs = len(better[better[t] == 0].index.drop_duplicates())
    renormalization = num_control_obs / control_weights
    better.loc[better[t] == 0, "weights"] = (
        better.loc[better[t] == 0, "weights"] * renormalization
    )
    assert better.weights.notnull().all()

    better = better.drop(["t_in_group"], axis="columns")

    # Make sure right length and values!
    assert len(result_of_fit) == len(better)
    assert better.loc[better[t] == 0, "weights"].sum() == num_control_obs

    return better

Exercise 11

Copy-paste that code and run it with your original data, your (fit) model, and what you got back when you ran result_of_fit. Then we’ll work with the output of that. You should get back a single dataframe of the same length as your original model.

[29]:
result

[29]:
female simplified_race county class94 discretized_age
0 1.0 0.0 10.0 3.0 5.0
1 0.0 2.0 * 3.0 3.0
2 0.0 0.0 8.0 3.0 6.0
3 0.0 0.0 * 1.0 4.0
4 0.0 0.0 24.0 3.0 3.0
... ... ... ... ... ...
5509 0.0 0.0 * 3.0 6.0
5510 1.0 3.0 247.0 3.0 3.0
5511 0.0 3.0 * 3.0 5.0
5512 0.0 2.0 246.0 3.0 2.0
5513 0.0 0.0 99.0 3.0 2.0

5310 rows × 5 columns

[30]:
def get_dataframe(model, result_of_fit):

    # Get original data
    better = model.input_data.loc[result_of_fit.index]
    if not better.index.is_unique:
        raise ValueError("Need index values in input data to be unique")

    # Get match groups for clustering
    better["match_group"] = np.nan
    better["match_group_size"] = np.nan
    for idx, group in enumerate(model.units_per_group):
        better.loc[group, "match_group"] = idx
        better.loc[group, "match_group_size"] = len(group)

    # Get weights. I THINK this is right?! At least for with repeat=False?
    t = model.treatment_column_name
    better["t_in_group"] = better.groupby("match_group")[t].transform(np.sum)

    # Make weights
    better["weights"] = np.nan
    better.loc[better[t] == 1, "weights"] = 1  # treaments are 1

    # Controls start as proportional to num of treatments
    # each observation is matched to.
    better.loc[better[t] == 0, "weights"] = better["t_in_group"] / (
        better["match_group_size"] - better["t_in_group"]
    )

    # Then re-normalize for num unique control observations.
    control_weights = better[better[t] == 0]["weights"].sum()

    num_control_obs = len(better[better[t] == 0].index.drop_duplicates())
    renormalization = num_control_obs / control_weights
    better.loc[better[t] == 0, "weights"] = (
        better.loc[better[t] == 0, "weights"] * renormalization
    )
    assert better.weights.notnull().all()

    better = better.drop(["t_in_group"], axis="columns")

    # Make sure right length and values!
    assert len(result_of_fit) == len(better)
    assert better.loc[better[t] == 0, "weights"].sum() == num_control_obs

    return better

[31]:
matched_data = get_dataframe(model, result)
[32]:
matched_data.head()
[32]:
annual_earnings female simplified_race has_college county class94 discretized_age match_group match_group_size weights
0 42900.0 1 0.0 0 10 3 5 59.0 5.0 0.930000
1 31200.0 0 2.0 0 31 3 3 411.0 108.0 0.070189
2 20020.0 0 0.0 1 8 3 6 52.0 3.0 1.000000
3 22859.2 0 0.0 0 44 1 4 424.0 28.0 1.240000
4 73860.8 0 0.0 1 24 3 3 106.0 7.0 1.000000

Check Your Matches and Analyze

Exercise 12

We previously tested balance on simplified_race, and by county. Check those again. Are there still statistically significant differences in college education by simplified_race?

Note that when you test for this, you’ll need to take into account the weights column you got back from get_dataframe. What DAME does is not actually the 1-to-1 matching described in our readings – instead, however many observations that exact match it finds it puts in the same “group”. (These groups are identified in the dataframe you got from get_dataframe by the column match_group, and the size of each group is in match_group_size.)

So to analyze the data, you need to use the wls (weighted least squares) function in statsmodels. For example, if your data is called matched_data, you might run:

smf.wls(
    "has_college ~ C(simplified_race)", matched_data, weights=matched_data["weights"]
).fit().summary()
[33]:
import statsmodels.formula.api as smf

smf.wls(
    "has_college ~ C(simplified_race)", matched_data, weights=matched_data["weights"]
).fit().summary()
[33]:
WLS Regression Results
Dep. Variable: has_college R-squared: 0.000
Model: WLS Adj. R-squared: -0.001
Method: Least Squares F-statistic: 1.134e-12
Date: Sat, 04 Mar 2023 Prob (F-statistic): 1.00
Time: 13:59:41 Log-Likelihood: -3736.0
No. Observations: 5310 AIC: 7480.
Df Residuals: 5306 BIC: 7506.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.2119 0.007 31.608 0.000 0.199 0.225
C(simplified_race)[T.1.0] 3.469e-17 0.018 1.92e-15 1.000 -0.036 0.036
C(simplified_race)[T.2.0] -5.378e-17 0.019 -2.86e-15 1.000 -0.037 0.037
C(simplified_race)[T.3.0] 1.18e-16 0.020 5.83e-15 1.000 -0.040 0.040
Omnibus: 860.389 Durbin-Watson: 2.000
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1353.227
Skew: 1.234 Prob(JB): 1.41e-294
Kurtosis: 2.851 Cond. No. 3.95


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Exercise 13

Now use a weighted least squares regression on your matched data to regress annual earnings on just having a college eduction. What is the apparent effect of a BA? How does that compare to our initial estimate using the raw CPS data (before matching)?

[34]:
smf.wls(
    "annual_earnings ~ has_college", matched_data, weights=matched_data["weights"]
).fit().summary()
[34]:
WLS Regression Results
Dep. Variable: annual_earnings R-squared: 0.058
Model: WLS Adj. R-squared: 0.057
Method: Least Squares F-statistic: 324.1
Date: Sat, 04 Mar 2023 Prob (F-statistic): 2.19e-70
Time: 13:59:41 Log-Likelihood: -61753.
No. Observations: 5310 AIC: 1.235e+05
Df Residuals: 5308 BIC: 1.235e+05
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3.909e+04 351.293 111.287 0.000 3.84e+04 3.98e+04
has_college 1.374e+04 763.203 18.003 0.000 1.22e+04 1.52e+04
Omnibus: 2934.035 Durbin-Watson: 2.006
Prob(Omnibus): 0.000 Jarque-Bera (JB): 33100.529
Skew: 2.424 Prob(JB): 0.00
Kurtosis: 14.230 Cond. No. 2.58


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[35]:
# dame_flame.utils.post_processing.ATE(matching_object=model)

Exercise 14

Now include our other matching variables as controls (e.g. all the coefficients you gave to DAME to use). Does the coefficient change?

[36]:
smf.wls(
    "annual_earnings ~ has_college + C(simplified_race)"
    " + C(discretized_age) + female + C(county)",
    matched_data,
    weights=matched_data["weights"],
).fit().summary()
[36]:
WLS Regression Results
Dep. Variable: annual_earnings R-squared: 0.238
Model: WLS Adj. R-squared: 0.188
Method: Least Squares F-statistic: 4.786
Date: Sat, 04 Mar 2023 Prob (F-statistic): 1.62e-132
Time: 13:59:42 Log-Likelihood: -61189.
No. Observations: 5310 AIC: 1.230e+05
Df Residuals: 4984 BIC: 1.252e+05
Df Model: 325
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 4.761e+04 2429.505 19.595 0.000 4.28e+04 5.24e+04
C(simplified_race)[T.1.0] -8344.9150 1067.331 -7.818 0.000 -1.04e+04 -6252.476
C(simplified_race)[T.2.0] -6753.9175 1140.523 -5.922 0.000 -8989.844 -4517.991
C(simplified_race)[T.3.0] -3220.6308 1202.997 -2.677 0.007 -5579.035 -862.227
C(discretized_age)[T.3] 8584.0505 868.037 9.889 0.000 6882.316 1.03e+04
C(discretized_age)[T.4] 1.251e+04 923.078 13.558 0.000 1.07e+04 1.43e+04
C(discretized_age)[T.5] 1.266e+04 964.214 13.131 0.000 1.08e+04 1.46e+04
C(discretized_age)[T.6] 9235.0616 1189.062 7.767 0.000 6903.976 1.16e+04
C(discretized_age)[T.7] 1.347e+04 2975.342 4.528 0.000 7639.580 1.93e+04
C(county)[T.1] -1.114e+04 3231.653 -3.446 0.001 -1.75e+04 -4799.550
C(county)[T.2] -1.279e+04 3245.734 -3.942 0.000 -1.92e+04 -6430.115
C(county)[T.3] -9142.9921 1.19e+04 -0.771 0.441 -3.24e+04 1.41e+04
C(county)[T.4] -6471.4363 2990.234 -2.164 0.030 -1.23e+04 -609.261
C(county)[T.5] -6378.3577 4178.131 -1.527 0.127 -1.46e+04 1812.617
C(county)[T.6] -1.627e+04 1.04e+04 -1.566 0.118 -3.66e+04 4103.579
C(county)[T.7] -1.023e+04 3835.754 -2.666 0.008 -1.77e+04 -2706.649
C(county)[T.8] -1.153e+04 3175.916 -3.629 0.000 -1.78e+04 -5300.388
C(county)[T.9] -1.382e+04 5130.249 -2.694 0.007 -2.39e+04 -3762.096
C(county)[T.10] -1.501e+04 3427.926 -4.380 0.000 -2.17e+04 -8293.693
C(county)[T.11] -1.418e+04 3040.008 -4.664 0.000 -2.01e+04 -8218.173
C(county)[T.12] -6849.3954 3121.285 -2.194 0.028 -1.3e+04 -730.303
C(county)[T.13] -1.381e+04 3368.446 -4.101 0.000 -2.04e+04 -7209.235
C(county)[T.14] -1.369e+04 3918.748 -3.493 0.000 -2.14e+04 -6007.378
C(county)[T.15] -9190.4689 4799.618 -1.915 0.056 -1.86e+04 218.895
C(county)[T.16] -1.061e+04 3616.372 -2.935 0.003 -1.77e+04 -3523.982
C(county)[T.17] -1.601e+04 7731.556 -2.070 0.038 -3.12e+04 -848.909
C(county)[T.18] -1788.7046 4787.690 -0.374 0.709 -1.12e+04 7597.275
C(county)[T.19] -1.684e+04 6555.900 -2.569 0.010 -2.97e+04 -3989.067
C(county)[T.20] -1.4e+04 4328.098 -3.235 0.001 -2.25e+04 -5515.772
C(county)[T.21] -6184.3776 3595.547 -1.720 0.085 -1.32e+04 864.476
C(county)[T.22] -1.826e+04 3694.928 -4.941 0.000 -2.55e+04 -1.1e+04
C(county)[T.23] -1.242e+04 3244.837 -3.828 0.000 -1.88e+04 -6059.940
C(county)[T.24] -1.104e+04 3171.193 -3.482 0.001 -1.73e+04 -4824.300
C(county)[T.25] -1.506e+04 3496.105 -4.309 0.000 -2.19e+04 -8211.049
C(county)[T.26] -1.391e+04 3137.745 -4.432 0.000 -2.01e+04 -7755.819
C(county)[T.27] -1.337e+04 3345.523 -3.997 0.000 -1.99e+04 -6814.638
C(county)[T.28] -1.297e+04 5671.204 -2.287 0.022 -2.41e+04 -1853.870
C(county)[T.29] -4845.0790 5203.125 -0.931 0.352 -1.5e+04 5355.335
C(county)[T.30] -2248.8602 5067.398 -0.444 0.657 -1.22e+04 7685.469
C(county)[T.31] -1.155e+04 4710.979 -2.452 0.014 -2.08e+04 -2313.897
C(county)[T.32] -7114.5598 3632.908 -1.958 0.050 -1.42e+04 7.538
C(county)[T.33] -1.272e+04 3067.035 -4.149 0.000 -1.87e+04 -6711.354
C(county)[T.34] -1.32e+04 3443.339 -3.833 0.000 -1.99e+04 -6447.428
C(county)[T.35] -9324.2429 3384.603 -2.755 0.006 -1.6e+04 -2688.932
C(county)[T.36] -1.505e+04 3658.198 -4.115 0.000 -2.22e+04 -7881.001
C(county)[T.37] -1.023e+04 3762.247 -2.720 0.007 -1.76e+04 -2856.766
C(county)[T.38] -1.308e+04 3310.408 -3.951 0.000 -1.96e+04 -6590.153
C(county)[T.39] -1.177e+04 3252.230 -3.618 0.000 -1.81e+04 -5391.513
C(county)[T.40] -1.516e+04 3677.299 -4.123 0.000 -2.24e+04 -7954.028
C(county)[T.41] -9224.6049 2690.524 -3.429 0.001 -1.45e+04 -3949.994
C(county)[T.42] -1.365e+04 3460.095 -3.944 0.000 -2.04e+04 -6864.991
C(county)[T.43] -1.165e+04 3698.161 -3.151 0.002 -1.89e+04 -4404.402
C(county)[T.44] -1.07e+04 2961.782 -3.611 0.000 -1.65e+04 -4889.799
C(county)[T.45] -7052.4496 3127.876 -2.255 0.024 -1.32e+04 -920.437
C(county)[T.46] -9064.0230 3380.497 -2.681 0.007 -1.57e+04 -2436.761
C(county)[T.47] -1.681e+04 3013.844 -5.579 0.000 -2.27e+04 -1.09e+04
C(county)[T.48] -8897.8731 3296.249 -2.699 0.007 -1.54e+04 -2435.774
C(county)[T.49] 2.125e+04 7993.324 2.659 0.008 5580.256 3.69e+04
C(county)[T.50] -2.251e+04 1.13e+04 -1.997 0.046 -4.46e+04 -409.459
C(county)[T.51] -5880.2544 3452.043 -1.703 0.089 -1.26e+04 887.270
C(county)[T.52] -1.006e+04 8297.064 -1.213 0.225 -2.63e+04 6201.302
C(county)[T.53] -7396.8710 1.97e+04 -0.376 0.707 -4.59e+04 3.12e+04
C(county)[T.54] -1.349e+04 8853.843 -1.524 0.128 -3.08e+04 3865.030
C(county)[T.55] 6016.8132 9691.581 0.621 0.535 -1.3e+04 2.5e+04
C(county)[T.56] -7435.2407 7281.280 -1.021 0.307 -2.17e+04 6839.272
C(county)[T.57] 8434.9441 1.4e+04 0.604 0.546 -1.89e+04 3.58e+04
C(county)[T.58] -5918.5652 5278.657 -1.121 0.262 -1.63e+04 4429.925
C(county)[T.59] -1.631e+04 9894.838 -1.649 0.099 -3.57e+04 3086.114
C(county)[T.60] -1.499e+04 6719.761 -2.231 0.026 -2.82e+04 -1819.273
C(county)[T.61] -8015.9218 1.09e+04 -0.732 0.464 -2.95e+04 1.34e+04
C(county)[T.62] 1.171e+04 1.32e+04 0.884 0.377 -1.43e+04 3.77e+04
C(county)[T.63] -1.124e+04 8626.461 -1.303 0.193 -2.82e+04 5671.473
C(county)[T.64] 1.57e+04 8596.970 1.826 0.068 -1156.314 3.26e+04
C(county)[T.65] -1.964e+04 9254.939 -2.122 0.034 -3.78e+04 -1498.590
C(county)[T.66] -2.286e+04 1.47e+04 -1.550 0.121 -5.18e+04 6046.864
C(county)[T.67] -1.286e+04 2.97e+04 -0.434 0.664 -7.1e+04 4.53e+04
C(county)[T.68] 265.8046 9594.437 0.028 0.978 -1.85e+04 1.91e+04
C(county)[T.69] -1.382e+04 9042.247 -1.528 0.127 -3.15e+04 3909.886
C(county)[T.70] -2.086e+04 1.33e+04 -1.567 0.117 -4.7e+04 5231.063
C(county)[T.71] -1.914e+04 8999.435 -2.127 0.033 -3.68e+04 -1501.949
C(county)[T.72] 1.203e+04 1.91e+04 0.630 0.528 -2.54e+04 4.95e+04
C(county)[T.73] -4151.5319 5648.737 -0.735 0.462 -1.52e+04 6922.479
C(county)[T.74] -1.341e+04 7534.403 -1.780 0.075 -2.82e+04 1357.560
C(county)[T.75] -1.719e+04 3924.766 -4.381 0.000 -2.49e+04 -9500.690
C(county)[T.76] -3.746e+04 2.62e+04 -1.430 0.153 -8.88e+04 1.39e+04
C(county)[T.77] 237.8624 6897.951 0.034 0.972 -1.33e+04 1.38e+04
C(county)[T.78] 6770.8989 2.24e+04 0.302 0.763 -3.72e+04 5.07e+04
C(county)[T.79] 4.146e+04 8462.736 4.899 0.000 2.49e+04 5.8e+04
C(county)[T.80] -1.741e+04 5589.953 -3.115 0.002 -2.84e+04 -6454.909
C(county)[T.81] -1.143e+04 5721.800 -1.998 0.046 -2.27e+04 -216.304
C(county)[T.82] -1.658e+04 1.61e+04 -1.032 0.302 -4.81e+04 1.49e+04
C(county)[T.83] -2.307e+04 1.25e+04 -1.840 0.066 -4.77e+04 1514.953
C(county)[T.84] -8999.9730 1.22e+04 -0.740 0.459 -3.28e+04 1.48e+04
C(county)[T.85] 3419.8384 9046.856 0.378 0.705 -1.43e+04 2.12e+04
C(county)[T.86] 1.824e+04 1.43e+04 1.279 0.201 -9720.390 4.62e+04
C(county)[T.87] -2.403e+04 1.09e+04 -2.205 0.027 -4.54e+04 -2667.316
C(county)[T.88] -1.326e+04 1.74e+04 -0.763 0.445 -4.73e+04 2.08e+04
C(county)[T.89] -1.574e+04 8531.472 -1.845 0.065 -3.25e+04 985.816
C(county)[T.90] -8870.8793 4968.614 -1.785 0.074 -1.86e+04 869.791
C(county)[T.91] -1.04e+04 9791.880 -1.062 0.288 -2.96e+04 8799.744
C(county)[T.92] -1.687e+04 1.17e+04 -1.441 0.150 -3.98e+04 6076.622
C(county)[T.93] -4593.8182 9052.966 -0.507 0.612 -2.23e+04 1.32e+04
C(county)[T.94] -66.5470 9761.637 -0.007 0.995 -1.92e+04 1.91e+04
C(county)[T.95] -1.199e+04 5085.007 -2.357 0.018 -2.2e+04 -2017.441
C(county)[T.96] -1.27e+04 1.03e+04 -1.228 0.219 -3.3e+04 7572.284
C(county)[T.97] -2.457e+04 9432.408 -2.605 0.009 -4.31e+04 -6080.404
C(county)[T.98] 2197.3011 1.53e+04 0.144 0.886 -2.78e+04 3.22e+04
C(county)[T.99] -9289.2093 3098.282 -2.998 0.003 -1.54e+04 -3215.213
C(county)[T.100] 1.121e+04 7876.656 1.423 0.155 -4235.032 2.66e+04
C(county)[T.101] -1.684e+04 9598.189 -1.754 0.079 -3.57e+04 1979.720
C(county)[T.102] -2597.8743 1.35e+04 -0.192 0.847 -2.91e+04 2.39e+04
C(county)[T.103] -1.014e+04 9066.754 -1.118 0.264 -2.79e+04 7638.542
C(county)[T.104] -252.2631 1.16e+04 -0.022 0.983 -2.3e+04 2.25e+04
C(county)[T.105] -1.145e+04 1.3e+04 -0.880 0.379 -3.7e+04 1.41e+04
C(county)[T.106] -3.539e+04 3.35e+04 -1.055 0.291 -1.01e+05 3.03e+04
C(county)[T.107] -3.364e+04 1.71e+04 -1.962 0.050 -6.72e+04 -29.720
C(county)[T.109] -4795.8027 8034.001 -0.597 0.551 -2.05e+04 1.1e+04
C(county)[T.110] -8716.5745 1.06e+04 -0.822 0.411 -2.95e+04 1.21e+04
C(county)[T.111] -1.946e+04 2.19e+04 -0.889 0.374 -6.24e+04 2.35e+04
C(county)[T.112] -7351.2007 1.42e+04 -0.516 0.606 -3.53e+04 2.06e+04
C(county)[T.113] -2.293e+04 2.6e+04 -0.882 0.378 -7.39e+04 2.8e+04
C(county)[T.114] -1.908e+04 1.61e+04 -1.185 0.236 -5.07e+04 1.25e+04
C(county)[T.115] -2798.0105 1e+04 -0.280 0.780 -2.24e+04 1.68e+04
C(county)[T.116] -8785.2988 1.35e+04 -0.650 0.516 -3.53e+04 1.77e+04
C(county)[T.117] -2.473e+04 4.28e+04 -0.577 0.564 -1.09e+05 5.92e+04
C(county)[T.118] -1.698e+04 1.1e+04 -1.538 0.124 -3.86e+04 4669.861
C(county)[T.120] -1.704e+04 1.32e+04 -1.287 0.198 -4.3e+04 8908.447
C(county)[T.121] -2.767e+04 1.32e+04 -2.093 0.036 -5.36e+04 -1747.007
C(county)[T.123] -1.291e+04 5091.347 -2.536 0.011 -2.29e+04 -2931.559
C(county)[T.124] -3.209e+04 1e+04 -3.197 0.001 -5.18e+04 -1.24e+04
C(county)[T.125] -1.795e+04 1.25e+04 -1.440 0.150 -4.24e+04 6490.095
C(county)[T.126] -6948.1460 9793.690 -0.709 0.478 -2.61e+04 1.23e+04
C(county)[T.127] -4111.7736 9598.180 -0.428 0.668 -2.29e+04 1.47e+04
C(county)[T.128] -1.407e+04 3.35e+04 -0.420 0.675 -7.98e+04 5.17e+04
C(county)[T.129] -1.338e+04 1.22e+04 -1.094 0.274 -3.73e+04 1.06e+04
C(county)[T.130] -2.958e+04 1.24e+04 -2.379 0.017 -5.4e+04 -5205.688
C(county)[T.131] -9462.1890 2.13e+04 -0.445 0.656 -5.11e+04 3.22e+04
C(county)[T.132] -7889.2394 5638.330 -1.399 0.162 -1.89e+04 3164.369
C(county)[T.133] 2175.4884 1.51e+04 0.144 0.886 -2.75e+04 3.18e+04
C(county)[T.134] 5.467e+04 1.8e+04 3.045 0.002 1.95e+04 8.99e+04
C(county)[T.135] -2.083e+04 1.38e+04 -1.509 0.131 -4.79e+04 6228.141
C(county)[T.136] -4420.8608 5997.620 -0.737 0.461 -1.62e+04 7337.113
C(county)[T.137] -4087.2387 8100.626 -0.505 0.614 -2e+04 1.18e+04
C(county)[T.138] -7939.1789 8029.739 -0.989 0.323 -2.37e+04 7802.643
C(county)[T.139] -692.3178 1.02e+04 -0.068 0.946 -2.08e+04 1.94e+04
C(county)[T.140] -1.651e+04 1.37e+04 -1.202 0.229 -4.34e+04 1.04e+04
C(county)[T.141] -8251.4595 1.22e+04 -0.676 0.499 -3.22e+04 1.57e+04
C(county)[T.142] -1373.8309 2.13e+04 -0.065 0.948 -4.3e+04 4.03e+04
C(county)[T.143] -5545.8845 5862.828 -0.946 0.344 -1.7e+04 5947.838
C(county)[T.144] -4.741e+04 2.12e+04 -2.232 0.026 -8.91e+04 -5771.921
C(county)[T.145] -9556.0440 3.47e+04 -0.276 0.783 -7.75e+04 5.84e+04
C(county)[T.146] -1.903e+04 1.59e+04 -1.196 0.232 -5.02e+04 1.22e+04
C(county)[T.147] -2.8e+04 1.52e+04 -1.848 0.065 -5.77e+04 1707.239
C(county)[T.148] -2.012e+04 1.76e+04 -1.146 0.252 -5.45e+04 1.43e+04
C(county)[T.149] -1.012e+04 4824.670 -2.097 0.036 -1.96e+04 -657.758
C(county)[T.150] -2.052e+04 3.35e+04 -0.612 0.541 -8.63e+04 4.52e+04
C(county)[T.151] -2.235e+04 1.13e+04 -1.974 0.048 -4.45e+04 -157.867
C(county)[T.152] -1.249e+04 7332.914 -1.703 0.089 -2.69e+04 1887.273
C(county)[T.153] -1.074e+04 2.99e+04 -0.360 0.719 -6.93e+04 4.78e+04
C(county)[T.155] -1.174e+04 1.32e+04 -0.892 0.373 -3.75e+04 1.41e+04
C(county)[T.157] -2.379e+04 1.34e+04 -1.777 0.076 -5e+04 2454.505
C(county)[T.158] -3.57e+04 1.24e+04 -2.878 0.004 -6e+04 -1.14e+04
C(county)[T.159] -3961.2072 2.67e+04 -0.149 0.882 -5.62e+04 4.83e+04
C(county)[T.160] -1.811e+04 8356.451 -2.167 0.030 -3.45e+04 -1729.232
C(county)[T.161] 6091.1910 8574.294 0.710 0.477 -1.07e+04 2.29e+04
C(county)[T.162] 9087.0830 1.55e+04 0.586 0.558 -2.13e+04 3.95e+04
C(county)[T.163] -2.133e+04 2.02e+04 -1.054 0.292 -6.1e+04 1.84e+04
C(county)[T.164] -2.283e+04 1.51e+04 -1.511 0.131 -5.24e+04 6795.581
C(county)[T.165] -2.21e+04 2.13e+04 -1.037 0.300 -6.39e+04 1.97e+04
C(county)[T.166] -1.551e+04 9441.670 -1.642 0.101 -3.4e+04 3002.990
C(county)[T.167] -3.947e+04 3.24e+04 -1.218 0.223 -1.03e+05 2.41e+04
C(county)[T.168] 1.007e+04 7103.076 1.417 0.156 -3857.374 2.4e+04
C(county)[T.169] 1.283e+04 1.14e+04 1.121 0.262 -9600.934 3.53e+04
C(county)[T.170] -1.615e+04 1.4e+04 -1.150 0.250 -4.37e+04 1.14e+04
C(county)[T.171] 2.365e+04 7177.844 3.295 0.001 9576.024 3.77e+04
C(county)[T.172] -3.626e+04 3.24e+04 -1.119 0.263 -9.98e+04 2.73e+04
C(county)[T.173] -2.629e+04 2.38e+04 -1.106 0.269 -7.29e+04 2.03e+04
C(county)[T.174] -2.314e+04 1.59e+04 -1.458 0.145 -5.43e+04 7983.357
C(county)[T.176] -8331.5679 1.76e+04 -0.472 0.637 -4.29e+04 2.62e+04
C(county)[T.177] -7787.9729 6943.898 -1.122 0.262 -2.14e+04 5825.123
C(county)[T.178] -8426.9047 9527.061 -0.885 0.376 -2.71e+04 1.03e+04
C(county)[T.179] 5061.4958 9108.387 0.556 0.578 -1.28e+04 2.29e+04
C(county)[T.180] -9156.0300 7253.999 -1.262 0.207 -2.34e+04 5065.000
C(county)[T.181] -6042.8525 1.16e+04 -0.522 0.602 -2.87e+04 1.67e+04
C(county)[T.182] -1.6e+04 8180.157 -1.956 0.051 -3.2e+04 40.230
C(county)[T.183] 1688.4654 4051.315 0.417 0.677 -6253.894 9630.825
C(county)[T.184] -1.566e+04 3739.789 -4.189 0.000 -2.3e+04 -8333.281
C(county)[T.185] -1.158e+04 1.5e+04 -0.769 0.442 -4.11e+04 1.79e+04
C(county)[T.186] -1.279e+04 1.07e+04 -1.198 0.231 -3.37e+04 8134.088
C(county)[T.187] -6652.3137 8781.097 -0.758 0.449 -2.39e+04 1.06e+04
C(county)[T.188] -1.083e+04 3548.268 -3.053 0.002 -1.78e+04 -3878.254
C(county)[T.189] -2.122e+04 5407.161 -3.925 0.000 -3.18e+04 -1.06e+04
C(county)[T.190] -2.661e+04 1.48e+04 -1.803 0.071 -5.55e+04 2321.109
C(county)[T.191] -3036.3270 1.92e+04 -0.158 0.874 -4.06e+04 3.45e+04
C(county)[T.192] 6940.9690 9281.647 0.748 0.455 -1.13e+04 2.51e+04
C(county)[T.193] -9444.8116 1.05e+04 -0.902 0.367 -3e+04 1.11e+04
C(county)[T.194] 2498.3312 1.35e+04 0.185 0.853 -2.4e+04 2.9e+04
C(county)[T.195] -2.03e+04 9595.165 -2.116 0.034 -3.91e+04 -1493.798
C(county)[T.196] -2753.6229 6585.461 -0.418 0.676 -1.57e+04 1.02e+04
C(county)[T.197] 1.33e+04 5757.604 2.311 0.021 2015.983 2.46e+04
C(county)[T.198] 7328.9890 7484.437 0.979 0.328 -7343.801 2.2e+04
C(county)[T.199] -2.136e+04 8594.063 -2.485 0.013 -3.82e+04 -4510.853
C(county)[T.200] -7508.9207 3016.843 -2.489 0.013 -1.34e+04 -1594.581
C(county)[T.201] -2.737e+04 1.51e+04 -1.811 0.070 -5.7e+04 2251.151
C(county)[T.202] -1237.8204 2.25e+04 -0.055 0.956 -4.54e+04 4.29e+04
C(county)[T.203] -6666.3212 1.53e+04 -0.435 0.664 -3.67e+04 2.34e+04
C(county)[T.204] -1.464e+04 9850.437 -1.486 0.137 -3.39e+04 4674.099
C(county)[T.205] -1.706e+04 6442.944 -2.648 0.008 -2.97e+04 -4427.328
C(county)[T.206] -3.733e+04 1.51e+04 -2.465 0.014 -6.7e+04 -7640.473
C(county)[T.209] 6087.6907 1.35e+04 0.452 0.651 -2.03e+04 3.25e+04
C(county)[T.210] -2.578e+04 2.65e+04 -0.973 0.330 -7.77e+04 2.61e+04
C(county)[T.211] -1093.8388 7382.081 -0.148 0.882 -1.56e+04 1.34e+04
C(county)[T.213] -2.153e+04 2.35e+04 -0.916 0.360 -6.76e+04 2.45e+04
C(county)[T.214] -2.351e+04 1.27e+04 -1.858 0.063 -4.83e+04 1299.507
C(county)[T.215] -1.66e+04 1.12e+04 -1.488 0.137 -3.85e+04 5271.613
C(county)[T.216] -9052.8944 5337.794 -1.696 0.090 -1.95e+04 1411.531
C(county)[T.217] -2.914e+04 9643.602 -3.021 0.003 -4.8e+04 -1.02e+04
C(county)[T.218] -1.642e+04 1.66e+04 -0.987 0.324 -4.9e+04 1.62e+04
C(county)[T.219] -482.9429 8190.878 -0.059 0.953 -1.65e+04 1.56e+04
C(county)[T.220] 1.038e+04 1.11e+04 0.934 0.350 -1.14e+04 3.22e+04
C(county)[T.221] 7271.3058 1.15e+04 0.634 0.526 -1.52e+04 2.97e+04
C(county)[T.222] -1.622e+04 1.02e+04 -1.588 0.112 -3.63e+04 3804.653
C(county)[T.223] -1.878e+04 5199.813 -3.612 0.000 -2.9e+04 -8586.505
C(county)[T.224] -9192.1466 1.45e+04 -0.636 0.525 -3.75e+04 1.92e+04
C(county)[T.225] 2641.2389 1.98e+04 0.134 0.894 -3.61e+04 4.14e+04
C(county)[T.226] -8830.2891 6639.675 -1.330 0.184 -2.18e+04 4186.397
C(county)[T.227] -2.143e+04 1.09e+04 -1.970 0.049 -4.28e+04 -102.683
C(county)[T.228] 3.233e+04 1.71e+04 1.894 0.058 -1135.703 6.58e+04
C(county)[T.229] -1.428e+04 1.91e+04 -0.748 0.454 -5.17e+04 2.31e+04
C(county)[T.230] -1.28e+04 6264.355 -2.043 0.041 -2.51e+04 -517.493
C(county)[T.231] -8271.2748 7245.759 -1.142 0.254 -2.25e+04 5933.602
C(county)[T.232] -2.299e+04 8519.159 -2.699 0.007 -3.97e+04 -6289.960
C(county)[T.233] -1.285e+04 8491.101 -1.513 0.130 -2.95e+04 3796.614
C(county)[T.234] -1.735e+04 3.1e+04 -0.559 0.576 -7.82e+04 4.35e+04
C(county)[T.235] -1.766e+04 3.16e+04 -0.559 0.576 -7.96e+04 4.43e+04
C(county)[T.236] -1.217e+04 8349.230 -1.457 0.145 -2.85e+04 4200.746
C(county)[T.237] -1.141e+04 4505.618 -2.533 0.011 -2.02e+04 -2581.176
C(county)[T.238] -1.613e+04 7139.990 -2.258 0.024 -3.01e+04 -2127.974
C(county)[T.239] -2.712e+04 2.14e+04 -1.266 0.206 -6.91e+04 1.49e+04
C(county)[T.240] 6490.6376 9154.156 0.709 0.478 -1.15e+04 2.44e+04
C(county)[T.241] -3359.7031 6189.659 -0.543 0.587 -1.55e+04 8774.753
C(county)[T.242] -1.613e+04 1.27e+04 -1.272 0.203 -4.1e+04 8724.775
C(county)[T.243] -1.788e+04 1.39e+04 -1.286 0.199 -4.51e+04 9383.449
C(county)[T.244] -3.071e+04 1.43e+04 -2.150 0.032 -5.87e+04 -2703.321
C(county)[T.245] -8153.4624 1.35e+04 -0.603 0.547 -3.47e+04 1.84e+04
C(county)[T.246] 5281.9970 4036.138 1.309 0.191 -2630.610 1.32e+04
C(county)[T.247] -9765.4698 5639.622 -1.732 0.083 -2.08e+04 1290.671
C(county)[T.248] 143.0204 8692.204 0.016 0.987 -1.69e+04 1.72e+04
C(county)[T.249] -1.476e+04 3.35e+04 -0.440 0.660 -8.05e+04 5.1e+04
C(county)[T.250] 1.616e+04 1.05e+04 1.541 0.123 -4398.223 3.67e+04
C(county)[T.251] -1.651e+04 9569.822 -1.726 0.084 -3.53e+04 2247.701
C(county)[T.252] -1.57e+04 2.01e+04 -0.781 0.435 -5.51e+04 2.37e+04
C(county)[T.253] -1.801e+04 1.58e+04 -1.141 0.254 -4.89e+04 1.29e+04
C(county)[T.254] -5819.5313 1.24e+04 -0.470 0.638 -3.01e+04 1.85e+04
C(county)[T.256] -9352.6145 1.31e+04 -0.712 0.476 -3.51e+04 1.64e+04
C(county)[T.257] -1.474e+04 4650.454 -3.170 0.002 -2.39e+04 -5626.575
C(county)[T.258] -2.95e+04 1.86e+04 -1.587 0.113 -6.6e+04 6945.417
C(county)[T.259] 5.662e+04 1.78e+04 3.176 0.002 2.17e+04 9.16e+04
C(county)[T.260] -2.075e+04 1.42e+04 -1.461 0.144 -4.86e+04 7093.704
C(county)[T.261] -1.214e+04 9508.704 -1.277 0.202 -3.08e+04 6499.043
C(county)[T.262] -1.499e+04 3.16e+04 -0.474 0.635 -7.69e+04 4.7e+04
C(county)[T.263] 1.094e+04 1.25e+04 0.872 0.383 -1.36e+04 3.55e+04
C(county)[T.264] -2479.7477 1.18e+04 -0.210 0.834 -2.57e+04 2.07e+04
C(county)[T.265] -2.203e+04 1.05e+04 -2.105 0.035 -4.25e+04 -1517.160
C(county)[T.266] 6210.2882 3.35e+04 0.185 0.853 -5.95e+04 7.19e+04
C(county)[T.267] -1.138e+04 2.33e+04 -0.488 0.626 -5.71e+04 3.44e+04
C(county)[T.268] -2.097e+04 1.51e+04 -1.389 0.165 -5.06e+04 8629.225
C(county)[T.269] -1.836e+04 7924.529 -2.317 0.021 -3.39e+04 -2824.110
C(county)[T.270] -1.361e+04 1.72e+04 -0.789 0.430 -4.74e+04 2.02e+04
C(county)[T.271] -1.204e+04 1.55e+04 -0.777 0.437 -4.24e+04 1.84e+04
C(county)[T.272] -1.485e+04 8880.399 -1.672 0.095 -3.23e+04 2561.281
C(county)[T.273] -1.149e+04 2.63e+04 -0.438 0.662 -6.3e+04 4e+04
C(county)[T.274] -9891.3781 4694.196 -2.107 0.035 -1.91e+04 -688.687
C(county)[T.275] -5173.7930 1.15e+04 -0.449 0.653 -2.77e+04 1.74e+04
C(county)[T.276] -1.966e+04 1.66e+04 -1.182 0.237 -5.22e+04 1.29e+04
C(county)[T.277] -3.402e+04 2.86e+04 -1.188 0.235 -9.01e+04 2.21e+04
C(county)[T.278] 2322.2439 1.9e+04 0.122 0.903 -3.49e+04 3.95e+04
C(county)[T.279] -1.119e+04 1.84e+04 -0.609 0.543 -4.72e+04 2.48e+04
C(county)[T.280] -1.345e+04 6306.646 -2.133 0.033 -2.58e+04 -1085.930
C(county)[T.281] -8403.4354 2.99e+04 -0.281 0.778 -6.69e+04 5.01e+04
C(county)[T.282] -3681.0429 1.2e+04 -0.307 0.759 -2.72e+04 1.99e+04
C(county)[T.283] -5339.8789 9936.127 -0.537 0.591 -2.48e+04 1.41e+04
C(county)[T.284] -1.452e+04 8179.287 -1.775 0.076 -3.06e+04 1519.240
C(county)[T.285] -3.43e+04 2.13e+04 -1.613 0.107 -7.6e+04 7382.687
C(county)[T.286] 7684.0660 1.16e+04 0.664 0.507 -1.5e+04 3.04e+04
C(county)[T.287] -1.775e+04 5134.756 -3.456 0.001 -2.78e+04 -7680.041
C(county)[T.288] 2038.4699 1.76e+04 0.116 0.908 -3.26e+04 3.66e+04
C(county)[T.289] -8041.0146 1.11e+04 -0.726 0.468 -2.97e+04 1.37e+04
C(county)[T.290] -1.581e+04 7099.249 -2.227 0.026 -2.97e+04 -1889.232
C(county)[T.291] -7663.1814 1.35e+04 -0.567 0.571 -3.42e+04 1.88e+04
C(county)[T.292] -1.567e+04 7319.153 -2.141 0.032 -3e+04 -1323.652
C(county)[T.293] -5825.2555 1.34e+04 -0.433 0.665 -3.22e+04 2.05e+04
C(county)[T.294] 1.947e+04 9840.528 1.978 0.048 177.105 3.88e+04
C(county)[T.295] -1.75e+04 1.76e+04 -0.996 0.319 -5.2e+04 1.7e+04
C(county)[T.296] -1096.1249 1.8e+04 -0.061 0.951 -3.64e+04 3.42e+04
C(county)[T.297] -1.141e+04 4502.234 -2.535 0.011 -2.02e+04 -2586.102
C(county)[T.298] -8732.4700 2.87e+04 -0.304 0.761 -6.5e+04 4.75e+04
C(county)[T.299] -1.74e+04 2.99e+04 -0.583 0.560 -7.59e+04 4.11e+04
C(county)[T.300] -1.986e+04 1.04e+04 -1.907 0.057 -4.03e+04 560.482
C(county)[T.301] -4.329e+04 3.24e+04 -1.336 0.182 -1.07e+05 2.03e+04
C(county)[T.302] 4581.0131 7593.047 0.603 0.546 -1.03e+04 1.95e+04
C(county)[T.303] -1.484e+04 7473.001 -1.985 0.047 -2.95e+04 -186.892
C(county)[T.304] -2.222e+04 8772.347 -2.533 0.011 -3.94e+04 -5022.926
C(county)[T.305] -2514.2672 7217.330 -0.348 0.728 -1.67e+04 1.16e+04
C(county)[T.306] -2.336e+04 1.99e+04 -1.174 0.240 -6.24e+04 1.57e+04
C(county)[T.307] -1.116e+04 9796.527 -1.139 0.255 -3.04e+04 8045.695
C(county)[T.308] 8.973e+04 1.66e+04 5.411 0.000 5.72e+04 1.22e+05
C(county)[T.309] -1.179e+04 5120.504 -2.302 0.021 -2.18e+04 -1747.726
C(county)[T.310] -1.268e+04 3.35e+04 -0.378 0.705 -7.84e+04 5.31e+04
C(county)[T.311] -1.624e+04 1.01e+04 -1.611 0.107 -3.6e+04 3524.476
C(county)[T.312] -9191.1488 9873.286 -0.931 0.352 -2.85e+04 1.02e+04
C(county)[T.313] -1.765e+04 4.28e+04 -0.412 0.680 -1.02e+05 6.63e+04
C(county)[T.314] -1.793e+04 7170.454 -2.501 0.012 -3.2e+04 -3875.671
C(county)[T.315] 879.4379 1.04e+04 0.084 0.933 -1.96e+04 2.13e+04
C(county)[T.316] -5807.9640 6758.678 -0.859 0.390 -1.91e+04 7442.020
C(county)[T.317] 1.794e+04 9335.924 1.922 0.055 -362.354 3.62e+04
C(county)[T.318] -4980.2142 8582.095 -0.580 0.562 -2.18e+04 1.18e+04
C(county)[T.319] 1051.3237 1.42e+04 0.074 0.941 -2.68e+04 2.89e+04
C(county)[T.320] -1.431e+04 1.5e+04 -0.951 0.341 -4.38e+04 1.52e+04
C(county)[T.321] -1.684e+04 6968.208 -2.416 0.016 -3.05e+04 -3176.799
C(county)[T.322] -406.6903 6523.846 -0.062 0.950 -1.32e+04 1.24e+04
C(county)[T.323] -1.853e+04 5726.625 -3.236 0.001 -2.98e+04 -7302.079
C(county)[T.324] -2501.1256 7866.013 -0.318 0.751 -1.79e+04 1.29e+04
C(county)[T.325] -1.951e+04 2.24e+04 -0.873 0.383 -6.33e+04 2.43e+04
has_college 1.325e+04 742.630 17.843 0.000 1.18e+04 1.47e+04
female -8657.2875 609.011 -14.215 0.000 -9851.217 -7463.358
Omnibus: 2414.039 Durbin-Watson: 1.981
Prob(Omnibus): 0.000 Jarque-Bera (JB): 22245.534
Skew: 1.945 Prob(JB): 0.00
Kurtosis: 12.242 Cond. No. 198.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Exercise 15

If you stopped matching after the second iteration (Iteration 1) back in Exercise 10, you may be wondering if that was a good choice! Let’s check by restricting our attention to ONLY exact matches (iteration = 0). Run that match.

[37]:
model2 = dame_flame.matching.DAME(
    repeats=False, verbose=3, want_pe=True, early_stop_iterations=0
)
model2.fit(
    for_matching,
    treatment_column_name="has_college",
    outcome_column_name="annual_earnings",
)
result2 = model2.predict(for_matching)
Completed iteration 0 of matching
        Number of matched groups formed in total:  370
        Unmatched treated units:  644 out of a total of  1150 treated units
        Unmatched control units:  3187 out of a total of  4365 control units
        Number of matches made this iteration:  1684
        Number of matches made so far:  1684
        Covariates dropped so far:  set()
        Predictive error of covariate set used to match:  1199312680.0957854
1684 units matched. We stopped after iteration 0
[38]:
matched_data2 = get_dataframe(model2, result2)

Exercise 16

Now use a weighted linear regression on your matched data to regress annual earnings on just having a college eduction. Is that different from what you had when you allowed more low quality matches?

[39]:
smf.wls(
    "annual_earnings ~ has_college", matched_data2, weights=matched_data2["weights"]
).fit().summary()
[39]:
WLS Regression Results
Dep. Variable: annual_earnings R-squared: 0.049
Model: WLS Adj. R-squared: 0.048
Method: Least Squares F-statistic: 86.65
Date: Sat, 04 Mar 2023 Prob (F-statistic): 3.92e-20
Time: 13:59:42 Log-Likelihood: -19512.
No. Observations: 1684 AIC: 3.903e+04
Df Residuals: 1682 BIC: 3.904e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3.914e+04 664.386 58.907 0.000 3.78e+04 4.04e+04
has_college 1.128e+04 1212.039 9.308 0.000 8904.805 1.37e+04
Omnibus: 855.250 Durbin-Watson: 2.037
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6653.000
Skew: 2.256 Prob(JB): 0.00
Kurtosis: 11.629 Cond. No. 2.42


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Other Forms of Matching

OK, hopefully this gives you a taste of matching! There are, of course, many other permutations to be aware of though.

  • Matching with replacement. In this exercise, we set repeat=False, so each observation could only end up in our final dataset once. However, if we use repeat=True, if an untreated observation is the closest observation to multiple treated observations, it may get put in the dataset multiple times. We can still use this dataset in almost the same way, though, except we have to make use of weights so that if an observation appears, say, twice, each observation has a weight that’s 1/2 the weight of an observation only appearing once.

  • Matching with continuous variables: DAME is used for exact matching, but if you have lots of continuous variables, you can also match on those. In fact, the Almost Exact Matching Lab also has a library called MALTS that will do matching with continuous variables. That package does something like Mahalanobis Distance matching, but ulike Mahalanobis, which calculates the distance between observations in terms of the difference in all the matching variables normalized by each matching variable’s standard deviation, MALTS does something much more clever. (Here’s the paper describing the technique if you want all the details). Basically, it figures out how well each matching variable predicts our outcome \(Y\), then weights the different variables by their predictive power instead of just normalizing by something arbitrary like their standard deviation. As a result, final matches will prioritize matching more closely on variables that are outcome-relevant. In addition, when it sees a categorical variable, it recognizes that and only pairs observations when they are an exact match on that categorical variable.

  • If you’re dataset is huge, use FLAME: this dataset is small, but if you have lots of observations and lots of matching variable, the computational complexity of this task explodes, so the AEML created FLAME, which works with millions of observations at only a small cost to match quality.

Absolutely positively need the solutions?

Don’t use this link until you’ve really, really spent time struggling with your code! Doing so only results in you cheating yourself.

Link