{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Matching Exercise \n",
    "\n",
    "In this exercise, we'll be evaluating how getting a college degree impacts earnings in the US using matching.\n",
    "\n",
    "## Matching Packages: Python v. R\n",
    "\n",
    "Just as the best tools for machine learning tend to be in Python since they're developed by CS people (who prefer Python), most of the best tools for causal inference are implemented in R since innovation in causal inference tends to be lead by social scientists using R. As a result, the most well developed matching package is called [MatchIt](https://kosukeimai.github.io/MatchIt/index.html), and is only available in R (though you can always call it from Python using `rpy2`). \n",
    "\n",
    "In the last couple years, though, a group of computer scientists and statisticians here at Duke have made some great advancements in matching (especially the computational side of things), and they recently released a set of matching packages in both R and Python that we'll be using today. They have some great algorithms we'll use today, but be aware these packages aren't as mature, and aren't general purpose packages yet. So if you ever get deep into matching, be aware you will probably still want to make at least partial use of the R package [MatchIt](https://kosukeimai.github.io/MatchIt/index.html), as well as some other R packages for new innovative techniques (like [Matching Frontier estimation](https://projects.iq.harvard.edu/frontier/home)), or [Adaptive Hyper-Box Matching](https://almost-matching-exactly.github.io/AHB-R-package/). \n",
    "\n",
    "## Installing dame-flame. \n",
    "\n",
    "For this lesson, begin by installing `dame-flame` with `pip install dame-flame` (it's not on conda yet). \n",
    "\n",
    "[DAME](https://almost-matching-exactly.github.io/DAME-FLAME-Python-Package) is an algorithm that we can use for a version of coarse exact matching. The package only accepts a list of categorical variables, and then attempts to match pairs that match exactly on those variables. That means that if you want to match on, say, age, you have to break it up into categories (say, under 18, 18-29, 30-39, etc. etc.). \n",
    "\n",
    "(NOTE: As of 2024, their documentation site is weird: click the dropdowns next to headings to see the content, otherwise the documentation looks deserted)\n",
    "\n",
    "Of course, one cannot always find exact matches on all variables, so what DAME does is:\n",
    "\n",
    "1. Find all observations that match on *all* matching variables. \n",
    "2. Figure out which matching variable is least useful in predicting the outcome of interest $Y$ and drops that, then tries to match the remaining observations on the narrowed set of matching variables. \n",
    "3. This repeats until you run out of variables, all observations are matched, or you hit a stopping run (namely: quality of matches falls below a threshold). \n",
    "\n",
    "In addition, the lab has also created FLAME, which does the same thing, but employs some tricks to make it *massively* more computationally efficient, meaning it can be used on datasets with millions of observations (which most matching algorithms cannot). It's a little less accurate, but an amazing contribution never the less. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Setup\n",
    "\n",
    "To save you some time and let you focus on matching, I've *pre-cleaned* about one month worth of of data from the US Current Population Survey data we used for our [gender discrimination analysis](exercises/exercises_regression_incomeineq.ipynb). You can download the data [from here](https://github.com/nickeubank/MIDS_Data/blob/master/Current_Population_Survey/cps_for_matching.dta?raw=true\"), or read it directly with:\n",
    "\n",
    "```python \n",
    "cps = pd.read_stata(\n",
    "    \"https://github.com/nickeubank/MIDS_Data/blob/master\"\n",
    "    \"/Current_Population_Survey/cps_for_matching.dta?raw=true\"\n",
    ")\n",
    "```\n",
    "\n",
    "Load the data and quickly familiarize yourself with its contents. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting To Know Your Data\n",
    "\n",
    "Before you start matching, it is important to examine your data to ensure that matching is feasible (you have some overlap the the features of people in the treated and untreated groups), and also that there is a reason *to* match: either you're unsure about some of the functional forms at play, or your have some imbalance between the two groups. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 1\n",
    "\n",
    "Show the raw difference of `annual_earnings` between those with and without a college degree (`has_college`). Is the difference statistically significant?"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2\n",
    "\n",
    "Next we can check for balance. Check the share of people in different racial groups who have college degrees. Are those differences statistically significant? (Remember how to check for difference in distributions of categorical variables).\n",
    "\n",
    "Race is coded as White Non-Hispanic (0), Black Non-Hispanic (1), Hispanic (2), Other (3).\n",
    "\n",
    "Does the data seem balanced?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3\n",
    "\n",
    "One of the other advantages of matching is that even when you have balanced data, you don't have to go through the process of testing out different functional forms to see what fits the data base. \n",
    "\n",
    "In our last exercise, we looked at the relationship between gender and earnings \"controlling for age\", where we just put in age as a linear control. Plot a non-linear regression of `annual_earnings` on age (`PolyFit(order=3)` is fine.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Does the relationship look linear?\n",
    "\n",
    "Does this speak to why it's nice to not have to think about functional forms with matching as much?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Matching!\n",
    "\n",
    "Because DAME is an implementation of exact matching, we have to discretize all of our continuous variables. Thankfully, in this case we only have `age`, so this shouldn't be too hard!\n",
    "\n",
    "### Exercise 4\n",
    "\n",
    "Create a new variable that discretizes age into a single value for each decade of age. \n",
    "\n",
    "Because CPS only has employment data on people 18 or over, though, include people who are 18 or 19 with the 20 year olds so that group isn't too small, and if you see any other really small groups, please merge those too. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 5\n",
    "\n",
    "We also have to covert our string variables into numeric variables for DAME, so convert `county` and `class94` to a numeric vector of intergers. \n",
    "\n",
    "(Note: it's not clear whether `class94` belongs: if it reflects people choosing fields based on passion, it belongs; if people choose certain jobs because of their degrees, its not something we'd actually want in our regression. \n",
    "\n",
    "Hint: if you use `pd.Categorical` to convert you var to a categorical, you can pull the underlying integer codes with `.codes`. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's Do Matching with DAME\n",
    "\n",
    "### Exercise 6\n",
    "\n",
    "First, drop all the variables you *don't* want in matching (e.g. your original `age` variable), and any observations for which `annual_earnings` is missing. \n",
    "\n",
    "You will probably also have to drop a column named `index`: DAME will try and match on ANY included variables, and so because there was a column called `index` in the data we imported, if we leave it in DAME will try (and obviously fail) to match on index. \n",
    "\n",
    "Also, it's best to reset your index, as `dame_flame` using index labels (e.g., the values in `df.index`) to identify matches. So you want to be sure those are unique."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 7\n",
    "\n",
    "The syntax of `dame_flame` is similar to the syntax of `sklearn`. If you start with a dataset called `my_data` with a `treat` variable with treatment assignment and an `outcome` variable for my outcome of interest ($Y$), the syntax to do basic matching would be:\n",
    "\n",
    "```python\n",
    "import dame_flame\n",
    "model = dame_flame.matching.DAME(\n",
    "    repeats=False,\n",
    "    verbose=3,\n",
    "    want_pe=True,\n",
    "    stop_unmatched_t=True,\n",
    ")\n",
    "model.fit(\n",
    "    for_matching,\n",
    "    treatment_column_name=\"has_college\",\n",
    "    outcome_column_name=\"annual_earnings\",\n",
    ")\n",
    "result = model.predict(for_matching)\n",
    "```\n",
    "\n",
    "Where the arguments:\n",
    "\n",
    "- `repeats=False` says that I only want each observation to get matched once. We'll talk about what happens if we use `repeats=True` below. \n",
    "- `verbose=3` tells dame to report everything it's doing as it goes. \n",
    "- `want_pe` says \"please include the predictive error in your printout at each step\". This is a measure of match quality. \n",
    "- `stop_unmatched_t` says \"once you've matched all the treatment units, you can stop.\"\n",
    "\n",
    "So run DAME on your data!"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Interpreting DAME output\n",
    "\n",
    "The output you get from doing this *should* be reports from about 8 iterations of matching. In each iteration, you'll see a description of the number of matches made in the iteration, the number of treatment units still unmatched, and the number of control units unmatched. \n",
    "\n",
    "In the first iteration, the algorithm tries to match observations that match on *all* the variables in your data. That's why in the first iteration, you see the set of variables being dropped is an empty set (`Covariates dropped so far: set()`) — it *hasn't* dropped any variables:\n",
    "\n",
    "```\n",
    "Completed iteration 0 of matching\n",
    "\tNumber of matched groups formed in total:  370\n",
    "\tUnmatched treated units:  644 out of a total of  1150 treated units\n",
    "\tUnmatched control units:  3187 out of a total of  4365 control units\n",
    "\tNumber of matches made this iteration:  1684\n",
    "\tNumber of matches made so far:  1684\n",
    "\tCovariates dropped so far:  set()\n",
    "\tPredictive error of covariate set used to match:  1199312680.0957854\n",
    "\n",
    "```\n",
    "\n",
    "(Note depending on how you binned ages, you may get slightly different results than this)\n",
    "\n",
    "But as we can see from this output, the algorithm found 1,684 perfect matches—pairs of observations (one treated, one untreated) that had *exactly* the same value of all the variables we included. But we also see we still have 644 *unmatched* treated units, so what do we do?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The answer is that if we want to match more of our treatment variables, we have to try and match on a subset of our variables.\n",
    "\n",
    "But what variable should we drop? This is the secret sauce of DAME. DAME picks the variables to drop by trying to predict our outcome $Y$ using all our variables (by default using a ridge regression), then it drops the matching variable that is contributing the least to that prediction. Since our goal in matching is to eliminate baseline differences ($E(Y_0|D=1) - E(Y_1|D=0)$), dropping the covariates least related to $Y$ makes sense.\n",
    "\n",
    "As a result, in the second iteration (called iteration 1, since it uses 0-based indexing), we see that the variable it drops first is `county`, and it's subsequently able to make another 3,626 new matches on the remaining variables!\n",
    "\n",
    "```\n",
    "Completed iteration 1 of matching\n",
    "\tNumber of matched groups formed in total:  494\n",
    "\tUnmatched treated units:  25 out of a total of  1150 treated units\n",
    "\tUnmatched control units:  180 out of a total of  4365 control units\n",
    "\tNumber of matches made this iteration:  3626\n",
    "\tNumber of matches made so far:  5310\n",
    "\tCovariates dropped so far:  frozenset({'county'})\n",
    "\tPredictive error of covariate set used to match:  1199421883.1095908\n",
    "\n",
    "```\n",
    "\n",
    "And so DAME continues until its matched all treated observations, and even then it keeps going to evaluate different covariates it might exclude."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 8\n",
    "\n",
    "Congratulations! You just on your first one-to-many matching! \n",
    "\n",
    "The next step is to think about which of the matches that DAME generated are good enough for inclusion in our analysis. As you may recall, one of the choices you have to make as a researcher when doing matching is how \"good\" a match has to be in order to be included in your final data set. By default, DAME will keep dropping matching variables until it has been able to match all the treated observations or runs out of variables. It will do this no matter how bad the matches start to become -- if it ends up with the treated observation and a control observation that can only be matched on gender, it will match them just on gender, even though we probably don't think that that's a \"good\" match.\n",
    "\n",
    "The way to control this behavior is to tell DAME when to stop manually using the `early_stop_iterations` argument to tell the matching algorithm when to stop.\n",
    "\n",
    "So when is a good time to stop? There's no objective or \"right\" answer to that question. It fundamentally comes down to a trade-off between bias (which gets higher is you allow more low quality matches into your data) and variance (which will go down as you increase the number of matches you keep).\n",
    "\n",
    "But one way to start the process of picking a cut point is to examine how the quality of matches evolves over iterations. DAME keeps this information in `model.pe_each_iter`. This shows, for each iteration, the \"prediction error\" resulting from dropping the variables excluded in each step. This \"prediction error\" is the difference in the mean-squared error of regressing $Y$ on our matching variables (by default in a ridge regression) with all variables versus with the subset being used for matching in a given iteration. By design, of course, this is always increasing. \n",
    "\n",
    "To see how this evolves, plot your `pe` against iteration numbers. You can also see the `pe` values for each iteration reported in the output from when DAME ran above if you want to make your you're lining up the errors with iterations right. \n",
    "\n",
    "Are there any points where the match quality seems to fall off dramatically? "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 9\n",
    "\n",
    "Suppose we want to ensure we have at least 5,000 observations in our matched data—where might you cut off the data to get a sample size of at least that but before a big quality falloff?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 10\n",
    "\n",
    "Re-run your matching, stopping at the point you picked above using `early_stop_iterations`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting Back a Dataset\n",
    "\n",
    "OK, my one current complaint with DAME is that it doesn't just give you back a nice dataset of your matches for analysis. If we look at our results — `matches` — it's *almost* what we want, except it has dropped our treatment and outcome columns, and put a string `*` in any entry where a value *wasn't* used for matching:\n",
    "\n",
    "```\n",
    "  female simplified_race   county   class94   discretized_age\n",
    "0  1.0     0.0              10.0      3.0          5.0\n",
    "1  0.0     2.0              *         3.0          3.0\n",
    "2  0.0     0.0              8.0        3.0         6.0\n",
    "3  0.0     0.0              *         1.0          4.0\n",
    "4  0.0     0.0              24.0      3.0          3.0\n",
    "```\n",
    "\n",
    "So for now (though I think this will get updated in the package), we'll have to do it ourselves! Just copy-paste this:\n",
    "\n",
    "```python\n",
    "\n",
    "def get_dataframe(model, result_of_fit):\n",
    "\n",
    "    # Get original data\n",
    "    better = model.input_data.loc[result_of_fit.index]\n",
    "    if not better.index.is_unique:\n",
    "        raise ValueError(\"Need index values in input data to be unique\")\n",
    "\n",
    "    # Get match groups for clustering\n",
    "    better[\"match_group\"] = np.nan\n",
    "    better[\"match_group_size\"] = np.nan\n",
    "    for idx, group in enumerate(model.units_per_group):\n",
    "        better.loc[group, \"match_group\"] = idx\n",
    "        better.loc[group, \"match_group_size\"] = len(group)\n",
    "\n",
    "    # Get weights. I THINK this is right?! At least for with repeat=False?\n",
    "    t = model.treatment_column_name\n",
    "    better[\"t_in_group\"] = better.groupby(\"match_group\")[t].transform(np.sum)\n",
    "\n",
    "    # Make weights\n",
    "    better[\"weights\"] = np.nan\n",
    "    better.loc[better[t] == 1, \"weights\"] = 1  # treaments are 1\n",
    "\n",
    "    # Controls start as proportional to num of treatments\n",
    "    # each observation is matched to.\n",
    "    better.loc[better[t] == 0, \"weights\"] = better[\"t_in_group\"] / (\n",
    "        better[\"match_group_size\"] - better[\"t_in_group\"]\n",
    "    )\n",
    "\n",
    "    # Then re-normalize for num unique control observations.\n",
    "    control_weights = better[better[t] == 0][\"weights\"].sum()\n",
    "\n",
    "    num_control_obs = len(better[better[t] == 0].index.drop_duplicates())\n",
    "    renormalization = num_control_obs / control_weights\n",
    "    better.loc[better[t] == 0, \"weights\"] = (\n",
    "        better.loc[better[t] == 0, \"weights\"] * renormalization\n",
    "    )\n",
    "    assert better.weights.notnull().all()\n",
    "\n",
    "    better = better.drop([\"t_in_group\"], axis=\"columns\")\n",
    "\n",
    "    # Make sure right length and values!\n",
    "    assert len(result_of_fit) == len(better)\n",
    "    assert better.loc[better[t] == 0, \"weights\"].sum() == num_control_obs\n",
    "\n",
    "    return better\n",
    "\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Exercise 11\n",
    "\n",
    "Copy-paste that code and run it with your original data, your (fit) model, and what you got back when you ran `result_of_fit`. Then we'll work with the output of that. You should get back a single dataframe of the same length as your original model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check Your Matches and Analyze\n",
    "\n",
    "### Exercise 12\n",
    "\n",
    "We previously tested balance on `simplified_race` and `county`. Check those again. Are there still statistically significant differences in college education by `simplified_race`?\n",
    "\n",
    "Note that when you test for this, you'll need to take into account the `weights` column you got back from `get_dataframe`. What DAME does is not actually the 1-to-1 matching described in our readings — instead, however many observations that exact match it finds it puts in the same \"group\". (These groups are identified in the dataframe you got from `get_dataframe` by the column `match_group`, and the size of each group is in `match_group_size`.)\n",
    "\n",
    "So to analyze the data, you need to use the `wls` (weighted least squares) function in `statsmodels`. For example, if your data is called `matched_data`, you might run:\n",
    "\n",
    "```python\n",
    "smf.wls(\n",
    "    \"has_college ~ C(simplified_race)\", matched_data, weights=matched_data[\"weights\"]\n",
    ").fit().summary()\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Exercise 13\n",
    "\n",
    "Now use a weighted least squares regression on your matched data to regress annual earnings on *just* having a college eduction. What is the apparent effect of a BA? How does that compare to our initial estimate using the raw CPS data (before matching)?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 14\n",
    "\n",
    "Now include our other matching variables as controls (e.g. all the coefficients you gave to DAME to use). Does the coefficient change?"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 15\n",
    "\n",
    "If you stopped matching after the second iteration (Iteration 1) back in Exercise 10, you may be wondering if that was a good choice! Let's check by restricting our attention to ONLY exact matches (`iteration = 0`). Run that match."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Exercise 16\n",
    "\n",
    "Now use a weighted linear regression on your matched data to regress annual earnings on *just* having a college eduction. Is that different from what you had when you allowed more low quality matches?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Other Forms of Matching\n",
    "\n",
    "OK, hopefully this gives you a taste of matching! There are, of course, *many* other permutations to be aware of though. \n",
    "\n",
    "- Matching with replacement.  In this exercise, we set `repeat=False`, so each observation could only end up in our final dataset once. However, if we use `repeat=True`, if an untreated observation is the closest observation to multiple treated observations, it may get put in the dataset multiple times. We can still use this dataset in *almost* the same way, though, except we have to make use of weights so that if an observation appears, say, twice, each observation has a weight that's 1/2 the weight of an observation only appearing once. \n",
    "- Matching with continuous variables: DAME is used for exact matching, but if you have lots of continuous variables, you can also match on those. In fact, the Almost Exact Matching Lab also has a library called [MALTS](https://almost-matching-exactly.github.io/MALTS/) that will do matching with continuous variables. That package does something *like* Mahalanobis Distance matching, but ulike Mahalanobis, which calculates the distance between observations in terms of the difference in all the matching variables normalized by each matching variable's standard deviation, MALTS does something much more clever. (Here's [the paper](https://arxiv.org/abs/1811.07415) describing the technique if you want all the details). Basically, it figures out how well each matching variable predicts our outcome $Y$, then weights the different variables by their predictive power instead of just normalizing by something arbitrary like their standard deviation. As a result, final matches will prioritize matching more closely on variables that are outcome-relevant. In addition, when it sees a categorical variable, it recognizes that and only pairs observations when they are an exact match on that categorical variable. \n",
    "- If you're dataset is huge, use `FLAME`: this dataset is small, but if you have lots of observations and lots of matching variable, the computational complexity of this task explodes, so the AEML created FLAME, which works with millions of observations at only a small cost to match quality. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  },
  "vscode": {
   "interpreter": {
    "hash": "718fed28bf9f8c7851519acf2fb923cd655120b36de3b67253eeb0428bd33d2d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}