# Propensity Score Matching Exercise¶

To practice propensity score matching, Let’s estimate how college education impacts earnings in the US using data from the US Current Population Survey (CPS) on US wages in 2019.

```
[2]:
```

```
# Load survey
import pandas as pd
cps = pd.read_stata('https://github.com/nickeubank/MIDS_Data/blob/master/Current_Population_Survey/morg18.dta?raw=true')
# Limit to people currently employed and working full time.
cps = cps[cps.lfsr94 == 'Employed-At Work']
cps = cps[cps.uhourse >= 35]
# And we can adjust earnings per hour (in cents) into dollars,
cps['earnhre_dollars'] = cps['earnhre'] / 100
cps['annual_earnings'] = cps['earnhre_dollars'] * cps['uhourse'] * 52
# And create gender and college educ variable
cps['female'] = (cps.sex == 2).astype('int')
cps['has_college_educ'] = (cps.grade92 > 43).astype('int')
cps.describe()
```

```
[2]:
```

county | smsastat | age | sex | grade92 | race | ethnic | marital | uhourse | earnhre | ... | gedhigr | yrcoll | grprof | gr6cor | ms123 | occ2012 | earnhre_dollars | annual_earnings | female | has_college_educ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

count | 133814.000000 | 132638.000000 | 133814.000000 | 133814.000000 | 133814.000000 | 133814.000000 | 18480.000000 | 133814.000000 | 133814.000000 | 65755.000000 | ... | 3107.000000 | 36240.000000 | 0.0 | 0.0 | 0.0 | 133814.000000 | 65755.000000 | 65755.000000 | 133814.000000 | 133814.000000 |

mean | 25.735020 | 1.173932 | 43.335458 | 1.440320 | 41.059680 | 1.434274 | 2.581872 | 3.253359 | 42.596515 | 1940.998783 | ... | 6.640489 | 2.853256 | NaN | NaN | NaN | 3989.409128 | 19.409988 | 41757.890924 | 0.440320 | 0.148295 |

std | 61.578816 | 0.379052 | 13.335412 | 0.496427 | 2.512128 | 1.270713 | 2.417939 | 2.676927 | 7.002970 | 1008.707762 | ... | 1.321649 | 0.963869 | NaN | NaN | NaN | 2708.186730 | 10.087078 | 23164.092147 | 0.496427 | 0.355394 |

min | 0.000000 | 1.000000 | 16.000000 | 1.000000 | 31.000000 | 1.000000 | 1.000000 | 1.000000 | 35.000000 | 17.000000 | ... | 1.000000 | 1.000000 | NaN | NaN | NaN | 10.000000 | 0.170000 | 397.800000 | 0.000000 | 0.000000 |

25% | 0.000000 | 1.000000 | 32.000000 | 1.000000 | 39.000000 | 1.000000 | 1.000000 | 1.000000 | 40.000000 | 1300.000000 | ... | 6.000000 | 2.000000 | NaN | NaN | NaN | 1550.000000 | 13.000000 | 27040.000000 | 0.000000 | 0.000000 |

50% | 0.000000 | 1.000000 | 43.000000 | 1.000000 | 41.000000 | 1.000000 | 1.000000 | 1.000000 | 40.000000 | 1675.000000 | ... | 7.000000 | 3.000000 | NaN | NaN | NaN | 4050.000000 | 16.750000 | 35360.000000 | 0.000000 | 0.000000 |

75% | 29.000000 | 1.000000 | 54.000000 | 2.000000 | 43.000000 | 1.000000 | 4.000000 | 7.000000 | 40.000000 | 2300.000000 | ... | 8.000000 | 3.000000 | NaN | NaN | NaN | 5700.000000 | 23.000000 | 49920.000000 | 1.000000 | 0.000000 |

max | 810.000000 | 2.000000 | 85.000000 | 2.000000 | 46.000000 | 26.000000 | 8.000000 | 7.000000 | 99.000000 | 9999.000000 | ... | 8.000000 | 5.000000 | NaN | NaN | NaN | 9750.000000 | 99.990000 | 361920.000000 | 1.000000 | 1.000000 |

8 rows × 24 columns

## Exercise 1¶

How many observations have a college degree, how many does not have a college degree.

## Exercise 2¶

Show the raw difference of `earnhre_dollars`

between the group with college degree and that without the college.

## Exercise 3¶

Select the covariates that may be correlated with the treatment and dependent variables, use these covariates fit a logistic model to obtain propensity score.

## Exercise 4¶

Evaluate the common support of the treated and control groups

## Exercise 5¶

Obtain a matched sample using k:1 nearest neighbor method. Show the top ten rows of the matched data

## Exercise 6¶

Conduct a t-test between the treatment and control group using the matched data. Interpret the result. Are covariates balanced?

## Exercise 7¶

Fit four separate regression models to estimate the effect of college education on earning per hour. - an OLS model, including only the treatment variable - an OLS model, including the treatment variable and covariates - a weighted least squared model, including only the treatment variable, using the weight obtained by propensity score matching - a weighted least squared model, including the treatment variable and covariates, using the weight obtained by propensity score matching

Compare the above four models, interpret the results.