Propensity Score Matching Exercise

To practice propensity score matching, Let’s estimate how college education impacts earnings in the US using data from the US Current Population Survey (CPS) on US wages in 2019.

[2]:
# Load survey
import pandas as pd

cps = pd.read_stata('https://github.com/nickeubank/MIDS_Data/blob/master/Current_Population_Survey/morg18.dta?raw=true')

# Limit to people currently employed and working full time.
cps = cps[cps.lfsr94 == 'Employed-At Work']
cps = cps[cps.uhourse >= 35]

# And we can adjust earnings per hour (in cents) into dollars,
cps['earnhre_dollars'] = cps['earnhre'] / 100
cps['annual_earnings'] = cps['earnhre_dollars'] * cps['uhourse'] * 52

# And create gender and college educ variable
cps['female'] = (cps.sex == 2).astype('int')
cps['has_college_educ'] = (cps.grade92 > 43).astype('int')

cps.describe()
[2]:
county smsastat age sex grade92 race ethnic marital uhourse earnhre ... gedhigr yrcoll grprof gr6cor ms123 occ2012 earnhre_dollars annual_earnings female has_college_educ
count 133814.000000 132638.000000 133814.000000 133814.000000 133814.000000 133814.000000 18480.000000 133814.000000 133814.000000 65755.000000 ... 3107.000000 36240.000000 0.0 0.0 0.0 133814.000000 65755.000000 65755.000000 133814.000000 133814.000000
mean 25.735020 1.173932 43.335458 1.440320 41.059680 1.434274 2.581872 3.253359 42.596515 1940.998783 ... 6.640489 2.853256 NaN NaN NaN 3989.409128 19.409988 41757.890924 0.440320 0.148295
std 61.578816 0.379052 13.335412 0.496427 2.512128 1.270713 2.417939 2.676927 7.002970 1008.707762 ... 1.321649 0.963869 NaN NaN NaN 2708.186730 10.087078 23164.092147 0.496427 0.355394
min 0.000000 1.000000 16.000000 1.000000 31.000000 1.000000 1.000000 1.000000 35.000000 17.000000 ... 1.000000 1.000000 NaN NaN NaN 10.000000 0.170000 397.800000 0.000000 0.000000
25% 0.000000 1.000000 32.000000 1.000000 39.000000 1.000000 1.000000 1.000000 40.000000 1300.000000 ... 6.000000 2.000000 NaN NaN NaN 1550.000000 13.000000 27040.000000 0.000000 0.000000
50% 0.000000 1.000000 43.000000 1.000000 41.000000 1.000000 1.000000 1.000000 40.000000 1675.000000 ... 7.000000 3.000000 NaN NaN NaN 4050.000000 16.750000 35360.000000 0.000000 0.000000
75% 29.000000 1.000000 54.000000 2.000000 43.000000 1.000000 4.000000 7.000000 40.000000 2300.000000 ... 8.000000 3.000000 NaN NaN NaN 5700.000000 23.000000 49920.000000 1.000000 0.000000
max 810.000000 2.000000 85.000000 2.000000 46.000000 26.000000 8.000000 7.000000 99.000000 9999.000000 ... 8.000000 5.000000 NaN NaN NaN 9750.000000 99.990000 361920.000000 1.000000 1.000000

8 rows × 24 columns

Exercise 1

How many observations have a college degree, how many does not have a college degree.

Exercise 2

Show the raw difference of earnhre_dollars between the group with college degree and that without the college.

Exercise 3

Select the covariates that may be correlated with the treatment and dependent variables, use these covariates fit a logistic model to obtain propensity score.

Exercise 4

Evaluate the common support of the treated and control groups

Exercise 5

Obtain a matched sample using k:1 nearest neighbor method. Show the top ten rows of the matched data

Exercise 6

Conduct a t-test between the treatment and control group using the matched data. Interpret the result. Are covariates balanced?

Exercise 7

Fit four separate regression models to estimate the effect of college education on earning per hour. - an OLS model, including only the treatment variable - an OLS model, including the treatment variable and covariates - a weighted least squared model, including only the treatment variable, using the weight obtained by propensity score matching - a weighted least squared model, including the treatment variable and covariates, using the weight obtained by propensity score matching

Compare the above four models, interpret the results.

Absolutely positively need the solutions?

Don’t use this link until you’ve really, really spent time struggling with your code! Doing so only results in you cheating yourself.

Link