Predicting Mortgage Delinquency Risk

Note: this is a new exercise, so if you find something weird, please bring it to my attention.

You have been hired by a mortgage servicing firm (a company that buys mortgages and then collects mortgage payments from homeowners) to build a model to answer the question:

Given all available information about a newly issued mortgage, what is the likelihood that the mortgage will enter delinquency (the homeowner will be at least 30 days late on a mortgage payment) during the first two years of the mortgage?

The servicer’s hope, obviously, is to differentiate between mortgages to try and purchase (those that will be consistently paid) and mortgages they wish to avoid.

For this task, you have been given REAL data on a sample of all US Standard single family home mortgages purchased or insured by Freddie Mac in a single calendar year along with payment data from that and two subsequent years.

Gradescope Autograding

Please follow all standard guidance for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called results and ensuring your notebook runs from the start to completion without any errors.

For this assignment, please name your file exercise_passive_prediction.ipynb before uploading.

You can check that you have answers for all questions in your results dictionary with this code:

assert set(results.keys()) == {

Submission Limits

Please remember that you are only allowed FOUR submissions to the autograder. Your last submission (if you submit 4 or fewer times), or your third submission (if you submit more than 4 times) will determine your grade Submissions that error out will not count against this total.

That’s one more than usual in case there are issues with exercise clarity.

Data Cleaning and Organization

Data for this exercise can be found here. This folder includes both the data to be used and documentation, though you can find supplemental documentation here.

The only difference between this data and the original Freddie Mac sampled data is that I’ve limited the scope of service data to three calendar years.

Exercise 1

Begin by loading both:

  • the mortgage origination file (sample_orig_2004.txt). This should contain information on all mortgages issued in 2004, along with non-time varying features of these mortgages (the initial amount, the credit score of the applicant, etc.), and

  • the servicing data (sample_svcg_2004orig_3years.txt). This contains monthly records of all recorded payments (or non-payments) for all mortgages issued in 2004 during the calendar years of 2004, 2005, and 2006.

So the autograder can see the data, be sure to load it directly from a URL (don’t download and load from your own system).

Load the data AND ensure your data has column names. You will likely need to reference the documentation to figure out how to do so.

Exercise 2

What is the unit of observation in sample_orig_2004.txt and in sample_svcg_2004orig_3years.txt?

Exercise 3

Merge your two datasets. Be sure to use the validate keyword argument in merge.

You will find some records in the origination files not in the servicing file. We need data from both files, so just do an inner join.

Assuming that you list the data associated with sample_orig_2004.txt first and sample_svcg_2004orig_3years.txt second, what keyword are you passing to validate? Store your answer as a string (use one of: "1:1", "m:1", "1:m", "m:m") in a dictionary called results under the key ex2_merge_type.

Exercise 4

Mortgages come in many shapes and flavors, however your servicer is only interested in predicting default for the more standard form of mortgage. Subset your data to only include:

  • Mortgages taken out for purchase of a property,

  • With first payments due in the quarter of origination or the first quarter after origination.

(In a perfect world we would just limit our analysis to mortgages where the first payment is due the month after origination. Unfortunately we only know the quarter of origination, so the only way to subset for relatively vanilla mortgages is to look for mortgages where the first payment was due in the same quarter or the quarter after origination.)

Subset for these mortgages. How many unique mortgages remain in the data?

Hint: You may need to read the documentation for the Loan Sequence Number variable.

Store the resulting number of unique mortgages in results under the key ex4_num_mortgages.

Exercise 5

The servicer wants to predict delinquency during the first 24 payment due dates (you may assume payments are due every month starting with the month the first payment is due). Subset the data to these first 24 (possible) payment due dates.

Note that not all loans will have 24 records in the servicing file in the first 24 months as a result of data merging issues on behalf of Freddie Mac. As noted in the Freddie Mac documentation:

For a given loan, each monthly reporting period in the monthly performance data file combines data elements from multiple reporting cycles and systems at Freddie Mac. As such, perceived data anomalies may be a result of timing mismatches between default/delinquency reporting cycles and investor reporting cycles. Examples of some commonly occurring anomalies in the data are included throughout this section. In all cases, the best information available at the time the Dataset is generated, subject to operational constraints, is used.

So subset for the first two years of (possible) payments, resulting in up to 24 observations per mortgage (but potentially less given the data cleanliness issues).

After this subsetting, store the number of remaining observations (not mortgages, observation) in results under the key "ex5_num_obs".

Exercise 6

For each unique mortgage in your dataset, create an indicator variable that takes on a value of 1 if, at any time during this period, the mortgage has been delinquent.

Delinquency status is stored in the variable CURRENT LOAN DELINQUENCY STATUS, and is coded as:

CURRENT LOAN DELINQUENCY STATUS – A value corresponding to the number of days the borrower is delinquent, based on the due date of last paid installment (“DDLPI”) reported by servicers to Freddie Mac, and is calculated under the Mortgage Bankers Association (MBA) method. If a loan has been acquired by REO, then the Current Loan Delinquency Status will reflect the value corresponding to that status (instead of the value corresponding to the number of days the borrower is delinquent).

0 = Current, or less than 30 days delinquent

1 = 30-59 days delinquent

2=60–89days delinquent

3=90–119days delinquent

And so on…

RA = REO Acquisition

Exercise 7

At this point, you should be able to drop all servicing variables reported on a monthly basis and just keep information about the original mortgage issuance (and still keep an indicator for whether the mortgage has ever been delinquent).

Store the final number of mortgages in your data under ex7_num_mortgages and the share (between 0 and 1) of mortgages that have been delinquent under ex7_share_delinquent.

Modelling Delinquency Risk

Your data should now be relatively tidy, in the technical sense of the term. And that means it should be relatively straightforward for you to build a model that answers the question “Given the features of a newly originated mortgage, how likely is the mortgage holder to fall into delinquency within the first two years after origination?”

Exercise 8

First, we need to identify the target for our model useful predictors from the data and do feature engineering.

Let’s begin with identifying some features that probably aren’t going to be useful. For example, "Metropolitan Statistical Area (MSA) Or Metropolitan Division" is probably not an appropriate feature to include in this analysis. Can you figure out why? Make sure to show (quantitatively) why not.

Hint: should be more than the missing rate.

Hint 2: how many observations for a given city do you think you’d need to determine if that city had especially high mortgage delinquency rates?

Hint 3: if not all possible values of a variable are present in your training data, what problem might that cause during testing and deployment?

Exercise 9

For your analysis, include the following variables:

Credit Score
First Time Homebuyer Flag
Number of Units
Mortgage Insurance Percentage (MI %)
Occupancy Status
Original Debt-to-Income (DTI) Ratio
Original UPB
Original Loan-to-Value (LTV)
Original Interest Rate
Prepayment Penalty Mortgage (PPM) Flag
Amortization Type (Formerly Product Type)
Property State
Property Type
Original Loan Term
Number of Borrowers
Interest Only (I/O) Indicator

Be sure to clean these variables. When doing so, please treat missing data as missing (e.g., np.nan, not as a distinct category).

Exercise 10

The next step in our analysis is to convert our categorical variables to one-hot-encodings and use train_test_split to split our data.

To ensure replicability, before you train_test_split your data, please sort your data by Loan Sequence Number. This will ensure when we split the data with a random seed below, everyone will get the same split and the autograder will function.

You may create your one-hot-encodings however you wish, but I’m a fan of the patsy library’s dmatrices function.

Hint: You should end up with 8 categorical variables, including some binary flags and Number_of_Borrowers, Number_of_Units (which you could argue should be continuous, but I think are better treated as categorical).

Store the number of observations in your final dataset in ex10_num_obs.

Exercise 11

Use train_test_split from sklearn.model_selection to split the data.

Before you do, Use 0.2 as the test_size and use random_state=42.

Exercise 12

Now fit a GradientBoostingClassifier to the data (from sklearn.ensemble). Set random_state=42. using roc_auc_score, get your ROC AUC score against the test data.

Please round your answer to four decimal places - sometimes sklearn models have very small instabilities even after setting a random seed.

Store in results under the key "ex12_roc_auc".

Exercise 13

Use the predict method to generate a confusion matrix. What problem do you see with the result?

Exercise 14

To address the problem from Exercise 13, use .predict_proba() to set your own threshold for classification. Your stakeholder is mostly concerned with False Negatives (mortgages classified as safe that actually are not), so use an 8% probability threshold to get a good balance of a low False Omission Rate (the share of predicted negatives that are false negatives) with a reasonable amount of mortgages still being considered “viable.”

In other words, treat a mortgage as risky (1) if the predicted probability is greater than 0.08. Be sure to use “greater than” rather than “greater than or equal to”.

What is the False Omission Rate at an 8% classification threshold from the model above?

Store the result under the key "ex14_false_omission_rate".

Exercise 15

Your stakeholder wants to by as many mortgages as it can while maintaining a delinquency rate of purchased mortgages below 5%. Based on your answer above, do you feel like your model can provide that level of performance?

Now To The Future

The preceding analysis is precisely the type of analysis you would do if, in late 2006, you’d been asked to evaluate mortgage performance in the last two years for use going forward. So let’s see how your model performs now!

In this folder you will find data on mortgages originated in 2007 along with servicing data from 2007, 2008, and 2009.

Exercise 16

Please load this data (again, from a URL to help the autograder) and clean it in the same manner as before. As a sanity check, how many observations do you have in the final dataset (after you’ve removed observations with missing values to allow you to generate predicted delinquency rates)?

Store the final number of observations in "ex16_num_obs" and the share of those mortgages that are delinquent in "ex16_share_delinquent"

Exercise 17

Had your stakeholder purchased mortgages using your model (with the 8% cutoff), what would the resulting False Omission Rate (share of predicted negatives that are false negatives) rate have been? (e.g., compare the predicted values for mortgages using the model trained above with realized outcomes). Store your result under the key "ex17_false_omission_rate"

Exercise 18

How did the retrospective performance of your model compare to its actual performance moving forward? Why? Did you stay below the 5% target for mortgages that were predicted to be safe but in the end were not set by the stakeholder?