# Predicting Mortgage Delinquency Risk¶

**Note: this is a new exercise, so if you find something weird, please bring it to my attention.**

You have been hired by a mortgage servicing firm (a company that buys mortgages and then collects mortgage payments from homeowners) to build a model to answer the question:

**Given all available information about a newly issued mortgage, what is the likelihood that the mortgage will enter delinquency (the homeowner will be at least 30 days late on a mortgage payment) during the first two years of the mortgage?**

The servicer’s hope, obviously, is to differentiate between mortgages to try and purchase (those that will be consistently paid) and mortgages they wish to avoid.

For this task, you have been given REAL data on a sample of all US Standard single family home mortgages purchased or insured by Freddie Mac in a single calendar year along with payment data from that and two subsequent years.

## Gradescope Autograding¶

Please follow all standard guidance for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called `results`

and ensuring your notebook runs from the start to completion without any errors.

For this assignment, please name your file `exercise_passive_prediction.ipynb`

before uploading.

You can check that you have answers for all questions in your `results`

dictionary with this code:

```
assert set(results.keys()) == {
"ex2_merge_type",
"ex4_num_mortgages",
"ex5_num_obs",
"ex7_num_mortgages",
"ex7_share_delinquent",
"ex10_num_obs",
"ex12_roc_auc",
"ex14_false_omission_rate",
"ex16_num_obs",
"ex16_share_delinquent",
"ex17_false_omission_rate",
}
```

### Submission Limits¶

Please remember that you are **only allowed FOUR submissions to the autograder.** Your last submission (if you submit 4 or fewer times), or your third submission (if you submit more than 4 times) will determine your grade Submissions that error out will **not** count against this total.

That’s one more than usual in case there are issues with exercise clarity.

## Data Cleaning and Organization¶

Data for this exercise can be found here. This folder includes both the data to be used and documentation, though you can find supplemental documentation here.

The only difference between this data and the original Freddie Mac sampled data is that I’ve limited the scope of service data to three calendar years.

### Exercise 1¶

Begin by loading both:

the mortgage origination file (

`sample_orig_2004.txt`

). This*should*contain information on all mortgages issued in 2004, along with non-time varying features of these mortgages (the initial amount, the credit score of the applicant, etc.), andthe servicing data (

`sample_svcg_2004orig_3years.txt`

). This contains monthly records of all recorded payments (or non-payments) for all mortgages issued in 2004 during the calendar years of 2004, 2005, and 2006.

So the autograder can see the data, be sure to load it directly from a URL (don’t download and load from your own system).

Load the data AND ensure your data has column names. You will likely need to reference the documentation to figure out how to do so.

### Exercise 2¶

What is the unit of observation in `sample_orig_2004.txt`

and in `sample_svcg_2004orig_3years.txt`

?

### Exercise 3¶

Merge your two datasets. Be sure to use the `validate`

keyword argument in `merge`

.

You will find some records in the origination files not in the servicing file. We need data from both files, so just do an `inner`

join.

Assuming that you list the data associated with `sample_orig_2004.txt`

first and `sample_svcg_2004orig_3years.txt`

second, what keyword are you passing to `validate`

? Store your answer as a string (use one of: `"1:1"`

, `"m:1"`

, `"1:m"`

, `"m:m"`

) in a dictionary called `results`

under the key `ex2_merge_type`

.

### Exercise 4¶

Mortgages come in many shapes and flavors, however your servicer is only interested in predicting default for the more standard form of mortgage. Subset your data to only include:

Mortgages taken out for purchase of a property,

With first payments due in the quarter of origination or the first quarter after origination.

(In a perfect world we would just limit our analysis to mortgages where the first payment is due the month after origination. Unfortunately we only know the *quarter* of origination, so the only way to subset for relatively vanilla mortgages is to look for mortgages where the first payment was due in the same quarter or the quarter after origination.)

Subset for these mortgages. How many unique mortgages remain in the data?

Hint: You may need to read the documentation for the `Loan Sequence Number`

variable.

Store the resulting number of unique mortgages in `results`

under the key `ex4_num_mortgages`

.

### Exercise 5¶

The servicer wants to predict delinquency during the first 24 payment due dates (you may assume payments are due every month starting with the month the first payment is due). Subset the data to these first 24 (possible) payment due dates.

Note that not all loans will have 24 records in the servicing file in the first 24 months as a result of data merging issues on behalf of Freddie Mac. As noted in the Freddie Mac documentation:

For a given loan, each monthly reporting period in the monthly performance data file combines data elements from multiple reporting cycles and systems at Freddie Mac. As such, perceived data anomalies may be a result of timing mismatches between default/delinquency reporting cycles and investor reporting cycles. Examples of some commonly occurring anomalies in the data are included throughout this section. In all cases, the best information available at the time the Dataset is generated, subject to operational constraints, is used.

So subset for the first two years of (possible) payments, resulting in *up to* 24 observations per mortgage (but potentially less given the data cleanliness issues).

After this subsetting, store the number of remaining observations (not mortgages, observation) in `results`

under the key `"ex5_num_obs"`

.

### Exercise 6¶

For each unique mortgage in your dataset, create an indicator variable that takes on a value of 1 if, at any time during this period, the mortgage has been delinquent.

Delinquency status is stored in the variable `CURRENT LOAN DELINQUENCY STATUS`

, and is coded as:

CURRENT LOAN DELINQUENCY STATUS – A value corresponding to the number of days the borrower is delinquent, based on the due date of last paid installment (“DDLPI”) reported by servicers to Freddie Mac, and is calculated under the Mortgage Bankers Association (MBA) method. If a loan has been acquired by REO, then the Current Loan Delinquency Status will reflect the value corresponding to that status (instead of the value corresponding to the number of days the borrower is delinquent).

0 = Current, or less than 30 days delinquent

1 = 30-59 days delinquent

2=60–89days delinquent

3=90–119days delinquent

And so on…

RA = REO Acquisition

### Exercise 7¶

At this point, you should be able to drop all servicing variables reported on a monthly basis and just keep information about the original mortgage issuance (and still keep an indicator for whether the mortgage has ever been delinquent).

Store the final number of mortgages in your data under `ex7_num_mortgages`

and the share (between 0 and 1) of mortgages that have been delinquent under `ex7_share_delinquent`

.

## Modelling Delinquency Risk¶

Your data should now be relatively tidy, in the technical sense of the term. And that means it should be relatively straightforward for you to build a model that answers the question “Given the features of a newly originated mortgage, how likely is the mortgage holder to fall into delinquency within the first two years after origination?”

### Exercise 8¶

First, we need to identify the target for our model useful predictors from the data and do feature engineering.

Let’s begin with identifying some features that probably *aren’t* going to be useful. For example, `"Metropolitan Statistical Area (MSA) Or Metropolitan Division"`

is probably *not* an appropriate feature to include in this analysis. Can you figure out why? Make sure to show (quantitatively) why not.

Hint: should be more than the missing rate.

Hint 2: how many observations for a given city do you think you’d need to determine if that city had especially high mortgage delinquency rates?

Hint 3: if not all possible values of a variable are present in your training data, what problem might that cause during testing and deployment?

### Exercise 9¶

For your analysis, include the following variables:

```
Credit Score
First Time Homebuyer Flag
Number of Units
Mortgage Insurance Percentage (MI %)
Occupancy Status
Original Debt-to-Income (DTI) Ratio
Original UPB
Original Loan-to-Value (LTV)
Original Interest Rate
Channel
Prepayment Penalty Mortgage (PPM) Flag
Amortization Type (Formerly Product Type)
Property State
Property Type
Original Loan Term
Number of Borrowers
Interest Only (I/O) Indicator
```

Be sure to clean these variables. When doing so, please treat missing data as missing (e.g., `np.nan`

, not as a distinct category).

### Exercise 10¶

The next step in our analysis is to convert our categorical variables to one-hot-encodings and use `train_test_split`

to split our data.

To ensure replicability, **before** you `train_test_split`

your data, please sort your data by `Loan Sequence Number`

. This will ensure when we split the data with a random seed below, everyone will get the same split and the autograder will function.

You may create your one-hot-encodings however you wish, but I’m a fan of the patsy library’s `dmatrices`

function.

Hint: You should end up with 8 categorical variables, including some binary flags and `Number_of_Borrowers`

, `Number_of_Units`

(which you could argue should be continuous, but I think are better treated as categorical).

Store the number of observations in your final dataset in `ex10_num_obs`

.

### Exercise 11¶

Use `train_test_split`

from `sklearn.model_selection`

to split the data.

Before you do, Use `0.2`

as the `test_size`

and use `random_state=42`

.

### Exercise 12¶

Now fit a `GradientBoostingClassifier`

to the data (from `sklearn.ensemble`

). Set `random_state=42`

. using `roc_auc_score`

, get your ROC AUC score against the test data.

**Please round your answer to four decimal places** - sometimes sklearn models have very small instabilities even after setting a random seed.

Store in `results`

under the key `"ex12_roc_auc"`

.

### Exercise 13¶

Use the `predict`

method to generate a confusion matrix. What problem do you see with the result?

### Exercise 14¶

To address the problem from Exercise 13, use `.predict_proba()`

to set your own threshold for classification. Your stakeholder is mostly concerned with False Negatives (mortgages classified as safe that actually are not), so use an 8% probability threshold to get a good balance of a low False Omission Rate (the share of predicted negatives that are false negatives) with a reasonable amount of mortgages still being considered “viable.”

In other words, treat a mortgage as risky (`1`

) if the predicted probability is greater than 0.08. Be sure to use “greater than” rather than “greater than or equal to”.

What is the False Omission Rate at an 8% classification threshold from the model above?

Store the result under the key `"ex14_false_omission_rate"`

.

### Exercise 15¶

Your stakeholder wants to by as many mortgages as it can while maintaining a delinquency rate of purchased mortgages below 5%. Based on your answer above, do you feel like your model can provide that level of performance?

## Now To The Future¶

The preceding analysis is precisely the type of analysis you would do if, in late 2006, you’d been asked to evaluate mortgage performance in the last two years for use going forward. So let’s see how your model performs now!

In this folder you will find data on mortgages originated in 2007 along with servicing data from 2007, 2008, and 2009.

### Exercise 16¶

Please load this data (again, from a URL to help the autograder) and clean it in the same manner as before. As a sanity check, how many observations do you have in the final dataset (after you’ve removed observations with missing values to allow you to generate predicted delinquency rates)?

Store the final number of observations in `"ex16_num_obs"`

and the share of those mortgages that are delinquent in `"ex16_share_delinquent"`

### Exercise 17¶

Had your stakeholder purchased mortgages using your model (with the 8% cutoff), what would the resulting False Omission Rate (share of predicted negatives that are false negatives) rate have been? (e.g., compare the predicted values for mortgages using the model trained above with realized outcomes). Store your result under the key `"ex17_false_omission_rate"`

### Exercise 18¶

How did the retrospective performance of your model compare to its actual performance moving forward? Why? Did you stay below the 5% target for mortgages that were predicted to be safe but in the end were not set by the stakeholder?