Predicting Mortgage Delinquency Risk

Note: this is a new exercise, so if you find something weird, please bring it to my attention.

You have been hired by a mortgage servicing firm (a company that buys mortgages and then collects mortgage payments from homeowners) to build a model to answer the question:

Given all available information about a newly issued mortgage, what is the likelihood that the mortgage will enter delinquency (the homeowner will be at least 30 days late on a mortgage payment) during the first two years of the mortgage?

The servicer’s hope, obviously, is to differentiate between mortgages to try and purchase (those that will be consistently paid) and mortgages they wish to avoid.

For this task, you have been given REAL data on a sample of all US Standard single family home mortgages purchased or insured by Freddie Mac in a single calendar year along with payment data from that and two subsequent years.

Gradescope Autograding

Please follow all standard guidance for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called results and ensuring your notebook runs from the start to completion without any errors.

For this assignment, please name your file exercise_passive_prediction.ipynb before uploading.

You can check that you have answers for all questions in your results dictionary with this code:

assert set(results.keys()) == {
    "ex2_merge_type",
    "ex4_num_mortgages",
    "ex5_num_obs",
    "ex7_num_mortgages",
    "ex7_share_delinquent",
    "ex10_num_obs",
    "ex12_roc_auc",
    "ex14_false_omission_rate",
    "ex16_num_obs",
    "ex16_share_delinquent",
    "ex17_false_omission_rate",
}

Submission Limits

Please remember that you are only allowed FOUR submissions to the autograder. Your last submission (if you submit 4 or fewer times), or your third submission (if you submit more than 4 times) will determine your grade Submissions that error out will not count against this total.

That’s one more than usual in case there are issues with exercise clarity.

Data Cleaning and Organization

Data for this exercise can be found here. This folder includes both the data to be used and documentation, though you can find supplemental documentation here.

The only difference between this data and the original Freddie Mac sampled data is that I’ve limited the scope of service data to three calendar years.

Exercise 1

Begin by loading both:

  • the mortgage origination file (sample_orig_2004.txt). This should contain information on all mortgages issued in 2004, along with non-time varying features of these mortgages (the initial amount, the credit score of the applicant, etc.), and

  • the servicing data (sample_svcg_2004orig_3years.txt). This contains monthly records of all recorded payments (or non-payments) for all mortgages issued in 2004 during the calendar years of 2004, 2005, and 2006.

So the autograder can see the data, be sure to load it directly from a URL (don’t download and load from your own system).

Load the data AND ensure your data has column names. You will likely need to reference the documentation to figure out how to do so.

[1]:
import pandas as pd
import numpy as np

pd.set_option("mode.copy_on_write", True)
[2]:
origination_colnames = [
    "Credit Score",
    "First Payment Date",
    "First Time Homebuyer Flag",
    "Maturity Date",
    "Metropolitan Statistical Area (MSA) Or Metropolitan Division",
    "Mortgage Insurance Percentage (MI %)",
    "Number of Units",
    "Occupancy Status",
    "Original Combined Loan-to-Value (CLTV)",
    "Original Debt-to-Income (DTI) Ratio",
    "Original UPB",
    "Original Loan-to-Value (LTV)",
    "Original Interest Rate",
    "Channel",
    "Prepayment Penalty Mortgage (PPM) Flag",
    "Amortization Type (Formerly Product Type)",
    "Property State",
    "Property Type",
    "Postal Code",
    "Loan Sequence Number",
    "Loan Purpose",
    "Original Loan Term",
    "Number of Borrowers",
    "Seller Name",
    "Servicer Name",
    "Super Conforming Flag",
    "Pre-HARP Loan Sequence Number",
    "Program Indicator",
    "HARP Indicator",
    "Property Valuation Method",
    "Interest Only (I/O) Indicator",
    "Mortgage Insurance Cancellation Indicator",
]
[3]:
service_colnames = [
    "Loan Sequence Number",
    "Monthly Reporting Period",
    "Current Actual UPB",
    "Current Loan Delinquency Status",
    "Loan Age",
    "Remaining Months to Legal Maturity",
    "Defect Settlement Date",
    "Modification Flag",
    "Zero Balance Code",
    "Zero Balance Effective Date",
    "Current Interest Rate",
    "Current Deferred UPB",
    "Due Date of Last Paid Installment (DDLPI)",
    "MI Recoveries",
    "Net Sales Proceeds",
    "Non MI Recoveries",
    "Expenses",
    "Legal Costs",
    "Maintenance and Preservation Costs",
    "Taxes and Insurance",
    "Miscellaneous Expenses",
    "Actual Loss Calculation",
    "Modification Cost",
    "Step Modification Flag",
    "Deferred Payment Plan",
    "Estimated Loan-to-Value (ELTV)",
    "Zero Balance Removal UPB",
    "Delinquent Accrued Interest",
    "Delinquency Due to Disaster",
    "Borrower Assistance Status Code",
    "Current Month Modification Cost",
    "Interest Bearing UPB",
]
[4]:
mortgages = pd.read_csv(
    "https://github.com/nickeubank/MIDS_Data/raw/master/mortgages/2004/sample_orig_2004.txt",
    sep="|",
    names=origination_colnames,
)
servicing = pd.read_csv(
    "https://github.com/nickeubank/MIDS_Data/raw/master/mortgages/2004/sample_svcg_2004orig_3years.txt",
    sep="|",
    names=service_colnames,
)

print(f"2004 data has {len(mortgages):,.0f} new mortgages")
print(f"2004 data has {len(servicing):,.0f} servicing records")
2004 data has 50,000 new mortgages
2004 data has 1,287,161 servicing records
/var/folders/fs/h_8_rwsn5hvg9mhp0txgc_s9v6191b/T/ipykernel_20548/3651092739.py:6: DtypeWarning: Columns (4,8,24) have mixed types. Specify dtype option on import or set low_memory=False.
  servicing = pd.read_csv(
[5]:
# Life becomes way easier if I add this here,
# though you may not realize it till
# around exercise 7
servicing = servicing[
    [
        "Monthly Reporting Period",
        "Current Loan Delinquency Status",
        "Loan Sequence Number",
    ]
]

Exercise 2

What is the unit of observation in sample_orig_2004.txt and in sample_svcg_2004orig_3years.txt?

Exercise 3

Merge your two datasets. Be sure to use the validate keyword argument in merge.

You will find some records in the origination files not in the servicing file. We need data from both files, so just do an inner join.

Assuming that you list the data associated with sample_orig_2004.txt first and sample_svcg_2004orig_3years.txt second, what keyword are you passing to validate? Store your answer as a string (use one of: "1:1", "m:1", "1:m", "m:m") in a dictionary called results under the key ex2_merge_type.

[6]:
results = {}
results["ex2_merge_type"] = "1:m"
[7]:
combined = pd.merge(
    mortgages,
    servicing,
    on="Loan Sequence Number",
    how="inner",
    validate="1:m",
    indicator=True,
)
combined._merge.value_counts()
assert (combined._merge != "right_only").all()
[8]:
combined.sample().T
[8]:
163509
Credit Score 711
First Payment Date 200404
First Time Homebuyer Flag N
Maturity Date 201903
Metropolitan Statistical Area (MSA) Or Metropolitan Division NaN
Mortgage Insurance Percentage (MI %) 0
Number of Units 1
Occupancy Status P
Original Combined Loan-to-Value (CLTV) 67
Original Debt-to-Income (DTI) Ratio 21
Original UPB 93000
Original Loan-to-Value (LTV) 67
Original Interest Rate 5.375
Channel T
Prepayment Penalty Mortgage (PPM) Flag N
Amortization Type (Formerly Product Type) FRM
Property State MO
Property Type SF
Postal Code 63300
Loan Sequence Number F04Q10238951
Loan Purpose N
Original Loan Term 180
Number of Borrowers 2
Seller Name ABN AMRO MORTGAGE GROUP, INC.
Servicer Name CITIMORTGAGE, INC.
Super Conforming Flag NaN
Pre-HARP Loan Sequence Number NaN
Program Indicator 9
HARP Indicator NaN
Property Valuation Method 9
Interest Only (I/O) Indicator N
Mortgage Insurance Cancellation Indicator 9
Monthly Reporting Period 200509
Current Loan Delinquency Status 0
_merge both

Exercise 4

Mortgages come in many shapes and flavors, however your servicer is only interested in predicting default for the more standard form of mortgage. Subset your data to only include:

  • Mortgages taken out for purchase of a property,

  • With first payments due in the quarter of origination or the first quarter after origination.

(In a perfect world we would just limit our analysis to mortgages where the first payment is due the month after origination. Unfortunately we only know the quarter of origination, so the only way to subset for relatively vanilla mortgages is to look for mortgages where the first payment was due in the same quarter or the quarter after origination.)

Subset for these mortgages. How many unique mortgages remain in the data?

Hint: You may need to read the documentation for the Loan Sequence Number variable.

Store the resulting number of unique mortgages in results under the key ex4_num_mortgages.

[9]:
# For purchase
combined = combined[combined["Loan Purpose"] == "P"]

# First payment
combined["year_first"] = combined["First Payment Date"] // 100
combined["month_first"] = combined["First Payment Date"] % 100
combined["quarter_first"] = ((combined["month_first"] - 1) // 3) + 1
assert combined["quarter_first"].isin(range(1, 5)).all()

combined["year_decimal_first"] = (
    combined["year_first"] + (combined["month_first"] - 1) / 12
)

# Origination. Get from loan sequence number.
combined["year_orig"] = combined["Loan Sequence Number"].str.get(2).astype("int") + 2000
assert (combined["year_orig"] == 2004).all()
combined["quarter_orig"] = combined["Loan Sequence Number"].str.get(4).astype("int")
assert combined["quarter_orig"].isin(range(1, 5)).all()


# Keep only if same or following quarter
same_quarter = (combined["year_first"] == combined["year_orig"]) & (
    combined["quarter_first"] == combined["quarter_orig"]
)
next_quarter_same_year = (combined["year_first"] == combined["year_orig"]) & (
    combined["quarter_first"] == (combined["quarter_orig"] + 1)
)
next_quarter_next_year = (
    (combined["year_first"] == (combined["year_orig"] + 1))
    & (combined["quarter_first"] == 1)
    & (combined["quarter_orig"] == 4)
)
combined = combined[same_quarter | next_quarter_same_year | next_quarter_next_year]

results["ex4_num_mortgages"] = combined["Loan Sequence Number"].nunique()
print(
    f"After subsetting there are {results['ex4_num_mortgages']:,.0f} unique mortgages"
)
After subsetting there are 17,504 unique mortgages

Exercise 5

The servicer wants to predict delinquency during the first 24 payment due dates (you may assume payments are due every month starting with the month the first payment is due). Subset the data to these first 24 (possible) payment due dates.

Note that not all loans will have 24 records in the servicing file in the first 24 months as a result of data merging issues on behalf of Freddie Mac. As noted in the Freddie Mac documentation:

For a given loan, each monthly reporting period in the monthly performance data file combines data elements from multiple reporting cycles and systems at Freddie Mac. As such, perceived data anomalies may be a result of timing mismatches between default/delinquency reporting cycles and investor reporting cycles. Examples of some commonly occurring anomalies in the data are included throughout this section. In all cases, the best information available at the time the Dataset is generated, subject to operational constraints, is used.

So subset for the first two years of (possible) payments, resulting in up to 24 observations per mortgage (but potentially less given the data cleanliness issues).

After this subsetting, store the number of remaining observations (not mortgages, observation) in results under the key "ex5_num_obs".

[10]:
combined["year_reporting"] = combined["Monthly Reporting Period"] // 100
combined["month_reporting"] = combined["Monthly Reporting Period"] % 100
combined["year_decimal_reporting"] = (
    combined["year_reporting"] + (combined["month_reporting"] - 1) / 12
)

combined["age"] = np.round(
    (combined["year_decimal_reporting"] - combined["year_decimal_first"]), 4
)
combined["age"].value_counts().sort_index()
[10]:
age
-0.0833    14436
 0.0000    16613
 0.0833    16970
 0.1667    17088
 0.2500    17051
 0.3333    16975
 0.4167    16899
 0.5000    16788
 0.5833    16629
 0.6667    16494
 0.7500    16342
 0.8333    16203
 0.9167    16038
 1.0000    15861
 1.0833    15683
 1.1667    15535
 1.2500    15370
 1.3333    15192
 1.4167    15052
 1.5000    14940
 1.5833    14823
 1.6667    14704
 1.7500    14586
 1.8333    14466
 1.9167    13159
 2.0000    11743
 2.0833    10369
 2.1667     8766
 2.2500     7129
 2.3333     5561
 2.4167     4474
 2.5000     3416
 2.5833     2342
 2.6667     1397
 2.7500      600
 2.8333       16
Name: count, dtype: int64
[11]:
two_years = combined[(0 <= combined["age"]) & (combined["age"] < 2)]

# Make sure I don't have an off-by-one problem.
assert len(two_years["age"].value_counts()) == 24
[12]:
results["ex5_num_obs"] = len(two_years)
print(f"Number of obs is {results['ex5_num_obs']:,.0f}")
Number of obs is 379,461

Exercise 6

For each unique mortgage in your dataset, create an indicator variable that takes on a value of 1 if, at any time during this period, the mortgage has been delinquent.

Delinquency status is stored in the variable CURRENT LOAN DELINQUENCY STATUS, and is coded as:

CURRENT LOAN DELINQUENCY STATUS – A value corresponding to the number of days the borrower is delinquent, based on the due date of last paid installment (“DDLPI”) reported by servicers to Freddie Mac, and is calculated under the Mortgage Bankers Association (MBA) method. If a loan has been acquired by REO, then the Current Loan Delinquency Status will reflect the value corresponding to that status (instead of the value corresponding to the number of days the borrower is delinquent).

0 = Current, or less than 30 days delinquent

1 = 30-59 days delinquent

2=60–89days delinquent

3=90–119days delinquent

And so on…

RA = REO Acquisition

[13]:
# Current Loan Delinquency Status
two_years["Current Loan Delinquency Status"] = (
    two_years["Current Loan Delinquency Status"].replace("RA", 99).astype("int")
)

two_years["ever_delinquent"] = (
    two_years.groupby("Loan Sequence Number")[
        "Current Loan Delinquency Status"
    ].transform("max")
    > 0
).astype("int")

Exercise 7

At this point, you should be able to drop all servicing variables reported on a monthly basis and just keep information about the original mortgage issuance (and still keep an indicator for whether the mortgage has ever been delinquent).

Store the final number of mortgages in your data under ex7_num_mortgages and the share (between 0 and 1) of mortgages that have been delinquent under ex7_share_delinquent.

[14]:
two_years.columns
[14]:
Index(['Credit Score', 'First Payment Date', 'First Time Homebuyer Flag',
       'Maturity Date',
       'Metropolitan Statistical Area (MSA) Or Metropolitan Division',
       'Mortgage Insurance Percentage (MI %)', 'Number of Units',
       'Occupancy Status', 'Original Combined Loan-to-Value (CLTV)',
       'Original Debt-to-Income (DTI) Ratio', 'Original UPB',
       'Original Loan-to-Value (LTV)', 'Original Interest Rate', 'Channel',
       'Prepayment Penalty Mortgage (PPM) Flag',
       'Amortization Type (Formerly Product Type)', 'Property State',
       'Property Type', 'Postal Code', 'Loan Sequence Number', 'Loan Purpose',
       'Original Loan Term', 'Number of Borrowers', 'Seller Name',
       'Servicer Name', 'Super Conforming Flag',
       'Pre-HARP Loan Sequence Number', 'Program Indicator', 'HARP Indicator',
       'Property Valuation Method', 'Interest Only (I/O) Indicator',
       'Mortgage Insurance Cancellation Indicator', 'Monthly Reporting Period',
       'Current Loan Delinquency Status', '_merge', 'year_first',
       'month_first', 'quarter_first', 'year_decimal_first', 'year_orig',
       'quarter_orig', 'year_reporting', 'month_reporting',
       'year_decimal_reporting', 'age', 'ever_delinquent'],
      dtype='object')
[15]:
# This is made easier by drops earlier.
two_years = two_years.drop(
    columns=[
        "Current Loan Delinquency Status",
        "Monthly Reporting Period",
        "month_reporting",
        "year_reporting",
        "year_decimal_reporting",
        "age",
        "_merge",
    ]
)
two_years = two_years.drop_duplicates()
assert two_years["Loan Sequence Number"].is_unique
results["ex7_share_delinquent"] = two_years.ever_delinquent.mean()
results["ex7_num_mortgages"] = len(two_years)

print(
    f"There are now {results['ex7_num_mortgages']:,.0f} unique mortgages in the data."
)
print(
    f"During the period studied, {results['ex7_share_delinquent']:.1%}"
    " of loans are delinquent at some point."
)
There are now 17,471 unique mortgages in the data.
During the period studied, 7.1% of loans are delinquent at some point.

Modelling Delinquency Risk

Your data should now be relatively tidy, in the technical sense of the term. And that means it should be relatively straightforward for you to build a model that answers the question “Given the features of a newly originated mortgage, how likely is the mortgage holder to fall into delinquency within the first two years after origination?”

Exercise 8

First, we need to identify the target for our model useful predictors from the data and do feature engineering.

Let’s begin with identifying some features that probably aren’t going to be useful. For example, "Metropolitan Statistical Area (MSA) Or Metropolitan Division" is probably not an appropriate feature to include in this analysis. Can you figure out why? Make sure to show (quantitatively) why not.

Hint: should be more than the missing rate.

Hint 2: how many observations for a given city do you think you’d need to determine if that city had especially high mortgage delinquency rates?

Hint 3: if not all possible values of a variable are present in your training data, what problem might that cause during testing and deployment?

[16]:
missing = (
    two_years["Metropolitan Statistical Area (MSA) Or Metropolitan Division"]
    .isna()
    .mean()
)
print(f"MSA Missing for {missing:.1%} of observations.")
MSA Missing for 27.6% of observations.
[17]:
has_msa = two_years[
    two_years["Metropolitan Statistical Area (MSA) Or Metropolitan Division"].notnull()
]
has_msa["msa_has_delinquent"] = has_msa.groupby(
    "Metropolitan Statistical Area (MSA) Or Metropolitan Division"
)["ever_delinquent"].transform(np.sum)

no_delinquents = (has_msa["msa_has_delinquent"] == 0).mean()
few_delinquents = (has_msa["msa_has_delinquent"] < 4).mean()

print(
    f"Another {no_delinquents:.1%} of observations in MSA with zero delinquencies "
    f"and {few_delinquents:.1%} have less than 4 delinquencies."
)
Another 10.0% of observations in MSA with zero delinquencies and 39.1% have less than 4 delinquencies.
/var/folders/fs/h_8_rwsn5hvg9mhp0txgc_s9v6191b/T/ipykernel_20548/2479956568.py:6: FutureWarning: The provided callable <function sum at 0x104c5a480> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  )["ever_delinquent"].transform(np.sum)

It’s missing for about 27% of the mortgages in the data, there are zero delinquencies in another 10% of observations, and 39% have 3 or fewer. That’s just too sparse — when coefficients are being estimated for those categories, they’re bound to over-fit.

Put differently: would you be comfortable estimating the delinquency rate for, say, the city of Denver, CO with three or fewer observations? Of course not, but when you try and predict values to new data, that’s precisely what you’re doing — applying an estimate of delinquency from a few observations to any new observations in that MSA.

Exercise 9

For your analysis, include the following variables:

Credit Score
First Time Homebuyer Flag
Number of Units
Mortgage Insurance Percentage (MI %)
Occupancy Status
Original Debt-to-Income (DTI) Ratio
Original UPB
Original Loan-to-Value (LTV)
Original Interest Rate
Channel
Prepayment Penalty Mortgage (PPM) Flag
Amortization Type (Formerly Product Type)
Property State
Property Type
Original Loan Term
Number of Borrowers
Interest Only (I/O) Indicator

Be sure to clean these variables. When doing so, please treat missing data as missing (e.g., np.nan, not as a distinct category).

[18]:
# Clean up!
for i in [
    "Channel",
    "Property Valuation Method",
    "First Time Homebuyer Flag",
    "Occupancy Status",
]:
    two_years[i] = two_years[i].replace(9, np.nan)
    two_years[i] = two_years[i].replace("9", np.nan)

for i in ["Number of Units", "Property Type", "Number of Borrowers"]:
    two_years[i] = two_years[i].replace(99, np.nan)
    two_years[i] = two_years[i].replace("99", np.nan)


for i in [
    "Mortgage Insurance Percentage (MI %)",
    "Original Debt-to-Income (DTI) Ratio",
    "Original Loan-to-Value (LTV)",
]:
    two_years[i] = two_years[i].replace(999, np.nan)
    two_years[i] = two_years[i].replace("999", np.nan)

two_years["Credit Score"] = two_years["Credit Score"].replace(9999, np.nan)
two_years["Credit Score"] = two_years["Credit Score"].replace("9999", np.nan)
[19]:
for_ml = two_years[
    [
        "Credit Score",
        "First Time Homebuyer Flag",
        "Number of Units",
        "Mortgage Insurance Percentage (MI %)",
        "Occupancy Status",
        "Original Debt-to-Income (DTI) Ratio",
        "Original UPB",
        "Original Loan-to-Value (LTV)",
        "Original Interest Rate",
        "Channel",
        "Prepayment Penalty Mortgage (PPM) Flag",
        "Amortization Type (Formerly Product Type)",
        "Property State",
        "Property Type",
        "Original Loan Term",
        "Number of Borrowers",
        "Interest Only (I/O) Indicator",
        "ever_delinquent",
        "Loan Sequence Number",
    ]
]

Exercise 10

The next step in our analysis is to convert our categorical variables to one-hot-encodings and use train_test_split to split our data.

To ensure replicability, before you train_test_split your data, please sort your data by Loan Sequence Number. This will ensure when we split the data with a random seed below, everyone will get the same split and the autograder will function.

You may create your one-hot-encodings however you wish, but I’m a fan of the patsy library’s dmatrices function.

Hint: You should end up with 8 categorical variables, including some binary flags and Number_of_Borrowers, Number_of_Units (which you could argue should be continuous, but I think are better treated as categorical).

Store the number of observations in your final dataset in ex10_num_obs.

[20]:
for_ml = for_ml.sort_values("Loan Sequence Number")

import patsy
import re

# Patsy can't handle this type of punctuation in formulas
for_ml.columns = [re.sub(" ", "_", c) for c in for_ml.columns]
for_ml.columns = [re.sub("%", "", c) for c in for_ml.columns]
for_ml.columns = [re.sub("/", "", c) for c in for_ml.columns]
for_ml.columns = [re.sub("\(", "", c) for c in for_ml.columns]
for_ml.columns = [re.sub("\)", "", c) for c in for_ml.columns]
for_ml.columns = [re.sub("-", "", c) for c in for_ml.columns]

for_ml.columns
[20]:
Index(['Credit_Score', 'First_Time_Homebuyer_Flag', 'Number_of_Units',
       'Mortgage_Insurance_Percentage_MI_', 'Occupancy_Status',
       'Original_DebttoIncome_DTI_Ratio', 'Original_UPB',
       'Original_LoantoValue_LTV', 'Original_Interest_Rate', 'Channel',
       'Prepayment_Penalty_Mortgage_PPM_Flag',
       'Amortization_Type_Formerly_Product_Type', 'Property_State',
       'Property_Type', 'Original_Loan_Term', 'Number_of_Borrowers',
       'Interest_Only_IO_Indicator', 'ever_delinquent',
       'Loan_Sequence_Number'],
      dtype='object')
[21]:
for_ml["Property_Type"].value_counts()
[21]:
Property_Type
SF    12117
PU     3224
CO     1741
MH      264
CP      120
Name: count, dtype: int64
[22]:
y, X = patsy.dmatrices(
    "ever_delinquent ~ Credit_Score"
    "+ First_Time_Homebuyer_Flag"
    "+ C(Number_of_Units)"
    "+ Mortgage_Insurance_Percentage_MI_"
    "+ Occupancy_Status"
    "+ Original_DebttoIncome_DTI_Ratio"
    "+ Original_UPB"
    "+ Original_LoantoValue_LTV"
    "+ Original_Interest_Rate"
    "+ Channel"
    "+ Prepayment_Penalty_Mortgage_PPM_Flag"
    "+ Amortization_Type_Formerly_Product_Type"
    "+ Property_State"
    "+ Property_Type"
    "+ Original_Loan_Term"
    "+ C(Number_of_Borrowers)"
    "+ Interest_Only_IO_Indicator",
    for_ml,
)
[23]:
# Eyeball check result
X
[23]:
DesignMatrix with shape (17052, 76)
  Columns:
    ['Intercept',
     'First_Time_Homebuyer_Flag[T.Y]',
     'C(Number_of_Units)[T.2.0]',
     'C(Number_of_Units)[T.3.0]',
     'C(Number_of_Units)[T.4.0]',
     'Occupancy_Status[T.P]',
     'Occupancy_Status[T.S]',
     'Channel[T.C]',
     'Channel[T.R]',
     'Channel[T.T]',
     'Prepayment_Penalty_Mortgage_PPM_Flag[T.Y]',
     'Property_State[T.AL]',
     'Property_State[T.AR]',
     'Property_State[T.AZ]',
     'Property_State[T.CA]',
     'Property_State[T.CO]',
     'Property_State[T.CT]',
     'Property_State[T.DC]',
     'Property_State[T.DE]',
     'Property_State[T.FL]',
     'Property_State[T.GA]',
     'Property_State[T.GU]',
     'Property_State[T.HI]',
     'Property_State[T.IA]',
     'Property_State[T.ID]',
     'Property_State[T.IL]',
     'Property_State[T.IN]',
     'Property_State[T.KS]',
     'Property_State[T.KY]',
     'Property_State[T.LA]',
     'Property_State[T.MA]',
     'Property_State[T.MD]',
     'Property_State[T.ME]',
     'Property_State[T.MI]',
     'Property_State[T.MN]',
     'Property_State[T.MO]',
     'Property_State[T.MS]',
     'Property_State[T.MT]',
     'Property_State[T.NC]',
     'Property_State[T.ND]',
     'Property_State[T.NE]',
     'Property_State[T.NH]',
     'Property_State[T.NJ]',
     'Property_State[T.NM]',
     'Property_State[T.NV]',
     'Property_State[T.NY]',
     'Property_State[T.OH]',
     'Property_State[T.OK]',
     'Property_State[T.OR]',
     'Property_State[T.PA]',
     'Property_State[T.PR]',
     'Property_State[T.RI]',
     'Property_State[T.SC]',
     'Property_State[T.SD]',
     'Property_State[T.TN]',
     'Property_State[T.TX]',
     'Property_State[T.UT]',
     'Property_State[T.VA]',
     'Property_State[T.VI]',
     'Property_State[T.VT]',
     'Property_State[T.WA]',
     'Property_State[T.WI]',
     'Property_State[T.WV]',
     'Property_State[T.WY]',
     'Property_Type[T.CP]',
     'Property_Type[T.MH]',
     'Property_Type[T.PU]',
     'Property_Type[T.SF]',
     'C(Number_of_Borrowers)[T.2.0]',
     'Credit_Score',
     'Mortgage_Insurance_Percentage_MI_',
     'Original_DebttoIncome_DTI_Ratio',
     'Original_UPB',
     'Original_LoantoValue_LTV',
     'Original_Interest_Rate',
     'Original_Loan_Term']
  Terms:
    'Intercept' (column 0)
    'First_Time_Homebuyer_Flag' (column 1)
    'C(Number_of_Units)' (columns 2:5)
    'Occupancy_Status' (columns 5:7)
    'Channel' (columns 7:10)
    'Prepayment_Penalty_Mortgage_PPM_Flag' (column 10)
    'Amortization_Type_Formerly_Product_Type' (columns 11:11)
    'Property_State' (columns 11:64)
    'Property_Type' (columns 64:68)
    'C(Number_of_Borrowers)' (column 68)
    'Interest_Only_IO_Indicator' (columns 69:69)
    'Credit_Score' (column 69)
    'Mortgage_Insurance_Percentage_MI_' (column 70)
    'Original_DebttoIncome_DTI_Ratio' (column 71)
    'Original_UPB' (column 72)
    'Original_LoantoValue_LTV' (column 73)
    'Original_Interest_Rate' (column 74)
    'Original_Loan_Term' (column 75)
  (to view full data, use np.asarray(this_obj))
[24]:
results["ex10_num_obs"] = X.shape[0]
print(f"The final dataset has {results['ex10_num_obs']:,.0f} observations")
The final dataset has 17,052 observations

Exercise 11

Use train_test_split from sklearn.model_selection to split the data.

Before you do, Use 0.2 as the test_size and use random_state=42.

[25]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Exercise 12

Now fit a GradientBoostingClassifier to the data (from sklearn.ensemble). Set random_state=42. using roc_auc_score, get your ROC AUC score against the test data.

Please round your answer to two decimal places - sometimes sklearn models have very small instabilities even after setting a random seed.

Store in results under the key "ex12_roc_auc".

[26]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, confusion_matrix
import matplotlib.pyplot as plt


gb = GradientBoostingClassifier(random_state=42)
gb.fit(X_train, y_train)

# Predicts
y_pred = gb.predict(X_test)
y_pred_proba = gb.predict_proba(X_test)[:, 1]

# Score ane matrix
auc_score = roc_auc_score(y_test, y_pred_proba)
conf_matrix = confusion_matrix(y_test, y_pred)

results["ex12_roc_auc"] = np.round(auc_score, 2)
print(f"My ROC AUC score rounded to 2 places is {results['ex12_roc_auc']:.2f}")
/Users/nce8/opt/miniconda3/lib/python3.11/site-packages/sklearn/preprocessing/_label.py:114: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
My ROC AUC score rounded to 2 places is 0.71

Exercise 13

Use the predict method to generate a confusion matrix. What problem do you see with the result?

[27]:
print(f"Confusion matrix:\n{conf_matrix}")
Confusion matrix:
[[3146    9]
 [ 252    4]]

It’s just getting accuracy by almost always saying “no delinquency” (the dominant class), a canonical issue with unbalanced data.

Exercise 14

To address the problem from Exercise 13, use .predict_proba() to set your own threshold for classification. Your stakeholder is mostly concerned with False Negatives (mortgages classified as safe that actually are not), so use an 8% probability threshold to get a good balance of a low False Omission Rate (the share of predicted negatives that are false negatives) with a reasonable amount of mortgages still being considered “viable.”

In other words, treat a mortgage as risky (1) if the predicted probability is greater than 0.08. Be sure to use “greater than” rather than “greater than or equal to”.

What is the False Omission Rate at an 8% classification threshold from the model above?

Store the result under the key "ex14_false_omission_rate".

[28]:
CUTOFF = 0.08
conf_matrix = confusion_matrix(y_test, (y_pred_proba > CUTOFF))
print(f"Confusion matrix with {CUTOFF} cutoff:\n{conf_matrix}")

results["ex14_false_omission_rate"] = float(
    conf_matrix[1, 0] / np.sum(conf_matrix[:, 0])
)
print(
    f"The False Omission Rate at {CUTOFF} is:\n{results['ex14_false_omission_rate']:.2%}"
)
Confusion matrix with 0.08 cutoff:
[[2387  768]
 [ 123  133]]
The False Omission Rate at 0.08 is:
4.90%
[29]:
pd_confusion_matrix = pd.crosstab(
    y_test.squeeze(),
    (y_pred_proba > CUTOFF).astype("int"),
    rownames=["Actual"],
    colnames=["Predicted"],
)
[30]:
pd_confusion_matrix
[30]:
Predicted 0 1
Actual
0.0 2387 768
1.0 123 133

Exercise 15

Your stakeholder wants to by as many mortgages as it can while maintaining a delinquency rate of purchased mortgages below 5%. Based on your answer above, do you feel like your model can provide that level of performance?

Well, the result above says yes (4.9%), but how confident should we be in that number?? We’ll see!

Now To The Future

The preceding analysis is precisely the type of analysis you would do if, in late 2006, you’d been asked to evaluate mortgage performance in the last two years for use going forward. So let’s see how your model performs now!

In this folder you will find data on mortgages originated in 2007 along with servicing data from 2007, 2008, and 2009.

Exercise 16

Please load this data (again, from a URL to help the autograder) and clean it in the same manner as before. As a sanity check, how many observations do you have in the final dataset (after you’ve removed observations with missing values to allow you to generate predicted delinquency rates)?

Store the final number of observations in "ex16_num_obs" and the share of those mortgages that are delinquent in "ex16_share_delinquent"

[31]:
###########
# Load
###########

mortgages_2007 = pd.read_csv(
    "https://github.com/nickeubank/MIDS_Data/raw/master/mortgages/2007/sample_orig_2007.txt",
    sep="|",
    names=origination_colnames,
)
servicing_2007 = pd.read_csv(
    "https://github.com/nickeubank/MIDS_Data/raw/master/mortgages/2007/sample_svcg_2007orig_3years.txt",
    sep="|",
    names=service_colnames,
)

print(f"2007 data has {len(mortgages_2007):,.0f} new mortgages")
print(f"2007 data has {len(servicing_2007):,.0f} servicing records")
2007 data has 50,000 new mortgages
2007 data has 1,277,771 servicing records
/var/folders/fs/h_8_rwsn5hvg9mhp0txgc_s9v6191b/T/ipykernel_20548/507551674.py:10: DtypeWarning: Columns (8,24) have mixed types. Specify dtype option on import or set low_memory=False.
  servicing_2007 = pd.read_csv(
[32]:
########
# NOTE: In any serious real workflow,
# the way to do this would be to write a .py script
# and parameterize it, not copy-paste the code.
# This kind of duplication invites problems if, for
# example, you change code above and forget to change it
# down here as well.
########

# Life becomes way easier if I add this here,
servicing_2007 = servicing_2007[
    [
        "Monthly Reporting Period",
        "Current Loan Delinquency Status",
        "Loan Sequence Number",
    ]
]

combined_2007 = pd.merge(
    mortgages_2007,
    servicing_2007,
    on="Loan Sequence Number",
    how="inner",
    validate="1:m",
    indicator=True,
)
combined_2007._merge.value_counts()
assert (combined_2007._merge != "right_only").all()

# For purchase
combined_2007 = combined_2007[combined_2007["Loan Purpose"] == "P"]

# First payment
combined_2007["year_first"] = combined_2007["First Payment Date"] // 100
combined_2007["month_first"] = combined_2007["First Payment Date"] % 100
combined_2007["quarter_first"] = ((combined_2007["month_first"] - 1) // 3) + 1
assert combined_2007["quarter_first"].isin(range(1, 5)).all()

combined_2007["year_decimal_first"] = (
    combined_2007["year_first"] + (combined_2007["month_first"] - 1) / 12
)

# Origination. Get from loan sequence number.
combined_2007["year_orig"] = (
    combined_2007["Loan Sequence Number"].str.get(2).astype("int") + 2000
)
assert (combined_2007["year_orig"] == 2007).all()
combined_2007["quarter_orig"] = (
    combined_2007["Loan Sequence Number"].str.get(4).astype("int")
)
assert combined_2007["quarter_orig"].isin(range(1, 5)).all()


# Keep only if same or following quarter
same_quarter_2007 = (combined_2007["year_first"] == combined_2007["year_orig"]) & (
    combined_2007["quarter_first"] == combined_2007["quarter_orig"]
)
next_quarter_same_year_2007 = (
    combined_2007["year_first"] == combined_2007["year_orig"]
) & (combined_2007["quarter_first"] == (combined_2007["quarter_orig"] + 1))
next_quarter_next_year_2007 = (
    (combined_2007["year_first"] == (combined_2007["year_orig"] + 1))
    & (combined_2007["quarter_first"] == 1)
    & (combined_2007["quarter_orig"] == 4)
)
combined_2007 = combined_2007[
    same_quarter_2007 | next_quarter_same_year_2007 | next_quarter_next_year_2007
]

combined_2007["year_reporting"] = combined_2007["Monthly Reporting Period"] // 100
combined_2007["month_reporting"] = combined_2007["Monthly Reporting Period"] % 100
combined_2007["year_decimal_reporting"] = (
    combined_2007["year_reporting"] + (combined_2007["month_reporting"] - 1) / 12
)

combined_2007["age"] = np.round(
    (combined_2007["year_decimal_reporting"] - combined_2007["year_decimal_first"]), 4
)
combined_2007["age"].value_counts().sort_index()

two_years_2007 = combined_2007[(0 <= combined_2007["age"]) & (combined_2007["age"] < 2)]
assert len(two_years_2007["age"].value_counts()) == 24

# Current Loan Delinquency Status
two_years_2007["Current Loan Delinquency Status"] = (
    two_years_2007["Current Loan Delinquency Status"].replace("RA", 99).astype("int")
)

two_years_2007["ever_delinquent"] = (
    two_years_2007.groupby("Loan Sequence Number")[
        "Current Loan Delinquency Status"
    ].transform("max")
    > 0
).astype("int")

# This is made easier by drops earlier.
two_years_2007 = two_years_2007.drop(
    columns=[
        "Current Loan Delinquency Status",
        "Monthly Reporting Period",
        "month_reporting",
        "year_reporting",
        "year_decimal_reporting",
        "age",
        "_merge",
    ]
)
two_years_2007 = two_years_2007.drop_duplicates()
assert two_years_2007["Loan Sequence Number"].is_unique

###########
# Clean up missing values
###########

for i in [
    "Channel",
    "Property Valuation Method",
    "First Time Homebuyer Flag",
    "Occupancy Status",
]:
    two_years_2007[i] = two_years_2007[i].replace(9, np.nan)
    two_years_2007[i] = two_years_2007[i].replace("9", np.nan)

for i in ["Number of Units", "Property Type", "Number of Borrowers"]:
    two_years_2007[i] = two_years_2007[i].replace(99, np.nan)
    two_years_2007[i] = two_years_2007[i].replace("99", np.nan)

for i in [
    "Mortgage Insurance Percentage (MI %)",
    "Original Debt-to-Income (DTI) Ratio",
    "Original Loan-to-Value (LTV)",
]:
    two_years_2007[i] = two_years_2007[i].replace(999, np.nan)
    two_years_2007[i] = two_years_2007[i].replace("999", np.nan)

two_years_2007["Credit Score"] = two_years_2007["Credit Score"].replace(9999, np.nan)
two_years_2007["Credit Score"] = two_years_2007["Credit Score"].replace("9999", np.nan)

for_ml_2007 = two_years_2007[
    [
        "Credit Score",
        "First Time Homebuyer Flag",
        "Number of Units",
        "Mortgage Insurance Percentage (MI %)",
        "Occupancy Status",
        "Original Debt-to-Income (DTI) Ratio",
        "Original UPB",
        "Original Loan-to-Value (LTV)",
        "Original Interest Rate",
        "Channel",
        "Prepayment Penalty Mortgage (PPM) Flag",
        "Amortization Type (Formerly Product Type)",
        "Property State",
        "Property Type",
        "Original Loan Term",
        "Number of Borrowers",
        "Interest Only (I/O) Indicator",
        "ever_delinquent",
        "Loan Sequence Number",
    ]
]

for_ml_2007 = for_ml_2007.sort_values("Loan Sequence Number")

for_ml_2007.columns = [re.sub(" ", "_", c) for c in for_ml_2007.columns]
for_ml_2007.columns = [re.sub("%", "", c) for c in for_ml_2007.columns]
for_ml_2007.columns = [re.sub("/", "", c) for c in for_ml_2007.columns]
for_ml_2007.columns = [re.sub("\(", "", c) for c in for_ml_2007.columns]
for_ml_2007.columns = [re.sub("\)", "", c) for c in for_ml_2007.columns]
for_ml_2007.columns = [re.sub("-", "", c) for c in for_ml_2007.columns]

for_ml_2007.columns
[32]:
Index(['Credit_Score', 'First_Time_Homebuyer_Flag', 'Number_of_Units',
       'Mortgage_Insurance_Percentage_MI_', 'Occupancy_Status',
       'Original_DebttoIncome_DTI_Ratio', 'Original_UPB',
       'Original_LoantoValue_LTV', 'Original_Interest_Rate', 'Channel',
       'Prepayment_Penalty_Mortgage_PPM_Flag',
       'Amortization_Type_Formerly_Product_Type', 'Property_State',
       'Property_Type', 'Original_Loan_Term', 'Number_of_Borrowers',
       'Interest_Only_IO_Indicator', 'ever_delinquent',
       'Loan_Sequence_Number'],
      dtype='object')
[33]:
y_2007, X_2007 = patsy.dmatrices(
    "ever_delinquent ~ Credit_Score"
    "+ First_Time_Homebuyer_Flag"
    "+ C(Number_of_Units)"
    "+ Mortgage_Insurance_Percentage_MI_"
    "+ Occupancy_Status"
    "+ Original_DebttoIncome_DTI_Ratio"
    "+ Original_UPB"
    "+ Original_LoantoValue_LTV"
    "+ Original_Interest_Rate"
    "+ Channel"
    "+ Prepayment_Penalty_Mortgage_PPM_Flag"
    "+ Amortization_Type_Formerly_Product_Type"
    "+ Property_State"
    "+ Property_Type"
    "+ Original_Loan_Term"
    "+ C(Number_of_Borrowers)"
    "+ Interest_Only_IO_Indicator",
    for_ml_2007,
)
[34]:
results["ex16_num_obs"] = X_2007.shape[0]
results["ex16_share_delinquent"] = float(y_2007.mean())
print(f"Num obs in the 2007 data is {results['ex16_num_obs']:,.0f}")
print(
    f"share of these 2007 morgages that are delinquent is {results['ex16_share_delinquent']:.2%}"
)
Num obs in the 2007 data is 21,972
share of these 2007 morgages that are delinquent is 10.96%

Exercise 17

Had your stakeholder purchased mortgages using your model (with the 8% cutoff), what would the resulting False Omission Rate (share of predicted negatives that are false negatives) rate have been? (e.g., compare the predicted values for mortgages using the model trained above with realized outcomes).

Store your result under the key "ex17_false_omission_rate". Round your answer to 4 decimal places.

[35]:
# Predicts
y_2007_pred_proba = gb.predict_proba(X_2007)[:, 1]
conf_matrix_2007 = confusion_matrix(y_2007, (y_2007_pred_proba > CUTOFF))
print(f"Confusion matrix with {CUTOFF} cutoff:\n{conf_matrix_2007}")

# false_rate = float(conf_matrix_2007[1, 0] / np.sum(conf_matrix_2007[:, 0]))

# results["ex17_false_omission_rate"] = np.round(false_rate, 4)


results["ex17_false_omission_rate"] = float(
    conf_matrix_2007[1, 0] / np.sum(conf_matrix_2007[:, 0])
)

print(
    f"The False Omission Rate at {CUTOFF} for 2007 data is:"
    f"\n{results['ex17_false_omission_rate']:.2%}"
)
Confusion matrix with 0.08 cutoff:
[[14487  5076]
 [  847  1562]]
The False Omission Rate at 0.08 for 2007 data is:
5.52%

Exercise 18

How did the retrospective performance of your model compare to its actual performance moving forward? Why? Did you stay below the 5% target for mortgages that were predicted to be safe but in the end were not set by the stakeholder?

Nope. split-train-test only tells you about accuracy you can expect internally; it doesn’t account for external validity induced uncertainty.

[38]:
assert set(results.keys()) == {
    "ex2_merge_type",
    "ex4_num_mortgages",
    "ex5_num_obs",
    "ex7_num_mortgages",
    "ex7_share_delinquent",
    "ex10_num_obs",
    "ex12_roc_auc",
    "ex14_false_omission_rate",
    "ex16_num_obs",
    "ex16_share_delinquent",
    "ex17_false_omission_rate",
}