Interpretable Modelling of Credit Risk

As detailed in Cynthia Rudin’s excellent commentary on interpretability (ArXiV version here), there are a plethora of reasons to avoid the use of black box models when models are being used to make high stakes decisions to may have life-altering effects on real people. Efforts to develop “explainable black box models,” while appealing for their potential to let us continuing using the same tools we always have and to creation explanations after the fact, are inherently flawed. As Rudin notes in my single favorite passage from her paper:

Explainable ML methods provide explanations that are not faithful to what the original model computes. Explanations must be wrong. They cannot have perfect fidelity with respect to the original model. If the explanation was completely faithful to what the original model computes, the explanation would equal the original model, and one would not need the original model in the first place, only the explanation. (In other words, this is a case where the original model would be interpretable.) This leads to the danger that any explanation method for a black box model can be an inaccurate representation of the original model in parts of the feature space.

An inaccurate (low-fidelity) explanation model limits trust in the explanation, and by extension, trust in the black box that it is trying to explain. An explainable model that has a 90% agreement with the original model indeed explains the original model most of the time. However, an explanation model that is correct 90% of the time is wrong 10% of the time. If a tenth of the explanations are incorrect, one cannot trust the explanations, and thus one cannot trust the original black box. If we cannot know for certain whether our explanation is correct, we cannot know whether to trust either the explanation or the original model.

With this motivation in mind, in this exercise, we will use a cutting edge interpretable modeling framework to model credit risk using data from the 14th Pacific-Asia Knowledge Discovery and Data Mining conference (PAKDD 2010). This data covers the period of 2006 to 2009, and “comes from a private label credit card operation of a Brazilian credit company and its partner shops.” (The competition was won by TIMi, who purely by coincidence helped me complete my PhD dissertation research!).

We will be working with Generalized Additive Models (GAMs) (not to be confused with Generalized Linear Models (GLMs) — GLMs are a special case of GAMs). In particular, we will be using the pyGAM, though this is far from the only GAM implementation out there. mvgam in R is probably considered the gold standard, as it was developed by a pioneering researcher of GAMs. statsmodels also has an implementation, and GAM is also hiding in plain sight behind many other tools, like Meta’s Prophet time series forecasting library (which is GAM-based).

Data Prep

Exercise 1

The PADD 2010 data is in this repository. You can find column names in PAKDD2010_VariablesList.XLS and the actual data in PAKDD2010_Modeling_Data.txt.

Note: you may run into a string-encoding issue loading the PAKDD2010_Modeling_Data.txt data. All I’ll say is that most latin-based languages used latin8 as a text encoding prior to broad adoption of UTF-8. (Don’t know about UTF? Check out this video!)

Load the data (including column names).

Exercise 2

There are a few variables with a lot of missing values (more than half missing). Given the limited documentation for this data it’s a little hard to be sure why, but given the effect on sample size and what variables are missing, let’s go ahead and drop them. You you end up dropping 6 variables.

Hint: Some variables have missing values that aren’t immediately obviously.

(This is not strictly necessary at this stage, given we’ll be doing more feature selection down the line, but keeps things easier knowing we don’t have to worry about missingness later.)

Exercise 3

Let’s start off by fitting a model that uses the following variables:

"QUANT_DEPENDANTS",
"QUANT_CARS",
"MONTHS_IN_RESIDENCE",
"PERSONAL_MONTHLY_INCOME",
"QUANT_BANKING_ACCOUNTS",
"AGE",
"SEX",
"MARITAL_STATUS",
"OCCUPATION_TYPE",
"RESIDENCE_TYPE",
"RESIDENCIAL_STATE",
"RESIDENCIAL_CITY",
"RESIDENCIAL_BOROUGH",
"RESIDENCIAL_ZIP_3"

(GAMs don’t have any automatic feature selection methods, so these are based on my own sense of features that are likely to matter. A fully analysis would entail a few passes at feature refinement)

Plot and otherwise characterize the distributions of all the variables we may use. If you see anything bananas, adjust how terms enter your model. Yes, pyGAM has flexible functional forms, but giving the model features that are engineered to be more substantively meaningful (e.g., taking log of income) will aid model estimation.

You should probably do something about the functional form of at least PERSONAL_MONTHLY_INCOME, and QUANT_DEPENDANTS.

Exercise 4

Geographic segregation means residency data often contains LOTS of information. But there’s a problem with RESIDENCIAL_CITY and RESIDENCIAL_BOROUGH. What is the problem?

In any real project, this would be something absolutely worth resolving, but for this exercise, we’ll just drop all three string RESIDENCIAL_ variables.

Model Fitting

Exercise 5

First, use train_test_split to do an 80/20 split of your data. Then, using the TARGET_LABEL_BAD variable, fit a classification model on this data. Optimize with gridsearch. Use splines for continuous variables and factors for categoricals.

At this point we’d ideally be working with 11 variables. However pyGAM can get a little slow with factor features with lots of values + lots of unique values (e.g., 50,000 observations and the many values of RESIDENCIAL_ZIP takes about 15 minutes on my computer). In that configuration, you should get a model fit in 10-15 seconds.

So let’s start by fitting a model that also excludes RESIDENCIAL_ZIP.

Exercise 6

Create a (naive) confusion matrix using the predicted values you get with predict() on your test data. Our stakeholder cares about two things:

  • maximizing the number of people to whom they extend credit, and

  • the false negative rate (the share of people identified as “safe bets” who aren’t, and who thus default).

How many “good bets” does the model predict (true negatives), and what is the False Omission Rate (the share of predicted negatives that are false negatives)?

Looking at the confusion matrix, how did the model maximize accuracy?

Exercise 7

Suppose your stakeholder wants to minimize false negative rates. How low of a False Omission Rate (the share of predicted negatives that are false negatives) can you get (assuming more than, say, 10 true negatives), and how many “good bets” (true negatives) do they get at that risk level?

Hint: use predict_proba()

Note: One can use class weights to shift the emphasis of the original model fitting, but for the moment let’s just play with predict_proba() and thresholds.

Exercise 8

If the stakeholder wants to maximize true negatives and can tolerate a false omission rate of 19%, how many true negatives will they be able to enroll?

Let’s See This Interpretability!

We’re using GAMs for their interpretability, so let’s use it!

Exercise 9

Plot the partial dependence plots for all your continuous factors with 95% confidence intervals (I have three, at this stage).

If you get an error like this when generating partial_dependence errors:

----> pdep, confi = gam.partial_dependence(term=i, X=XX, width=0.95)

...
ValueError: X data is out of domain for categorical feature 4. Expected data on [1.0, 2.0], but found data on [0.0, 0.0]

it’s because you have a variable set as a factor that doesn’t have values of 0. pyGAM is assuming 0 is the excluded category. Just recode the variable to ensure 0 is used to identify one of the categories.

Exercise 10

How does the partial correlation with respect to age look?

Exercise 11

Refit your model, but this time impose monotonicity or concavity/convexity on the relationship between age and credit risk (which makes more sense to you?). Fit the model and plot the new partial dependence.

Exercise 12

Functional form constraints are often about fairness or meeting regulatory requirements, but they can also prevent overfitting.

Does this change the number of “true negatives” you can enroll below a false omission rate of 19%?

Exercise 13

In the preceding exercises, we allowed pyGAM to choose its own smoothing parameters / coefficient penalties. This makes life easy, but it isn’t always optimal, especially because when it does so, it picks the same smoothing penalty (the lambda in .summary()) for all terms.

(If you haven’t seen them let, penalities are designed to limit overfitting by, basically, “penalizing” big coefficients on different terms. This tends to push models towards smoother fits.)

To get around this, we can do a grid or random search. This is definitely a little slow, but let’s give it a try!

Then following the model given in the docs linked above, let’s do a random search. Make sure your initial random points has a shape of 100 x (the number of terms in your model).

Exercise 14

How many true negatives can you get now at a less than 19% False Omission Rate?

Exercise 15

Add an interaction term between age and personal income.

Exercise 16

Now visualize the partial dependence interaction term.

Exercise 17

Finally, another popular interpretable model is the ExplainableBoostingClassifier. You can learn more about it here, though how much sense it will make to you may be limited if you aren’t familiar with gradient boosting yet. Still, at least one of your classmates prefers it to pyGAM, so give it a try using this code:

from interpret.glassbox import ExplainableBoostingClassifier
from interpret import show
import warnings

ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)

with warnings.catch_warnings():
   warnings.simplefilter("ignore")

   ebm_global = ebm.explain_global()
   show(ebm_global)

   ebm_local = ebm.explain_local(X_train, y_train)
   show(ebm_local)