Your First Stakeholder

Welcome to Solving Problems With Data! For your first class exercise, imagine you are a data science consultant and have been hired by a leading credit advisory firm to help them with a pressing data science need — namely, investigating more regulatory-compliant machine learning models for estimating credit worthiness.

While this is an exercise, and we are just imagining that you have been hired by a credit advisory firm, the project proposal you will be working with is entirely authentic. It was submitted as a Capstone project proposal to the Duke MIDS program, and it was accepted and undertaken by a team of MIDS students (which means we actually know how the project turned out).

Your task is to review this proposal as a team, then detail how you would approach addressing the clients needs. Please be as concrete as possible, detailing exactly what you would do over the first four weeks of your engagement.

The Proposal

Begin Actual Proposal

The Acme Corporation [not the company’s real name, obviously] is a leading credit advisory firm that operates at the intersection of deep credit expertise and advanced data science. We work with top banks and leading fintechs. Our data scientists build statistical models to predict consumer behavior, such as who to lend to, in what amount and at what terms – one might think this would be core capabilities for major lenders, but our expertise has led them to rely on us to solve the most challenging data science problems. – machine learning models are key to driving this performance increase. One of the great advantages of machine learning over traditional regressions is its ability to capture non-linear effects. The down side of the method is they can be hard to interpret. There are several ways to explain what is occurring in models. e.g. partial dependence plots, shapley values, etc. The financial services market is highly regulated and requires a deep understanding of models used in lending. US law requires lenders to be able to give the top variables on an individual basis and for these to make sense from an impact direction. e.g. in lending you could not have a variable predicting a person is lower risk if they are currently delinquent on a loan. This needs to always happen in the model, not just using overall averages. The Acme Corporation has several methods to address the explainability problem for ML models.

Given this high burden, The Acme Corporation is always looking to test model specifications that improve the current baseline model performance while being completely transparent about the model itself. Our current champion model type is Gradient Boosted Machine (GBM). For this project, we would like to explore the viability of Explainable Boosting Machines (EBM).

The project would include the following:

  • Use a publicly available dataset as a baseline comparison. The Acme Corporation has proposed a potential dataset below, but open to others if the team wants to use another. The Acme Corporation can help determine if the other dataset is similar to the dataset used for typical model builds. Given we are a consulting firm our clients do not want us sharing data, so publicly available datasets are necessary.

  • Research how EBM methods work and make a comparison to GBMs

  • Create a baseline model from at GBM to make a comparison

  • Build and optimize a EBM and compare performance versus a GBM. This would entail both model performance and explainability tools.

  • Create top variables on the EBM to ensure compliance with US law. Compare top variables of the EBM versus the GBM

  • Add overlays to EBM code to enforce monotonicity between variable impacts and target.

The publicly available dataset to use is the KDD cup dataset. It is stored here. This dataset has both the sample size and number of predictors for a machine learning model. It has roughly 191k observations and 481 columns. The dataset was used in competition to construct the best model for optimizing direct mail response strategies. For the project, the team will need to define a response variable and then build out the best GBM and EBM model. The data is already split into build and validation and the team should use those. Any other feature creation is encouraged, as long as the output is explainable.

The Acme Corporation can support this project through bi-weekly meetings with the project team. In those meetings, The Acme Corporation can provide guidance on how to address the problem along with feedback on the results. The expected output is a write-up and presentation describing the detailed mechanics of EBMs along with a documented model performance difference between an EBM and GBM. A documented modeling coding pipeline using EBMs is also needed. As a stretch goal, creating a methodology for adverse actions that satisfies the regulatory needs. This is an important area of work at The Acme Corporation which we would highly value input and research from a top university.

Acme Corporation Contacts:

[redacted]

End Actual Proposal

Your Task

Outline the tasks the members of your team would undertake to address the needs of this stakeholder in order of priority. Please detail what you would do in the first week, second week, third week, and fourth week (you may assume you are working on this full time). Include discussion of how you might divide tasks among team members. Also detail deliverables you would attempt to provide the client.