Converting Stakeholder Prompts into Questions

As a data scientist, it will probably be the case that the people you work with will not come to you with well-formed research questions they want investigated. Instead, it is much more likely they will come to you with a problem they are facing, and in some cases a poorly formed idea of how they think you can solve their problem using data science (a solution that will almost certain invoke the magic of “machine learning”).

When this happens, it is your job to do two things:

  1. Figure out the real core problem / need of the stakeholder,

  2. Determine what questions, if answered, would help you address that problem.

What’s The Real Problem / Need

You might think that the on thing that the stakeholder knows best is nature of their problem. But one thing you will quickly find is that stakeholders will very often mis-identify or mis-represent their true problem / need.

For example, suppose you are approached by a real estate group interested in developments in the United Kingdom. They tell you:

We do a lot of work building large residential buildings. One problem we often have is that after we start digging the fundations of our buildings, we discover that we’re actually on top of ancient ruins of some sort that the stupid government wants to preserve for stupid posterity. Then we have to spend piles of money to have archeologists come to dig out these old bones with toothbrushes, all of which also puts us behind schedule. So what we want you to do is design a machine learning algorithm that we can use to detect archeological sites from the ground penetrating radar scans we do after we’ve cleared out a job site so we can avoiding digging foundations in the spots on our plots where archeological sites might be located.

So what does this stakeholder want you to do? At first glance, it might seem like the answer is “develop a machine learning algorithm for detecting archeological sites in ground penetrating radar scans”.

But here’s the thing: most people ask data scientists for help because they don’t know much about data science. As a result, they specific thing they ask for may not actually be the best approach available given the state of data science.

In this case, for example, I would argue that the real need of this company is to know where archeological sites are likely to be located, full stop. Finding a way to identify archeological remains in ground radar scans might help them, but there are two problems with that approach. First, it’s not clear it’s feasable. Training a supervised machine learning algorithm requires labeled training data, which in this case would mean ground penetrating radar scans of identified-but-not-yet-excavated-ruins. And its not at all clear such data could be obtained.

But second, and more importantly, it seems like it’d be much more helpful and cost effective if you could make a guess about whether there are archeological sites at a location before the company buys the land and clears away the existing buildings!

So what would I propose to a stake holder like this? I would say: “If I could tell you the likelihood that an archeological site is under a plot of land *you’re thinking about buying, would that be more helpful to you?”

Now to be clear, you may not always be right in your assessment. They may come back and say “No, sorry – there’s this other consideration we didn’t explain initially” (like “we’ve already bought all this land. We’re committed.”). But before diving in, you always want to be sure you’re solving the right problem. And you should never assume that the problem to be addressed has been properly specified.

What Questions Do I Need To Ask?

Data science tools aren’t designed to solve problems; they answer questions. And so to have any hope of successfully solving the problem presented by your stakeholder, you must first determine what questions, if answered, would help solve the problem at hand.

In the example of the real estate development company above, for example, a question which, if answered, would make it essentially solve the stakeholder’s problem would be:

Can I predict the presence of archeological ruins using publicly available data, such as data topology, hydrology, soil types, and geography?

Unlike using machine learning to process ground penetrating radar scans, the training data required to build a supervised machine learning algorithm to answer this question – information on the locations of previously located archeological ruins – is likely public, and there are lots of public sources of data on landscape features and soil types. You would have to deal with the fact that people will only have figured out if archeological ruins are present in places where development has taken place in the past (so you’d want to build your training data using only the locations of large real estate developments that would find ruins if ruins were present), but this is emminently doable.

Again, I am not saying this is the best strategy – there may be considerations we don’t know about yet – but this approach is certainly something I would bring back to the real estate company before doing any actual data work.

Make Your Questions Specific and Actionable

In developing your questions, it is important to make them specific and actionable. A specific and actionable question is one that makes it very clear what you need to do next. For example, suppose an international aid organization told you they were worried that urbanization in Africa, Asia and Latin America was impacting efforts to reduce infant mortality. Some examples of specific, actionable questions are: “Is infant mortality higher among recent migrants to urban centers, controlling for income?” or “are the causes of infant mortality among recent migrants to urban centers different from those living in rural area?” Reading those questions, you can probably immediately think of what data you’d need to collect, and what regressions you’d want to run to generate answers to those questions.

Vague questions would be “is urbanization impacting efforts to reduce infant mortality?”, or “does urbanization affect infant mortality?” Note that when you read these, they don’t seem to obviously imply a way forward.

Perhaps the best way to figure out if your question is answerable is to write down what an answer to your question would look like. Seriously – try it. Can you write down, on a piece of paper, the graph, regression table, or machine learning diagnostic statistics (complete with labels on your axes, names for variables, etc.) that would constitute an answer to your question? If not, it’s probably too vague!

Many Questions?

To be clear, it may not always be the case that you can solve your stakeholder’s problem by answering a single question. For example, suppose you’re working with an educational agency trying to decide which of two pilot programs designed to provide early education to children they should launch nationally. To help them reach a decision, you may need to ask several questions, like:

  1. What was the effect of the first pilot program on child learning?

  2. What was the effect of the second pilot program on child learning?

  3. What were the costs of the two program (so we can determine the cost effectiveness of each program)?

But in this situations, it is still best to break you question down into smaller components.