Taxonomy of Questions

A central focus of this course will be thinking about how the tools of data science can best be brought to bear on different types of questions about the world. To that end, we must begin by introducing a taxonomy of questions.

To be clear, this taxonomy is my own, and thus you may not find that everyone you encounter will immediately recognize the distinctions that, by the end of this class, I hope will be clear to you. Indeed, part of the reason that data science is so fragmented is that different disciplines tend to focus (almost myopically) on certain classes of questions. As a result, they often fail to recognize that while the tools they use may be effective for their preferred questions, different tools may be required by those interested in different questions. In this class, we will strive to recognize the merits of all form of inquiry, and develop the skills necessary to properly approach any questions that comes our way.

In this course, we will use a three-fold taxonomy of questions:

  • Descriptive Questions: Questions about the current (or past) state of the world. Descriptive questions are often about measuring things that haven’t previously been measured, or identifying previously unseen patterns.

  • Causal Questions: Questions about causes and effects – why does the world look the way it does? What caused the current state of the world? What is the effect of, say, a drug?

  • Predictive Questions: Questions about the future, or questions that require extrapolation beyond current data.

Positive versus Normative Questions

Before we dive into each of these types of questions, it is worth pausing to emphasize that this is a taxonomy of questions about how the world is, not about how world should be. That’s because while data science is an amazing tool for telling us about the state of the world, it cannot, on its own, answer should questions. That is because answering should questions requires evaluating the desirability of different possible outcomes, and that can only be done using on the basis of a system of values. Data science may tell us the consequences of different courses of action, but it cannot tell us whether those consequences make a given course of action worthwhile.

To illustrate, suppose you are interested in reducing opioid overdoses. Your rigorous data science analysis may tell you that increasing the regulation of opioid prescriptions will reduce overdoses by some amount X and reduce access to opioids for those with chronic pain by some amount Y. But does that mean you should enact the policy? Well, that depends on how much value you place on patients with chronic pain having access to opioids, and how much value you place on preventing overdoses. And the answers to those questions simply can’t come from your data.

In this course, we will refer to questions about how the world is as “positive questions”, and questions about how the world should be as normative questions. But be aware these terms aren’t universal. Some people use the term “descriptive” instead of positive, and “proscriptive” instead of normative. Since we are using the term “descriptive” for a different purpose in this class, I will use normative and positive throughout this course to avoid confusion.

Descriptive Questions

Descriptive questions are questions about the current or past state of the world. In the words of John Gerring (we’ll read more of his work for a future class):

A descriptive argument describes some aspect of the world. In doing so, it aims to answer what questions (e.g., when, whom, out of what, in what manner) about a phenomenon or a set of phenomena. Descriptive arguments are about what is/was. For example: “Over the course of the past two centuries there have been three major waves of democratization.

Of all the types of questions we will study in this class, descriptive questions tend to be the least appreciated, but I would argue that in many ways they are the most important. That is because descriptive analyses are often the foundation for all other work. After all, it is only by first understanding the patterns in our world can we then move on to asking questions about how they arrose, or how they may evolve in the future.

To illustrate, let’s consider a few important descriptive analysis:

Descriptive Example 1: Nope, This Time Isn’t Different

A great example of descriptive analysis comes from economics. In 2011, Carmen Reinhart and Kenneth Rogoff published a book called This Time Is Different: Eight Centuries of Financial Folly. In it, they comprehensively analyze hundreds of years of economic history across more than sixty countries to document how, despite pundits regularly decrying that “this time things are different,” financial crises occur with remarkable frequency, duration, and ferocity. They offer some theories about why this may be, but their core contribution is documenting clearly that whatever it is that drives financial crises, it is not something specific to any geography or period, but rather something common to human economic systems the world over, wiping away dozens of attempts to explain specific financial crises as special cases in the process.

Descriptive Example 3: Disease Surveillance

It is hard to think of a public health discovery that didn’t start with disease surveillance – the practice of keeping descriptive statistics about causes of death or disease. Efforts to understand HIV began when public health officials saw a huge rise in gay men dying from diseases that shouldn’t have been fatal for young otherwise healthy men, and research into the role of cigarettes in causing lung cancer started when data showed that lung cancer rates were exploding across the world.

Descriptive Example 4: Global Warming

Before we began to develop a rigorous understanding of the dynamics that were causing our climate to warm, we first had to become aware that, well, our climate was warming! Yup, yet another current discipline that began with a “simple” (I put “simple” in quotes because there’s nothing actually simple about measuring the temperature of the entire world over (initially) decades and (later) centuries) descriptive analysis: what is the temperature of the Earth, and how has it changed over time?

Last Thoughts

By now you can hopefully recognize the importance of descriptive analyses. Description is rarely the end of the scientific inquiry, but it is often the start, and without it, it is hard to imagine where we would be today.

And while descriptive analyses may seem simple (e.g. Vera Rubin just measured the rate of rotation of stars in a galaxy, and Carmen Reinhart and Ken Rogoff just documented every financial crisis in the past several hundred years) because they often don’t entail sophisticated machine learning algorithms or extremely complicated statistical models, in truth descriptive analyses are often extremely difficult to undertake in their own way. For example, because they usually entail measuring something that hasn’t been measured before, descriptive analyses generally require massive data collection efforts and innovations in measurement. And as we will also discuss in future classes, in some ways that put even greater demands on the researcher (and the researcher’s judgement and case knowledge) than some of the other types of questions we will discuss.

Causal Questions

Causal questions are questions about causes and effects, and why we see certain patterns in the world. They often take the form of “What is the effect of X on Y”, where X could be a drug and Y is a disease, or where X is a goverment policy and Y is a public health or economic outcome.

In many cases, as described above, causal questions are prompted by the answers to descriptive questions. Vera Rubin discovered that the rotational curves of galaxies couldn’t be explained by current physics, so now dozens of experiments are trying to find particles whose presence would cause the patterns she has documented.

To borrow once more from John Gerring:

causal arguments attempt to answer why questions. Specifically, they assert that one or more factors generate change in some outcome, or generated change on some particular outcome. They imply a counterfactual. For example: “The third wave of democratization was caused, in part, by the end of the Cold War.” It will be seen that descriptive arguments are nested within causal arguments. Both X and Y are descriptive statements about the world upon which the causal argument rests.

Causal questions are, honestly, some of the easiest to come up with:

  • What is the effect of minimum wage laws on unemployment?

  • What is the effect of DRUG X on DISEASE Y?

  • What is the effect of Iowa voting first in US Presidential Primaries on the types of politicians who become president?

At the same time, though, they can also be awfully hard to answer…

Why Causal Inference is Hard

As we will explore in lots of detail, causal questions can be very hard to answer. To understand why, suppose we were interested in the effect of an increase in cigarette taxes that took place in Durham in 2019 on smoking rates.

In a magical, idealized world, the way we would answer this causal question is by creating two worlds: one in which our causal treatment (e.g. the tax increase) was enacted, and one where it was not. Then we could ask “were smoking rates different in the 2019 Durham with a policy change as compared to the world where no policy change took place in Durham in 2019?”

In the real world, however, we can never actually see both the world with the policy change and a world without the policy change for the same unit of observation at the same moment in time. In the language of causal inference, we can never directly observe our counter factual – the outcome that didn’t actually occur. This is what is referred to as the fundamental problem of causal inference.

To get around this fundamental problem, we must instead estimate what we think would have happened in Durham had there been no tax increase. For example, we might assume that Durham in 2018 (before the tax increase) is a good stand-in for a 2019 Durham without a tax increase; or we might think other counties in North Carolina in 2019 are a good model for what Durham would have been like absent a policy change.

But because we can never see Durham in 2019 absent the policy change, we can never really be sure if we’ve modelled our counter-factual accurately. We can only use our knowledge of the cases and circumstantial evidence to argue that we’ve estimated is a good stand-in for our counter-factual world. And that, in a nutshell, is the art of causal inference.

Prediction / Extrapolation Questions

Finally, we come to what is likely the most trendy topic in data science: prediction!

In this course, we will use the terms “prediction” and “extrapolation” pretty interchangably. Thus when we say “prediction”, we won’t just mean “offering guesses about what will happen in the future,” but also “what might happen if we consider cases that are outside the domain for which we currently have data.” That means that not only does building a model future stock market returns count as prediction, but so too does using data from a set of mammogram scans that have already been analyzed by human radiologists to build a model for analyze mammograms that haven’t been reviewed by human radiologists.

Examples of predictive questions include things like:

  • In what parts of the city are we most likely to see opioid overdoses tomorrow?

  • What features of customers predict the likelihood they will actually buy something when they come to my website?

  • What features of MRI scans predict alzheimers?

In this course, we will discuss two general forms of predictive analyses: predictive analyses based on causal inference, and predictive analyses based on supervised machine learning.

Prediction Based on Causal Methods

When one answers a causal question, one is generating an answer that, by its very nature, should be safe to use for prediction. If we run an experiment to test the effects of, say, statins on cholesterol, and we find that statins cause a reduction in cholesterol, then presumably we know what will happen if we give statins to more people.

As we will discuss in detail in this course, however, while causal methods often give us a very good basis for prediction, it is also important to understand their limitations. A drug study that only examined the effects of statins on white men between 45 and 75 with dangerously high cholesterol, for example, is probably a good basis for predicting the effect of statins on a 65 year-old white man with sky-high cholesterol; but how likely is it to predict outcomes for a 65 year-old black man with sky-high cholesterol, or a 45 year-old woman with sky-high cholesterol, or a 65 year-old white man with only moderately high cholesterol?

Prediction Based on Supervised Machine Learning

If prediction is trendy right now, supervised machine learning is out-of-this-world trendy.

In this class, we’ll treat all supervised machine learning (SML) models as tools for “prediction”, since in essence all a supervised machine learning tool is designed to do is to predict what the agent that labeled the SML’s training data would do when given unlabeled data. For example, what a SML model that is trained using mammogram scans and diagnoses provided by human radiologists if fundamentally trying to do is guess, when given a new mammogram scan, what a human radiologist would label that data. (Note: there are some machine learning approaches – like reinforcement learning – that don’t fit this model, but those are currently pretty rare in data science applications.)

As we’ll discuss, this can be deeply problematic, as the behavior of supervised machine learning algorithms becomes deeply intertwined with the nature of the training data it has been given. For example, suppose a SML algorithm is trained on real world drug arrest data that reflects the fact that a Black citizen is more likely to be arrested for drugs than a White citizen despite similar rates of drug use. If that algorithm is then used in court to evaluate whether its safe to release someone on bond pending tiral, then that algorithm will say that a Black defendant is more likely to be arrested in the future (reflecting the racial bias in its training data) despite the fact that the Black defendant may be no more likely to use drugs than a White defendent, making it less likely that Black defendent will be released on bail.

And to be clear, this type of bias doesn’t find its way into SML models because the people training them were careless in writing their SML; the tendency of SML to reflect biases is intrinsic to SML itself. What makes SML exciting as a prediction tool is that it’s designed to find patterns in the data that programmers don’t have to identify explicitly that explain variations in outcomes. But that means that so long as we live in a society where race, gender, nationality, etc. shape outcomes, SML algorithms will do their best to use these sources of variation because they explain variation. And the only goal of a SML is to explain variation.

But don’t worry – we won’t only talk about SML bias. We’ll also discuss – using the framework we develop in our discussion of causal inference – when SML algorithms can be safely used to make predictions, and when they tend to be very “fragile”, and unable to adapt to new contexts.

Conclusion

This has been a quick tour of the taxonomy of questions we’ll be examining in this class, and some of the issues associated with each. I know that it introduces a lot of material. Please do your best to wrestle with what is provided here, and read this over a few times. It will help contextualize everything we do moving forward. At the same time, however, know that we’ll return to all of the topics introduced here in much more detail later, so don’t worry if you don’t feel like you’ve fully internalized everything this covers.