Taxonomy of Questions¶
A central focus of this course will be thinking about how the tools of data science can best be brought to bear on different types of questions about the world. To that end, we must begin by introducing a taxonomy of questions.
To be clear, this taxonomy is my own, and thus you may not find that everyone you encounter will immediately recognize the distinctions that, by the end of this class, I hope will be clear to you. Indeed, part of the reason that data science is so fragmented is that different disciplines tend to focus (almost myopically) on certain classes of questions. As a result, they often fail to recognize that while the tools they use may be effective for their preferred questions, different tools may be required by those interested in different questions. In this class, we will strive to recognize the merits of all form of inquiry, and develop the skills necessary to properly approach any questions that comes our way.
In this course, we will use a three-fold taxonomy of questions:
Descriptive Questions: Questions about the current (or past) state of the world. Descriptive questions are often about measuring things that haven’t previously been measured, or identifying previously unseen patterns.
Causal Questions / Predictions with Manipulation: Causality is fundamentally about understanding causes and effects – why does the world look the way it does? What caused the current state of the world? And crucially, if we were to try and manipulate the world in some way, what would the effect of that manipulation be?
Classification Questions / Prediction Without Manipulation: Questions about trying to guess and observations unobserved “type”, or to predict behavior that will occur in the future (but in a world where we are passive observers—in other words, situations where we want to predict future events we don’t play an active role in shaping).
Positive versus Normative Questions¶
Before we dive into each of these types of questions, it is worth pausing to emphasize that this is a taxonomy of questions about how the world is, not about how world should be. That’s because while data science is an amazing tool for telling us about the state of the world, it cannot, on its own, answer should questions. That is because answering should questions requires evaluating the desirability of different possible outcomes, and that can only be done using on the basis of a system of values. Data science may tell us the consequences of different courses of action, but it cannot tell us whether those consequences make a given course of action worthwhile.
To illustrate, suppose you are interested in reducing opioid overdoses. Your rigorous data science analysis may tell you that increasing the regulation of opioid prescriptions will reduce overdoses by some amount X and reduce access to opioids for those with chronic pain by some amount Y. But does that mean you should enact the policy? Well, that depends on how much value you place on patients with chronic pain having access to opioids, and how much value you place on preventing overdoses. And the answers to those questions simply can’t come from your data.
In this course, we will refer to questions about how the world is as “positive questions”, and questions about how the world should be as normative questions. But be aware these terms aren’t universal. Some people use the term “descriptive” instead of positive, and “proscriptive” instead of normative. Since we are using the term “descriptive” for a different purpose in this class, I will use normative and positive throughout this course to avoid confusion.
Descriptive questions are questions about the current or past state of the world. In the words of John Gerring (we’ll read more of his work for a future class):
A descriptive argument describes some aspect of the world. In doing so, it aims to answer what questions (e.g., when, whom, out of what, in what manner) about a phenomenon or a set of phenomena. Descriptive arguments are about what is/was. For example: “Over the course of the past two centuries there have been three major waves of democratization.
Of all the types of questions we will study in this class, descriptive questions tend to be the least appreciated, but I would argue that in many ways they are the most important. That is because descriptive analyses are often the foundation for all other work. After all, it is only by first understanding the patterns in our world can we then move on to asking questions about how they arrose, or how they may evolve in the future.
To illustrate, let’s consider a few important descriptive analysis:
Descriptive Example 1: Nope, This Time Isn’t Different¶
A great example of descriptive analysis comes from economics. In 2011, Carmen Reinhart and Kenneth Rogoff published a book called This Time Is Different: Eight Centuries of Financial Folly. In it, they comprehensively analyze hundreds of years of economic history across more than sixty countries to document how, despite pundits regularly decrying that “this time things are different,” financial crises occur with remarkable frequency, duration, and ferocity. They offer some theories about why this may be, but their core contribution is documenting clearly that whatever it is that drives financial crises, it is not something specific to any geography or period, but rather something common to human economic systems the world over, wiping away dozens of attempts to explain specific financial crises as special cases in the process.
Descriptive Example 3: Disease Surveillance¶
It is hard to think of a public health discovery that didn’t start with disease surveillance – the practice of keeping descriptive statistics about causes of death or disease. Efforts to understand HIV began when public health officials saw a huge rise in gay men dying from diseases that shouldn’t have been fatal for young otherwise healthy men, and research into the role of cigarettes in causing lung cancer started when data showed that lung cancer rates were exploding across the world.
Descriptive Example 4: Global Warming¶
Before we began to develop a rigorous understanding of the dynamics that were causing our climate to warm, we first had to become aware that, well, our climate was warming! Yup, yet another current discipline that began with a “simple” (I put “simple” in quotes because there’s nothing actually simple about measuring the temperature of the entire world over (initially) decades and (later) centuries) descriptive analysis: what is the temperature of the Earth, and how has it changed over time?
Descriptive Questions and EDA¶
I have found that many students equate efforts to answer descriptive questions with doing Exploratory Data Analysis (EDA). While there is a sense in which EDA is a form of descriptive analysis, it is unfair to descriptive analysis to equate the two activities. When most people talk about EDA, they’re talking about noodling around with a data set that they had been given before they start to analyze it in detail. In other words, it is a relatively casual exploration of a provided data set meant primarily to help the user get familiar with the specific variables they plan to put in a fancier model.
Real descriptive analysis, by contrast, often necessitates collecting / compiling novel data that is designed to shed light on a phenomenon of interest. It is not just a step towards modelling, but rather a research endeavor and its own right, particularly when one starts doing more complicated forms of descriptive analyses (e.g., fitting unsupervised machine learning to learn about the latent structure of the data; fitting descriptive statistical models; etc.).
By now you can hopefully recognize the importance of descriptive analyses. Description is rarely the end of the scientific inquiry, but it is often the start, and without it, it is hard to imagine where we would be today.
And while descriptive analyses may seem simple (e.g. Vera Rubin just measured the rate of rotation of stars in a galaxy, and Carmen Reinhart and Ken Rogoff just documented every financial crisis in the past several hundred years) because they often don’t entail sophisticated machine learning algorithms or extremely complicated statistical models, in truth descriptive analyses are often extremely difficult to undertake in their own way. For example, because they usually entail measuring something that hasn’t been measured before, descriptive analyses generally require massive data collection efforts and innovations in measurement. And as we will also discuss in future classes, in some ways that put even greater demands on the researcher (and the researcher’s judgement and case knowledge) than some of the other types of questions we will discuss.
Causal Questions and Prediction-with-Manipulation¶
Causal questions are questions about causes and effects, and why we see certain patterns in the world. They often take the form of “What is the effect of X on Y”, where X could be a drug and Y is a disease, or where X is a goverment policy and Y is a public health or economic outcome.
In many cases, as described above, causal questions are prompted by the answers to descriptive questions. Vera Rubin discovered that the rotational curves of galaxies couldn’t be explained by current physics, so now dozens of experiments are trying to find particles whose presence would cause the patterns she has documented.
To borrow once more from John Gerring:
causal arguments attempt to answer why questions. Specifically, they assert that one or more factors generate change in some outcome, or generated change on some particular outcome. They imply a counterfactual. For example: “The third wave of democratization was caused, in part, by the end of the Cold War.” It will be seen that descriptive arguments are nested within causal arguments. Both X and Y are descriptive statements about the world upon which the causal argument rests.
Causal questions are, honestly, some of the easiest to come up with:
What is the effect of minimum wage laws on unemployment?
What is the effect of DRUG X on DISEASE Y?
What is the effect of Iowa voting first in US Presidential Primaries/Caususes on the types of politicians who become president?
At the same time, though, they can also be awfully hard to answer…
Why Causal Inference is Hard¶
As we have explored in lots of detail, causal questions can be very hard to answer.
What follows should be review, but since people occasionally stumble on these pages out of sequence, I’ll reiterate some of what we’ve discussed before: To understand why, suppose we were interested in the effect of an increase in cigarette taxes that took place in Durham in 2019 on smoking rates.
In a magical, idealized world, the way we would answer this causal question is by creating two worlds: one in which our causal treatment (e.g. the tax increase) was enacted, and one where it was not. Then we could ask “were smoking rates different in the 2019 Durham with a policy change as compared to the world where no policy change took place in Durham in 2019?”
In the real world, however, we can never actually see both the world with the policy change and a world without the policy change for the same unit of observation at the same moment in time. In the language of causal inference, we can never directly observe our counter factual – the outcome that didn’t actually occur. This is what is referred to as the fundamental problem of causal inference.
To get around this fundamental problem, we must instead estimate what we think would have happened in Durham had there been no tax increase. For example, we might assume that Durham in 2018 (before the tax increase) is a good stand-in for a 2019 Durham without a tax increase; or we might think other counties in North Carolina in 2019 are a good model for what Durham would have been like absent a policy change.
But because we can never see Durham in 2019 absent the policy change, we can never really be sure if we’ve modelled our counter-factual accurately. We can only use our knowledge of the cases and circumstantial evidence to argue that we’ve estimated is a good stand-in for our counter-factual world. And that, in a nutshell, is the art of causal inference.
Causal Inference, Manipulations, and Prediction¶
The father of the counter-factual model of causality we have worked with in this class once famously said “there is no causation without manipulation.” This is probably a bit of a simplification, but not much of one; especially in the context of business or public policy, we are generally interested in causal inference because we are considering enacting a manipulation (say, a new tax policy, or building a new store that could be in a neighborhood with wealthy residents or a neighborhood with middle-income residents), and we want to be able to predict the result of that manipulation.
In the three examples listed above, for example, we’re likely interested in the effect of minimum wage laws on unemployment because we’re thinking about changing (“manipulating”) a city’s minimum wage laws; in the effect of Drug X on Disease Y because we probably want to use Drug X to treat Disease Y / take Drug X off the market to prevent more of Disease Y; and the effect of Iowa’s first-in-the-nation caucuses because we might want to change the rules around the Presidential primary process.
As we’ll discuss below, however, not all “prediction” questions are about understanding the effect of manipulations, and recognizing whether the question you are asking is one that entails understanding the effect of a manipulation or just predicting future outcomes is really important.
Classification Questions / Prediction-Without-Manipulation¶
Finally, we come to what is likely the most trendy topic in data science: prediction and classification! In other words, the domain of supervised machine learning.
In this course, we will group “classification” and “prediction-without-manipulation” into one bin. Thus this section isn’t just about figuring out “what group does this observation likely belong to?”, but also “what do we think this observation may do in the future.” That means that not only does building a model future stock market returns count as prediction, but so too does using data from a set of mammogram scans that have already been analyzed by human radiologists to build a model for identifying tumors in mammograms that haven’t been reviewed by human radiologists.
Examples of classification / prediction without manipulation questions include things like:
In what parts of the city are we most likely to see opioid overdoses tomorrow?
What features of customers predict the likelihood they will actually buy something when they come to my website?
What features of MRI scans predict alzheimers?
Most classification / prediction-without-manipulation is the domain of supervised machine learning (SML). SML is the label applied to an incredibly powerful set of tools, but it is a set of tools that tends to have one fundamental flaw: it is really bad at out-of-sample extrapolation. Basically, SML tools tend to give garbage predictions if you give them input data that looks really different from your training data.
(Why? I would argue that it is mostly because SML isn’t trying to model actual physical or social processes that have analogues in the real world; they are just looking for consistent structures in the data. This works well if the data to which you want to apply your model has the same internal structure as your training data, but since if you give it data that doesn’t look like your training data, there’s no guarantee the same patterns will exist.)
What does this have to do with manipulation? Well, a “manipulation”, almost by definition, is an intervention that results in changes in the world. So whenever we introduce a significant manipulation, we are moving to a world that is “out-of-sample.”
To be clear, out-of-sample extrapolations are always hard for statistical models (see: our discussion of external validity considerations for causal models!). But in trying to estimate a causal relationship, causal inference is often doing its best to understand real-world relationships between factors (say, that smoking causes and lung cancer deaths), not just correlations. SML, by contrast, is an explicitly “we just want good correlations” endeavor. So if you asked an SML model what would happen if you improved lung cancer treatments, it would say “well, lung cancer deaths and smoking are positive correlated, so fewer lung cancer deaths would reduce smoking!” (this is a transparently ridiculous example, but similarly ridiculous things happen in more subtle ways with complicated black box algorithms).
So when using Supervised Machine Learning models, it’s best to use them in contexts where you want to predict outcomes in a world you aren’t manipulating. That way you are more likely to get valid predictions.
To be clear, you may want to make predictions of what would happen if you did nothing so that you can then try to do something (e.g., predict bad health outcomes so you can target those patients for extra care!), but in those situations, just remember all your model can tell you is what would happen in a world without your manipulation; don’t use it to predict outcomes post-intervention.
Oh, but be careful about situations where you have a manipulation that seems like it occurs after you’ve used your model to make a prediction, but where awareness that a model is being employed itself constitutes a manipulation that changes the world enough to ruin your model. For example, in your other reading you’ll read about a school that employed an SML model to grade student essays. This may seem benign—the model is grading essays after they’ve been written, so the manipulation is after the model has been used—but when students realized an SML model was being used, it changed their behavior. This change in behavior resulted in the world changing enough that the model was now effectively doing “out-of-sample” extrapolations that turned out to be meaningless!
This has been a quick tour of the taxonomy of questions we’ll be examining in this class, and some of the issues associated with each. I know that it introduces a lot of material. Please do your best to wrestle with what is provided here, and read this over a few times. It will help contextualize everything we do moving forward. At the same time, however, know that we’ll return to all of the topics introduced here in much more detail later, so don’t worry if you don’t feel like you’ve fully internalized everything this covers.