Internal and External Validity¶
When evaluating any study, it is often helpful to think about two different types of study validity: internal and external.
The internal validity of a study is the degree to which it has accurately interpreted its case. In the context of causal inference research, internal validity is about whether a study has accurately measured a causal effect in the context being studied. That is to say, how confident are we that the reported causal effect is the real causal effect for the specific entities in the study?
External validity, by constrast, is about whether we think the results of a given study are likely to generalize to other contexts.
To illustrate the difference, suppose a new video streaming service sent out an e-mail offering new users a deal on subscriptions, and then measured the difference in sign-up rates between the users who got the deal and users who just got a generic e-mail with information about the service.
The internal validity of the study is the degree to which the study accurately measured the causal effect of the offer on signup rates. Internal validity hinges on things we’ve talked about a lot in class, like whether the people who received the deal had the same potential outcomes as the people who got the generic email.
The external validity of the study, by contrast, is about whether we think the estimated effect is the same effect we would see if we tried to send out a similar email to recruit customers to an established streaming service (instead of a new one), or if we tried to use a similar offer to recruit people to a new music streaming service.
All studies are subject to both types of concerns, and as we’ll discuss below, there are often trade-offs between internal and external validity, especially in causal research.
As described above, internal validity is about whether we think a study has accurately interpreted the case it is studying. In the context of causal inference, this amounts to asking whether the study properly estimated the causal effect studied for the set of entities it was focused on.
Internal validity is about the concerns we’ve been focused on in this class so far, like whether our control and treatment groups have the same potential outcomes. As we will discuss again and again, the validity of any causal estimate depends on whether we believe a set of unverifiable assumptions are met. Evaluating whether those assumptions are met is a big part of evaluating the internal validity of a study.
With that said, internal validity also rests on all sorts of more banal concerns, like whether the researchers actually measured their outcomes correctly, calculated the right standard errors, and chose a reasonable model specification.
Internal validity is often the focus of causal inference textbooks and classes, and we’ll introduce even more conditions that have to be met for a study to have good internal validity as we continue (like SUTVA). But while internal validity is important, it’s not the only thing we care about, because there’s also…
External validity is fundamentally about the generalizability of a study: whether the causal estimate found in a study is likely to also be a good guess for the causal effect in a different context.
External validity is one of the most important things to think about as a consumer of other people’s research, because when you read other people’s research, you’re usually doing so because you’re looking for information you can use to address a specific problem you face. In these situations, it’s critical that you always ask yourself: are the results from this study likely to also be valid in the context of my problem?
Of course, when asking about the external validity of a study, we have to specify the setting to which we want to generalize its results. A study that looks at how Duke undergraduates’ consumer behavior changes when faced with different types of ads on google may have good external validity in terms of its generalizability to other elite Univerities like Emory, Vanderbilt, or UNC. But it might not generalize to the US population as a whole.
This means that external validity is different from internal validity in an important way: when faced with the same facts about a study, everyone should generally agree on the internal validity of a study, but the external validity of a study really depends on how you want to use the results.
External Validity Considerations¶
There are many reasons that the results of a study may not generalize to a new context. Here are a handful of the most common issues to bear in mind:
The study population may be different from the population in the new context.
Almost by definition, the entities in the new context will be different from the entities in the original study (even if we’re working with the same people, we’re looking at them at a different time). But the key question for external validity is whether the entities in the new context are different in a way that would impact their response to a given treatment.
It’s not hard to think of reasons that different populations may respond differently to a given treatment. For example, suppose a company finds ads for luxury cars increase sales among rich people in New York. It’s hard to imagine that the same ad run in a poor neighborhood in Detroit would have the same effect.
As you think about population differences, make sure you consider not only standard demographic attributes (age, gender, wealth, education), but also cultural or social differences. Many issues businesses deal with – especially advertising and brand image – may be culturally specific, and so may not generalize to all communities.
This may all seem obvious as you read it, but using unrepresentative samples in research and medicine, then making recommendations for the general public is a huge problem in the real world.
White men are massively over-represented in medical trials, for example. Unsurprisingly, this means that when the results of those trials are generalized to the population as a whole, we suddenly discover (SURPRISE) that the predicted results didn’t always hold for women or people of color! (e.g. drug doses set for men are often too high for women; some heart drugs work great for White men, but often interact poorly with a gene common in Asians and Pacific Islanders; and Multiple sclerosis turns out to be drive by a different mutation in Black patients than European descendants).
And for the longest time, psychology research was based almost entirely on studies conducted using student volunteers. But of course, students at elite universities are not a representative population – they’re disproportionately Western, Educated, from Industrialized, Rich, and Democratic countries (they’re WEIRD). And as a result, our academic model of human behavior is really just a model of a bunch of WEIRD kids.
Unrepresentative training data is also one of the reasons that so many machine learning algorithms are just plain racist (this isn’t causal inference, but it’s the same idea) – if you train a facial recognition algorithm using predominantly white faces, turns out that they will either not see Black faces, or worse, mis-identify people of color (which is a really bad thing when those algorithms are being used by the police).
So while internal validity issues may seem more sophisticated and thus interesting, don’t overlook the importance of these kinds of external validity issues!
The treatment might differ between study and new context
A study may declare that it has measured the effect of billboard ads on sales, or an infinite scroll on engagement. But it’s always important to remember that while we may interprete studies in these general terms, the reality is that that billboard study probably measured the effect of a specific set of billboard ads on sales, and the infinite scroll study looked at the effect of infinite scroll in a specific app.
So always be careful to think about what exactly the treatment in a study was, and whether its likely to generalize to the case you study about.
There may be scaling effects
Often times when we’re thinking about external validity, we’re not just thinking re-using a treatment or intervention; we’re thinking about scaling them up.
But an intervention that works on a few people / is only in place for a short period may not be a perfect model for what happens when that same intervention is applied at scale or permanently. For example, the returns to showing people a TV ad about your company for the first time is probably not the same as the returns to airing that ad the 1,000,000th time. Or sales from selling a special product at one store for a limited time may not be a good indicator of the sales you would see if your “special product” were available everywhere all the time.
People may also respond differently to an intervention when it gets big or becomes permanent. To illustrate, I’d like to tell a story about a famous experiment in India (paper).
Rural health clinics in India have a huge problem: nurse absenteeism. To try and address the problem, in the late 2000s an NGO (along with some MIT economists) decided to see if they could fix the problem. The NGO started keeping track of when nurses clocked in and out, and then shared the information with the government, who then applied fines or punishments to nurses who weren’t showing up for work.
Initially, the intervention was successful, leading to very large increases in attendance (doubling it in fact!) after a few months. But as nurses came to realize this wasn’t just a little study but actually something that was going to be around for a while, they mobilized politically, and soon administrators were allowing nurses to claim an increasing number of “exempt days”, avoiding punishment. And so sure enough, nurses stopped coming to work, and absenteeism had returned to pre-intervention levels 16 months after the program began.
This is an example of what economists call a “general equilibrium” effect – when we introduce a treatment to the world, the world responds. But often these responses don’t happen in small trials the same way they do when policies go big, creating serious generalizability problems.
Relatedly: if you are a public policy person or an economic development person, I cannot recommend this paper by Angus Deaton and Nancy Cartwright enough for discussing the limitations of RCTs for learning about the effects of policy or nature of social processes. It’s a long, very thoughtful paper, but it’s really, really good.
Trade-Offs Between Internal and External Validity¶
OK, great. So let’s just maximize internal AND external validity whenever we’re doing research!
The problem is that there is often a trade-off between internal and external validity. That’s because the best way to ensure internal validity is to try and control everything you can about the entities being studied, which often means doing things like bring people into a lab setting, or making an experiment quick so you can monitor people carefully. After all, the more control you have, the more sure you are that all the assumptions necessary for your design to generate a valid causal estimate are met.
But… the more you try to control everything, the more artificial the environment you’re studying becomes, and (potentially) the less likely the results you see in the lab are to match what we’d see in the real world.
We’ve actually already seen examples of this in our reading. As you may recall, in the first chapter of Mastering ’Metrics, we learned about two studies of the effects of having health insurance.
The first was a study conducted by RAND in which participants were enrolled in different kinds of insurance. Everyone who was enrolled got at least catastrophic coverage, and then participants were enrolled in plans with different co-pay structures.
The second was the Oregon Health Plan (OHP, the Oregon version of Medicaid). In the OHP study, a lottery was conducted to determine which applicants to Medicaid would actually be allowed to enroll (since they didn’t have the funds to enroll everyone who was eligible).
On the one hand, the RAND study could probably be said to have better internal validity – everyone who enrolled got the insurance policy they were assigned. In the OHP program, by contrast, only about 25% of the people who won the lottery actually enrolled in OHP, meaning that the lottery winners who actually got insurance may have been different from the average person who won the lottery (something called “low compliance”, which we’ll talk about next week).
But participants in the RAND study were much older, wealthier, and better educated than the average uninsured American. Moreover, to get people to enroll in their carefully conducted study, RAND had to give everyone in the study at least catastrophic coverage. That means the people in the study didn’t look like average Americans, and the control group wasn’t fully uninsured.
In the OHP study, by contrast, the people in the study were exactly the population of people who are uninsured in the US, and the exact type of insurance they got was Medicaid.
So in terms of internal validity, RAND was probably more successful; but in terms of external validity with respect to whether the results would generalize to expansion of government insurance in the US, many would argue that despite its internal validity problems, the OHP study is probably more informative.
This tension between internal and external validity exists everywhere: bringing users into a lab allows researchers to study how users interact with different user interfaces with remarkable precision, and researchers can be sure participants are only using the interface their supposed to; but who knows if those same users would act differently if they were at home and their dog was barking and they were hurrying to answer an email from their boss?
Internal validity is really important, and understanding how to evaluate internal validity is hard because of the the challenges caused by the fundamental problem of causal inference.
But remember that internal validity isn’t everything, and that when evaluating research, also think about both internal and external validity.