Beyond The Experiment¶

The Role for Causal Inference with Observational Data in Industry

I don’t think there’s anyone who would question the importance of A/B testing in the tech sector today. It is used constantly to help companies refine their products, better target their ads, and incrementally innovate.

But while A/B testing is definitely a skill a good data scientist should have in their toolbox, it’s important to understand that it is not the be-all and end-all of causal inference in industry. In this reading, we’ll discuss why it’s important to learn how to think critically about causality when working with observational data (“observational data” is data that we gather by passively observing the world – or that somebody else collected by passively observing the world – rather than data that we get by directly manipulating subjects in an experiment).

A/B Testing Isn’t Always Feasible¶

The first reason it’s important to know about tools that go beyond A/B testing is that A/B testing isn’t always feasible. A/B testing only works when you can efficiently randomize people into different treatment arms, and where you can do it at low cost (if running a test is expensive, then even if you want to run the test, you should be doing some analysis first to make sure it’s worth while!). But this isn’t possible for every decision, even in big tech companies. Here’s just a handful of situations where you can’t run an A/B test:

You Don’t Control The Relevant User Behavior: Suppose you want to know the effect of, say, following a specific person on facebook or twitter on user satisfaction, or choosing to use a certain function in your product. The “ideal experiment” would be to randomly assign some users to follow the person in question / use your new function, and block others from doing so. But of course you can’t do that! So you have to find other ways to compare followers and non-followers that accounts for ways they may be different in terms of user satisfaction or behavior.
It’s a One-Shot Deal: Suppose you’re trying to decide whether to run a Super Bowl ad, or open a second store. You can’t A/B test these small-N events. When dealing with these types of “big strategic” decisions, the best you can do is look to your companies past experiences / the experiences of other companies that you think are comparable using the best available tools for observational causal inference.
Randomization is Bad Business: It’s one thing to randomize someone’s landing page color scheme; it’s another thing entirely to randomly show people different prices for your products – customers wouldn’t stand for it! Similarly, sometimes you want a big rollout – like the launch of a new title on Netflix, or the premier of a new movie – and an A/B test would interfere with that. (Not an abstract example: here’s Netflix talking about this constraint).
Time Horizons are Too Long: A/B testing is popular in tech because you get so much data from users so quickly that they’re easy to run on short time horizons. But if you’re thinking about opening more stores across the country, you can’t put your business on hold, pick a few locations, setup new shops, and wait for sales data; you need to get a sense of what’s going to happen now. As we read in Kohavi, Tang and Xu, A/B tests work best when we have an Overall Evaluation Criterion (OEC) that is “measurable in the short term (the duration of an experiment) yet believed to causally drive long-term strategic objectives.” But when that is not available / the shortest experiment available that would allow you to measure a meaningful OEC would take too long, you need other options!

A/B Testing Isn’t Always Legal or Ethical¶

Those are all practical business constraints, but there are also ethical and legal constraints many people face.

For example, if you’re at a social media company and want to study the effects of harassment on customer retention, you can’t randomly assign people to be harassed!

And even if you think certain experiments are ok, you also have to be mindful of what the public will say. In 2012, Facebook ran an experiment in which they manipulated the emotional valence of posts they saw in their feeds, showing them either more happy posts or more sad posts. The goal was to see how moods spilled over from one user to another on the network.

The experiment was legal, but when people found out they were having the moods toyed with at random, they were furious. Now some tech people would say that “everything we do is to try and manipulate consumer happiness, how is this any different? And maybe you even believe that (though I think that, at the very least, deliberately showing people sad things means you’re doing harm, which is a huge no-no in human experimentation). But it doesn’t really matter, because it cost Facebook a lot of good will.

Similar issues happened with OKCupid when they published research on how they manipulated people’s dating profiles. That, honestly, was a pretty standard experiment, but when it comes to messing with people’s hearts… you have to be careful.

Finally, A/B randomizations aren’t always legal. For example, suppose you’re working in Human Resources for a company that wants to improve retention of younger tech wokers, and you want to offer childcare subsidies to employees. If that workforce is unionized, you can’t offer a benefit to only some people within a specific class of employees due to union contracts.

Going Beyond Testing for the Future¶

Another limitation of A/B tests is that we usually use them to do small scale tests to motivate larger, future rollouts. But sometimes we want to just measure the effect of a single, large intervention.

In many areas of business, it’s increasingly common to hire outside companies to do very complicated projects, like administering government programs or providing a service to employees. In these situations, companies have increasingly found that paying vendors based on inputs (hours worked, money spent) doesn’t actually encourage vendors to do a very good job, or to try to be efficient – if you’re being paid on the basis of what you spend, you will obviously you will not have an incentive to cut costs!

In these situations, companies are increasingly turning to something called “at-risk contracting.” In these arrangements, the company doing the hiring offers to pay the vendor if they achieve a specific outcome. For example, a hospital might hire a company to reduce in-hospital infections and pay them based on reductions in infections, or a company may hire a vendor to minimize the downtime of their internal network, and pay them based on downtime. By tying compensation to the outcome that the hiring party actually cares about, the vendors well incentivized to do precisely what the hiring company wants them to do!

But in these arrangements, it’s critical to both parties that there’s a good way to measure the effect of the vendor’s efforts or someone will get over or under paid. That’s not an average effect, that’s a specific effect, and something for which experiments are not well-suited.

In these circumstances, it’s unfortunately the case that people often just measure outcomes before the vendor starts and compare them to outcomes when the vendor finishes. But if the outcome in question is subject to natural variation – such as seasonality – then we expect the outcome to vary whether the vendor does anything or not, and so you need a better causal design (e.g. difference-in-difference) to take those types of confounders into account.

Don’t want to be limited¶

Here’s another important thought: professionally, if you only know how to do A/B testing, you risk limiting your opportunities for upward advancement. Big companies are getting better and better at providing analysts with extremely user friendly A/B testing tools, which means doing A/B tests is getting easier and easier. Twitter, for example, has an internal tool called Duck, Duck, Goose that makes running an A/B test a relatively low-difficulty task. And while running a good A/B test is never easy (as we have discussed at length, even with A/B testing, the internal and external validity of our causal estimates will always be dependent on a wide range of assumptions), it is something that more and more people are learning to do well and which companies are learning to make easier. Thus, if all you know how to do are A/B tests, you may find it increasingly hard to differentiate yourself in the data science market.

Again, this isn’t to say you shouldn’t learn to do A/B testing, just don’t only learn A/B testing.

Not enough to be able to do it; you must be able to explain it¶

One final reason it’s important to understand how to apply the logic of causal inference to observational data is that even if you only work with A/B data, it is almost inevitable that you will interact with other people in your company who want to use observational data. In these situations, it’s not enough that you understand the considerations that go into making causal inferences from observational data; you also need to be able to explain those to colleagues.