Endogenous Stopping¶
There is often a temptation when running experiments to watch the data roll in as the experiment runs. In AB testing, you may watch because it’s easy; in medical studies, you may watch because the trial is expensive and you’d like to stop as soon as you can, or because you want to know if lots of patients start experiencing negative side effects.
But it turns out that it is critically important to the legitimacy of experiments that you not stop an experiment early because the data looks good (or bad).
Ending an experiment because of the intermediate results is what’s called “stopping endogenously”, and it will render your experiment statistically invalid. The math on this gets very complicated, but the basic idea is that the apparent results of your experiment will fluctuate over time, and the law of large numbers only guarantees that in the long run, your \(\widehat{ATE}\) will probably be equal to the true \(ATE\). The results for short periods are likely to show your treatment is more amazing than it really is, or more terrible than it really is; probability only ensures those moments will be relatively rare. But if you choose to stop an experiment because you’ve hit on of those moments (that should be fleeting), you’ll end up with erroneous results.
To illustrate this point, Ramesh Johari, Leo Pekelis, and David Walsh created a great illustration where they ran a fake A/B test on a large website in which the two treatment conditions (A and B) were exactly the same. They ran this over several days, then plotted – for each moment in time – whether the data would say A is better than B if the experiment were stopped and analyzed at that time. As the figure shows, over the long run the data shows there’s no significant difference between A and B; but there are moments where random fluctuations make the difference look significant. So if you had chosen to stop the experiment as soon as you hit one of those moments, you’d be in deep trouble!

To be clear, that doesn’t mean there aren’t ways you can stop experiments early based on results – see Johari, Pekelis, and Walsh’s paper for ways to do so in a statistically sound sense – but don’t do it unless you really understand the statistics (even if your boss really wants to!).