A/B Testing the Udacity Website

In these exercises, we’ll be analyzing data on user behavior from an experiment run by Udacity, the online education company. More specifically, we’ll be looking at a test Udacity ran to improve the onboarding process on their site.

Udacity’s test is an example of an “A/B” test, in which some portion of users visiting a website (or using an app) are randomly selected to see a new version of the site. An analyst can then compare the behavior of users who see a new website design to users seeing their normal website to estimate the effect of rolling out the proposed changes to all users. While this kind of experiment has it’s own name in industry (A/B testing), to be clear it’s just a randomized experiment, and so everything we’ve learned about potential outcomes and randomized experiments apply here.

(Udacity has generously provides the data from this test under an Apache open-source license, and you can find their original writeup here. If you’re interested in learning more on A/B testing in particular, it seems only fair while we use their data to flag they have a full course on the subject here.)

Udacity’s Test

The test is described by Udacity as follows:

At the time of this experiment, Udacity courses currently have two options on the course overview page: “start free trial”, and “access course materials”.

Current Conditions Before Change

  • If the student clicks “start free trial”, they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first.

  • If the student clicks “access course materials”, they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

Description of Experimented Change

  • In the experiment, Udacity tested a change where if the student clicked “start free trial”, they were asked how much time they had available to devote to the course.

  • If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free.

  • At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. This screenshot shows what the experiment looks like.

Udacity’s Hope is that…:

this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn’t have enough time – without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches’ capacity to support students who are likely to complete the course.

Import the Data

Exercise 1

Begin by importing Udacity’s data on user behavior by going to http://www.github/nickeubank/MIDS_Data/ and using the udacity_AB_testingfolder, or by clicking here. Note that there are TWO datasets for this test – one for the control data (users who saw the original design), and one for treatment data (users who saw the experimental design). Udacity decided to show their test site to 1/2 of visitors, so there are roughly the same number of users appearing in each dataset (though this is not a requirement of AB tests).

Exercise 2

Explore the data. Can you identify the unit of observation of the data (e.g. what is represented by each row)?

Pick your measures

Exercise 3

The easiest way to analyze this data is to stack it into a single dataset where each observation is a day-treatment-arm (so you should end up with two rows per day, one for those who are in the treated groups, and one for those who were in the control group). Note that currently nothing in the data identifies whether a given observation is a treatment group observation or a control group observation, so you’ll want to make sure to add a “treatment” indicator variable.

The variables in the data are:

  • Pageviews: number of unique users visiting homepage

  • Clicks: number of those users clicking “Start Free Trial”

  • Enrollments: Number of people enrolling in trial

  • Payments: Number of people who eventually pay for the service. Note the payment column reports payments for the users who first visited the site on the reported date, not payments occurring on the reported date.

Exercise 4

Given Udacity’s goals, what outcome are they hoping will be impacted by their manipulation?

Or, to ask the same question in the language of the Potential Outcomes Framework, what is the \(Y\)?

Or to ask the same question in the language of Kohavi, Tang and Xu, what is the Overall Evaluation Criterion (OEC)?

(I’m only asking one question, I’m just trying to phrase it using different terminologies we’ve encountered to help you see how they all fit together)

Exercise 5

Given Udacity’s goals, what outcome are they hoping will not be impacted by their manipulation? In other words, what do they want to measure to ensure their treatment doesn’t have unintended negative consequences?

Note that while this isn’t how Kohavi, Tang, and Xu use the term “guardrail metrics”—they only use the term to refer to things we measure to ensure the experiment is working the way it should—some people would also use the term “guardrail metrics” for something that could be impacted even if the experiment is working correctly, but which the organization wants to track to ensure they aren’t impacted because they are deemed really important.

Validating The Data

Exercise 6

Whenever you are working with experimental data, the first thing you want to do is verify that users actually were randomly sorted into the two arms of the experiment. In this data, half of users were supposed to be shown the old version of the site and half were supposed to see the new version.

Pageviews tells you how many unique users visited the welcome site we are experimenting on. Pageviews is what is sometimes called an “invariant” or “guardrail” variable, meaning that it shouldn’t vary across treatment arms—after all, people have to visit the site before they get a chance to see the treatment, so there’s no way that being assigned to treatment or control should affect the number of pageviews assigned to each group.

“Invariant” variables are also an example of what are known as a “pre-treatment” variable, because pageviews are determined before users are manipulated in any way. That makes it analogous to gender or age in experiments where you have demographic data—a person’s age and gender are determined before they experience any manipulations, so the value of any pre-treatment attributes should be the same across the two arms of our experiment. This is what we’ve previously called “checking for balance,” If pre-treatment attributes aren’t balanced, then we may worry our attempt to randomly assign people to different groups failed. Kohavi, Tang and Xu call this a “trust-based guardrail metric” because it helps us determine if we should trust our data.

To test the quality of the randomization, calculate the average number of pageviews for the treated group and for the control group. Do they look similar?

Exercise 7

“Similar” is a tricky concept – obviously, we expect some differences across groups since users were randomly divided across treatment arms. The question is whether the differences between groups are larger than we’d expect to emerge given our random assignment process. To evaluate this, let’s use a ttest to test the statistical significance of the differences we see.

Note: Remember that scipy functions don’t accept pandas objects, so you use a scipy function, you have to pass the numpy vectors underlying your data with the .values operator (e.g. df.my_column.values).

Does the difference in pageviews look statistically significant?

Exercise 8

Pageviews is not the only “pre-treatment” variable in this data we can use to evaluate balance/use as a guardrail metric. What other measure is pre-treatment? Review the description of the experiment if you’re not sure.

Exercise 9

Check if the other pre-treatment variable is also balanced.

# Yup, good balance! Difference has a p-value of only 0.93!

Estimating the Effect of Experiment

Exercise 10

Now that we’ve established we have good balance (meaning we think randomization was likely successful), we can evaluate the effects of the experiment. Test whether the OEC and the metric you don’t want affected have different average values in the control group and treatment group.

Because we’ve randomized, this is a consistent estimate of the Average Treatment Effect of Udacity’s website change.

Did Udacity achieve their goals?

Note: You may discover some issues with your data. Can you figure out what’s going on, and adjust?

Exercise 11

One of the magic things about experiments is that all you have to do is compare averages to get an average treatment effect. However, you can do other things to try and increase the statistical power of your experiments, like add controls in a linear regression model.

As you likely know, a bivariate regression is exactly equivalent to a t-test, so let’s start by re-estimating the effect of treatment on your OEC using a linear regression. Can you replicate the results from your t-test? They shouldn’t just be close—they should be numerically equivalent (i.e. exactly the same to the limits of floating point number precision).

Exercise 12

Now add indicator variables for the day of each observation. Do the standard errors on your treatment variable change? If so, in what direction?

You should have found that your standard errors decreased by about 30%—this is why, although just comparing means works, if you have additional variables you should add them as covariates in your analysis. Moreover, in other settings you may find this effect is even larger – the date indicators we added to our data are perfectly balanced between treatment and control, so we aren’t adding a lot of data to the model by adding them as variables. As we’ll see in later exercises, adding variables like “gender” or “age” (which will never be perfectly balanced across treatment and control) will help even more.

Exercise 13

Given your results, what would you tell Udacity about their trial?

Exercise 14

As a last exercise, instead of adding indicators for each date, add indicators for day of the week (e.g. Monday, Tuesday, etc.).

(This is just for data manipulation practice!)