Trustworthy online experiments

Some time ago I read the book Trustworthy Online Controlled Experiments: a practical guide to A/B testing by Kohavi, Tang, and Xu. I was pleasantly surprised by how practical it was! The authors have collectively worked at Microsoft, Amazon, Google, and LinkedIn, so they had a lot of great real life examples of both success and failure. While they don’t go deep, they have a nice coverage of many topics that you usually won’t find outside internal company guides. I’ve summarised my learnings from the book, with the addition of my own experience in some areas for me to refer back to.

Background

Motivation

It’s hard to guess which changes will be successful, and the successes are usually small incremental gains. By reducing the overhead, tech companies have made it possible for almost anybody to test an idea. Every now and then, someone gets lucky and a small change has a massive effect. At Bing, adding some of the description to the headline of an ad turned out to generate a $100M/year. Similarly, despite a VP being against it, personalised recommendations based on a user’s cart turned out to be very profitable for Amazon. They were then able to improve on the original ‘people who bought X bought Y’, by generalising it to ‘people who viewed X bought Y’, and later ‘people who searched X, bought Y’. Another area that’s been shown again and again to have a positive effect on many metrics is performance improvements (e.g. speeding up page load can increase sales).

However, don’t get too excited, you shouldn’t expect a large number of your experiments to succeed like this. It can also go the other way, for example Bing expected integrating with social networks would have a strong positive effect. After two years of experimenting and not seeing value, it was abandoned.

Why do we need randomised experiments?

Imagine we launched a new feature and observed users who interacted with it. Maybe those users are 50% less likely to churn than those who didn’t interact. Does that mean our feature causes less churn? We can’t say anything about the effect of the treatment on churn. The users self selected into the the treatment, and therefore are unlikely to otherwise be comparable to those who didn’t. Perhaps heavy users are more likely to interact with the feature, and thus they naturally have a lower churn rate (regardless of whether the feature is there or not).

In it’s simplest form, an experiment solves this by randomly splitting users into two groups: one sees no change from normal (control), and the other sees some new feature or change (variant). User interactions in both groups are monitored and logged (instrumented). Then at the end of the experiment, we use this data to assess the differences between the two groups on various metrics. By randomising which users are exposed, we know that the only difference between the two groups is due to the change (plus some random variation). This is why experiments are the gold standard for establishing causality, and sit near the top of the hierarchy of evidence.

When possible, a controlled experiment is often the easiest, most reliable and sensitive mechanism to evaluate changes. We’re only able to measure the effect of being exposed (some may not even notice the change), but this is generally enough to guide product development. Experiments can also be great for rolling out features because they provide an easy way to roll back if there’s issues.

Tenets for success

It help a lot if a company meets the following criteria:

These can develop over time, and chapter 4 of the book has some good advice on the stages of maturity and what to focus on if you’re in the earlier stages.


Part 1. Pre-experiment design

General set up

Once you understand the hypothesis and treatment (e.g. what is the mechanism by which we expect to see an effect), below are some of the questions to think about:

The above should then be wrapped up into a clear experiment plan and hypothesis. E.g. a good example hypothesis from the book is “Adding a coupon code field to the checkout page will degrade revenue per user for users who start the purchase process”.

Deciding Experiment Metrics

Measure a salesman by the orders he gets (output), not the calls he makes (activity) – Andrew Grove

Decide which metrics are important before the experiment starts to avoid the temptation of building a story around the results (which might be False Positives). They should be quantifiable, measurable in the short term (during the experiment), and sensitive to changes within a reasonable time period. Typically there will be a small number of “primary” metrics, some “secondary” metrics to help understand the primary movements, and guardrail metrics (e.g. monitoring things like performance).

Overall Evaluation Criteria (OEC)

In the book they talk about the concept of an OEC, which (potentially) combines multiple metrics, to give a single metric that’s believed to be causally related to the long-term business goals. For example, for a search engine, the OEC could be a combination of usage (e.g. sessions per user), relevance (e.g. successful sessions, or time to success), and ad revenue metrics. It could also be a single metric like active days per user (if the target is to increase user engagement). To combine multiple metrics, they can first by normalised (0-1), then assigned a weight. This will require definition of how much each component should contribute. For example, how much churn is tolerable if engagement and revenue increase more than enough to compensate? If growth is a priority, there might be a low tolerance for churn. In practice, you will still want to track the components as well to understand movements.

OEC Examples (goal -> feature -> metrics)

How to evaluate metrics

Improving sensitivity with variance reduction

At this point, you may have identified some important metrics, but they’re under-powered. One way we can improve the power of a test (sensitivity), is by reducing variance. When variance is reduced, there’s less overlap between the distributions of control and variant, making the effect more clear.

Variance reduction

Ways to reduce variance:


Part 2. During the experiment

There can be a number of issues that pop up during an experiment. Usually they are around instrumentation, randomisation, ramping, or concurrent experiments. Monitoring and alerts are useful here to catch the problems early (e.g. test for a significant difference in the size of the control/test samples). I might come back to write on this in the future, but for now I’m mostly skipping over the topic to focus on where I spend the most time.


Part 3. Post experiment analysis

In the A/B case, we have two independent random samples, and summary statistics for each of our metrics in each group. Comparing the means gives us an estimate of the average treatment effect (ATE) in the population. By chance, we expect some random variation between the two groups, and so the observed difference will vary around the true effect in the population (i.e. even if the null is true and there is no efect, we will still observed differences). Therefore, it’s important we correctly estimate the standard error (how much we expect estimates to vary with repeated sampling), so we can assess whether a difference is ‘large enough’ for us to reject the null hypothesis. To do this, we compare the observed difference to the range of values we would observe if the null was true (so assuming the null is true, how likely is it we would observe a difference at least as extreme as this). If the result is surprising, we reject the null and conclude there is likely an effect (the result is significant).

Which test to use?

In general, the t-test or Welch’s t-test is most common for comparing means, and a z-test or g-test for proportions. For binary metrics, see here, and for continuous metrics see here. For ratio metrics, and comparisons of the percentage change (also a ratio), the variance is not the same as that of an average. In these cases, the Delta method can be used, or alternatively bootstrapping (see this post).

Common issues:

Analysing segments

The effect of an experiment is not always uniform (i.e. there are heterogeneous treatments effects), so it can be useful to break down the results for different segments. For example, market or country (e.g. could be different reactions by region, or localisation issues), device or platform (e.g. web/mobile/app, iOS/Android, and browser breakdowns can help with spotting bugs), time of day and day of week (plotting effects over time), user characteristics (did they join recently e.g. within last month, account type (single/shared, business/personal)). Some considerations:

Simpsons paradox with ramping Example from: Kohavi et al., chapter 3

Making a decision

Our goal is to use the data we’ve collected to make a decision, to ship the change, or not to ship.

Considerations:

Potential outcomes:

Other topics

Measuring long-term effects

Measuring long term effects is always a challenge given we typically want to launch a successful change as soon as possible. There are two common approaches:

Opt-in products

Delayed Activation (e.g. opt-in products): Usually you would randomise assignments for some recruitment period, and then continue to observe for some observation window to give time for people to activate. Otherwise you might underestimate the effect size as newer users haven’t had time to activate.

Experiments with an expected impact on performance

Sometimes we expect a new feature will negatively impact performance (e.g. calculations are required before showing the page, slowing down the page load). Before we spend time optimising a feature to address the performance impact though, we first want to know if it has any value. One way to separate the effect of the performance from the effect of the feature is to add a variant that does the calculations but doesn’t show the change:

The difference between the control and variant 1 is the effect of any performance loss (as the only difference is the calculations etc. being done in variant 1). The difference between variant 1 and variant 2 gives us the effect of the feature (assuming no performance impact), as both should have the same performance. Finally, the difference between variant 2 and the control is the total effect of switching to this feature (including the effect of any performance change). If this total effect is positive, any performance loss is outweighed by the effect of the feature and so we could ship the change. However, if the total effect is negative, but the feature effect is positive and the performance effect is negative, we’ll need to work on optimisations to address the performance issues.

Threats to validity

Particularly if you were not involved in the experiment design stage, it’s worthwhile thinking about whether there could be any issues with the validity of the results. This can be broken into internal, external, and statistical validity:

1) Internal validity (truth in the experiment):

Does it measure what we intended to measure? Experiments control for confounders by randomising the treatment, but there are some common issues that might still make us question the internal validity:

2) External validity (truth in the population): Can the results be generalised to new people/situations/time periods beyond what was originally tested? Some of the key threats to this are:

3) Statistical validity: Were appropriate statistical tests chosen (usually built-in to the company tool), and the assumptions of the test(s) met?

Strategy decisions

When goals can be measured with well defined metrics, experiments can help companies “hill climb” to a local optimum. However, if you want to try something far from the current, it takes more resources to get an MVP and it’s more likely to fail. This is because testing big changes in experiments requires longer and larger experiments (more likely to have primacy/novelty effects), and you may need many experiments to isolate what is and isn’t working. That said, experiments can still help reduce the uncertainty and provide gaurdrails for changes. Beyond that, a useful question to consider is “how will additional information help decision making?”.

When not to experiment?

Sometimes a decision needs to be made, and an experiment is simply not practical. For example: the population of users is too small and you would have to run the experiment for an impractically long time; you’re retention period for the ID is limited so you can’t run the experiment long enough; or, there are legal/ethical reasons. Experimentation is just one tool in the toolbox, and sometimes qualitative approaches make more sense (e.g. usability tests, interviews), or a quasi-experiment methodology can be useful (e.g. a geo experiment randomising at the country/region level, or randomising time instead of users).

Estimating the cumulative impact of experiments

To calculate team impact, some still add up the estimated average effect for a target metric across all the experiments that they decided to launch. This assumes each launch is additive (independent), and the effect is stable over time. However, even with those assumptions it will over-estimate the effect due to m-error. This post explains it well, but essentially we truncate the distribution of effect sizes by imposing a significance threshold. If teams run under-powered experiments, the over-estimation will be worse.

Two better options are:

Conclusions

I’ll finish with a quote that resonated with me:

Good data scientists are skeptics: they look at anomolies, they question results, and they invoke Twyman’s law when the results look to good. (Kohavi et al,. Chapter 3)