Trustworthy online experiments

Some time ago I read the book Trustworthy Online Controlled Experiments: a practical guide to A/B testing by Kohavi, Tang, and Xu. I was pleasantly surprised by how practical it...

Interviewing for DS jobs in the US

Some notes on my experience interviewing for Data Scientist jobs in the US in 2021. For context, I was based in the Netherlands working as a Data Scientist and required...

Statistical power: concept and calculations

The statistical power of a test is simply the chance of detecting an effect, when one exists. The concept may be more familiar in the context of a medical test....

Metric validation for AB testing

In AB testing, the t-test is one of the most commonly used tests. However, if the assumptions are not met, the results are not valid. A key assumption we make...

Confidence intervals

The confidence interval quantifies the uncertainty of a sample estimate. When we estimate a population parameter with a sample statistic, it’s unlikely it equals the population value exactly. For example,...

Non-inferiority testing

Non-inferiority tests are just one-sided tests with a margin, but they are quite useful in experimentation. For example, let’s say you’re adding a new feature to your web shop. You’re...

What is a p value, and how it relates to error rates?

The p-value might seem simple at first, but the definition tends to confuse people. Formally, the p-value is the probability of obtaining a result at least as extreme as the...

Hypothesis testing: Two sample tests for proportions

This post covers the most commly used statistical tests for comparing a binary (success/failure) metric in two independent samples. For example, imagine we run an A/B experiment where our primary...

Hypothesis testing: Two sample tests for continuous metrics

This post covers the statistical tests I use most often for comparing a continuous metric in two independent samples. In general, I recommend using Welch’s t-test, and if the assumptions...

Weighted statistics and the t-test

Sometimes the sample data we have doesn’t represent the population well. For example, maybe you run a survey and the response rate is higher for males than females. If the...