Causal inference

The primary goal of this guide is to store my notes, but I hope it may also serve as a practical resource for other data scientists in industry. It begins with an intuitive overview of causal effects and the notation commonly used to describe them, then moves through how to define a causal question, assess identifiability, and estimate effects from data using models. While the framework is grounded primarily in potential outcomes, I also make use of causal diagrams (DAGs) to support the discussion on identification.

If you’d like to skip ahead, here are links to some of the key sections:


Introduction

You have likely heard the phrase “correlation is not causation.” Formally, we would say a treatment has a causal effect on an individual if their outcome under treatment differs from what it would have been without it. This includes both active interventions (like sending an email), and passive exposures (like a policy change or app outage). The fundamental problem is that we can never observe both outcomes for the same individual. If you send a customer an email, you can observe whether they return afterward, but you will never know for sure what they would have done without it. The potential outcomes framework makes this explicit by distinguishing between the factual (observed outcome) and the counterfactual (unobserved alternative). While we can’t observe both for one individual, we can estimate the average treatment effect (ATE) across groups, provided the treated and untreated are otherwise comparable. Further, by conditioning on relevant variables, we can estimate the conditional average treatment effect (CATE), enabling more personalised decisions.

In randomised experiments, comparability is achieved by design. By randomly assigning treatment, we ensure that differences in outcomes between treated and untreated groups can be attributed to the treatment. However, experiments are not always feasible due to legal, ethical, financial, or logistical constraints. In such cases, we rely on observational data, and distinguishing causation from correlation becomes more difficult. When two variables are associated, observing the value of one, tells us something about the value of the other. However, observing two variables X and Y moving together cannot tell us whether X causes Y, Y causes X, or whether both are caused by a third variable Z. To use a classic example, beer and ice cream sales may rise together in summer, not because one causes the other, but because both are driven by temperature.

Causal inference from observational data requires careful thinking about the data-generating process: defining an intervention (treatment), understanding how treatment is assigned, and clarifying assumptions. For any method to identify causal effects, three core conditions must hold: a well-defined intervention (something concrete needs to be changing), exchangeability (treated and untreated groups are comparable in the absence of treatment), and positivity (every individual has a nonzero probability of receiving each treatment level).

Ultimately, causal claims from observational data rely heavily on domain knowledge and the plausibility of the assumptions. Since perfect data is rare, transparency and sensitivity analyses are essential for assessing potential biases. This transparency is both a strength and a vulnerability, as it invites scrutiny. However, relying on methods with hidden or unacknowledged assumptions does not lead to more trustworthy decisions; it only conceals the uncertainty rather than addressing it.

Description, Prediction, or Explanation?

Before jumping into causal inference theory and methods, it’s crucial to understand where causal questions fit within the broader data science landscape. As a data scientist, you’re often handed vague requests. A critical part of the job is to clarify what the stakeholder is really asking, and what they intend to do with the answer. Most questions fall into one of three broad categories:

Confusion often arises in real-world projects when questions span multiple categories. You might start with descriptive analysis to identify patterns and understand your population, build predictive models to forecast outcomes, and ultimately tackle causal questions about how to influence results. The common mistake is treating these categories as interchangeable. Even when using the same tools, the assumptions, goals, and interpretations differ. Crucially, a model with high predictive accuracy does not necessarily reveal the true effect of any variable. Conversely, a model designed for causal estimation may not excel at prediction. These are fundamentally different tasks.

To illustrate this point, suppose you build a model to predict beer sales, and find that ice cream sales are a strong predictor. The model performs well and the coefficient is significantly positive, so you conclude that increasing ice cream sales will also increase beer sales. However, the true driver is likely weather. Acting on this mistaken belief could waste time, or even reduce overall sales if the two products compete. It is easy to make this mistake. Predictive models often produce convincing results that seem to explain outcomes. When stakeholders want actionable insights, there is pressure to move quickly and skip formal causal analysis. As a result, we often get what Haber et al. (2022) calls “Schrödinger’s causal inference”: an analysis that avoids stating a causal goal, but still makes causal claims and recommendations.

While it often requires more upfront effort, more formal causal thinking prevents teams from pursuing strategies that have no real effect, saving time in the long run by avoiding costly detours.

Choosing the Right Level of Evidence

Causal thinking promotes a deeper understanding of problems by focusing on mechanisms and clarifying assumptions. However, not every causal question requires a full formal study. In many operational settings, a strong correlation paired with contextual knowledge is enough to justify quick action. For example, if conversions drop suddenly and you notice a product change was deployed at the same time, it’s reasonable to suspect the change as the cause and test a rollback. In these cases, you are still reasoning causally, but the cost of being wrong is low, and the response is reversible. The key is to be honest about the level of uncertainty and the type of evidence being used. As a counter example, suppose you’re evaluating whether the company should expand to a new market. Let’s say prior expansions were successful, but focused on high-growth regions. Naively comparing this market to the prior examples may overestimate the expected outcome. The decision is high cost and difficult to reverse, so it would be beneficial to use a more formal causal approach to control for the market differences.

Workflow

The distinction between prediction and causation naturally leads us to the workflow for explanatory questions. If you’ve done a lot of predictive modeling, this will feel familiar, but with some important shifts in mindset. In causal inference, our goal isn’t just to model outcomes well, but to estimate what would happen under different interventions. That means clearly defining counterfactual scenarios and ensuring that our models reflect the underlying data-generating process, rather than aiming for high predictive accuracy.

The workflow below outlines one principled approach to identifying causal effects, though it’s not the only path:

I’m presenting this workflow upfront to provide a clear framework for how the pieces fit together. In the next sections, we’ll dive into the theory and notation behind each step, but having the full picture in mind will help keep the details grounded.


Notation

Before we go further, let’s cover some notation and definitions. This will make it easier to follow the explanations, but you can refer back as you go. Note, capital letters are used for random variables, and lower case for particular values of variables.

Notation / Term Definition
\(T_i\) Treatment indicator for observation \(i\). Takes the value 1 if treated, 0 if not treated.
\(Y_i\) The observed outcome for observation \(i\).
\(Y_{\text{0i}}\) The potential outcome for observation \(i\) without treatment (i.e., counterfactual under control).
\(Y_{\text{1i}}\) The potential outcome for observation \(i\) with treatment (i.e., counterfactual under treatment).
\(Y_{\text{1i}} - Y_{\text{0i}}\) The individual treatment effect for observation \(i\). This is not directly observable.
\(ATE = E[Y_{\text{1}} - Y_{\text{0}}]\) The average treatment effect (ATE): the expected treatment effect across the entire population.
\(ATT = E[Y_{\text{1}} - Y_{\text{0}} \mid T=1]\) The average treatment effect on the treated (ATT): the expected effect for those who actually received treatment.
\(\tau(x) = E[Y_{\text{1}} - Y_{\text{0}} \mid X = x]\) The conditional average treatment effect (CATE): the expected effect for individuals with covariates \(X = x\).
\(\perp\!\!\!\perp\) Denotes independence. For example, \(A \perp\!\!\!\perp B\) means that variables \(A\) and \(B\) are statistically independent.

If \(T_i = t\), then \(Y_\text{ti}=Y_i\). That is, if observation i was actually treated (t=1), their potential outcome under treatment equals their observed outcome. Meanwhile, \(Y_t\) denotes the random variable representing the potential outcome under treatment t across the entire population.


An introduction to causal treatment effects

Individual effect

A treatment has a causal effect on an individual’s outcome if their outcome with the treatment differs to their outcome without it. Let’s pretend we have two identical planets, and the only difference is that on Earth 1 no one plays tennis (\(Y^\text{t=0}\)) and on Earth 2 everyone plays tennis (\(Y^\text{t=1}\)). The outcome is whether you get cancer, and the treatment is playing tennis. The table below shows there is a causal effect of tennis on cancer for Alice, Bob and Eva because the difference in the potential outcomes is non-zero. Alice and Eva get cancer if they don’t play tennis, and don’t get cancer if they do. The inverse is true for Bob.

Name \(Y^\text{t=0}\) \(Y^\text{t=1}\) Difference
Alice 1 0 -1
Bob 0 1 1
Charlie 1 1 0
David 0 0 0
Eva 1 0 -1
Fred 0 0 0
Average 3/6 2/6 -1/6

Average effect

If we take the average treatment effect in the example above (\(ATE=-\frac{1}{6}\)), we would conclude tennis decreases the chance of you getting cancer. We can also see this in the average cancer rate for those with and without treatment (as the average of the differences is equal to the difference of the averages). With tennis (\(Y^\text{t=1}\)) they had a \(\frac{2}{6}\) chance, whereas without it (\(Y^\text{t=0}\)) they would have had a \(\frac{3}{6}\) chance (\(E[Y^\text{t}] \neq E[Y^\text{t'}]\)). Keep in mind there can be individual causal effects even if there is no average causal effect. For example, Alice has an individual causal effect because she avoids cancer if she plays tennis, even if the average showed no difference. While the average causal effect is most commonly used, you can also use other differences such as the median or variance.

The fundamental problem

The fundamental problem of causal inference is that we can never observe the same person/observation with and without treatment at the same point in time. While we might not be able to measure individual effects, we can measure average effects. This is where potential outcomes comes in: the actual observed outcome is factual and the unobserved outcome is the couterfactual. If we compare those with and without treatment and they are otherwise similar, we can estimate the average effect.

Name \(Y^\text{t=0}\) \(Y^\text{t=1}\)
Alice 1 ?
Bob ? 1
Charlie 1 ?
David ? 0
Eva ? 0
Fred 0 ?
Average 2/3 1/3
\[ATE = E[Y_\text{1} - Y_\text{0}] = 1/3 - 2/3 = -1/3\]

Now in our two Earth’s example, if we randomly selected the samples, this would be an unbiased estimate of the average causal effect. However, if we just took a sample of people from our one Earth and compared those who played tennis to those who don’t, we run into the problem of bias because all else is not equal (there are other factors influencing the cancer rate that also impact the likelihood of playing tennis, i.e. confounders). We can say from this there’s an association, but not necessarily causation.

Bias in observational data

Bias is the key concept that separates association from causation. Intuitively, when we compare observational data, we will consider why a difference might exist that isn’t due to the treatment itself. For instance, playing tennis requires time and money, which means tennis players may also have higher incomes and engage in other health-promoting behaviors, such as eating healthy food, and having regular doctor visits. This could explain some or all of the observed difference in cancer rates.

We cannot observe the counterfactual outcome, \(Y_0\), for the tennis group. In other words, we can’t know what their cancer risk would be if they didn’t play tennis. However, we can reason that \(Y_0 \mid T=1\) (the cancer risk of tennis players if they didn’t play tennis) is likely lower than \(Y_0 \mid T=0\) (the cancer risk of non-tennis players), implying that the baseline differs (it’s not a fair comparison). Using the potential outcomes framework we can express this mathematically. An association is measured as the difference in observed outcomes:

\[E[Y \mid T=1] - E[Y \mid T=0] = E[Y_1 \mid T=1] - E[Y_0 \mid T=0]\]

We can separate the treatment effect (ATT) and the bias components by first adding and subtracting the counterfactual outcome \(E[Y_0 \mid T=1]\) (the outcome for treated individuals had they not received the treatment):

\[E[Y \mid T=1] - E[Y \mid T=0] = E[Y_1 \mid T=1] - E[Y_0 \mid T=0] + E[Y_0 \mid T=1] - E[Y_0 \mid T=1]\]

Then re-arranging the equation:

\[= E[Y_1 - Y_0 \mid T=1] + E[Y_0 \mid T=1] - E[Y_0 \mid T=0]\]

Where:

In our tennis example, we suspect that \(E[Y_0 \mid T=1] < E[Y_0 \mid T=0]\), meaning that tennis players, even without tennis, may have a lower cancer risk than non-tennis players. This bias will likely underestimate the true causal effect of tennis on cancer rates. Note, this was an example of confounding bias, but later we will cover other sources of bias.

Addressing the bias

To obtain an unbiased estimate of the treatment effect, we need the treated and untreated groups to be exchangeable before the treatment is applied (\(E[Y_0 \mid T=1] = E[Y_0 \mid T=0]\), so the bias term becomes 0). In this case, the difference in means can give us a valid causal estimate for the Average Treatment Effect on the Treated (ATT):

\[ATT = E[Y_1 - Y_0 \mid T=1]\] \[= E[Y_1 \mid T=1] - E[Y_0 \mid T=1]\] \[= E[Y_1 \mid T=1] - E[Y_0 \mid T=0]\] \[= E[Y \mid T=1] - E[Y \mid T=0]\]

If the treated and untreated groups are expected to respond to the treatment similarly (exchangeable), then we can assume that the treatment effect is the same for both groups. In that case:

\[E[Y_1 - Y_0 \mid T=1] = E[Y_1 - Y_0 \mid T=0]\] \[ATT = E[Y_1 - Y_0 \mid T=1] = ATE\]

This is the goal of causal inference: to remove bias so we are left with a consistent estimate of the estimand.


Formulating a causal question

A well-defined causal question begins with a clear decision context: What intervention is being considered? What outcome are you trying to influence? And for whom? These questions help define the unit of analysis, the treatment, the outcome, and the target population. Once this is clear, we can formulate the question more precisely.

Estimands

We often refer to the average treatment effect (ATE), which is the expected difference in outcomes if everyone received treatment versus if no one did. However, the ATE is just one example of an estimand you might target. To clarify the terminology:

Consideration Why it matters Definition
Estimand Defines the “who” and “what” of the causal question (for example, ATE, ATT). The causal effect you want to learn about, the target quantity or “who” and “what” of the question.
Effect measure Specifies how the effect is quantified (for example, difference in means, odds ratio). How you express or quantify the effect, the scale or metric used.
Estimator Describes the method or algorithm used to estimate the effect from data. Your method or algorithm for estimating the effect from data.
Estimate The actual numerical result you compute, subject to sampling variability. The numerical result you get, which can vary due to sampling and randomness.

To make this more relatable, consider this analogy: the effect measure is the type of cake you want to bake (chocolate, carrot, etc.), the estimand is the perfect picture of the cake in the cookbook, the estimator is the recipe, and the estimate is the cake you actually bake, complete with cracks, uneven frosting, or oven quirks.

Example

Different estimands target different populations or conditions. For example, if you’re offering a 20% personalised discount to customers, are you interested in its impact on all customers, only those who received it, or specific subgroups like frequent shoppers? We can consider different estimands depending on the business goal in this case:

Estimand Description Use Case Business Question
ATE (Average Treatment Effect) Effect averaged over the entire population. Evaluating broad rollout potential What is the average impact of offering a 20% discount across all customers?
ATT (Average Treatment Effect on the Treated) Effect among those who actually received the treatment. Evaluating past campaign performance How effective was the discount among those who received it?
ATC (Average Treatment Effect on the Controls) Effect among those who did not receive the treatment. Expanding treatment to new customers What is the expected effect if we applied the discount to previously untreated customers?
CATE (Conditional Average Treatment Effect) Effect conditional on covariates or subgroups (e.g. high spenders). Estimates average effects in subgroups; often used as a proxy for Individual Treatment Effects (ITEs) Designing personalised or segmented campaigns How much more effective is the discount for frequent shoppers compared to occasional ones?
LATE (Local Average Treatment Effect) Effect among compliers (those whose treatment is influenced by an instrument, e.g. randomized assignment). Using natural experiments or A/B tests with imperfect adherence What is the effect of the discount on customers who redeem it (when randomly assigned to receive one)?
QTE (Quantile Treatment Effect) Effect at specific outcome quantiles. Captures heterogeneity beyond the mean. Detecting tail effects (e.g. high-value or low-value shifts) How does the discount affect different points in the spend distribution (e.g. light vs. heavy spenders)?
Policy Value Expected outcome under a treatment assignment rule (e.g. treat only high-CATE users). Making decisions based on targeting models What is the impact of using a targeting strategy based on predicted treatment effects?

Implicit target

In practice, it can be unclear which estimand your chosen estimator actually targets. Different estimation methods implicitly assign different weights to subgroups within your data, which influences whether your estimate reflects the average treatment effect (ATE), the effect on the treated (ATT), or another causal quantity. For example, linear regression often produces a weighted average of subgroup effects, but the weights depend on the distribution of treatment and covariates in your sample. Even in a random sample, if some groups show more treatment variation or have larger sample sizes, they may receive more weight, causing the estimate to reflect their treatment effect more than the overall ATE. Additionally, negative weights may appear, indicating extrapolation beyond the observed data and potentially reducing the validity of the estimate. Examining these implied weights can help clarify the causal interpretation of your estimate and ensure it aligns with your decision context and causal question.

Effect measures

Once you’ve defined the estimand, the next decision is how to express the size of the effect. That is, the effect measure. This determines the scale of the estimate, which influences both interpretation and communication. In most cases, you’re comparing expected potential outcomes under treatment (\(E[Y_1]\)) and control (\(E[Y_0]\)). The ATE, for instance, uses an additive scale:

\[\text{ATE} = E[Y_1 - Y_0] = E[Y_1] - E[Y_0]\]

But other scales, such as ratios or log-transformed differences, may be more appropriate depending on the outcome type or audience. Sticking with the discount example we can show how different effect measures apply to binary (purchase) and continuous (spending) outcomes:

Outcome Type Difference
(Additive)
Ratio
(Relative)
Multiplicative
(Log or Odds-based)
Description Absolute difference in outcome between treatment and control groups. Relative (proportional) change due to treatment. Nonlinear transformations of effect. Often used when scale is skewed or to stabilize variance.
Binary Outcome
Did the customer purchase?
$\text{Risk Diff} = \Pr(Y_1 = 1) - \Pr(Y_0 = 1)$
“The discount increases purchase likelihood by 10 percentage points.”
$\text{Risk Ratio} = \frac{\Pr(Y_1 = 1)}{\Pr(Y_0 = 1)}$
“Customers with the discount are 1.5× more likely to buy.”
$\text{Odds Ratio} = \frac{\frac{\Pr(Y_1 = 1)}{1 - \Pr(Y_1 = 1)}}{\frac{\Pr(Y_0 = 1)}{1 - \Pr(Y_0 = 1)}}$
“The odds of purchase double with the discount.”
Continuous Outcome
How much did the customer spend?
$\text{Mean Diff} = E[Y_1] - E[Y_0]$
“Average spend increases by $5 with the discount.”
$\text{Ratio of Means} = \frac{E[Y_1]}{E[Y_0]}$
“Discounted customers spend 1.2× more on average.”
$\log\left(\frac{E[Y_1]}{E[Y_0]}\right)$ or
$\frac{E[Y_1] - E[Y_0]}{E[Y_0]} \times 100\%$
“Spend increases by 20% with the discount.”

Note: Recall that \(Y_t\) is the random variable representing the potential outcome under treatment \(t\) across the entire population.

When choosing an effect measure:

Summary

Estimands define the target of your analysis: the causal effect you’re trying to learn about, for a specific population under specific conditions. Whilst effect measures define the scale used to express that effect, which influences how results are interpreted, compared, and communicated. Choosing the right estimand and effect measure is a foundational part of causal inference, ensuring your findings directly answer the right question.

Once you’ve defined these, the next challenge is figuring out how to identify that causal quantity from data. In randomised experiments, identification is usually straightforward: treatment is independent of potential outcomes by design. But with observational data, identifying a causal effect requires careful reasoning about causal structure, specifically, which variables to adjust for, and which to leave alone. Therefore, we’ll start with a brief section on Randomised Control Trials, and then spend a lot more time on observational data.


Causal identification in randomised experiments

In a Randomised Control Trial (RCT) there are still missing values for the counterfactual outcomes, but they occur by chance due to the random assignment of treatment (i.e. they are missing at random). Imagine flipping a coin for each person in your sample: if it’s heads, they get the treatment; if it’s tails, they get the control. Since the assignment is random, the two groups (treatment and control) are similar on average, making them exchangeable (i.e. the treatment effect would have been the same if the assignments were flipped). This is an example of a marginally randomised experiment, meaning it has a single unconditional (marginal) randomisation probability for all participants (e.g. 50% chance of treatment with a fair coin, though it could also be some other ratio).

The potential outcomes in this case are independent of the treatment, which we express mathematically as: \(Y_t \perp\!\!\!\perp T\). You could convince yourself of this by comparing pre-treatment values of the two groups to make sure there’s no significant difference. Given the expected outcome under control, \(E[Y_0 \mid T=0]\), is equal to the expected outcome under control for those who received treatment, \(E[Y_0 \mid T=1]\), any difference between the groups after treatment can be attributed to the treatment itself: \(E[Y \mid T=1] - E[Y \mid T=0] = E[Y_1 - Y_0] = ATE\).

It’s important not to confuse this with the statement: \(Y \perp\!\!\!\perp T\) which would imply that the actual outcome is independent of the treatment, meaning there is no effect of the treatment on the outcome.

Conditional randomisation & standardisation

Another type of randomised experiment is a conditionally randomised experiment, where we have multiple randomisation probabilities that depend on some variable (S). For example, in the medical world we may apply a different probability of treatment depending on the severity of the illness. We flip one (unfair) coin for non-severe cases with a 30% probability of treatment, and another for severe cases with a 70% probability of treatment. Now we can’t just compare the treated to the untreated because they have different proportions of severe cases.

However, they are conditionally exchangeable: \(Y_t \perp\!\!\!\perp T \mid S=s\). This is to say that within each subset (severity group), the treated and untreated are exchangeable. It’s essentially 2 marginally randomised experiments, within each we can calculate the causal effect. It’s possible the effect differs in each subset (stratum), in which case there’s heterogeneity (i.e. the effect differs across different levels of S). However, we may still need to calculate the population ATE (e.g. maybe we won’t have the variable S for future cases). We do this by taking the weighted average of the stratum effects, where the weight is the proportion of the population in the subset:

\[Pr[Y_t=1] = Pr[Y_t = 1 \mid S =0]Pr[S = 0] + Pr[Y_t = 1 \mid S = 1]Pr[S = 1]\] \[\text{Risk ratio} = \frac{\sum_\text{s} Pr[Y = 1 \mid S=s, T=1]Pr[S=s]}{\sum_\text{s} Pr[Y = 1 \mid S=s, T=0]Pr[S=s]}\]

This is called standardisation, and conditional exchangeability is also referred to as ignorable treatment assignment, no omitted variable bias or exogeneity.

Inverse Probability Weighting (IPW)

Mathematically, this is the same as the process of standardisation described above. Each individual is weighted by the inverse of the conditional probability of treatment that they received. Thus their weight depends on their treatment value T and the covariate S (as above, which determined their probability of treatment): if they are treated, then the weight is \(1 / Pr[T = 1 \mid S = s]\); and, if they are untreated it’s: \(1 / Pr[T = 0 \mid S = s]\). To combine these to cover all individuals we use the probability density function instead of the probability of treatment (T): \(W^T = 1 / f[A \mid L]\). Both IPW and standardisation are controlling for S by simulating what would have happened if we didn’t use S to determine the probability of treatment.

Challenges

Even in a randomised experiment, certain issues can threaten the validity of causal conclusions. One foundational assumption in causal inference is the Stable Unit Treatment Value Assumption (SUTVA). It has two components: 1) no interference between units: one individual’s outcome does not depend on the treatment assigned to others; and, 2) well-defined treatments: each unit receives a consistent version of the treatment, with no hidden variation. Two common violations of SUTVA are interference and leakage:

Aspect Interference Leakage of Treatments
Definition When an individual’s outcome is influenced by the treatment or behavior of others. When individuals in the control group (or other non-treatment groups) receive some form of treatment.
Context Common in social networks, product launches, team-based incentives, or any setting with interaction between units. Common in experiments or rollouts where treatment access is imperfectly controlled or shared across users.
Impact on Counterfactuals The counterfactual is not well-defined because an individual’s outcome depends on others’ treatments (violates SUTVA). The counterfactual can still be defined, but it becomes biased due to treatment contamination (may violate SUTVA).
Cause Results from interactions between individuals (e.g., peer effects, social contagion, team dynamics). Results from imperfect treatment assignment or adherence to protocols (e.g., non-compliance or treatment spillover).
Example A user’s likelihood of subscribing may depend on whether their friends or coworkers were also targeted with the offer. I send you a discount offer, and you forward it to a friend who was assigned to the control group.
Effect on Treatment Effect Makes it difficult to estimate treatment effects due to non-independence of outcomes. Dilutes the treatment effect, making it harder to assess the true effectiveness of the treatment.
Type of Problem A structural problem in defining counterfactuals due to interdependence between individuals. A practical problem with treatment implementation, leading to contamination between groups.

Analysis risks

Even with perfect randomisation at the design stage, careless analysis can break the assumptions we rely on to draw valid causal conclusions. The issues below are some of the most common ways experiment results get distorted in practice:

This isn’t an exhaustive list, but it highlights a key point: randomised experiments aren’t magical, you can still break randomization if you’re not careful.

Crossover Experiments

While less common in industry, it’s worth mentioning a special case where we can observe both potential outcomes for each individual, by assigning them to different conditions at different times. For example, we might give a user the treatment at time $t = 0$ and the control at time $t = 1$, then compare their outcomes across those periods to estimate an individual-level treatment effect. This design is called a crossover experiment, and it can, in theory, eliminate between-subject variation entirely. However, for this approach to yield valid causal estimates, several strong assumptions must hold:

In practice, these assumptions can be difficult to justify. For example, a user exposed to a new recommendation algorithm may change their behavior in a lasting way, making it difficult to observe a clean post-treatment control condition. Still, in tightly controlled settings where these assumptions are plausible, such as short-session A/B tests of user interface changes, crossover designs can offer high efficiency by letting individuals serve as their own control.


Causal identification in observational studies

Randomised experiments are considered the gold standard in causal inference because they eliminate many sources of bias by ensuring treatment is assigned independently of potential outcomes. However, they’re not always feasible or ethical. In many industry settings, constraints related to time, data infrastructure, or business needs make it difficult to run a proper experiment, or to wait for long-term outcomes. For instance, suppose you’re testing a new user onboarding experience. You can run an A/B test for two weeks, but maybe what you really care about is its impact on user retention six months later. Long-term experiments often face practical limitations: users may churn or become inactive before the outcome is observed, tracking identifiers like cookies or sessions may not persist long enough to follow users over time, and business pressures may require decisions before long-term data is available.

In such cases, we turn to observational data. The advantage is that it’s abundant in most businesses. The challenge is that treatment is not assigned at random, so treated and untreated groups may differ in ways that affect the outcome. These differences, called confounding, can bias our estimates of causal effects. For example:

To identify causal effects from observational data, we need to make assumptions and apply adjustment techniques to account for such confounding. The rest of this guide focuses on helping you navigate the choices and interpret causal estimates responsibly, even when randomisation isn’t available.

Identifiability conditions

To estimate causal effects with observational data, we often assume that treatment is conditionally random given a set of observed covariates (S). Under this assumption, the causal effect can be identified if the following three conditions hold:

1. Consistency: The treatment corresponds to a well-defined intervention with precise counterfactual outcomes. This overlaps with the Stable Unit Treatment Value Assumption (SUTVA), which adds the requirement of no interference between units, meaning one individual’s treatment does not affect another’s outcome. Vague or ill-defined treatments undermine consistency. Examples:

2. Exchangeability (also called ignorability): Given covariates S, the treatment assignment is independent of the potential outcomes. In other words, all confounders must be measured and included in S. If this is the case, then within each level of S, the distribution of potential outcomes is independent of treatment: \(Y_t \perp\!\!\!\perp T \mid S = s\). If important confounders are unmeasured, causal estimates may be biased. Since there’s no statistical test for exchangeability, defending it relies on domain expertise and thoughtful variable selection. This is often the hardest condition to justify in industry settings. Example:

3. Positivity (also called overlap): Every individual must have a non-zero probability of receiving every level of treatment, given S. That is, within each covariate-defined subgroup, there must be both treated and untreated individuals. Positivity violations can sometimes be addressed by redefining the population of interest using exclusion criteria, but this reduces generalisability and should be explicitly acknowledged. Examples:

While consistency and positivity are often manageable with careful study design and clear definitions, exchangeability typically requires more substantive domain knowledge and careful attention to potential confounding.

When assumptions fail

If these assumptions do not hold, all is not lost. There are alternative strategies that can help recover causal estimates under weaker or different sets of assumptions. For example, Instrumental Variable (IV) analysis leverages a variable that influences treatment assignment but does not directly affect the outcome (and is not related to unmeasured confounders). This can help achieve exchangeability in situations where confounding cannot be fully addressed through adjustment alone, effectively mimicking random assignment. These methods are covered in later sections.

Target Trial

A useful approach for framing a causal question is to first design a hypothetical randomised experiment, often called the target trial. This imagined experiment helps clarify exactly what causal effect we want to estimate and what conditions must hold to identify it from observational data. The target trial forces you to specify key components: the treatment or intervention, eligibility criteria, the assignment mechanism, and the outcome. Once these are defined, you can assess whether your observational data is rich enough to emulate the trial. For example, do you have sufficient information to identify who would be eligible? Can you determine treatment assignment timing? If your emulation is successful, the causal effect estimated from observational data should correspond to the causal effect in the target trial.

This thought experiment is particularly useful for evaluating the core assumptions required for identification:

If a target trial cannot be specified, then any associations observed in the data may still be useful for generating hypotheses but they cannot be interpreted as causal effects without stronger assumptions or alternative identification strategies.

Causal Graphs

Representing assumptions visually can further enhance understanding and communication. By mapping variables and their causal relationships (including unmeasured variables), causal graphs help you reason about which variables need adjustment. Typically, these graphs are drawn as directed acyclic graphs (DAGs). Each variable is represented as a node, and arrows (or edges) between nodes indicate causal relationships, usually flowing from left to right. With just two variables, we would say they are independent if there is no arrow between them (for example, the outcomes of rolling two random dice). Keep in mind, graphs show relationships that apply to any individual in the population. Although uncommon, it’s possible that the average effect or association is not observed in a sample, such as when positive and negative effects perfectly cancel each other out.

As more variables are added, it becomes important to consider the possible paths between them and whether those paths are open or blocked. The key principle is that any variable is independent of its non-descendants conditional on its parents. This means that once you know the variables that directly cause it (its parents), the variable does not depend on any other earlier variables that are not in its direct line of influence (its non-descendants). If all paths between two variables are blocked once we control for other variables, then those variables are conditionally independent. For example, a user’s likelihood of canceling a subscription might be independent of their reduced feature usage once we know their overall satisfaction score, as satisfaction explains both behaviors, leaving no additional link.

Understanding how and why paths are blocked or open leads us to the next key concept: the types of paths that exist in a causal graph and how they affect identifiability.

Key Causal Pathways

There are three key types of paths to look for: mediators, confounders, and colliders.

Mediator vs Confounder vs Collider

1. Mediator: A variable that lies on the causal path from treatment to outcome, capturing part or all of the effect of treatment.

Conditioning on a mediator helps clarify the mechanism by which one variable affects another. Consider a situation where Sales are influenced by both Ad Spend and Web Traffic:

\[\text{Ad Spend} \rightarrow \text{Traffic} \rightarrow \text{Sales}\]

Here, increasing Ad Spend increases Traffic, which in turn boosts Sales. If we condition on Traffic, the relationship between Ad Spend and Sales disappears because Traffic fully mediates (explains) this relationship. Mathematically:

\[S \perp \perp A \mid T\]

If we fit a regression model without considering traffic, we might mistakenly conclude there’s a direct relationship between ad spend and sales. Once traffic is included, ad spend might no longer be statistically significant, as its effect on sales is mediated by traffic. We might therefore want to optimise ad spend to maximise traffic, using a separate model to predict the effect of the increased traffic on sales. However, if we wanted to estimate the total effect of ad spend on sales, we wouldn’t want to control for traffic as this blocks the indirect path (underestimating the total).

2. Confounder: A variable that causally influences both the treatment and the outcome, potentially introducing spurious correlation if not controlled.

A variable that causes two other variables. Conditioning on this variable is necessary to avoid a spurious association between the two other variables, biasing the causal effect estimate. Imagine that Smoking has a causal effect on both Lung Cancer and Carrying a Lighter:

\[\text{Lighter} \leftarrow \text{Smoking} \rightarrow \text{Lung Cancer}\]

While carrying a lighter doesn’t cause lung cancer, if we compare lung cancer rates between people who carry a lighter and those who don’t, we might find a spurious association. This happens because smoking influences both lung cancer and carrying a lighter. To avoid bias, we must condition on smoking, breaking the spurious link between carrying a lighter and lung cancer (\(Lighter \perp \perp Lung Cancer \mid Smoking\)).

In a sales example, if both ad spend and sales increase during the holidays due to seasonality (Ad Spend ← Season → Sales), failing to control for the season could lead to overstating the effect of ad spend on sales.

3. Collider: A variable that is caused by two or more other variables, opening a non-causal path if conditioned on.

Be cautious not to condition on a collider as it can create a spurious relationship between the influencing variables by opening a non-causal path between treatment and outcome (collider bias). In this example, both Ad Spend and Product Quality affect Sales, but they are independent of each other:

\[\text{Ad Spend} \rightarrow \text{Sales} \leftarrow \text{Product Quality}\]

Once we condition on sales, we create a spurious association between ad spend and product quality (\(Ad Spend \perp \perp Product Quality\) but \(Ad Spend \not\perp Product Quality \mid Sales\)). When sales are high, both ad spend and product quality may be high, and when sales are low, both may be low. This creates a false impression that ad spend directly improves product quality, even though both only influence sales.

General Rules

Conditioning on non-colliders (confounders or mediators) blocks the path, making the other variables independent. Conditioning on a collider (or its descendants) creates a dependency between the variables influencing it, which opens the path. Not conditioning on the collider blocks the path, keeping the influencing variables independent.

Paths

By visualising causal relationships in a graph, you can better understand how variables are connected and whether observed associations are due to direct causal effects, confounding, mediation, or colliders. By distinguishing between these different relationships, we can handle them appropriately to have more accurate models and inference. While you don’t always need a graph, as the number of variables involved increases it can make it easier to manage. For example, let’s look at just a slightly more complex graph with both confounders and colliders. By visualising the graph it’s easier to identify whether X and Y are independent based on whether there’s an open path between them:

Paths

Let’s look at a labelled example. Suppose we want to understand the direct causal relationship between the number of police officers and the number of victims of violent crimes in a city. Depending on what variables we control for, we open different paths and will get different results:

Crime graph

Model Variables Included Notes
1. Victims ~ Police Only treatment (Police) Doesn’t control for confounder (Crime), so may yield biased results (e.g. more police associated with more crime). Mediator (Response) not controlled, so overestimates direct effect.
2. Victims ~ Police + Response + Crime + Coverage + Perception + Budget Treatment, mediator, confounder, and collider path Opens a backdoor path via collider (Media Coverage), which can create a spurious association between Police and Budget. If both are influenced by a common cause (e.g. City Budget), this can bias the estimate.
3. Victims ~ Police + Response + Crime + Budget Treatment, mediator, confounder, and irrelevant variable Avoids collider bias by excluding Media Coverage. Including Budget is unnecessary but not harmful unless it reintroduces bias through other unblocked paths.

Note, if we wanted to measure the total effect of the number of police on the number of victims, we would only control for the confounder in this case, as the mediator would absorbs some or all of the effect.

Systematic bias

Let’s revisit the concept of bias, this time with the help of causal graphs. Systematic bias occurs when an observed association between treatment and outcome is distorted by factors other than the true causal effect. In graphical terms, this often corresponds to open non-causal paths linking treatment and outcome, producing spurious associations even when the true effect is zero (sometimes called bias under the null). Common causes, or confounders, create backdoor paths that induce such spurious associations, while conditioning on common effects, or colliders, opens other non-causal paths and leads to selection bias. Measurement error in confounders or key variables can also contribute by preventing complete adjustment. Additionally, systematic bias can arise from other sources like model misspecification or data issues, which affect the accuracy of effect estimates. In this section, we’ll focus on confounding and collider bias, which can be clearly illustrated using causal graphs.

Confounding bias

In observational studies, treatment is not randomised and so it can share causes with the outcome. These common causes (confounders) open backdoor paths between treatment and outcome, creating spurious associations. If you don’t control for them, the observed relationship can reflect both the causal effect of treatment and the effect of these confounders.

Confounding examples

If we know the causal DAG that produced the data (i.e. it includes all causes shared by the treatment and outcome, measured and unmeasured), we can assess conditional exchangeability using the backdoor criterion:

All backdoor paths between T and Y are blocked by conditioning on A, and A contains no variables that are descendants of T.

In practice, this assumption is untestable, so we often reason about the likely magnitude and direction of bias:

Other methods have different unverifiable assumptions to deal with confounding. Difference-in-Differences (DiD) can be used with a negative outcome control (e.g. pre-treatment value of the outcome) to adjust for confounding. For example, if studying Aspirin’s effect on blood pressure, pre-treatment blood pressure (C) isn’t caused by the Aspirin (A), but shares the same confounders (e.g. pre-existing heart disease). Differences in C between the treatment groups gives us an idea of the bias in the Aspirin–blood pressure (Y) relationship (assuming the confounding effect is the same).

\[E[Y_1 - Y_0 \mid A = 1] = \left( E[Y \mid A = 1] - E[Y \mid A = 0] \right) - \left( E[C \mid A = 1] - E[C \mid A = 0] \right)\]

Ultimately, the choice of adjustment method for confounding depends on which assumptions are most plausible. With observational data, uncertainty is unavoidable, so being explicit about assumptions and using sensitivity tests to quantify the potential bias allows others to critically assess your conclusions in a productive way.

Selection bias

Selection bias refers to bias that arises when individuals or units are selected in a way that affects the generalisability of the results (i.e. it’s not representative of the target population). Here we focus on selection bias under the null, meaning there will be an association even if there’s no causal effect between the treatment and the outcome. This occurs when you condition on a common effect of the treatment and outcome, or causes of each. For example, when observations are removed due to loss to follow up, missing data, or individuals opting in/out:

Confounding examples

This can arise even in randomised experiments. Conditional-on-Positives (CoPE) bias occurs when analysis is restricted to “positive” cases for a continuous outcome. For instance, splitting “amount spent” into a participation model (0 vs >0) and a conditional amount model (>0) ignores that treatment can shift people from zero to positive spending. The spenders in treatment and control then differ systematically, and comparing them yields the causal effect plus selection bias

\[E[Y_0 \mid Y_1 > 0] - E[Y_0 \mid Y_0 > 0]\]

Typically, control spenders look inflated (negative bias), as those who spend only because of treatment spend less than habitual spenders. We need to account for the fact that treatment affects both the participation decision (whether someone spends anything) and the spending amount (conditional on spending).

Other than design improvements, selection bias can sometimes be corrected with weights (e.g. via Inverse Propensity Weighting). As with confounding, we should also consider the expected direction and magnitude of the bias to understand the impact.

Example: graphs in the wild

The relationships between variables in causal models can be much more complex in reality, making it challenging to identify the role of each variable. This is especially true when the time ordering of variables is unclear. Let’s consider the graph below looking at the causal effect of PFAS levels on health outcomes, controlling for obesity both before and after exposure:

PFAS effect Source: Inoue et al., 2020

In the graph, we see a bi-directional relationship between obesity and PFAS levels:

To estimate the total effect of PFAS exposure on health outcomes, we want to control for confounders but avoid controlling for mediators, as doing so would block the indirect effect of PFAS on health outcomes via obesity. This is where the issue arises because both obesity and PFAS levels are cumulative measures (i.e. they reflect long-term, prior exposure). Most studies only measure obesity at baseline, and the temporality between PFAS exposure and obesity is not clear. The authors of this graph note that conditioning on obesity_post (post-PFAS exposure) could:

They suggest not stratifying on obesity status if it’s likely a mediator of the PFAS-health relationship being studied (as the stratum effects could be biased). On the other hand, it’s generally safe to adjust for the common causes of PFAS and obesity (e.g. sociodemographic factors, lifestyle, and dietary habits) because these factors are less likely to act as mediators in this causal pathway. This is just one example, but it highlights the need to carefully consider the causal graph, and not over-adjust in ways that could bias the causal effects. When you want to measure the total effect, focus on controlling for confounders, not mediators.

Random variability

Random variability (or sampling variability) is the uncertainty that arises because we estimate population parameters using sample data. This is distinct from the sources of systematic bias (confounding, selection, measurement) which are issues of identification.

Statistical consistency

We can think of an estimator as a rule that takes sample data and produces a value for the parameter of interest in the population (the estimand). For example, we calculate the point estimate \(\hat{P}(Y=1 \mid A=a)\) for the population value \(P(Y=1 \mid A=a)\). Estimates will vary around the population value, but an estimator is considered consistent if the estimates get closer to the estimand as the sample size increases. We therefore have greater confidence in an estimate from a larger sample, which is reflected in a smaller confidence interval around it. Note, “consistency” here is a statistical property and different to the identifiability condition which uses “consistency” to mean a well-defined intervention.

Bringing this back to our cake example from earlier: imagine many bakers all trying to bake a cake using the same recipe (estimator). Each baker uses slightly different ingredients or ovens, so their cakes (estimates) vary. However, if you take the average cake quality across many bakers (many samples), that average gets closer to the perfect cake in the cookbook photo (the estimand) as the number of bakers increases. Consistency means that the recipe works on average, with many bakers the results converge to that ideal everyone is aiming for.

To contrast this with bias, now imagine the recipe (estimator) consistently calls for the wrong amount of sugar (i.e. model misspecification). No matter how many bakers try it, their cakes will systematically differ from the perfect cake in the cookbook (the estimand). This persistent deviation is bias, it shifts the average cake away from the target, unlike random variation which only increases the spread (variance) around the true cake. In other words, bias affects the center of the distribution of estimates, while random variability affects its width.

Random variability in experiments

The above is also true in experiments. Random assignment of the treatment doesn’t gaurantee exchangeability in the sample, but rather it means any non-exchangeability is due to random variation and not systematic bias. Sometimes you’re very unlucky and approximate exchangeability doesn’t hold (e.g. you find a significant difference in one or more pre-treatment variables that are confounders).

There’s ongoing debate about whether to adjust for chance imbalances in randomised experiments, particularly for pre-treatment variables that were not part of the randomisation procedure. The basic conditionality principle tells us we can estimate ATE without adjusting for covariates because treatment is randomised (unconfounded), but the extended version justifies conditioning on any pre-treatment X that are not randomised to improve precision (by explaining outcome variance), and estimate sub-group effects (heterogeneity). In practice, a pragmatic approach to avoid aver-adjustment is to adjust only for strong predictors of the outcome that are imbalanced enough to plausibly affect inference.

Confidence intervals

Most methods use standard formulas to calculate standard errors and confidence intervals. In some cases, multiple layers of uncertainty require more advanced techniques, such as bootstrapping, to fully capture all sources of random variability. However, confidence intervals only reflect uncertainty due to random variation; they do not account for systematic biases, whose uncertainty does not diminish as sample size increases.

Summary

Causal inference from observational data relies on strong assumptions to compensate for the lack of randomisation. Identification typically requires three core conditions: consistency (well-defined treatments lead to well-defined outcomes), exchangeability (no unmeasured confounding given observed covariates), and positivity (non-zero probability of treatment across covariate strata). Exchangeability is often the hardest to justify in real-world industry settings. The target trial framework and causal graphs (DAGs) help structure, assess, and communicate these assumptions, as well as the reasoning behind potential sources of bias. Visualising the relationships helps to distinguish what are the mediators, confounders, and colliders, ensuring bias is not introduced when conditioning (or not) on these variables. However, real world graphs can be complex, requiring careful consideration and expert knowledge. In general, you want to avoid over-adjusting, and focus on controlling for confounders.


Data Preparation

Returning to the target trial framework, one of its benefits is that it provides a clear blueprint for data preparation. It forces you to explicitly define: the outcome, target population (eligibility criteria), exposure definition, assignment process, follow up time period, and analysis plan. This helps clarify what data is needed and how it should be processed. In practice, getting to a clean analysis dataset often involves merging multiple sources, engineering features, and resolving issues like time alignment, units, or granularity. Therefore, during this stage, it’s important to run exploratory checks to catch potential problems early, such as:

Measurement error in key variables: Mislabeling treatments or misrecording outcome timestamps can significantly bias results. For example, if some users were marked as having received a notification but never actually did, the estimated treatment effect will be diluted. Similarly, if outcome events are logged with inconsistent timezones or delays (e.g. due to batch jobs), you might misalign treatment and outcome windows. Carefully validating treatment flags and verifying outcome timing logic (especially relative to exposure), can prevent subtle but critical biases.

Missing data: Missingness is rarely random in real-world data. For instance, users who churn stop generating events, resulting in censored outcomes. Or, income data might be missing more often for higher-earning users who opt out of surveys. If the probability of missingness is related to the treatment, outcome, or covariates, ignoring it can introduce bias. It’s important to diagnose the missingness pattern (e.g. checking the proportion missing by variable, treatment and outcome), and consider appropriate handling strategies, such as imputation, modeling dropout, or sensitivity analyses.

This is also a good point to evaluate the identifiability assumptions:


Estimating causal effects with models

So far, we’ve focused on identification: understanding when causal effects can be recovered from observational data under certain assumptions. We now turn to estimation: how to use models to compute those effects in practice. While this section emphasises observational data, the same tools can be applied to randomised experiments (for example, to improve precision by adjusting for covariates).

What is a model?

A model describes the expected relationship between treatment, outcome, and covariates. It defines the conditional distribution of the outcome given these inputs, either through a specified functional form or a flexible algorithm learned from data. In causal inference, models are used to estimate potential outcomes and causal effects under assumptions about the data-generating process.

Many commonly used models are parametric, meaning they assume a fixed functional form. For example, a linear model assumes the expected outcome increases at a constant rate with treatment:

\[E[Y \mid T] = \theta_0 + \theta_1 T\]

This is useful when treatment is continuous and the number of observed treatment levels is large or infinite, since the model summarises the effect using only a few parameters. However, to make this work, it must impose strong assumptions like linearity.

In contrast, nonparametric models make fewer assumptions and allow the data to determine the shape of the relationship. Returning to the linear model above, it also serves as a saturated model when treatment \(T\) is binary. In this case, the model estimates two conditional means (one for each treatment group), with no further assumptions. Although written in a linear form, it imposes no constraints and fits the data exactly, making it nonparametric in this specific setting.

Why Models Are Needed

Even if identification assumptions are satisfied, causal effects often can’t be computed directly from raw data. Reasons include:

Models help address these challenges. They allow adjustment for covariates (supporting exchangeability), interpolation across treatment levels (helping with positivity), and estimation of potential outcomes under alternative treatment scenarios (relying on consistency). Models can also stabilise estimates by pooling information across similar observations. In this way, modeling isn’t just about computation, it’s how the identifying assumptions are operationalised in practice.

Prediction vs. Inference

Models, by default, describe statistical associations in the observed data; they are not inherently causal. As covered in the introduction to this guide, it’s important to distinguish between using models for prediction (\(E[Y \mid T, X]\)) and for causal inference (\(E[Y(t)]\)). A model may suggest that customers who received a discount spent more, but this observed correlation does not prove the discount caused the increase. Similarly, if you predict bookings based on hotel price without controlling for seasonality, you might incorrectly think increasing the price causes increased demand. In other words, accurately forecasting outcomes does not guarantee valid causal effect estimates.

While predictive models can be evaluated using performance metrics on held-out data, causal inference depends on careful study design and justified assumptions. Specifically, the three identifiability conditions must be met: exchangeability, positivity, and consistency. In addition to these causal assumptions, model-specific assumptions must also hold. For parametric models, these include:

Since all models simplify reality, some degree of misspecification is inevitable. This makes sensitivity analysis essential for assessing how results change under alternative settings. Transparency about assumptions and rigor in testing their robustness help build confidence in the resulting causal estimates.

Extrapolation

A risk with models like regression is it introduces the possibility of extrapolation. Interpolation refers to estimating outcomes within the range of observed data and is generally safe. Extrapolation occurs when making predictions outside the support of the data. For example, estimating effects for treatment levels or covariate profiles that are rare or unobserved. This relies entirely on model assumptions and can lead to misleading results. In causal inference, extrapolation should be avoided unless supported by strong domain knowledge.


Common ATE models

The following sections will go through some of the common models used to estimate the ATE, including:

1. Adjustment-Based Methods: These rely on adjusting for observed confounders to identify causal effects.

2. Quasi-Experimental Designs: These exploit natural experiments or design features to identify causal effects without relying solely on observed confounding adjustment.

Each of these methods builds on the core identification assumptions but uses different strategies to estimate causal effects, depending on the data structure, assumptions, and sources of bias. There is no one size fits all approach, as most methods are designed for a narrow set of conditions. Therefore, it’s useful to be familiar with a range of approaches, and go deeper in those most relevant to your use case.


Adjustment based methods

Outcome regression

Generalised Linear Models (GLMs) such as linear regression are a popular choice for estimating ATE when there isn’t complex longitudinal data. For example, using a binary variable to indicate whether an observation was treated or not (T), for a continuous outcome we get:

\[Y_i = \beta_0 + \beta_1 T_i + \epsilon_i\]

Where:

In this single variable case, the linear regression is computing the difference in group means. If T is randomly assigned then \(\beta_1\) unbiasedly estimates the ATE:

\[ATE = \beta_1 = \frac{Cov(Y_i, T_i)}{Var(T_i)}\]

However, when using observational data, the key assumption of exchangeability often fails (\(E[Y_0 \mid T=0] \neq E[Y_0 \mid T=1]\)). To address this, we can extend the model to control for all confounders \(X_k\) (i.e. variables that affect both treatment and outcome):

\[Y_i = \beta_0 + \beta_1 T_i + \beta_k X_{ki} + \epsilon_i\]

Now \(\beta_1\) estimates the average treatment effect (difference in group means), conditional on the confounders (i.e. the conditional difference in means). Under linearity and correct model specification, this equals the marginal ATE (population-average ATE).

To break this down, you can imagine we instead did this in 2 steps: 1) regress T on X to isolate the variation in T not explained by the confounders (i.e. the residuals, which are orthogonal to X by design); 2) regress Y on the residuals from step 1 (the part of T that isn’t explained by X). If T was randomised, then X would have no predictive power and the residuals would just be the original T. Conceptually, we’re isolating the variation in treatment assignment that is unrelated to confounders (the residuals), and then estimating how this variation relates to the outcome to isolate the treatment effect net of confounding.

The normal equation does this in one step for multiple regression using the inverse matrix (\(\left(X^\top X\right)^{-1}\)) to “adjust” each predictor’s covariance with Y for the correlations among the predictors.

Considerations

Model assumptions

When using linear regression to estimate the ATE, the most important model assumptions are correct functional form (linearity in parameters) and no unmeasured confounding (exogeneity: X is not correlated with the error term), because violations of these lead to biased and inconsistent estimates of the treatment effect. These can be partially assessed by inspecting residual plots, adding interaction or nonlinear terms, and relying on domain knowledge to ensure all major confounders are included.

The other assumptions concern the errors and primarily affect inference rather than the point estimate: independence of errors, homoscedasticity (constant variance of the errors), and normality (in small samples). Violations can distort standard errors and confidence intervals, impacting our ability to quantify uncertainty around the ATE. Standard approaches to address these include using robust or cluster-robust standard errors, transforming variables, or modeling correlations in clustered or time-series data.

Avoid over-adjusting

To have an unbiased estimate we need to include all confounders in the model. However, this is often the most challenging aspect when working with observational data in industry settings. Unless you have a tightly controlled process with well-understood and measured mechanisms, some unmeasured confounding is likely to remain. It can be tempting to include every available variable to “adjust as much as possible,” but over-adjusting can actually introduce bias or increase variance, reducing the power to detect the true effect. Therefore, it’s important to be deliberate in choosing which variables to include:

✅ Predictors of the outcome: Including variables that predict the outcome helps reduce unexplained variation (noise) and improves the precision of the treatment effect estimate.

❌ Pure predictors of the treatment: These variables affect treatment assignment but have no relation to the outcome. Including them can reduce the variation in treatment, inflating standard errors and reducing power, though they do not bias the estimate if unrelated to the outcome.

❌ Mediators: Variables that lie on the causal pathway from treatment to outcome. Controlling for mediators blocks part of the treatment effect, causing underestimation of the total effect.

❌ Colliders: Variables influenced by both treatment and outcome (or their causes). Conditioning on colliders opens a backdoor path and induces spurious associations, biasing the estimate. For example, controlling for job type when estimating the effect of education on income may introduce bias because job type is affected by both education and income.

It’s worth mentioning one common mistake: using variable selection methods from predictive modeling to select confounders (e.g. step-wise methods or regularisation). A variable’s ability to predict the outcome (an association) does not tell us whether it’s a confounder. Therefore, blindly applying automated variable selection methods is a bad idea. Confounding is about relationships with both the treatment and the outcome, so variable selection for causal models must be grounded in domain knowledge and causal reasoning.

Implicit weighting

Coming back to our example of estimating ATE using $\beta_1$ from \(Y = \beta_0 + \beta_1 T + \beta_k X_{k}\). This effectively estimates treatment effects within each subgroup (stratum) defined by $X$, then averages those subgroup effects. However, the implicit weights assigned to each subgroup depend not just on subgroup size, but on the amount of treatment variation within that subgroup. Subgroups with more treatment variation provide more information for estimating $\beta_1$, so the model gives them more influence in order to minimise prediction error.

Mathematically, $\hat{\beta}_1$ is a weighted average of subgroup-specific treatment effects:

\[\hat{\beta}_1 = \sum_{s} w_s \hat{\beta}_{1,s}\]

Where $s$ indexes subgroups, $\hat{\beta}_{1,s}$ is the treatment effect in subgroup $s$, and the weights $w_s$ are proportional to:

\[w_s \propto \text{Var}(T \mid X = s) \times n_s\]

Here, $n_s$ is the size of subgroup $s$, and $\text{Var}(T \mid X = s)$ is the variance of treatment within subgroup $s$ (e.g. for binary treatment it would be $p_s(1-p_s), where $p_s$ is the treated proportion in subgroup s). This means that subgroups with limited treatment variation (small $\text{Var}(T \mid X = s)$) get very little weight, even if large in size.

For example, suppose $X$ indicates platform type: 800 mobile users (790 treated, 10 control), and 200 desktop users (100 treated, 100 control). The true treatment effect is 1 for mobile users and 2 for desktop users. Even though mobile users are 80% of the sample, their near-complete lack of treatment variation means the constant-effect regression places most weight on the desktop subgroup, resulting in an estimate closer to 2, instead of the true population weighted ATE: $0.8 \times 1 + 0.2 \times 2 = 1.2$.

ATE with unaccounted for treatment variation

This issue is a model misspecification: forcing a constant treatment effect means the regression compromises on a slope that fits the high-overlap subgroup best, even if it misrepresents the true ATE. Specifically, bias arises when:

If X does not affect Y, treatment effect is constant across subgroups, so the ATE estimate is unbiased even if there’s differences in treatment variation. Conversely, if X affects Y by moderating the treatment effect (i.e. heterogeneous effects), but treatment variation is equal across subgroups (i.e. $X$ does not affect $T$ = equal T/C split), the regression’s implicit weights match subgroup sizes, and the ATE is unbiased.

When positivity is violated or nearly violated (i.e. some subgroups are mostly treated or control), this misspecification worsens. To fit a single slope, the model extrapolates into regions of X with no treated or control units. In linear models, this can produce negative weights, effectively counterbalancing influence from data-rich regions with “imaginary” influence from unsupported regions, leading to unstable and potentially biased ATE estimates.

Practical checks:

Correctly estimating ATE with heterogeneous effects:

To properly estimate the ATE in the presence of heterogeneity (and treatment variation), you can explicitly model interactions between treatment and covariates (e.g. $T \times X$ terms). This allows the regression to estimate a separate treatment effect for each subgroup. Then, use g-computation (covered in more detail in the next section) to recover the population ATE:

This approach ensures the ATE reflects subgroup effects weighted by population prevalence, while adjusting for confounding.

Continuous treatments

This framework also applies when the treatment T is continuous, such as dosage, time spent on a feature, or monetary exposure. In that case, \(\beta_1\) is interpreted as a marginal effect: the expected change in outcome Y for a one-unit increase in treatment T (assuming a constant effect), holding covariates X constant. The same assumptions are required, but using linear regression imposes a constant treatment effect across the entire range of T. If the true relationship is nonlinear, this can lead to biased or misleading estimates.

Confounding examples

When the treatment effect varies across it’s range like in the example above, more flexible models may be needed (e.g. splines, GAMs, or kernel regression). To estimate the ATE, you then use standardisation (g-computation): compute the predicted outcomes at different treatment levels for each unit, and average the difference across the population. This gives you a robust estimate of the average effect, even when the treatment effect is not constant, and can be combined with bootstrapping to get confidence intervals.

Binary outcomes

For binary outcomes, linear regression is still commonly used (known as Linear Probability Models, LPM), especially when the goal is to estimate the ATE on the probability scale. The model is fitted just like before, but because the outcome is binary, there are important differences in interpretation and limitations. Specifically, $\beta_1$ estimates the difference in outcome probabilities between treated and untreated groups (the ATE on the probability scale). The model assumes a constant marginal effect, meaning treatment shifts the probability by the same amount for everyone, regardless of their baseline risk. This can oversimplify reality if effects vary across individuals.

LPMs have other practical drawbacks: they can predict probabilities outside the $[0,1]$ range, and the variance of residuals depends on the predicted probability (heteroskedasticity), peaking around 0.5. This violates ordinary least squares (OLS) assumptions, potentially distorting standard errors. Therefore, it is important to use robust (heteroskedasticity-consistent) standard errors and be cautious when predicted probabilities are close to 0 or 1.

If more accurate modeling of the relationship between covariates and the outcome is needed, especially for prediction or rare outcomes, logistic regression is often preferred. However, logistic regression coefficients represent log-odds ratios rather than probability differences, so $\beta_1$ is not directly interpretable as the ATE. To recover the ATE on the probability scale, two common methods are:

  1. Average Marginal Effects (AME): For each observation, compute the marginal effect of treatment:

    \[\text{ME}_i = \hat{P}_i \times (1 - \hat{P}_i) \times \hat{\beta}_1\]

    Where $\hat{P}_i$ is the predicted probability for individual $i$. The AME is the average of these marginal effects across all observations. Intuitively, you can think of the logistic function as a hill: the slope (marginal effect) varies depending on where you stand (the covariates). Some individuals are near the hilltop or base, where the slope is flat, and others are on the steep middle slope. By averaging everyone’s slope, the AME summarises the average steepness of the hill: the average effect of treatment on the probability scale.

  2. Marginal Standardisation (g-computation): After fitting the logistic model, predict each individual’s outcome twice: once with $T_i = 1$ and once with $T_i = 0$, while keeping covariates $X_i$ fixed at their observed values. The ATE is:

    \[\text{ATE} = \frac{1}{n} \sum_{i=1}^n \left[ \hat{Y}_i(T=1) - \hat{Y}_i(T=0) \right]\]

    This represents the average difference in predicted probabilities if everyone were treated versus if no one were treated. In single-time-point settings, this is equivalent to g-computation (covered in the next section).

For both methods, standard errors can be estimated using the Delta method or bootstrapping. The Delta method provides a fast, analytic approximation based on a Taylor expansion, while bootstrapping offers a flexible approach that is often more robust, particularly in complex models or small samples.

Causal strength

Assuming the model is specified correctly, the reliability of outcome regression for causal inference hinges on the validity of the identifiability assumptions:

G-computation (standardisation)

G-computation (also known as standardisation and the g-formula), takes a different approach to effect estimation. The “g” stands for generalised, as in the generalised formula for computing causal effects from observational data. Instead of relying on the treatment coefficient from a regression model, it explicitly estimates each individual’s potential outcomes under treatment and control, and then averages them to compute the average treatment effect (ATE):

  1. Fit an outcome model Estimate $\hat{Y}(T, X)$, a model for the outcome $Y$ as a function of treatment $T$ and covariates $X$. This is typically a parametric model (e.g. linear or logistic regression), although more flexible models can also be used.

  2. Predict potential outcomes For each observation $i$, use the fitted model to predict:

    • $\hat{Y}_i(1) = \hat{Y}(T=1, X_i)$: the predicted outcome if treated
    • $\hat{Y}_i(0) = \hat{Y}(T=0, X_i)$: the predicted outcome if not treated

    This simulates both counterfactual outcomes for each individual.

  3. Compute the ATE

    Average the individual level differences to get the ATE:

    \[\widehat{ATE} = \frac{1}{n} \sum_{i=1}^n \left( \hat{Y}_i(1) - \hat{Y}_i(0) \right)\]

    The target estimand (e.g., ATE, ATT) is explicit, because you define it through the population over which you average the potential outcomes.

If the outcome model is linear and assumes a constant treatment effect, such as \(Y = \beta_0 + \beta_T T + \beta_X^\top X + \varepsilon\), then standardisation will produce the same ATE estimate as you’d get from $\hat{\beta}_T$. This is because the predicted difference for each individual is: \(\hat{Y}_i(1) - \hat{Y}_i(0) = \hat{\beta}_T\). However, this equivalence holds only under a linear model with no effect heterogeneity (i.e. constant treatment effect), and when the model is correctly specified.

Benefits

The key benefit is it generalises beyond linear models:

However, it shares the same core vulnerability as any modeling-based method: if the outcome model is misspecified, the causal estimates can be biased.

Using ML models

While it’s traditionally associated with parametric outcome models (like linear or logistic regression), the method itself is model agnostic. Any model that predicts $\hat{Y}(T, X)$ well can be used, including machine learning approaches that can better capture non-linear relationships and interactions in the data. However, increased model flexibility comes with some risks:

Since the final causal estimate is computed by averaging these predicted potential outcomes, some individual-level prediction errors may cancel out. However, systematic biases, especially those that differ by treatment group, can lead to biased estimates of the treatment effect. To mitigate these risks:

While ML models offer flexibility and the ability to model complex relationships, they often come at the cost of reduced interpretability. Therefore, rigorous validation, both predictive and causal, is essential.

Confidence intervals

The simulation-like steps of g-computation create multiple layers of uncertainty (estimating the outcome model, predicting potential outcomes, then averaging them). Given this complexity, there’s often no closed-form expression for the standard error of the treatment effect estimate.

However, in certain cases, namely when the outcome model is linear and we assume a constant treatment effect, a closed-form solution is available using the Delta method. Under a linear model (e.g., $Y = \beta_0 + \beta_T T + \beta_X^\top X$), the predicted outcomes under treatment and control differ only in the coefficient on $T$, so:

\[\hat{Y}_i(1) - \hat{Y}_i(0) = \hat{\beta}_T\]

The standard error of the ATE is just the standard error of $\hat{\beta}_T$, and a 95% confidence interval can be constructed as:

\[\hat{\beta}_T \pm 1.96 \cdot \text{SE}(\hat{\beta}_T)\]

In more realistic settings where treatment effects vary and outcome models are flexible or nonlinear, we use resampling methods to quantify uncertainty:

Causal strength

G-computation relies on the assumption that the outcome model is correctly specified and sufficiently flexible to capture the true relationship between treatment, covariates, and outcome. Mis-specification or poor fit can bias results, especially in regions with sparse data or complex interactions. Using flexible machine learning models can help address this, but it requires careful model tuning, diagnostics, and validation. If this holds, it provides a transparent and interpretable way to estimate causal effects by simulating counterfactual outcomes.

The key identifiability assumptions for valid causal inference still remain:

Matching

Matching tries to approximate a randomised experiment by directly comparing similar observations. Meaning for each treated observation, we’re looking for a similar (exchangeable) untreated observation to compare to based on some set of covariates X. Unlike regression which adjusts for covariates in estimation, matching first selects a balanced subset of the data based on the covariates and then does a direct comparison. Exact matching, where matched units share identical covariate values, quickly becomes infeasible as the number of covariates grows or when covariates are continuous. Therefore, approximate matching methods are more common. This requires us to define a measurement of proximity (distance) to select the “closest” matches. For example, nearest neighbor matching pairs treated units with untreated units based on a distance metric such as Euclidean or Mahalanobis distance.

Other methods not covered here include propensity score matching (in the next section), and optimal or genetic matching. Matching is naturally suited for estimating ATT, because it uses the treated group as the reference and finds comparable controls. However, we can approximate ATE with bi-directional matching for both the treated and control group.

Distance measures

Matching identifies untreated units closest to each treated unit using a distance metric over covariates. The choice of metric affects covariate balance and thus the credibility of treatment effect estimates. Common choices include Mahalanobis, Euclidean, and propensity score distances (discussed in the next section).

Mahalanobis distance

Mahalanobis distance measures how far apart two points are in a multivariate sense, accounting for both the scale of each covariate and the relationships (correlations) between covariates. This prevents variables with large variance or high correlation from dominating the matching. It’s especially useful when matching on multiple continuous covariates. The distance between treated and control units is computed using the inverse covariance matrix of the covariates:

\[d(i, j) = \sqrt{(X_i - X_j)^\top S^{-1} (X_i - X_j)}\]

Where:

This rescales the differences between covariates based on their variance and correlation structure. Differences on variables with high variance count less, and correlated covariates are downweighted to avoid double-counting shared information. The calculation can be broken down as:

  1. Multiply the inverse covariance matrix $S^{-1}$ by the difference vector $X_i - X_j$ to get a transformed vector that weights each variable’s difference according to its variance (diagonal entries) and correlations with other variables (off-diagonal entries). Variables with high variance have smaller diagonal values in $S^{-1}$, so differences contribute less to the total distance. The off-diagonal entries adjust for correlated variables, downweighting differences aligned with expected correlations (e.g., height and weight moving together), while differences orthogonal to those correlations (unexpected variation) are upweighted.

  2. Multiply the result from step 1 by the transpose of the difference vector to obtain a scalar measuring the weighted squared distance between the two points. Taking the square root converts this into the Mahalanobis distance.

Euclidean

Euclidean distance is the simplest and most intuitive way to measure similarity between two units based on their covariates. It calculates the straight-line distance between points in multidimensional space by treating each covariate as a dimension. Specifically, it computes the difference between each covariate for units i and j, squares these differences, sums them, and takes the square root:

\[d(i, j) = \sqrt{(\mathbf{X}_i - \mathbf{X}_j)^\top (\mathbf{X}_i - \mathbf{X}_j)}\]

Because it relies on squared differences, variables with larger scales or ranges can dominate the distance calculation. Therefore, standardizing or normalizing covariates beforehand is common practice to ensure each covariate contributes proportionally. Another limitation is that Euclidean distance assumes covariates are independent and uncorrelated, which may not hold in practice, potentially leading to less optimal matches compared to methods like Mahalanobis distance. Despite this, its simplicity and computational efficiency make it a good choice when covariates are on similar scales and roughly uncorrelated.

Binary / categorical variables

Binary variables are typically treated as numeric (0/1) and included directly in distance calculations. This generally works well when the variables are balanced and consistently coded. For Euclidean distance, binary variables should be scaled or normalized with other covariates to prevent them from disproportionately influencing the distance. For Mahalanobis distance, scaling is implicit via the covariance matrix; however, when categories are rare, the covariance matrix can become unstable or near-singular, which requires inspection and possibly corrective action (e.g., grouping rare categories).

Categorical variables are usually handled by one-hot encoding each category into separate binary variables, which can then be treated like binary variables in distance computations. When certain categorical variables are critical confounders, exact matching or coarsened exact matching (grouping categories into broader bins) may be preferred to ensure treated and control units are matched exactly or approximately on these variables. This helps avoid imbalance that could bias causal estimates.

If rare categories or small sample sizes threaten the stability of the covariance matrix or matching quality, combining levels (coarsening) or using alternative distance measures may be appropriate (e.g. Gower distance computes a scaled distance per variable, handling different data types, then averages them).

Neighbor matching

Once we have a measure of distance $d(i,j)$, we match each treated unit to control unit(s) with the smallest distance. After sorting the distances $d_{ij}$ for all control units $j$ relative to treated unit $i$, we can select one or more matches:

Matching is often performed with replacement, allowing a control unit to be matched multiple times to different treated units. This tends to improve match quality and reduce bias, especially when treatment and control groups are imbalanced. However, matching with replacement can inflate variance and overuse certain control units. In settings where groups are large and well-balanced, matching without replacement may be preferred, or a limit might be set on the number of matches per control.

Caliper restriction

Caliper matching is a refinement of nearest neighbor matching that restricts matches to only those control units whose distance to the treated unit is within a specified threshold, called the caliper. The caliper sets a maximum acceptable distance between matched units, ensuring that matches are sufficiently close and thereby improving the quality of matches (reducing bias). However, there is a bias-variance trade-off: if the caliper is too small, it can exclude many treated units, reducing sample size and statistical power; and, too large risks including poor matches, increasing bias. Typically you would start with recommended defaults (e.g. 0.2 standard deviations), then assess the covariate balance (see the next section) and sample retention (might be losing generalizability and precision), iteratively adjusting as needed.

Estimating the treatment effect

After matching, several common estimators can be used to calculate treatment effects from matched data.

1. ATT via Treated Matching

The average treatment effect on the treated (ATT) is estimated as:

\[\widehat{ATT} = \frac{1}{N_T} \sum_{i \in T=1} \left( Y_i - \frac{1}{M_i} \sum_{j \in \text{matched}(i)} Y_j \right)\]

For each treated unit $i$, subtract the average outcome of its $M_i$ matched control units $j$. Only treated units contribute to this estimator. Matching can be performed with or without replacement. Standard errors can be estimated using bootstrapping to account for the dependence induced by matching.

2. Symmetric Matching

The average treatment effect (ATE) can be estimated via 1-to-1 symmetric matching:

\[\widehat{ATE} = \frac{1}{N} \sum_{i=1}^{N} (2T_i - 1)\left(Y_i - Y_{jm(i)}\right)\]

Each unit $i$ is matched to exactly one unit $jm(i)$ from the opposite treatment group. The term $(2T_i - 1)$ equals $+1$ for treated units and $-1$ for controls, ensuring differences $Y_i - Y_{jm(i)}$ are consistently signed as treated minus control outcomes. All $N = N_T + N_C$ units contribute once, making the estimator symmetric. This approach works best when treatment and control groups are balanced and have good overlap.

3. ATE via Reweighting

A more general ATE estimator that accounts for asymmetry in matching or reuse of units is:

\[\widehat{ATE} = \frac{1}{2} \left[ \frac{1}{N_T} \sum_{i \in T=1} \left( Y_i - \frac{1}{M_i} \sum_j Y_j \right) + \frac{1}{N_C} \sum_{j \in T=0} \left( \frac{1}{M_j} \sum_i Y_i - Y_j \right) \right]\]

This combines the ATT for treated units matched to controls and the ATC for control units matched to treated units. By weighting both groups equally, it estimates the full ATE even when the groups are imbalanced. Bootstrapping is recommended for standard errors due to the dependence created by matching.

All of these approaches approximate the ATE under strong overlap (positivity) and good covariate balance. When treatment assignment is highly imbalanced, ATT or ATC may be more reliable than the symmetric or reweighted ATE. To summarise:

Bias in matching

Matching reduces confounding by balancing observed covariates between treated and control groups, but two key sources of bias can remain:

Covariate balance

Assessing covariate balance is a crucial step both before and after matching to ensure that treated and control groups are comparable. Good balance means the distributions of covariates are similar across groups, supporting the assumption that observed outcome differences reflect the treatment effect rather than confounding. A common metric for assessing balance is the standardized mean difference (SMD) for each covariate, defined as:

\[\text{SMD}_k = \frac{\bar{X}_{k,T} - \bar{X}_{k,C}}{\sqrt{\frac{s_{k,T}^2 + s_{k,C}^2}{2}}}\]

Where:

Because the SMD is unitless, it allows comparison across covariates with different scales. Values close to zero indicate good balance, with thresholds such as 0.1 or 0.05 commonly used as benchmarks. By comparing SMDs before and after matching, you can evaluate how well the matching procedure reduced imbalance. Visual tools such as love plots or density plots can complement numerical metrics, providing intuitive insights into covariate distributions across groups.

Matching balance plots

Regression adjustment after matching (RAAM)

To mitigate residual imbalance bias, it’s common to fit a regression model on the matched sample to adjust for any remaining differences in covariates. After matching, you construct a dataset containing all treated units and their matched controls. This can be structured in a stacked format like below:

unit_id matched_set_id treatment X₁ X₂ outcome
101 1 1
205 1 0
309 1 0
102 2 1
220 2 0

Some control units may appear in multiple matched sets if matching is performed with replacement. Rather than duplicating these units, you can assign weights based on how many times each unit appears across sets. Now we have a more homogeneous sample, with smaller differences between treated and control units. We then fit a regression model to estimate the treatment effect while adjusting for remaining differences in covariates:

\[Y_i = \alpha + \tau T_i + \sum_{k=1}^p \beta_k X_{ki} + \epsilon_i\]

Where:

The regression model does not use the matched set IDs directly. Matching restricts the sample to comparable units, and regression adjusts for any small differences that remain. Conceptually, the regression projects units onto their expected outcomes given covariates, then estimates the average difference between treated and control units on that adjusted scale. Because some units may be reused, standard errors should be corrected. You can either: use robust standard errors clustered on matched set ID, or apply bootstrapping (repeat matching and regression on resampled data).

This hybrid approach leverages the strengths of both methods:

To get the ATE via reweighting we need to perform bidirectional matching (match treated units to controls and control units to treated units). To combine this with regression adjustment, you can either:

As always, careful selection of confounders and balance diagnostics are crucial for reliable causal inference.

Causal strength

Matching helps reduce bias by creating comparable groups (approximating a randomised experiment), but it relies on strong assumptions:

Propensity score methods

Propensity score methods also reduce confounding by balancing covariates between treated and untreated groups. However, instead of directly adjusting for all confounders \(X\), we estimate a single balancing score:

\[e(X) = \mathbb{P}(T = 1 \mid X)\]

This is the probability of receiving treatment given covariates. If treatment assignment depends on \(X\), then individuals with different \(X\) values may have different chances of being treated. But if \(X\) includes all confounders, then treatment is conditionally independent of the potential outcomes:

\[(Y(0), Y(1)) \perp T \mid X\]

Rosenbaum and Rubin (1983) showed that this implies:

\[(Y(0), Y(1)) \perp T \mid e(X)\]

So instead of adjusting for high-dimensional \(X\), we can adjust for the scalar \(e(X)\) and it will work in the same way. This simplifies matching, stratification, or weighting methods as individuals with similar propensity scores can be treated as if randomly assigned, even if their raw covariates differ. In effect, the propensity score compresses all relevant confounding information from \(X\) into a single variable.

Estimating $e(X)$

Since the treatment assignment mechanism is unknown in observational data, we estimate the propensity score $e(X)$ using a predictive model. Typically this would be logistic regression for binary treatments. More flexible models like random forests or gradient boosting can also be used. But the goal isn’t high predictive accuracy; it’s covariate balance. A model that includes variables highly predictive of treatment but unrelated to the outcome (i.e. not confounders) may boost predictive accuracy but worsen precision by increasing variance in the ATE estimate. For example, suppose you’re estimating the effect of a training program on productivity. You fit two models:

Despite the higher accuracy, Model B may yield worse balance on the confounders, causing biased causal estimates. That’s because the model is optimizing for prediction, allowing irrelevant variables to dominate instead of balancing the confounders.

Checking positivity

It’s important that treated and untreated units overlap. That is, everyone must have some chance of receiving and not receiving treatment: \(0 < e(X) < 1\). This is known as the positivity assumption. You can check this by comparing the distributions of propensity scores in the treated and untreated groups. Ideally, they should substantially overlap across the full range, and be smooth, with no scores close to 0 or 1. Propensity scores near 0 or 1 mean someone is almost guaranteed to be treated (or not), which causes estimators like the inverse probability weight to blow up, increasing the variance of the ATE (makes it sensitive to a few influential observations). As a rule of thumb, no individual should get a weight over 20. Some will trim the propensity scores to limit the range (e.g. 0.05 to 0.95), but this can introduce bias. Another option is to trim observations outside the overlap region (i.e. drop treated with no comparable untreated match), but this limits the generalisability.

Overlap example

Propensity scores + matching

Propensity scores offer a solution to the curse of dimensionality in matching. Instead of matching on all covariates directly, we match on a single summary score: the propensity score. This reduces the dimensionality of the problem and ensures that units are matched based on their likelihood of receiving treatment. In this way it focuses the comparison on units that are similar in terms of confounding risk.

As discussed in the matching section, a caliper can be set to control the maximum allowable difference in propensity scores for a pair to be matched (e.g. 0.2 standard deviations). However, there are some important caveats:

Matching is intuitive and transparent. However, it depends entirely on the quality of the propensity score model. If the model is misspecified, the resulting matches can still be unbalanced, and bias will remain.

Using e(X) as a weight

The estimated propensity score can also be used as a weight to construct a pseudo-population where treatment assignment is independent of observed covariates. In the next sections, we’ll walk through three widely used weighting-based approaches:

Method What it does (in plain terms) Efficiency & Robustness When to use it (and what to watch out for)
IPW Reweights observations by how unlikely their treatment was, based on covariates. Low efficiency; sensitive to extreme weights; not robust to misspecification. Use when the treatment model is well understood and overlap is strong. May require trimming or weight stabilization.
AIPW Combines outcome predictions with IPW to reduce bias if either model is wrong. Moderately efficient; doubly robust; variance can be high without cross-fitting. Useful when you want extra protection against misspecification. Works well with flexible outcome models, but cross-fitting is important to avoid overfitting.
TMLE Starts with an outcome model, then “nudges” it using the treatment model to improve accuracy. Most efficient; doubly robust; targeting step improves finite-sample performance. Preferred when using ML models and valid inference is important. Requires careful implementation (e.g. cross-fitting) but yields strong theoretical guarantees.

1) Inverse Probability Weighting (IPW)

Inverse Probability Weighting (IPW) leverages the propensity score $e(X)$ to reweight each unit’s outcome, effectively creating a synthetic population where treatment assignment is independent of $X$ (i.e. adjusting for confounding). The weight is inversely proportional to their probability of receiving their observed treatment. Specifically:

In this way it gives more influence to people who are “underrepresented” in their group: those treated despite low probability, or untreated despite high probability. We’re creating two comparable synthetic populations here (each the same size as the original), one where everyone is treated and one where no one is. Comparing them then gives us the conditional mean difference using only \(e(X)\) instead of the full set of covariates.

Recall the conditional mean difference (difference in outcomes between the treated and untreated):

\[\mathbb{E}[Y \mid X, T = 1] - \mathbb{E}[Y \mid X, T = 0]\]

Re-writing this using the propensity scores, the left hand side becomes:

\[\mathbb{E}\left[ \left. \frac{Y}{e(x)} \,\right|\, X, T = 1 \right] P(T)\]

We take the outcomes of treated people, reweight them using the inverse of their treatment probability (\(\frac{Y}{e(x)}\)). This downweights those with a high chance of treatment and upweights those with a low chance so we don’t over-represent those who were likely to be treated. This is then scaled by how common treatment is overall to give us an estimate of what everyone would have experienced, had they all received the treatment.

Then the right hand side follows the same idea for the untreated:

\[\mathbb{E}\left[ \left. \frac{Y}{1 - e(x)} \,\right|\, X, T = 0 \right] (1 - P(T))\]

This uses \(1-e(x)\) in the denominator (the probability of not being treated, given X), so \(\frac{Y}{1-e(x)}\) gives more weight to those unlikely to be untreated (but were). Scaling by how common it is to be untreated (1-P(T)), this estimates what the entire population would have experienced if no one had received the treatment, by appropriately reweighting the observed outcomes of untreated individuals.

Putting these together and simplifying, the expected potential outcomes under treatment and control are estimated as:

\[\mathbb{E}\left[ \frac{T \cdot Y}{e(X)} \right] \quad \text{and} \quad \mathbb{E}\left[ \frac{(1 - T) \cdot Y}{1 - e(X)} \right]\]

Subtracting these gives an estimate of the Average Treatment Effect (ATE):

\[\mathbb{E}\left[ \frac{T \cdot Y}{e(X)} - \frac{(1 - T) \cdot Y}{1 - e(X)} \right]\]

This can also be written more compactly as:

\[\mathbb{E} \left[ Y \cdot \frac{T - e(X)}{e(X)(1 - e(X))} \right]\]

Given the propensity score is an estimation, we need to account for this in the standard error of the ATE (e.g. by bootstrapping an ATE sampling distribution).

2) Augmented Inverse Probability Weighting (AIPW)

Propensity scores can also be combined with an outcome model (e.g. regression) to produce a doubly robust estimator. The key advantage is that you get a consistent estimate of the average treatment effect (ATE) if either the outcome model or the propensity score model is correctly specified. Since both methods alone have weaknesses (outcome models can be biased, and propensity score weighting can be unstable), combining them lets one “lean on” the other when it is misspecified.

For example, consider the Augmented Inverse Probability Weighting (AIPW) estimator:

AIPW formula

Where:

Note that AIPW can use either separate outcome models for treated and untreated groups or a single joint model with treatment as an input. Typically, two separate models are preferred when treatment effect heterogeneity is suspected or when there is sufficient data in both groups to flexibly model outcomes. A single joint model is often used when sample size is limited or when the treatment effect is believed to modify the outcome additively or smoothly across covariates. While simpler models are generally preferred for the propensity score to avoid instability, flexible machine learning models can be used for the outcome model to better capture complex and nonlinear relationships. Although double robustness provides some protection against misspecification, cross-fitting is recommended to reduce overfitting. This involves splitting the data into folds and evaluating each fold using models trained on the other folds, which helps preserve the theoretical guarantees of AIPW.

The key idea is to create two synthetic populations to estimate potential outcomes: one where everyone is treated (to estimate $Y(1)$) and one where everyone is untreated (to estimate $Y(0)$). Like standardisation, AIPW uses an outcome model to provide baseline predictions $\hat{\mu}_1(X), \hat{\mu}_0(X)$. But AIPW goes further by incorporating a correction based on the treatment assignment model. Specifically, the residuals (observed outcome minus predicted outcome) are weighted by the inverse of the propensity score for the observed treatment. This gives more influence to residuals from units with unlikely treatment assignments; typically the regions where the outcome model is more likely to be biased. For units not observed in a treatment group, the weighted residual is zero, so their contribution relies solely on the model prediction, as in standardisation. By combining outcome regression with inverse probability weighting, AIPW reduces bias from model misspecification and remains consistent if either model is correct. Because the variance reflects uncertainty from both models and residual noise, bootstrapping is typically the simplest and most robust way to compute standard errors.

Example

Imagine we want to estimate the effect of a promotion on spending for all users, but income (an important confounder), is imperfectly measured. High-income customers in our data were more likely to get the promotion and tend to spend more, so a naive regression may overestimate the promotion effect. If some customer types are rarely treated, outcome predictions there may be poor. By weighting the residuals inversely by the probability of receiving treatment, we “rescale” these residuals to reflect the overall population better, providing a correction in underrepresented regions. The residual tells us how wrong the outcome model was, and the weighting ensures this correction is applied in proportion to where the model is likely to be wrong (i.e. in underrepresented areas).

This means we have two chances to get unbiased estimates: either the outcome model is correct, or the propensity score model is correct.

3) Targeted Maximum Likelihood Estimation (TMLE)

TMLE builds on AIPW, using the same inputs but a different method for combining models to improve efficiency, especially in finite samples. While AIPW plugs in the model estimates directly into a closed-form expression, TMLE takes an extra step to update the outcome model using a likelihood-based adjustment. This makes TMLE particularly useful when using flexible machine learning models, such as random forests or gradient boosting, while still retaining statistical guarantees like valid confidence intervals.

TMLE starts from the insight that both outcome models (predicting E[Y T, X]) and treatment models (estimating P(T X)) can be error-prone, especially when estimated with flexible algorithms. TMLE reduces these errors by targeting the initial outcome estimate using information from the treatment model. The result is a final estimator with desirable properties like consistency, efficiency, and double robustness.

At a high level, TMLE proceeds as follows:

Estimate the outcome model

Use any supervised learning method to estimate the expected outcome as a function of treatment and covariates:

\[\hat{Q}(T, X) \approx \mathbb{E}[Y \mid T, X]\]

This could be linear regression, random forests, gradient boosting, or any other reasonable predictor. The goal is a good initial estimate of expected outcomes.

Estimate the treatment model (propensity score)

Estimate the conditional probability of receiving treatment T given covariates X:

\[\hat{g}(T \mid X)\]

For binary treatment, this is the usual propensity score. For continuous or multi-valued treatments, more complex models are needed to estimate the treatment assignment mechanism.

Compute the clever covariate

The clever covariate captures how much each observation contributes to the correction. For binary treatment, it is typically:

\[H(T, X) = \frac{I(T = 1)}{\hat{g}(1 \mid X)} - \frac{I(T = 0)}{\hat{g}(0 \mid X)}\]

This covariate encodes both the direction and size of the adjustment needed to address confounding.

Update the outcome model

The initial outcome model $\hat{Q}$ is updated through a targeted maximum likelihood step. This involves solving a likelihood-based estimating equation that incorporates the clever covariate, effectively nudging $\hat{Q}$ so that it better respects the observed treatment mechanism. In practice, this targeting step often involves fitting a small regression (for example, logistic or linear) using the clever covariate as a feature and the initial predicted outcomes as offsets.

This targeting step ensures that the final estimator is consistent if either the outcome model or the treatment model is correct, and efficient when both are correct. This is the double robustness property.

Estimate the treatment effect

Once the updated outcome model $\hat{Q}^*$ is obtained, potential outcomes under treatment and control are estimated:

\[\hat{\mu}_1 = \frac{1}{n} \sum_{i=1}^n \hat{Q}^*(1, X_i), \quad \hat{\mu}_0 = \frac{1}{n} \sum_{i=1}^n \hat{Q}^*(0, X_i)\]

Then the average treatment effect (ATE) is computed as:

\[\widehat{\text{ATE}} = \hat{\mu}_1 - \hat{\mu}_0\]

TMLE combines outcome and treatment modeling in a way that protects against misspecification in either one. Like AIPW, it is doubly robust: consistent if at least one of the models is correctly specified. In addition, the targeting step improves efficiency by reducing bias in the initial outcome predictions using information from the treatment model. To ensure valid inference when using flexible learners, TMLE typically includes cross-fitting. The data is split into folds, and each fold is evaluated using models trained on the others. This reduces overfitting and preserves the theoretical guarantees of TMLE, even with high-capacity models.

Diagnostics: Checking Covariate Balance

Regardless of the method chosen from above, it is essential to check whether the treatment and control groups are balanced on key covariates to draw valid causal conclusions. This means their distributions should be similar enough that comparisons approximate a randomized experiment.

With weighting methods (IPW, AIPW, TMLE), balance is assessed in the weighted sample: each observation is reweighted by the inverse of its estimated propensity score, and the resulting covariate distributions are compared across groups. With matching, balance is evaluated in the matched sample, which includes only treated and control units that were successfully paired. If variable-ratio or replacement matching is used, match weights should be incorporated into diagnostics.

Typical balance diagnostics include:

Diagnostics should be performed both before and after adjustment to assess how much imbalance remains. If balance is still poor, consider the following:

In short, covariate balance diagnostics are a necessary part of any causal analysis. They tell you whether your adjustment method has sufficiently addressed observed confounding. If balance is poor, the resulting estimates may still be biased, and further refinement or design changes may be needed.

Causal Strength

Propensity score methods provide a useful framework for estimating causal effects when confounding is strong and high-dimensional, but fully observed. Methods like IPW, AIPW, TMLE, and matching differ in how they adjust for confounding, but all rely on the same core assumptions. Correct model specification and the Stable Unit Treatment Value Assumption (SUTVA) are required for all, but the reliability of results ultimately depends on the validity of the following identifiability conditions:

Causal inference using propensity scores requires more than just applying a method. It depends on careful diagnostics, transparent assumptions, and a willingness to revise the model or approach if balance cannot be achieved. When these elements are in place, propensity score methods can provide credible estimates of causal effects from observational data.

Generalized Propensity Score (GPS)

Most common propensity score methods are designed for binary treatments and do not naturally extend to continuous treatments without modification. The Generalized Propensity Score (GPS) extends the concept to continuous treatments (such as dosage levels) by modeling the conditional density of the treatment given covariates:

\[r(t, X) = f_{T|X}(t | X)\]

Unlike the binary propensity score, which gives a scalar probability of treatment assignment, the GPS is a continuous function: it describes the conditional probability density of receiving a particular treatment level given covariates. In other words, it tells us how likely an observation is to receive a particular treatment level based on its characteristics.

Specifically, for a continuous treatment T, the GPS at level t and covariates X is defined as the conditional density $r(t, X) = f_{T X}(t \mid X)$. In practice, the GPS is often estimated using a parametric model for $T \mid X$, most commonly assuming a normal distribution. For example:
  1. Fit a linear regression model to estimate expected treatment: Regress the treatment variable T on covariates X:

    \[T = X^\top \beta + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2)\]
  2. Estimate parameters: From the fitted model, obtain estimates $\hat{\beta}$ and the residual variance $\hat{\sigma}^2$. Not everyone receives exactly the predicted treatment. This variation reflects randomness, measurement error, or unobserved factors, and is used to construct the conditional treatment density.

  3. Compute the conditional mean: For each observation i, compute the predicted treatment level based on its covariates:

    \[\hat{\mu}_i = X_i^\top \hat{\beta}\]
  4. Evaluate the conditional density (GPS): We ask how likely the observed treatment level $t_i$ is, given the predicted value $\hat{\mu}_i$. The estimated GPS is the value of the normal density at the observed treatment:

    \[\hat{r}(t_i, X_i) = \frac{1}{\sqrt{2\pi\hat{\sigma}^2}} \exp\left( -\frac{(t_i - \hat{\mu}_i)^2}{2\hat{\sigma}^2} \right)\]

Non-parametric alternatives such as kernel density estimation or Gaussian processes can be used when the normality assumption is not appropriate.

Once estimated, the GPS serves as a balancing score. Just like the binary propensity score, the idea is that conditional on the GPS, the covariates X are independent of treatment T. This mimics a randomized design and allows us to isolate the effect of treatment intensity. The GPS can then be used in different causal estimation methods:

To estimate a causal effect such as the average treatment effect between two treatment levels $t_1$ and $t_2$, compute $\mu(t_2) - \mu(t_1)$ from the estimated dose-response curve.

Diagnostics

Diagnostics for GPS methods play the same role as in binary propensity score methods: checking for covariate balance and ensuring sufficient overlap. However, the continuous treatment adds complexity.

Causal strength

GPS methods rely on extensions of the standard assumptions for causal inference:

GPS provides a principled way to recover full dose-response relationships without reducing treatment to binary form. But it can be sensitive to model misspecification and issues with support. Diagnostics such as GPS balance checks and overlap plots are essential to validate assumptions and ensure credible inference.

Orthogonal Machine Learning (Double ML)

Orthogonal Machine Learning (OML) is a method specifically designed to estimate causal effects using flexible machine learning models, while guarding against overfitting and bias from high-dimensional or complex covariates. While ML models can capture complex relationships, they don’t necessarily focus on estimating the treatment effect. So called “nuisance” variables, which influence both the treatment and outcome but are not of primary interest, can absorb part of the treatment effect. This leads to a biased coefficient estimate. OML addresses this by separating (orthogonalizing) the variation in treatment and outcome that is explained by covariates, allowing the causal effect to be estimated from the residual variation that remains. While the nuisance models use ML, the final step estimates the treatment effect using a simple parametric model, typically a linear regression on these residuals. This means it assumes the causal effect is well approximated by a linear or similarly simple function after confounding has been removed. Therefore, it still depends on correct specification of the final treatment effect model to yield unbiased causal estimates.

Orthogonalization

The fundamental idea is to decompose both the treatment and the outcome into components that are orthogonal, or uncorrelated, with respect to the nuisance variables. To do this, first we fit flexible ML models to predict the treatment and outcome using the nuisance variables. Using the residuals of those models (to remove the effect of the nuisance variables), we then fit a simple regression to isolate the effect of the treatment on the outcome:

  1. Predict Treatment: Fit an ML model \(M_t(X)\) to predict the treatment \(T\) from the covariates \(X\) (like a propensity score model). The residuals \(\tilde{t} = T - M_t(X)\) represent the portion of \(T\) that is unrelated to \(X\).

  2. Predict Outcome: Fit an ML model \(M_y(X)\) to predict the outcome \(Y\) from \(X\). The residuals \(\tilde{y} = Y - M_y(X)\) represent the portion of \(Y\) that is unrelated to \(X\).

  3. Estimate the Treatment Effect: The causal effect $\theta$ is then estimated by regressing \(\tilde{y}\) on \(\tilde{t}\), ensuring that the estimated effect reflects the part of \(Y\) that is attributable to \(T\) after adjusting for \(X\).

    $\tilde{y}_i = \alpha + \theta \tilde{t}_i + \varepsilon_i$

A key strength of this approach is that small errors in estimating the nuisance parameters do not bias the estimated treatment effect, provided those errors are orthogonal to the treatment residuals. However, there is a risk of the ML models over-fitting, and then the regression underestimating the treatment effect.

Cross-fitting

To avoid overfitting and to ensure valid confidence intervals, it is recommended to use cross-fitting. This involves training the nuisance models on one subset of the data and computing residuals on a separate subset.Specifically, the data is split into $K$ folds. For each fold $k$, you:

  1. Train the treatment $M_t(X)$ and outcome $M_y(X)$ models on the other $K-1$ folds (excluding fold $k$),
  2. Use those models to compute the residuals $\tilde{t}_i = T_i - M_t(X_i)$ and $\tilde{y}_i = Y_i - M_y(X_i)only on fold $k$.

Each residual is computed using models that didn’t see that observation (out-of-fold), reducing bias from overfitting. After processing all folds, you concatenate the residuals $\tilde{t}_i$ and $\tilde{y}_i$ to create the full-length vectors of orthogonalized treatment and outcome. Then you can estimate the treatment effect as before by regressing \(\tilde{y}\) on \(\tilde{t}\).

Benefits

In outcome regression, we get more precise estimates as we collect more data. Specifically, our estimation error shrinks at a rate of $1/\sqrt{n}$, called root-$n$ consistency. This means that as the sample size $n$ grows, our parameter estimates converge quickly to the true value, and the distribution of the estimator becomes approximately normal. This is what enables us to construct valid confidence intervals and perform reliable hypothesis tests.

Standard ML models generally converge more slowly. They prioritize prediction accuracy over statistical inference. For example, they often introduce bias to reduce variance (e.g. via regularization, tree pruning, or early stopping), which improves out-of-sample prediction performance. Their goal is to minimize a loss function, not to estimate interpretable parameters, so there’s no guarantee their coefficients (if they have any) are unbiased, consistent, or asymptotically normal. In high-dimensional settings, ML models may need exponentially more data to learn the underlying function accurately. This slows convergence further and makes it harder to quantify uncertainty. As a result, confidence intervals and hypothesis tests based directly on standard ML models are often invalid or unavailable.

The great thing about OML with cross-fitting is that it can achieve root-$n$ consistency, enabling reliable and statistically valid estimates (including valid confidence intervals and hypothesis tests), even when using complex machine learning models and high-dimensional covariates.

Causal strength

Orthogonal Machine Learning extends the familiar idea of adjusting for covariates in outcome regression to more flexible settings with machine learning models. When dealing with high-dimensional or complex confounding it can reduce bias, assuming the first-stage models approximate the true relationships well (no substantial misspecification). Cross-fitting reduces the risk of overfitting and bias, but poor first-stage performance can still affect precision. Importantly, the reliability of OML still rests on the same core identification assumptions as the other observational methods:

Orthogonalization improves robustness to errors in nuisance estimation, but it does not remove the need for thoughtful variable selection, strong overlap, or the absence of unmeasured confounding. As always, careful diagnostics, sensitivity analyses, and an understanding of the data-generating process are essential to ensure valid causal inference.


Adjustment methods comparison

To recap, adjustment-based methods estimate causal effects by controlling for confounding variables in observed data. However, they differ in how they construct comparability between treated and control groups: some model outcomes directly, some weight or match observations, and others combine strategies for robustness. Each has it’s strengths and weaknesses to consider:

Method How It Works Key Strengths Key Limitations
Outcome Regression Fits a model for $\mathbb{E}[Y \mid T, X]$ using the full dataset, then predicts outcomes under treatment and control for each unit and averages the difference to estimate ATE. Preferable when overlap is strong and the outcome model is correctly specified or can be well approximated. Uses all data; flexible modeling of outcomes; direct estimate of conditional effects. Sensitive to misspecification of outcome model; extrapolation risk in poor overlap regions.
G-computation (standardisation / g-formula) Predicts potential outcomes for all units under each treatment level and averages over the covariate distribution to get the marginal ATE. Preferable when flexible outcome modeling is feasible and marginal effects are the focus. Intuitive population-level estimate; uses entire sample. Shares regression’s sensitivity to model misspecification; no explicit confounding adjustment via treatment modeling.
Matching (on covariates or PS) Creates matched pairs or sets of treated and control units based on similarity, and estimates ATE from these matched samples. Preferable when overlap is limited and prioritizing local comparability for valid ATE. Avoids extrapolation; emphasizes direct comparability; less model-dependent. Reduces sample size; may increase variance; excludes unmatched units.
IPW Weights units by the inverse probability of their received treatment, creating a synthetic balanced sample; estimates ATE from weighted averages. Preferable when overlap is sufficient and the propensity score model is well specified and calibrated. Model-free outcome estimation; consistent if propensity model is correct; conceptually straightforward. Can produce unstable estimates with extreme weights from poor overlap.
AIPW (Doubly Robust) Combines outcome regression and IPW to estimate ATE; remains consistent if either the outcome or propensity model is correct. Preferable when using flexible outcome and treatment models and robustness is important. Double robustness reduces bias; more stable than IPW alone; suitable with flexible models. Still sensitive to extreme weights; cross-fitting recommended with machine learning.
TMLE (Targeted MLE) Starts with an initial outcome model, then uses the propensity score to “target” or update it, refining the ATE estimate for efficiency and valid inference. Preferable when using machine learning and seeking efficient, statistically rigorous ATE estimates. Double robust with improved finite-sample properties; compatible with ML; provides valid confidence intervals. More complex implementation; cross-fitting required for reliable inference.
Generalized Propensity Score (GPS) Extends propensity scores to continuous or multi-valued treatments; estimates dose-response curves or marginal ATE accordingly. Preferable when treatments are continuous or multi-level, and dose-response or marginal effects are needed. Enables causal inference beyond binary treatments; estimates marginal effects. Requires strong assumptions and precise treatment density modeling.
OML (Orthogonal ML) Uses orthogonal moments and cross-fitting to debias machine learning–based ATE estimators, especially in high-dimensional covariate spaces. Preferable for high-dimensional settings with complex outcome and treatment relationships modeled flexibly. Provides valid inference with complex ML; handles high-dimensional data; robust to overfitting. More abstract and technically involved; requires careful tuning and validation.

To give a specific example, let’s think about how the different methods would handle sparse or imbalanced subgroups (where there are few treated or control observations, so nearly violating positivity):

This is an example where you might favor robust methods like AIPW or TMLE, especially when you have complex relationships and some risk of model misspecification, but also want to mitigate issues caused by limited overlap or sparse data regions.


Quasi-Experimental Designs

Now we’re moving on to quasi-experimental designs. The adjustment-based methods discussed earlier (e.g. regression, matching, propensity scores, OML) require that all relevant confounders are observed and correctly included in the model; an assumption that is often violated in practice. Quasi-experimental designs address this limitation by leveraging natural experiments, study design features, or structural breaks that introduce variation in treatment assignment for reasons unrelated to potential outcomes. This can make their identifying assumptions more plausible, but also limits their scope and generalisability.

Common quasi-experimental methods include:

Instrumental Variable (IV) analysis

When unobserved confounders bias our estimates, we can’t directly measure the causal effect of a treatment T on an outcome Y. Instrumental variable (IV) analysis offers a way to estimate this effect indirectly, under specific conditions. The key is to find an instrument (Z) that influences the treatment T, but has no direct or indirect effect on the outcome Y except through T. In other words, any change in Y that results from changes in Z must operate entirely through T. If this holds, we can isolate the causal effect of T on Y by dividing the effect of Z on Y by the effect of Z on T. However, if Z affects Y through other pathways (violating the exclusion restriction), then the observed changes in Y cannot be attributed solely to T, and the IV estimate is invalid.

IV graph

Key assumptions:

IV populations

Monotonicity is considered optional because it’s not essential for the identification of the causal effect; rather, it affects the interpretation of the estimate. If monotonicity holds, you get a clean interpretation of the treatment effect as the LATE for the compliers. If the compliers are not special in any way, the results should generalise well to the whole population, but it’s not gauranteed like in a randomised experiment. If monotonicity doesn’t hold, the IV estimate becomes more complex, representing a weighted average of treatment effects (WATE) across different groups (rather than a uniform effect), making it harder to interpret and generalise.

How it works

As described above, the association of the instrument Z with the outcome Y will be a linear multiple of the association of the instrument Z with the treatment T:

IV formulas on graph

Since IV estimates rely on marginal differences (i.e. changes in T and Y), we can (and should) use linear regression even if the outcome is binary or categorical. The goal is not to perfectly model Y, but to isolate variation in T that is exogenous (i.e. uncorrelated with confounders), using Z. Importantly, linear regression in the first stage (T ~ Z) ensures that we isolate the part of T that is orthogonal to unobserved confounders (the error).

Wald Estimator

For a single binary instrument and binary treatment, the Wald estimator directly applies the logic above:

\[\hat{\beta}_{\text{IV}} = \frac{E[Y \mid Z = 1] - E[Y \mid Z = 0]}{E[T \mid Z = 1] - E[T \mid Z = 0]} = \frac{\alpha \cdot \beta}{\alpha} = \frac{\text{Reduced form}}{\text{First stage}}\]

If the instrument perfectly predicts treatment (i.e. 100% compliance), the denominator is 1, and the estimator reduces to the difference in outcomes (like a standard randomized controlled trial). However, if the instrument is weak (i.e. α is close to 0), the denominator becomes small, leading to unstable and biased estimates.

Two-stage least squares (2SLS)

In practice, 2SLS is the standard method for IV estimation (e.g. IV2SLS in Python’s linearmodels):

In the simple case of one instrument and no covariates, 2SLS produces the same result as the Wald estimator. But unlike Wald, 2SLS extends to: multiple instruments; continuous treatments and instruments; the inclusion of control variables (added to both stages); and valid standard errors (accounts for the fact \(\hat{T}\) is an estimate and not an observed value).

Example

Let’s solidify this logic with a simplified example. Suppose we want to estimate the effect of ad exposure (T) on purchases (Y). A temporary bug (Z) in the ad system randomly caused some users to be 30 percentage points less likely to see ads (first stage effect: α=0.3). Those affected by the bug (Z=1) had a 1 percentage point lower conversion rate during the session (reduced form effect: α⋅β=0.01). This tells us that a 30 percentage point decrease in ad exposure leads to a 1 percentage point decrease in purchases. Using the Wald estimator to scale up, the effect of 100% of users not seeing ads is estimated as \(\frac{0.01}{0.3}=0.03\), meaning removing all ads would cause a ~3 percentage point drop in conversion.

IV formulas on graph

This is an overly simplified example: we assume all IV assumptions hold. In reality, we’d need to test exchangeability (e.g., did the bug disproportionately affect certain segments or device types? Are the periods comparable and generalizable? Could users have noticed the change and reacted differently?). If there are measurable confounders of Z and Y, we can add these as covariates in 2SLS (e.g. device type).

Selecting an instrument

Finding a valid instrument is challenging in practice. In industry, common sources of instruments include bugs, experiments (especially unrelated A/B tests that nudge compliance with treatment), feature rollouts, and external shocks such as policy changes or outages (i.e. natural experiments). While these are commonly available, recall they need to have a strong enough effect on the treatment to be viable (i.e. the first stage estimate must be significant). It’s also possible to use user, device, or product traits as instruments (e.g. OS type, app version), but these are often directly or indirectly correlated with outcomes of interest (e.g. conversion rates), violating the exclusion restriction.

Some examples of instruments for inspiration:

Multiple instruments

A core limitation of IV analysis is that it estimates the causal effect for the compliers (LATE), i.e. individuals whose treatment status is influenced by the instrument Z. Since different instruments influence different subsets of the population, LATEs can vary across studies or contexts. This raises an important question: how representative is a given LATE of the true Average Treatment Effect (ATE) for the broader population? One strategy to address this is to use multiple, independent instruments. Each instrument identifies a LATE for its own set of compliers. By collecting these localized causal effects, we can explore how treatment effects vary across subgroups, and, under certain conditions, approximate the ATE by combining them.

Two main approaches: Joint model (multiple instruments together): Include all instruments simultaneously in the model. This produces a weighted average of the LATEs, where weights reflect each instrument’s first-stage strength (i.e. how strongly it shifts the treatment). All instruments must be valid (satisfying relevance and exclusion). For example, Angrist & Krueger (1991) combined quarter of birth (affects students on the margin of school-entry age) and compulsory schooling laws (affects those who would have dropped out without the law) to instrument for education, leveraging variation in both school entry and legal leaving age.

However, averaging multiple LATEs doesn’t guarantee you recover the ATE. As an analogy: each LATE is like a torch illuminating one part of a room. To reconstruct the full picture (the ATE), you either need to shine lights in the right places or understand the room’s layout. You can only recover the ATE if one of the following is true:

In practice, compliers often differ systematically. They may be more persuadable, more marginal, or behave differently in ways that affect treatment response. Understanding and adjusting for these differences is essential when trying to generalize from LATEs to broader conclusions.

Adding Covariates

Additional covariates are often necessary to satisfy identifiability assumptions when using observational instruments (e.g. adding confounders of the instrument Z and the outcome Y to both stages in 2SLS). However, this does not estimate the Local Average Treatment Effect (LATE) conditional on those covariates. Instead, the result is a weighted average of conditional LATEs, where the weights are determined by how the instrument varies across covariate strata and how those strata relate to the outcome. If treatment effects are heterogeneous across strata, or if defiers are present, these weights can become non-intuitive. They may be negative, exceed one, or fail to reflect subgroup sizes or first-stage strength. As a result, even with a valid instrument, the IV estimate \(\beta\) can be distorted and difficult to interpret as the effect for any specific population subgroup.

To address this, Słoczynski (2024) proposes two remedies:

Bias

Even with a valid instrument, a weak correlation between Z and T leads to noisy and potentially biased IV estimates. When the instrument is weak, the first-stage regression poorly predicts the treatment (\(\hat{T}\)). Although the noise is symmetric, the estimated first-stage coefficient can be close to zero or even change sign. Since the IV estimator can be expressed as \(\hat{\beta} = \frac{Cov(Z,Y)}{Cov(Z,T)}\), dividing by a small or near-zero denominator causes the estimate to “blow up,” producing a few extremely large positive outliers. Sensitivity tests are therefore important, because even small violations of the assumptions can result in a large bias. Importantly, the direction of the bias in IV estimates typically matches that of OLS. For example, if confounders are positively related to the treatment but negatively related to the outcome, both OLS and IV estimates will underestimate the true effect (negative bias).

Bias by method

However, IV is consistent, meaning it will approach the population value as the sample increases. Whereas, under confounding, the OLS estimate remains biased:

Bias by sample size

Deep IV

Deep IV is a method that extends instrumental variable (IV) analysis to handle nonlinear relationships and heterogeneous treatment effects using deep learning. It addresses the limitations of traditional IV approaches like Two-Stage Least Squares (2SLS), which assume linearity and constant treatment effects. Instead of fitting simple linear models, Deep IV uses neural networks to flexibly model the treatment and outcome processes in two stages:

It can be especially useful in settings with high-dimensional data, such as personalized recommendations, pricing strategies, or policy evaluation. However, it still relies on the standard IV assumptions (exclusion restriction and instrument relevance), requires careful tuning, and can be computationally expensive.

Causal strength

The ideal setting for IV is more restrictive than other methods. However, when a strong instrument is available, IV can be a powerful tool. It enables estimation of causal effects even when unmeasured confounding is present, and can also handle issues like simultaneity bias (feedback loop) and measurement error in the treatment (if uncorrelated with the instrument). However, IV relies on a distinct set of assumptions that are largely theoretical and often untestable, so the credibility of any causal claims hinges on their plausibility in the specific context.

These assumptions mean IV is best for cases where there’s a lot of unmeasured confounding, a binary and time fixed treatment, and a strong instrument. Further, the effect should be homogenous, or you are truly interested in the effect in the compliers, and monotonicity holds.

Difference in differences

Difference-in-Differences (DiD) is another quasi-experiment technique used to estimate causal effects when we observe outcomes before and after a treatment for both a treated and control group. It is especially useful for macro-level interventions such as policy evaluations (e.g. assessing the effect of a new law on unemployment) or marketing campaigns that lack granular tracking (e.g. TV ads or billboards). It would be a mistake to simply compare outcomes before and after the intervention in the treated group alone, as this ignores other time-varying factors unrelated to treatment that may have affected the outcome (e.g. seasonality, any underlying trend). The control group serves as a counterfactual, showing what would have happened to the treated group had the intervention not occurred. However, we also can’t just compare the treated and control group at the post treatment time period either as there can be pre-existing differences. To get around this, we look at the difference between the pre-post difference of each group (cancelling out any pre-existing difference between the groups by comparing groups, and cancelling out any time invariant confounders by comparing time periods). The key assumption is that, in the absence of treatment, both groups would have followed parallel trends. If that’s true, then any divergence in outcome trends after treatment can be attributed to the treatment effect.

Suppose we have two time periods: pre-treatment (t=0) and post-treatment (t=1); and, two groups: treated (D=1) and control (D=0). Using potential outcomes notation, we can write these as \(Y_D(t)\) (e.g. \(Y_1(0) \mid D=1\) is the potential outcome under treatment before treatment is implemented, averaged among those who were actually treated).

Using this notation, the DiD estimator (ATT) is:

\[\hat{ATT} = \left( \mathbb{E}[Y_1(1) \mid D=1] - \mathbb{E}[Y_0(1) \mid D=0] \right) - \left( \mathbb{E}[Y_1(0) \mid D=1] - \mathbb{E}[Y_0(0) \mid D=0] \right)\]

Where:

Re-arranging the formula to group the treated and control terms, we switch to the observed outcomes to get the Difference-in-Differences (DiD) formula:

\[\widehat{\text{DiD}} = \big(\overline{Y}_{\text{treated, post}} - \overline{Y}_{\text{treated, pre}}\big) - \big(\overline{Y}_{\text{control, post}} - \overline{Y}_{\text{control, pre}}\big)\]

To estimate what would have happened to the treated group if they hadn’t been treated (the counterfactual), we’re taking the pre-treatment average for the treated group, and adding the change observed in the control group (since they weren’t treated):

\[\mathbb{E}[Y_0(1) \mid D = 1] = \overline{Y}_{\text{treated, pre}} + \left( \overline{Y}_{\text{control, post}} - \overline{Y}_{\text{control, pre}} \right)\]

Regression

While you only need the group averages for the DiD formula above, converting it to a regression allows us to easily compute the standard error and confidence intervals. For this we need granular data (i.e. one row for each unit i and time period t). Using \(\text{Post}_t\) to indicate the time period (1 if after treatment, 0 if before), and \(\text{Treated}_i\) to indicate treatment (1 if unit i is in the treated group, 0 otherwise), we get:

\[Y_{it} = \alpha + \beta \cdot \text{Post}_t + \gamma \cdot \text{Treated}_i + \delta \cdot (\text{Treated}_i \times \text{Post}_t) + \varepsilon_{it}\]

Where:

Coming back to the DiD formula, we’re estimating the four group means from the model:

Group Post_t Treated_i Interaction Expected Y
Control, Pre 0 0 0 $\alpha$
Control, Post 1 0 0 $\alpha + \beta$
Treated, Pre 0 1 0 $\alpha + \gamma$
Treated, Post 1 1 1 $\alpha + \beta + \gamma + \delta$

Then we can do the same subtractions as before to get the DiD estimator:

\[\widehat{\text{DiD}} = \big(\overline{Y}_{\text{treated, post}} - \overline{Y}_{\text{treated, pre}}\big) - \big(\overline{Y}_{\text{control, post}} - \overline{Y}_{\text{control, pre}}\big)\]

To visualise what we’re doing, let’s create the counterfactual line by adding \(\gamma\) (i.e. the difference between treated and control at the pre-treatment time point) to the control line. This assumes the growth in the control is the same as the treated would have been had they not received treatment. The gap between the counterfactual and the observed value for the treated at the post-treatment time point is our DiD estimate \(\delta\) (ATT).

DiD plot

To show how this is differencing out unobserved confounders that are constant over time, suppose the true data-generating process is:

\[Y_{it} = \alpha_i + \gamma_t + \delta D_{it} + \varepsilon_{it}\]

Where:

By looking at differences over time within units, DiD cancels out \(\alpha_i\) (the time-invariant characteristics), because they don’t change between periods. Then, by comparing across groups, it cancels out \(\gamma_t\) (the common time based effects), as long as both groups are equally affected. This leaves just the treatment effect (\(\delta\)) and noise (\(\varepsilon\)). However, given we assume parallel trends in the absence of treatment, it does not remove bias from time-varying confounders that affect the groups differently (e.g. one group improves faster over time regardless of treatment).

Example

Suppose a new education policy (e.g. a new curriculum) is implemented only in one school zone, beginning in year t=1. You want to evaluate the impact of this policy on student test scores. Students are observed over time in both the treated and control zones.

\[Y_{it} = \alpha_i + \gamma_t + \delta D_{it} + \varepsilon_{it}\]

Extensions

Covariates: You can add observed covariates to the DiD model to improve precision or adjust for time-varying confounding (any residual imbalance in characteristics that change over time and differ across groups). This doesn’t change the identifying assumptions but helps reduce residual variance (narrower confidence intervals). Below, \(\mathbf{X}_{it}\) is the vector of covariates and \(\mathbf{\beta_3}\) are the corresponding coefficients:

\[Y_{it} = \beta_0 + \beta_1 \text{Post}_t + \beta_2 \text{Treated}_i + \delta \big(\text{Treated}_i \times \text{Post}_t\big) + \mathbf{\beta_3} \mathbf{X}_{it} + \epsilon_{it}\]

Multiple time periods: The classic DiD setup compares two time points: pre-treatment and post-treatment. But DiD is easily generalized to panel data with multiple time periods. This lets you see how the treatment effect evolves over time — before and after treatment. Instead of one single \(\delta\), you have a vector of treatment effects \(\delta_\tau\) indexed by time since treatment:

\[Y_{it} = \alpha_i + \gamma_t + \sum_{\tau \geq 0} \delta_\tau \cdot 1(t = \tau) \cdot D_i + \boldsymbol{\beta} \mathbf{X}_{it} + \epsilon_{it}\]

Say you have data for 3 years after treatment starts (\(\tau\) =0,1,2):

By estimating separate \(\delta_\tau\) for each year, you can see whether the trend. For example it may: peak immediately and decline; build up gradually; remain stable; or behave in some other pattern. This richer model is often called an event study or dynamic DiD because it studies the evolution of treatment effects over time.

Assumptions:

Parallel Trends:
The key identifying assumption of Difference-in-Differences (DiD) is that, in the absence of treatment, the average outcome for the treated group and the control group would have followed the same trend over time. Formally, if we let \(Y_{it}(0)\) be the potential outcome without treatment for unit \(i\) at time \(t\), then:

\[\mathbb{E}[Y_{it}(0) - Y_{i,t-1}(0) \mid \text{treated}] = \mathbb{E}[Y_{it}(0) - Y_{i,t-1}(0) \mid \text{control}]\]

In other words, we assume we can use the control as the counterfactual, so any differences in outcomes between treated and control groups before the treatment persist but do not change over time. If this assumption fails (e.g. treated group is on a systematically different trajectory), the DiD estimate will be biased. In industry this is often an issue because the choice of who to treat is targeted (e.g. testing a new feature in a lower performing market to try improve it). In practice, we might find a control by plotting pre-treatment trends in the outcome for each potential control vs the target group. If they appear parallel before treatment, the assumption is more credible:

DiD assumption plot

In addition to eyeballing it, we can fit a regression on the pre-treatment time period, using interaction terms (Treated x Time) to assess whether there is a pre-treatment difference in the slopes. Starting with a simple case of two time periods pre-treatment:

\[Y_{it} = \beta_0 + \beta_1 \, \text{Treated}_i + \beta_2 \, \text{Time}_t + \beta_3 \, (\text{Treated}_i \times \text{Time}_t) + \varepsilon_{it}\]
Term Value Coefficient Meaning
$\text{Treated}_i$ 1 if unit $i$ is treated, 0 otherwise $\beta_1$ Measures the average difference in outcomes between treated and control groups at time = 0 (the baseline period).
$\text{Time}_t$ 1 if time = 1 (later pre-period), 0 if time = 0 (baseline) $\beta_2$ Measures the average change in outcomes from time 0 to time 1 for the control group — that is, the baseline trend in the absence of treatment.
$\text{Treated}_i \times \text{Time}_t$ 1 only if unit is treated and time = 1 $\beta_3$ Measures the difference in trends from time 0 to 1 between treated and control groups. Specifically, it compares the change for treated units to that for control units. A non-zero value means the groups had different trends before treatment — violating the parallel trends assumption.

In this case, we could just use the t-test results for \(\beta_3\) to assess whether there is a significant difference. However, we can generalise this to multiple time periods and use an F-test to check if any of the interactions are significant:

\[Y_{it} = \beta_0 + \beta_1 \text{Treated}_i + \sum_{t=1}^T \beta_{2t} \, \text{Time}_t + \sum_{t=1}^T \beta_{3t} \, \big(\text{Treated}_i \times \text{Time}_t\big) + \varepsilon_{it}\]

Where:

In Python, after identifying the names of the interactions, you can write the hypothesis (\(H_0: \beta_{31} = \beta_{32} = \cdots = \beta_{3T} = 0\)) as a string and pass it to f_test in statsmodels (to restrict the test to only the interactions). If it’s significant, this indicates a violation of the assumption. If there are many pre-treatment time periods, one option is to aggregate them, or pick certain lags (e.g. -1 month, -2 months). However, keep in mind this is only checking if they were parrallel pre-treatment and doesn’t rule out a difference in the post-treatment time period that is unrelated to treatment.

No Spillovers (Stable Unit Treatment Value Assumption - SUTVA):
The treatment assigned to one unit (e.g. one school zone) does not affect the outcomes of units in the control group. For example, if the new education policy in the treated zone indirectly improves resources or student outcomes in the control zone, this assumption is violated, potentially biasing estimates. This is harder to assess directly from the data, and relies instead on being familiar with the intervention and outcomes.

No Composition Change (No Differential Attrition or Migration):
The composition of the treated and control groups remains stable over time, and changes in group membership are not related to the treatment. For example, if after the policy starts, higher-performing students move into the treated zone (or control zone students move away), the comparison groups are no longer comparable, and the DiD estimator may be biased. If you have stable IDs over time, you can catch this by checking if any units are assigned to multiple treatments.

Causal strength

DiD is not suitable for all cases, but offers a simple way to control for unobserved, time-invariant confounders and common shocks when we have panel data (measurements over time), and the assumptions are plausible.

DiD is generally best for situations where:

Two-Way Fixed Effects (TWFE)

TWFE is essentially a regression implementation of DiD that generalizes to more than two time periods, staggered treatment timing (units get treated at different times), and continuous treatment variables. By adding the unit itself as a control to the model (i.e. a binary variable for each ID), we control for all of the things that are fixed for a unit even if we can’t measure them (e.g. over a short time period a person’s IQ is likely unchanged).

The TWFE model is typically written as:

\[Y_{it} = \alpha_i + \lambda_t + \tau D_{it} + \varepsilon_{it}\]

Where:

When estimating this model in a regression, you don’t estimate \(\alpha_i\) and \(\lambda_t\) directly. Instead, they are encoded as dummy variables, dropping one to avoid perfect multicollinearity:

The resulting regression looks something like this:

\[Y_{it} = \tau D_{it} + \sum_{i=2}^{N} \beta_i \cdot \text{unit}_i + \sum_{t=2}^{T} \gamma_t \cdot \text{time}_t + \varepsilon_{it}\]

Demeaning (Within Transformation)

In practice, to avoid adding n-1 dummy variables which could be very large, we can use a method called “demeaning”. When you include dummy variables in a regression, you’re estimating and removing their average effect (or conditional average if there’s additional covariates). De-meaning does exactly the same thing, but more efficiently: it subtracts the averages directly, without estimating a coefficient for each dummy.

Step 1: Compute Averages

\[\bar{Y}_i = \frac{1}{T_i} \sum_t Y_{it}, \quad \bar{D}_i = \frac{1}{T_i} \sum_t D_{it}\] \[\bar{Y}_t = \frac{1}{N_t} \sum_i Y_{it}, \quad \bar{D}_t = \frac{1}{N_t} \sum_i D_{it}\] \[\bar{Y} = \frac{1}{NT} \sum_{i,t} Y_{it}, \quad \bar{D} = \frac{1}{NT} \sum_{i,t} D_{it}\]

Step 2: Two-Way Demeaning

Create demeaned variables by subtracting the averages, which removes both \(\alpha_i\) and \(\lambda_t\). Regressing on unit dummies controls for unit averages, so subtracting the unit means is a manual but more efficient way of removing the effect:

Step 3: Regress the demeaned outcome on the demeaned treatment

The de-meaning process yields the same estimate as running a full regression with one-hot dummies for each unit and time period:

\[\tilde{Y}_{it} = \tau \tilde{D}_{it} + \tilde{\varepsilon}_{it}\]
Component Explanation
\(\alpha_i\) Removed via subtracting unit average
\(\lambda_t\) Removed via subtracting time average
\(\tau\) Estimated by regressing demeaned \(Y_{it}\) on \(D_{it}\)

Assumptions

There are the functional form assumptions:

But also the strict exogeneity assumptions:

Bias

Recent econometrics research showed that staggered DiD with TWFE is biased when treatment effects are heterogeneous (post-2018, notably Goodman-Bacon, Sun & Abraham, Callaway & Sant’Anna). This is especially problematic when different groups adopt treatment at different times (e.g., state policies rolled out in different years), or when treatment effects grow or shrink over time. These patterns are common in real-world interventions, for example, an e-commerce site rolling out a feature by country, with the effect of the feature taking time to fully materialize.

The core problem is that some units act as controls for others after they’ve already been treated, causing TWFE to assign negative weights to certain comparisons. If the treatment effect increases over time, this tends to bias the overall estimate downward; if the effect decreases over time, it biases the estimate upward. As a result, the TWFE estimate isn’t just a weighted average of the group-specific effects, it’s a distorted mix of comparisons that can even have the wrong sign if treatment effects vary enough.

Stagger example

For staggered rollouts, researchers have increasingly moved toward more robust alternatives that explicitly model treatment effect heterogeneity, such as:

Using these approaches allows you to estimate group-specific treatment effects over time and avoid the misleading averages produced by naïve TWFE models in heterogeneous settings.

Causal strength

TWFE models extend DiD to more flexible settings with multiple time periods and variation in treatment timing. They offer a simple way to control for unobserved time-invariant confounders (via unit fixed effects) and time shocks common across units (via time fixed effects), using a linear regression framework.

TWFE is especially useful when:

Synthetic controls

Synthetic controls overcome the problem of finding a single good control for a DiD model. Instead of selecting one comparison unit, we take a weighted average of multiple control units to create a synthetic control that closely matches the treated unit’s trend during the pre-intervention period.

For example, imagine an e-commerce company runs a test for a new product in 1 of the 100 cities it operates in. Instead of choosing one comparison city, which may still not perfectly match the treated city, we construct a weighted combination of the remaining 99 cities to form a synthetic version. We then use this synthetic control as a proxy for what would have happened to the treated city post-intervention if it had not been treated.

Synthetic outcome formula

\[\hat{Y}^N_{1t} = \sum_{j=2}^{J+1} w_j Y_{jt}\]

Where:

Finding the weights

The weights \(w_j\) are chosen to minimize the difference between the pre-treatment outcomes of the treated unit and its synthetic counterpart:

\[\min_{w} \sum_{t=1}^T \left( Y_{1t} - \sum_{j=2}^{J+1} w_j Y_{jt} \right)^2\]

Where:

We can solve for these weights by running a regression where the units (for example, cities) are the features (columns) and the time periods are the observations (rows). However, this requires constrained least squares with:

These constraints ensure that the synthetic control stays within the range of observed data and avoids extrapolation.

Assessing significance: Placebo tests

Since we typically have few units (small sample size), standard errors from regression are unreliable. Instead, we use placebo tests:

This non-parametric test provides a practical way to assess whether the estimated treatment effect is meaningful or could plausibly have arisen by chance.

Bayesian structural time series

Another approach to estimating the effect of an intervention on time series data is Bayesian structural time series (BSTS), popularized by Google’s CausalImpact package.

Where synthetic controls build a weighted average of other units to form a comparison, BSTS builds a model-based forecast of what would have happened to the treated unit if the intervention had not occurred. It incorporates:

The estimated effect of the intervention is calculated as the difference between the observed data and the predicted counterfactual in the post-treatment period.

Model structure

A typical BSTS model looks like this:

\[Y_t = \mu_t + \tau_t + \beta X_t + \epsilon_t\]

Where:

The spike-and-slab priors are especially useful when there are many potential control variables. The spike controls whether a variable is likely to be included, while the slab shrinks included variables toward zero to guard against overfitting and multicollinearity.

How it works

  1. Fit the model on the pre-intervention period to learn patterns in the time series.
  2. Forecast the counterfactual to generate predictions for the post-intervention period, representing what would have happened without the intervention.
  3. Compare the observed data to the predicted counterfactual to estimate the causal impact.
  4. Quantify uncertainty using Bayesian credible intervals derived from the posterior distribution of the forecast.

Pre-selecting the control pool

While CausalImpact uses spike-and-slab priors to automatically select the most predictive control markets for the structural time series model, if you have many potential controls it can be useful to first reduce the pool to more viable candidates. Pre-selecting the control pool improves computational efficiency, reduces noise from unrelated markets, and can help avoid issues with multicollinearity between highly correlated controls.

One practical approach is to select control markets that have demonstrated similar trends to the treated market during the pre-intervention period. Similarity can be measured using methods like:

The MarketMatching package, developed by Stitch Fix, provides a convenient implementation of this idea. It uses DTW to score similarity between time series and select a subset of controls with trends closest to the treated unit. These selected markets can then be used as covariates in the CausalImpact model.

After pre-selecting controls, it’s still important to validate the model by testing its performance on placebo interventions in the pre-treatment period. This helps ensure that the final model is well-calibrated and reduces the risk of detecting spurious effects.

Interpreting the results

By default, CausalImpact produces a three-panel plot:

  1. Observed versus predicted, showing the actual series against the counterfactual forecast
  2. Pointwise impact, the estimated effect at each time point
  3. Cumulative impact, the running total of the effect over time

CausalImpact plot

Cumulative impact is most meaningful when the outcome represents a flow, such as daily sales. For metrics defined at a point in time, like the number of customers, the pointwise impact is often more interpretable.

Benefits

Limitations

Testing robustness

To assess robustness, placebo tests can be used:

Choosing between DiD / synthetic control / BSTS

Method Best for Key assumptions Strengths Limitations
Difference-in-Differences (DiD) Many treated & control units with parallel trends Parallel trends (or conditional on covariates) Simple to implement, interpretable Sensitive to trend differences, no flexible control selection
Synthetic Control One treated unit, good set of comparison units Treated can be approximated by weighted avg. of controls Transparent, intuitive counterfactual Requires good controls, limited flexibility in dynamics
Bayesian Structural Time Series (BSTS) One treated unit, complex or noisy time series Pre-period model well specified, stable covariate relationships Rich time series dynamics, automatic control selection More complex setup, relies on prior choices, sensitive to model misspecification

Recommendation:

Regression Discontinuity

Regression Discontinuity Design is a quasi-experimental method used for estimating causal effects when treatment is assigned based on a cutoff in a continuous variable (called the running variable or forcing variable). Given we’re looking at a limited time period, we assume that everything else affecting y changes smoothly at the the cutoff (continuity assumption), so any jump must be due to treatment. For example, if we decide anyone who scores above 80 on a test gets some reward, those who scored 80 are likely very similar to those who scored 81. Graphically, RDD looks like two regression lines, one on each side of the cutoff. The discontinuity (the jump) between those two lines at the cutoff is what we interpret as the causal effect of the treatment.

After centering the running variable, we can specify the Regression Discontinuity Design (RDD) model as:

\[Y_i = \alpha_0 + \alpha_1 \cdot \tilde{r}_i + \alpha_2 \cdot 1\{\tilde{r}_i > 0\} + \alpha_3 \cdot 1\{\tilde{r}_i > 0\} \cdot \tilde{r}_i + \varepsilon_i\]

Where:

Coefficient Measures Interpretation
$\alpha_0$ Baseline (untreated) Expected outcome at cutoff (coming from below)
$\alpha_1$ Pre-cutoff slope Trend of outcome as $r_i \to c^-$
$\alpha_2$ Treatment effect at cutoff Causal estimate: jump when we switch on the dummy (\(\lim_{r_i \to c^+} \mathbb{E}[y_i \mid r_i] - \lim_{r_i \to c^-} \mathbb{E}[y_i \mid r_i]\))
$\alpha_3$ Change in slope due to treatment Difference in slopes (change when the cutoff dummy switches to 1)

With our running variable (centered at 0), cutoff dummy, and their interaction, we’re able to fit the pre and post cutoff slopes:

RDD plot

This can also be combined with earlier methods:

RD Alone RD + DiD (RDDiD) RD + Matching/Covariates
Controls for confounding at the cutoff only. Adds control for time trends in panel or repeated cross-section. Improves covariate balance, reduces residual variance.
Local effect at the threshold. Local difference in trends at the threshold. Local effect, but more efficient.
You believe continuity in potential outcomes holds and no time trend confounding is likely. There’s a policy or event at a specific time that might affect the groups even without treatment. You see imbalance in covariates near the cutoff, especially with small samples.

Diagnostics / improvements

Given we’re essentially fitting two regression slopes, it’s possible the model will focus too much on fitting all the points and not on the points at the cutoff (which is what our causal estimate measures). One solution to this is to give more weight to the points near the cutoff, emphasizing the part of the data most relevant for identifying the discontinuity. There are a few ways to do this:

It’s good practice to try multiple bandwidths, kernels, and model specifications to check the sensitivity of your results.

McCrary Test

If units can precisely manipulate their position relative to the cutoff (e.g., by sorting or gaming eligibility), the assumption of continuity in potential outcomes may be violated, threatening identification. The McCrary test is a diagnostic for checking manipulation of the running variable around the cutoff. It tests whether there is a discontinuity in the density (distribution) of the running variable at the threshold by comparing densities just to the left and right of the cutoff. A significant jump suggests potential manipulation or sorting, while a smooth density supports the assumption that units cannot manipulate assignment precisely, strengthening the credibility of the RDD.

Causal strength

Under mild conditions (continuity of potential outcomes at cutoff), RD is as good as randomization near the threshold, making it very convincing when the assumptions hold. However, it only estimates the local average treatment effect (LATE) around the cutoff, not global effects (i.e. it doesn’t necessarily generalize to samples far from the cutoff).

Fuzzy RDD (non-compliance)

Standard (sharp) Regression Discontinuity Design (RDD) can be viewed as a special case of an Instrumental Variables (IV) design with perfect compliance at the cutoff. However, in many practical situations, treatment assignment is not perfectly determined by crossing the cutoff. Instead, the probability of receiving treatment increases discontinuously at the threshold, but some units above the cutoff might not receive treatment, and some below might receive it. This setting is known as fuzzy regression discontinuity design and can be estimated using an IV approach, where crossing the cutoff acts as an instrument for actual treatment receipt. The causal effect identified in fuzzy RDD is the local average treatment effect (LATE) for compliers, units whose treatment status is influenced by whether they cross the threshold.

Because \(Z_i\) influences \(D_i\) only probabilistically, we use \(Z_i\) as an instrument for \(D_i\) to estimate the causal effect of treatment on the outcome:

First stage: Model the effect of the instrument on treatment assignment:

\[D_i = \pi_0 + \pi_1 Z_i + f(r_i) + \epsilon_i^D\]

where \(f(r_i)\) flexibly models the relationship between the running variable and treatment (for example, separate slopes or polynomial terms on each side of the cutoff).

Second stage: Model the outcome as a function of predicted treatment:

\[Y_i = \beta_0 + \beta_1 \hat{D}_i + g(r_i) + \epsilon_i^Y\]

where \(\hat{D}_i\) is the fitted value of treatment from the first stage, and \(g(r_i)\) models the running variable in the outcome equation.

Since \(Z_i\) induces variation in \(D_i\) only near the cutoff \(c\), the 2SLS estimate \(\hat{\beta}_1\) recovers the local average treatment effect (LATE) for compliers at the threshold. This IV approach corrects for imperfect compliance, making fuzzy RDD a localized instrumental variables design.

Key assumptions:


Quasi-Experiment methods comparison

Quasi-experimental methods estimate causal effects by leveraging structural features in the data, such as policy shocks, eligibility cutoffs, or staggered adoption across units. These designs help approximate randomized experiments in settings where randomization is not possible. Some methods apply to cross-sectional data, where each unit is observed at a single point in time, while others require panel data, which tracks the same units over multiple periods. Methods like Synthetic Control or Bayesian Structural Time Series are designed for settings with a single treated unit, such as a country, product, or business unit affected by an intervention. In contrast, approaches like Difference-in-Differences and Two-Way Fixed Effects are better suited for multiple treated units, where treatment is introduced at different times or to different groups. These methods differ in the variation they rely on and often estimate local or group-average treatment effects, rather than a general population ATE. In practice, the choice of method is guided by the structure of the data and the nature of the intervention, making the appropriate approach relatively clear.

Method How It Works Data Needed Key Strengths Key Limitations
Instrumental Variables (IV) Uses a variable (instrument) that shifts treatment but affects the outcome only through treatment. Commonly estimated with 2SLS. Cross-sectional or panel Identifies local ATE (LATE) for compliers; handles unobserved confounding. Requires strong instrument validity assumptions; estimates a local effect, not always ATE.
Difference-in-Differences (DiD) Compares pre/post changes in outcomes between treated and untreated groups, assuming parallel trends. Panel Straightforward ATE estimate if assumptions hold; controls for time-invariant confounding. Assumes treated and control would have followed same trend; biased if trends differ.
Two-Way Fixed Effects (TWFE) Extends DiD to multiple groups and time periods by modeling unit and time fixed effects. Panel Can estimate ATE across staggered adoption designs; widely used. Can produce misleading ATEs with treatment effect heterogeneity; sensitive to specification.
Synthetic Controls Constructs a weighted combination of control units to create a synthetic version of the treated unit, matching pre-treatment trends. Panel (few treated units) Transparent and intuitive; suitable for policy evaluation with a single treated unit. Doesn’t generalize to population-level ATEs; not well suited for many treated units.
Bayesian Structural Time Series (BSTS) Uses time series models with latent components to forecast counterfactuals for treated units. A Bayesian analog to synthetic control. Panel (treated unit time series) Captures uncertainty formally; flexible modeling of time trends. Requires strong time series assumptions; typically used for single-unit interventions.
Regression Discontinuity (RDD) Exploits a cutoff in a continuous assignment variable to compare units just above and below the threshold. Cross-sectional (can be panel) Identifies local ATE at the cutoff; intuitive design; credible under smoothness assumptions. Only valid near the threshold; extrapolation to full ATE not justified.

Treatment and outcome types

Most methods have well established guidance and examples for binary treatments with continuous outcomes. Continuous treatments require adaptations to estimate dose-response relationships, such as generalized propensity scores (GPS), non-linear IV, or fuzzy RDD. Similarly, binary outcomes are generally supported with appropriate modeling, but may require different estimation strategies (e.g. logistic regression instead of OLS):

Binary Outcomes: Special Considerations

Binary outcomes (such as click-through, conversion, or retention) require special treatment in causal inference due to their discrete, non-continuous nature. As noted above, both modeling choices and interpretation of effects differ meaningfully from continuous outcome settings:

These modeling and estimation challenges are compounded by several deeper statistical issues:

For ATE, aggregation over units often mitigates these issues. However, extra care is still required in both modeling and interpretation, and choosing the right scale (risk difference vs. odds ratio) is critical depending on the decision context.


Evaluation

There’s no definitive test to confirm whether a causal estimate is correct, because causal inference always involves estimating unobservable quantities. We only ever see one potential outcome per individual, so we can’t directly check our work. Most methods therefore rely on untestable assumptions, like no unmeasured confounding. As a result, evaluation is not about proving ground truth, but about assessing internal validity (whether the effect is credibly identified in the data) and external validity (whether the result generalizes beyond the study setting). A good approach is to actively seek scenarios where your assumptions could break down. If you try to disprove your conclusion and fail, you strengthen the case for a real effect. In the end, the goal is to build a well-supported argument using careful study design, domain knowledge, and a range of diagnostic tools.

Sensitivity Analysis

Sensitivity analysis assesses how much your estimated effect depends on specific modeling choices and assumptions by testing how sensitive it is to other conditions. If small, reasonable changes in inputs or specifications lead to large swings in the result, this suggests the estimate may not be well-identified by the data.

Common checks include:

These tests do not prove that your estimate is correct, but they reveal whether the model behaves as expected under plausible variations. Sensitivity analysis helps assess the strength of your identification strategy and the reliability of the effect under key assumptions.

Consistency and Robustness Checks

In addition to sensitivity analysis, it is important to assess how well your results generalize across contexts, datasets, and methods. These broader checks are essential for evaluating both external validity and the overall credibility of the effect.

These checks complement sensitivity analysis by probing the robustness and transportability of the findings. Together, they help evaluate whether the estimated effect is not only statistically credible, but also meaningful and applicable in practice.

Quantifying the Effect of an Unmeasured Confounder

When you suspect a confounder is missing from your analysis, it’s useful to assess how strong it would need to be to change your conclusions. Below are several common approaches to quantify this vulnerability.

Simulating confounder strength

You can model a hypothetical unmeasured confounder $U$ by assigning it varying strengths of association with both the treatment $T$ and the outcome $Y$, then examine how the estimated treatment effect changes. The specific parameterization depends on whether you’re working with binary or continuous outcomes and treatments:

Tipping point analysis

Rather than simulate a range of confounder strengths, tipping point analysis solves for the minimum strength an unmeasured confounder would need to reduce the treatment effect to a chosen threshold, often the null (zero effect or RR = 1).

For linear models, this involves solving for the combinations of $R^2{UT}$ and $R^2{UY}$ that would produce a given bias. This defines a robustness frontier: the boundary of confounder strengths that would invalidate your inference. Software like the sensemakr R package automates this for regression settings.

E-value

The E-value is a specific tipping point metric for binary outcomes (or when reporting results on the risk ratio scale). It computes the minimum risk ratio that an unmeasured confounder would need to have with both the treatment and outcome, beyond what’s already adjusted for, to fully explain away an observed association.

If your observed effect is a risk ratio $\text{RR}_{TY} > 1$, the E-value is:

\[\text{E-value} = \text{RR}_{TY} + \sqrt{ \text{RR}_{TY} \cdot (\text{RR}_{TY} - 1) }\]

For example, if $\text{RR}_{TY} = 2$:

\[\text{E-value} = 2 + \sqrt{2} \approx 3.41\]

This means an unmeasured confounder would need a risk ratio of at least 3.41 with both treatment and outcome to explain away the result. You can also compute an E-value for the lower bound of the confidence interval, which gives a more conservative benchmark.

Rosenbaum bounds

For matching or instrumental variable studies, where unmeasured confounding cannot be addressed through standard covariate adjustment, Rosenbaum bounds assess how sensitive your results are to hidden bias in treatment assignment.

Instead of assuming perfectly matched pairs (that is, treatment is random within strata), Rosenbaum bounds allow for a small probability that treatment assignment depends on an unmeasured confounder. They compute a sensitivity parameter $\Gamma$ such that:

You can then ask: at what value of $\Gamma$ does the treatment effect stop being statistically significant?

Summary

These approaches don’t require knowing what the missing confounder is, only reasoning about how strong it would need to be. And while they can’t recover missing data, they make uncertainty explicit and provide a framework for assessing robustness.


Conditional ATE (CATE) methods

Estimating the Average Treatment Effect (ATE) is often just the starting point. In many real-world applications, the treatment effect is not constant across all individuals. Some groups may benefit more than others, and for some, the treatment might even be harmful. Understanding this heterogeneity is critical for targeted policies, personalised recommendations, and fair policies.

Heterogeneity (effect modifiers)

The average treatment effect (ATE) answers the question of whether we should roll out a treatment to the entire population or not. Whereas, estimating the conditional ATE (CATE) for different segments answers the question of who should we treat. Imagine we could observe Y at different values of T for each observation, then we could estimate their sensitivity to treatment based on the slope of the line (\(\frac{dY}{dT}\)). If the slope is large (high elasticity) they are more responsive to treatment, and if it’s small (low elasticity) they’re not. We could then group people based on their sensitivity and implement the treatment accordingly.

Now we typically can’t observe the same observation with different treatments as discussed above, but we can estimate this value from a sample. For example, suppose we want to estimate how sensitive demand (Y) is to price (T), i.e. the price elasticity of demand. To do this, we could randomly assign users to see the same product at slightly different prices within a reasonable range, say ±10% of the base price. Fitting a linear regression model we can get the ATE straight from \(\beta_1\) (average price elasticity):

\[Y_i = \beta_0 + \beta_1 T_i + \epsilon_i\] \[\frac{dY_i}{dT_i} = \beta_1\]

Adding interaction terms allows the treatment sensitivity (elasticity) to vary for different groups:

\[Y_i = \beta_0 + \beta_1 T_i + \beta_2 X_{i1} + \beta_3 X_{i2} + \beta_4 (T_i \cdot X_{i1}) + \beta_5 (T_i \cdot X_{i2}) + \epsilon_i\]

Where:

The Conditional Average Treatment Effect (CATE) for observation i is then the partial derivative of \(Y_i\) with respect to \(T_i\), conditional on X (using the interaction terms):

\[\frac{dY_i}{dT_i} = \beta_1 + \beta_4 X_{i1} + \beta_5 X_{i2}\]

Now the marginal effect of T depends on the user’s covariates (income and purchase frequency), allowing you to segment users by elasticity and personalize pricing or treatment accordingly. If we plot Y vs X for each level of T, the lines are no longer parallel because we’ve allowed the slope to vary based on the levels of X. For example:

heterogeneous effect example

Variables that change the magnitude or direction of the treatment effect across different levels of that variable are called effect modifiers. A surrogate effect modifier is a variable that isn’t a true modifier itself, but is correlated with the true effect modifier, and is used as a proxy. These are used when true effect modifier is unobserved or difficult to measure, for example click behaviour could be a surrogate for a latent trait like motivation. Existing literature or user research can be useful for validating surrogates when you don’t have the data for the modifier.

Evaluation

As mentioned for ATE, there’s no direct test to confirm individual CATE estimates are correct, because we never observe both potential outcomes for any single observation. Individual treatment effects are fundamentally unobservable. However, we can evaluate models of heterogeneity using group-level or out-of-sample checks. The best strategy, when feasible, is to use randomized experimental data for the evaluation. While it may not be possible for model training, running smaller-scale experiments solely for validation is often achievable. Using the randomised treatment, we’re able to get a reliable estimate of the ATE for groups of observations by comparing the treated to the control. Practical approaches for this include:

heterogeneous effect evaluation Note: If we use regression to compute the sensitivity, we can also easily add a confidence interval using the standard error

Without randomised data, a common approach is to use cross-validation: train your heterogeneity model on one part of the data, then evaluate how predicted CATEs correspond to observed differences in a holdout set. While this helps us understand whether the results are stable, it doesn’t tell us whether we’re capturing the causal effect (i.e. there can still be bias). Similarly, model performance metrics tell us how well we’re fitting the data. This is useful to avoid overfitting (using cross-validation), but it doesn’t correspond with the accuracy of the causal estimate. For example, a strong \(R^2\) means we’ve fit the data well, but it doesn’t tell us whether we’re capturing the causal relationship of interest (i.e. it could be biased due to confounding).

Ultimately, strong model metrics, good group-level correspondence, and favorable uplift curves based on randomised data provide practical confidence that the heterogeneity model is capturing useful variation for decision-making, even if individual CATEs remain uncertain.

Meta learners

The example above used linear regression to show how we can reframe predictive modeling, not to predict raw outcome values ($Y$), but to predict how the outcome changes with treatment (\(\frac{dY_i}{dT_i}\)). The challenge for applying more flexible or complex machine learning models is that the ground truth isn’t observed for each observation (i.e. we only observe one potential outcome). Yet to apply these models directly, we need to define a target that reflects the Conditional Average Treatment Effect (\(\tau(x)\)) instead of the raw Y values. For binary treatments, one effective solution to this is target transformation (F-learners).

Given:

We define the transformed outcome as:

\[\tilde{Y}_i = \frac{Y_i - \bar{Y}}{p(1 - p)} \cdot (T_i - p)\]

With random assignment (or unconfoundedness), this transformation satisfies:

\[\mathbb{E}[\tilde{Y} \mid X = x] = \tau(x)\]

This allows us to train any regression or machine learning model to predict \(\tilde{Y}_i\), and the model’s predictions will approximate \(\tau(x)\), the individual-level treatment effect. This approach is called an F-learner, where you transform the outcome and then fit a model to predict it. However, the transformed outcome $$\tilde{Y}_i$ $is often noisy, especially when treatment assignment is highly imbalanced. Therefore, this method works best with large datasets where noise averages out.

This is just one example of a meta learner for CATE estimation, with “meta” refering to the fact they wrap around standard supervised learning models to estimate CATE.

Learner What it Estimates Models Fit Formula for CATE \(\hat{\tau}(x)\) Pros Cons
S-learner \(\mathbb{E}[Y \mid X, T]\) Single model \(\hat{f}(X, T)\) (fit model predicting \(Y\) using \(X\) and \(T\) as features) \(\hat{\tau}(x) = \hat{f}(x, 1) - \hat{f}(x, 0)\) Simple; works with any supervised ML model Can underfit heterogeneity, especially if regularization shrinks or drops \(T\) coefficient
T-learner \(\mathbb{E}[Y \mid X=x, T=1]\) and \(\mathbb{E}[Y \mid X=x, T=0]\) Two models: \(\hat{\mu}_1(x) = \mathbb{E}[Y \mid X=x, T=1]\) and \(\hat{\mu}_0(x) = \mathbb{E}[Y \mid X=x, T=0]\) \(\hat{\tau}(x) = \hat{\mu}_1(x) - \hat{\mu}_0(x)\) Flexible; allows different complexity for treatment/control groups Needs sufficient \(n\) in both groups; imbalance can distort differences
X-learner Weighted combination of imputed individual treatment effects 1. Fit \(\hat{\mu}_1(x), \hat{\mu}_0(x)\) as in T-learner
2. Impute: \(D_1 = Y - \hat{\mu}_0(x), \quad D_0 = \hat{\mu}_1(x) - Y\)
3. Fit \(g_1(x), g_0(x)\) to imputed ITEs
4. Estimate \(e(x) = P(T=1 \mid X=x)\)
\(\hat{\tau}(x) = e(x) \cdot g_0(x) + (1 - e(x)) \cdot g_1(x)\) Handles imbalance well; leverages both groups Requires multiple models; sensitive to propensity or outcome model misspecification
F-learner \(\mathbb{E}[\tilde{Y} \mid X = x]\), with transformed outcome \(\tilde{Y}\) Model \(\hat{g}(x) \approx \mathbb{E}[\tilde{Y} \mid X = x]\), where \(\tilde{Y} = \frac{Y - \bar{Y}}{p(1-p)}(T - p), \quad p = P(T=1)\) \(\hat{\tau}(x) = \hat{g}(x)\) Works with any ML model on transformed \(\tilde{Y}\) High variance; works better with large datasets
R-learner Orthogonalized residual treatment effect 1. Estimate: \(\hat{m}(x) = \mathbb{E}[Y \mid X=x]\), \(\hat{e}(x) = P(T=1 \mid X=x)\)
2. Minimize: \(\sum_i ( (Y_i - \hat{m}(X_i)) - (T_i - \hat{e}(X_i)) g(X_i) )^2\)
\(\hat{\tau}(x) = \hat{g}(x)\) Reduces bias from nuisance parameters; integrates well with orthogonalization and DML Requires reliable nuisance models; more computationally intensive

Causal Forests

Causal forests are a nonparametric method for estimating heterogeneous treatment effects. They extend random forests to the causal inference setting, using modified tree-splitting rules that are sensitive to treatment effect variation to estimate CATE.

The algorithm proceeds in two main phases:

  1. Honest splitting: The data is randomly divided into two subsets. One subset is used to build trees (i.e. decide how to split), and the other is used to estimate treatment effects within each leaf. This prevents overfitting and enables valid inference.

  2. Splitting criterion: Instead of minimizing prediction error (as in standard regression trees), splits are chosen to maximize heterogeneity in treatment effects between child nodes. This focuses the model on capturing meaningful variation in treatment response.

Each tree estimates local treatment effects in its leaves, and the forest averages these estimates to produce a smooth CATE function $\hat{\tau}(x)$.

Causal forests make minimal parametric assumptions and can flexibly handle complex interactions and nonlinearities. They also support valid variance estimates for CATEs, enabling hypothesis testing and confidence intervals at the individual or group level.

BART (Bayesian Additive Regression Trees)

BART is a Bayesian nonparametric method that models complex functions by summing across many shallow regression trees. For causal inference, BART is typically applied to estimate potential outcomes $\hat{Y}(1)$ and $\hat{Y}(0)$ separately, allowing the CATE to be computed as their difference.

Key features of BART for CATE estimation:

A typical BART setup for causal inference models $Y \sim f(X, T)$, allowing the treatment effect to vary with covariates. Alternatively, you can estimate $f_1(X) = E[Y T=1, X]$ and $f_0(X) = E[Y T=0, X]$ separately, then take $\hat{\tau}(X) = f_1(X) - f_0(X)$.

BART is particularly strong when you care about uncertainty-aware estimates or want a smooth, nonparametric model of individual treatment effects.

Representation learning methods (TARNet / CFRNet / DragonNet)

Representation learning methods for CATE estimation use neural networks to learn a shared feature representation that captures relevant structure in the covariates before estimating potential outcomes. The goal is to improve generalization by balancing treated and control groups in a learned latent space, especially in the presence of covariate shift.

TARNet (Treatment-Agnostic Representation Network) is the baseline architecture: it first passes covariates through a shared representation network and then branches into two outcome heads—one for treated and one for control units.

CFRNet (Counterfactual Regression Network) builds on TARNet by adding a regularization term (typically based on Integral Probability Metrics like MMD or Wasserstein distance) that explicitly encourages the learned representation to balance the distribution of treated and control units.

DragonNet extends this further by jointly predicting the propensity score within the network and incorporating a targeted regularization loss to improve bias correction and efficiency.

These methods are flexible and can model complex, non-linear treatment effect heterogeneity. However, they typically require large amounts of training data and careful hyperparameter tuning. Because of their black-box nature, interpretability is limited compared to tree-based models. Despite these challenges, representation learning approaches have become popular for personalized intervention tasks and counterfactual prediction when large observational datasets are available.

Comparison

Method How It Works Data Needed Key Strengths Key Limitations
Meta-Learners (S-, T-, X-, F-) Use standard ML models in structured ways to estimate treatment effects, e.g., separate models per group or using transformed targets. Cross-sectional Modular; can plug in any ML model; widely used and understood. Susceptible to overfitting; no built-in uncertainty quantification.
R-Learner Residualizes both outcome and treatment, then regresses the residuals to estimate treatment effects. Cross-sectional Orthogonalization reduces bias; compatible with flexible models. Sensitive to errors in nuisance models; more complex than meta-learners.
Causal Forests (GRF) Builds tree-based ensemble models that split to maximize treatment effect heterogeneity using “honest” sample splitting. Cross-sectional or panel Captures non-linear, non-additive heterogeneity; includes uncertainty estimates. Less interpretable; slower training; may require large samples.
BART (Bayesian Additive Regression Trees) Bayesian ensemble of trees modeling both outcomes and treatment heterogeneity, with posterior uncertainty. Cross-sectional or panel Provides full posterior over effects; robust to model misspecification. Computationally intensive; harder to scale to very large datasets.
Representation Learning (TARNet / CFRNet / DragonNet) Use deep neural networks to learn shared representations of covariates before modeling treatment-specific outcomes. CFRNet adds balancing loss; DragonNet also predicts propensity scores. Cross-sectional (large N) Flexible architectures; balance improves generalization; end-to-end training. Requires large training data; tuning is complex; lower interpretability.
GANITE Uses Generative Adversarial Networks to model counterfactual outcomes directly, via adversarial training. Cross-sectional (large N) Can explicitly generate counterfactuals; models complex data distributions. Hard to train and tune; interpretability is low; GANs require care to avoid mode collapse.

General CATE estimation challenges

Estimating heterogeneous treatment effects (CATE) is essential for personalization and targeting, but it is substantially harder and noisier than estimating overall average treatment effects (ATE). The fundamental difficulty is that we only observe one outcome for each individual. This means CATE is estimated in expectation over groups of individuals with similar covariates, not at the individual level.

Several challenges follow:

The issues with binary outcomes are not specific to CATE (as covered earlier), but their impact is more pronounced when estimating heterogeneous effects. For ATE, the averaging helps smooth out variance and nonlinearities to some extent. For CATE, the nonlinearity, baseline dependence, and variance can distort effect estimates and create false signals of heterogeneity.

Focusing on Decisions, Not Just Estimates

Fortunately, we usually do not need precise individual-level CATE estimates. In practice, we want good decisions about who to prioritize. Several strategies help shift focus from noisy point estimates to decision-relevant metrics:

By emphasizing stable rankings and group-level effects over noisy individual estimates, you can still make effective personalized decisions even when CATE estimates are imperfect.


Longitudinal Methods

In many applied settings, data are collected repeatedly over time for the same individuals or units, creating longitudinal datasets. Analyzing longitudinal data introduces unique challenges such as time-varying confounding, informative loss to follow-up, and complex outcome structures like time-to-event data. Addressing these issues requires specialized methods that properly account for the evolving relationships between treatments, covariates, and outcomes over time. In this section, we review key longitudinal causal methods designed to handle these challenges.

Time-Varying Confounding

In many real-world settings, both treatments and confounders change over time. A user’s health status, engagement level, or behavior might influence future treatment, and also future outcomes. This creates a feedback loop where past treatment affects future confounders, which in turn affect subsequent treatment. If you’re goal is to measure the cumulative effect of treatment (e.g. the effect of repeated notifications), then specialised methods are necessary.

For example, suppose you’re interested in whether mobile notifications increase weekly user engagement (e.g. minutes of app use, or days active). You observe users across multiple weeks, and each week they may or may not receive a push notification. However, users who were less engaged in the prior week are more likely to be targeted for a notification in the current week. At the same time, past engagement is predictive of current engagement. This makes prior engagement a time-varying confounder: it affects both the treatment (notifications) and the outcome (future engagement), but is itself influenced by earlier treatment. Naively adjusting for a time-varying confounder in a static model can block part of the treatment effect or introduce collider bias, leading to biased estimates. This is similar to the issue of controlling for post treatment variables with a single treatment, but in longitudinal settings with repeated treatments, this issue becomes more complex and common, generally requiring specialized methods.

Simplication: single treatment point

Starting with the simplest option first, in some cases you can mitigate time-varying confounding by selecting a single treatment time point per user. Using the notifications example:

This setup lets you approximate a static causal effect, avoiding the key pitfall of time-varying confounding, namely, that your confounder is influenced by prior treatment. However, there are a number of limitations:

If you’re okay estimating the effect of a one-time treatment, and can clearly define pre-treatment covariates, then selecting a single point may be a reasonable simplification.

G-methods

If your goal is to estimate the cumulative or dynamic effect of repeated treatments (e.g. the effect of receiving multiple notifications over time), then you can’t just control for all covariates at the final time step, or even at each step, without risking bias. G-methods methods can account for the feedback loops that occur when prior treatment influences future covariates, which in turn affect subsequent treatment and outcomes (i.e. time varying confounding). Common G-methods for longitudinal problems include: G-computation (or the G-formula), Inverse Probability Weighting (IPW), and Marginal Structural Models (MSMs).

All of these require aligning treatment, covariate, and outcome data longitudinally, and they rely on the same core assumptions: positivity, consistency, and sequential exchangeability (no unmeasured confounding at each time).

Before applying any of these, you’ll need to:

G-computation (or the G-formula)

As in the static case, G-computation estimates causal effects by explicitly modeling outcomes under different treatment regimes. However, in longitudinal settings, we simulate sequences of outcomes over time. G-computation avoids conditioning on post-treatment variables directly by simulating what would happen under full treatment strategies (e.g. “always treat,” “never treat,” or “treat only if inactive last week”).

The key idea is:

  1. Model the outcome at each time step as a function of past treatment and covariate history:

    \[E[Y_t \mid A_{0:t}, L_{0:t}]\]
  2. Model covariate evolution (if needed): If covariates are affected by past treatment, model them too:

    \[P(L_t \mid A_{0:t-1}, L_{0:t-1})\]
  3. Simulate counterfactual outcomes under hypothetical treatment strategies: For each user, simulate forward from baseline to generate treatment, covariate, and outcome paths consistent with a given policy:

    • At each time step:

      • Set treatment $A_t$ based on the strategy
      • Predict $L_t$ using the covariate model (if applicable)
      • Predict $Y_t$ using the outcome model
  4. Average outcomes across users to estimate the causal effect of each strategy:

    \[\hat{E}[Y^{\bar{a}}] = \frac{1}{n} \sum_{i=1}^{n} \hat{Y}_i^{\bar{a}}\]

Example

Suppose you want to estimate the effect of sending a notification each week for two weeks on engagement in week 3. Each user has:

We do not model engagement in week 2 because we assume the final outcome $Y$ depends only on prior engagement and treatment, not directly on engagement in week 2. This simplifies simulation by removing the need to predict $L_2$.

Steps:

  1. Fit the outcome model:
    Model $Y$ as a function of $[L_0, A_1, L_1, A_2]$ (e.g., logistic regression). This allows you to simulate $Y$ under any counterfactual history.

  2. Fit the covariate model:
    Model $L_1$ as a function of $[L_0, A_1]$ (e.g., logistic regression). This allows you to simulate week 1 engagement under different treatments.

  3. Simulate outcomes for each user and treatment strategy:

    • Always-treat strategy ($A_1 = 1$, $A_2 = 1$):
      1. Set $A_1 = 1$
      2. Predict $\hat{L}_1 \sim P(L_1 \mid L_0, A_1 = 1)$
      3. Set $A_2 = 1$
      4. Predict $\hat{Y} \sim P(Y \mid L_0, A_1 = 1, \hat{L}_1, A_2 = 1)$
    • Never-treat strategy ($A_1 = 0$, $A_2 = 0$):
      Repeat the steps above with $A_1 = 0$ and $A_2 = 0$

    Optionally, repeat the simulation multiple times per user and average the results to reduce randomness from sampling.

  4. Estimate causal effects:
    Average predicted $Y$ across users for each strategy. Then take the difference:

    \[\hat{E}[Y \mid \text{always treat}] - \hat{E}[Y \mid \text{never treat}]\]

Note, we do not model how treatment was originally assigned (e.g. whether $A_1$ depends on $L_0$, which it very well may). Instead, we explicitly set treatment values according to predefined strategies, overriding the observed treatment. This means we do not require $A_1$ to be independent of baseline covariates like $L_0$.

However, the example above assumes that the treatment strategy is fixed in advance (e.g. always treat or never treat). In practice, treatment decisions are often dynamic (based on user behavior). For instance, you might send a notification in week 2 only if the user was inactive in week 1. In such cases, treatment depends on simulated covariates like $\hat{L}_1$, and the simulation must be adapted to reflect that dependency.

When treatment depends on simulated covariates

If your treatment rule depends on user behavior, and user behavior depends on past treatment, then you can’t simulate one without the other. You must recursively simulate covariates and then use them to apply your treatment rule at each time step. For example:

Suppose your strategy is: “notify in week 2 only if the user was inactive in week 1.”
You must simulate the user’s week 1 engagement before you can assign treatment in week 2.

Steps:

  1. Start from baseline $L_0$ and assign $A_1$
  2. Predict $\hat{L}_1 \sim P(L_1 \mid L_0, A_1)$
  3. Apply policy: if $\hat{L}_1 = 0$ (inactive), then set $A_2 = 1$; else $A_2 = 0$
  4. Use $[L_0, A_1, \hat{L}_1, A_2]$ to predict $Y$

Repeat this process for each user. Then average predicted outcomes to estimate effects under this policy. Compare against other strategies as before.

When the outcome spans multiple time points

If your outcome is measured over time (e.g., weekly engagement from week 1 to week 4), then you must simulate the outcome recursively at each time step. Each prediction depends on the full simulated treatment and covariate history up to that point. For example, suppose your outcome is engagement in weeks 1–3, and you want to simulate the effect of sending notifications each week. You must predict $Y_1$, $Y_2$, and $Y_3$ in sequence, conditioning each on the relevant treatment and covariates observed so far.

Steps:

  1. Start from baseline covariates $L_0$
  2. Assign $A_1$ (e.g. always treat)
  3. Predict $\hat{L}_1 \sim P(L_1 \mid L_0, A_1)$
  4. Predict $\hat{Y}_1 \sim P(Y_1 \mid L_0, A_1, \hat{L}_1)$
  5. Assign $A_2$ based on your strategy (e.g., always treat or policy rule)
  6. Predict $\hat{L}_2 \sim P(L_2 \mid \hat{L}_1, A_2)$
  7. Predict $\hat{Y}_2 \sim P(Y_2 \mid \hat{L}_1, A_2, \hat{L}_2, \hat{Y}_1)$
  8. Repeat for $A_3$, $\hat{L}_3$, and $\hat{Y}_3$

At the end, you’ll have a full simulated outcome trajectory under each treatment strategy:

$[\hat{Y}_1, \hat{Y}_2, \hat{Y}_3]$

You can then define your final outcome as: a specific time point (e.g. $\hat{Y}_3$); a cumulative measure (e.g. total weeks engaged: $\hat{Y}_1 + \hat{Y}_2 + \hat{Y}_3$); or, a binary indicator (e.g. ever engaged). Some more examples below linking it to the question:

Causal Question Final Estimate
What’s the effect of always treating on engagement in week 8? $\hat{E}[Y_8^{\bar{a}}]$
What’s the effect of always treating on total engagement over 8 weeks? $\hat{E}\left[\sum_{t=1}^8 Y_t^{\bar{a}}\right]$
What’s the effect on probability of engaging in any week? $\hat{E}\left[\mathbb{I}\left(\max_t Y_t^{\bar{a}} = 1\right)\right]$
What’s the effect on number of active weeks? $\hat{E}\left[\sum_t \mathbb{I}(Y_t = 1)\right]$
What’s the probability of being engaged in every week? $\hat{E}\left[\mathbb{I}(Y_1 = 1 \land \dots \land Y_8 = 1)\right]$

Repeat the simulation across users and strategies, then compare average outcomes as before.

Model choice

As in the static case, this approach is more of a framework. Any predictive model that can estimate the conditional expectation of the outcome and time-varying covariates can be used. Logistic regression is a standard choice when the outcome is binary, and linear regression is often used for continuous outcomes. However, more flexible models such as random forests, gradient boosting, or neural networks can also be used, especially when you expect non-linear relationships or complex interactions. The key requirement is that the models are well-calibrated and accurately capture the conditional distributions needed for simulating counterfactual outcomes.

Estimating Uncertainty

Since G-computation involves multiple modeled steps and stochastic simulations, it’s good practice to use bootstrapping and nested simulations to estimate confidence intervals and capture both model and simulation uncertainty.

IPW (Inverse Probability Weighting)

Inverse Probability Weighting (IPW) is an alternative to G-computation for handling time-varying confounding in longitudinal data. The core idea is to create a pseudo-population in which treatment assignment at each time point is independent of measured confounders. This allows for unbiased estimation of causal effects by effectively balancing the treated and untreated groups over time. IPW relies on the assumptions of positivity (every individual has a nonzero probability of receiving each treatment level at each time), consistency, and sequential exchangeability (no unmeasured confounding at each time step). Correctly specifying the treatment assignment models (propensity scores) is crucial for valid inference.

How it works in time-varying settings:

Example

Suppose you want to estimate the effect of sending push notifications each week on user engagement. Each user is observed over multiple weeks, with variables:

At each week $t$, users may receive a notification based on their prior engagement and notification history (time-varying confounding). For example, less engaged users might be more likely to receive notifications. We want to estimate the causal effect of $A_1$ and $A_2$ on engagement in week 3, adjusting for time-varying confounding.

Steps:

  1. Estimate propensity scores for each week: For binary treatments, this is typically done using logistic regression.

    For week 1, model:

    \[P(A_1 = a_1 \mid L_0)\]

    For week 2, model:

    \[P(A_2 = a_2 \mid L_0, A_1, L_1)\]

    And similarly for later weeks.

  2. Calculate stabilized weights:

    For each user, the weight up to week 2 is:

    \[w_i = \frac{P(A_1 = a_1) \times P(A_2 = a_2)}{P(A_1 = a_1 \mid L_0) \times P(A_2 = a_2 \mid L_0, A_1, L_1)}\]

    This downweights users with unlikely treatment histories conditional on their covariates, balancing the treated and untreated groups.

  3. Fit weighted outcome model:

    Fit a weighted regression of week 3 engagement $Y$ on treatment history $A_1, A_2$, without adjusting for covariates because the weights account for confounding.

    \[E[Y \mid A_1, A_2]\]

    Typical choices include:

    • Linear probability model: coefficients interpreted as average differences in probability (percentage points).
    • Logistic regression: coefficients interpreted as log-odds ratios (less intuitive but suitable for binary outcomes).
  4. Interpret results:

    The coefficients on $A_1$ and $A_2$ estimate the marginal causal effect of receiving notifications in weeks 1 and 2 on engagement in week 3, averaged over the population. For example, if the coefficient for $A_2$ in a linear model was 0.15, we could say: “receiving a notification in week 2 increases the probability of engagement in week 3 by 15 percentage points, on average.” The total effect of receiving notifications in both weeks is the sum of the coefficients (under a linear model assumption).

    This differs from G-computation, which estimates conditional expectations and uses simulation; IPW estimates the marginal effect by reweighting the observed data.

  5. Quantify uncertainty:

    Because the weights and outcome model are both estimated, standard errors should reflect uncertainty in both steps. A common approach is bootstrapping:

    • Resample the data with replacement many times (e.g., 500–1000 iterations).
    • For each bootstrap sample:
      • Re-estimate propensity score models and stabilized weights.
      • Refit the weighted outcome model.
      • Record treatment effect estimates (e.g., coefficients or contrasts).
    • Use the empirical distribution of estimates to compute confidence intervals (e.g., percentile-based or normal approximation).

Practical considerations

IPW weights can also be used to estimate treatment effects on outcomes measured at multiple time points by fitting weighted models for each outcome, or the full outcome trajectory. However, to explicitly model the effect of treatment histories on longitudinal outcomes, Marginal Structural Models (MSMs) are typically employed.

Marginal Structural Models (MSMs)

Marginal Structural Models (MSMs) provide a framework for estimating the causal effects of dynamic treatment strategies in the presence of time-varying confounding. Like IPW and G-computation, MSMs aim to recover causal effects from observational data, but they go a step further by explicitly modeling how treatment history influences outcomes over time. The key idea behind an MSM is to model the marginal, or population-average, relationship between treatment and outcome. This is typically done using a generalized linear model (GLM), such as linear regression for continuous outcomes or logistic regression for binary outcomes. These models yield coefficients that can be interpreted as average treatment effects across the population. Typically, IPW is implemented first to reweight the data, creating a pseudo-population in which treatment assignment is independent of past covariates. The MSM is then fit to this weighted data to estimate causal effects. This combination of reweighting followed by outcome modeling distinguishes MSMs from IPW alone and allows them to handle complex longitudinal treatment patterns without conditioning directly on post-treatment variables.

Example with a single final outcome:

Suppose you want to understand the effect of receiving weekly notifications on user engagement measured at week 3. The data includes:

Users may be more likely to receive notifications if they showed low engagement in previous weeks, creating time-varying confounding.

Steps:

  1. Estimate stabilized inverse probability weights (IPW):

    For each week $t$, estimate the probability of receiving the observed treatment $A_t$ conditional on past covariates and treatments. These are typically modeled with logistic regression:

    \[P(A_t = a_t \mid L_0, A_1, L_1, \dots, A_{t-1}, L_{t-1})\]

    Then compute stabilized weights as the product of marginal treatment probabilities divided by these conditional probabilities:

    \[w_i^{\text{stab}} = \prod_{t=1}^2 \frac{P(A_t = a_t)}{P(A_t = a_t \mid L_0, A_1, L_1, \dots, A_{t-1}, L_{t-1})}\]

    Note: Check for extreme weights and consider truncation or smoothing to improve stability.

  2. Specify the MSM:

    Choose a parametric model relating the expected potential outcome $Y^{\bar{a}}$ to treatment history. For example:

    \[E[Y^{\bar{a}}] = \beta_0 + \beta_1 A_1 + \beta_2 A_2 + \beta_3 \text{(baseline covariates)}\]
  3. Fit the MSM using weighted regression:

    Fit the model for $Y$ on treatment history variables using the stabilized IP weights, without including time-varying confounders as covariates. This weighted regression estimates marginal causal effects averaged over confounders.

  4. Interpret coefficients:

    In this context, marginal effects refer to population-average causal effects. Rather than measuring the effect conditional on specific covariate values (as in traditional regression), MSMs estimate the average effect of treatment history across the entire study population, after balancing confounders via weighting. This means the effects describe how the average outcome would change if everyone in the population received a particular treatment pattern, marginalizing over differences in covariates.

    • $\beta_1$ estimates the average causal effect of receiving a notification in week 1 on engagement at week 3.
    • $\beta_2$ estimates the average causal effect of receiving a notification in week 2 on engagement at week 3.

MSMs with multiple outcome time points

When outcomes are measured repeatedly over time (e.g., weekly engagement from week 1 to week 3 or beyond), MSMs can be extended to model the entire outcome trajectory as a function of treatment history.

For example, suppose you observe engagement outcomes $Y_1, Y_2, Y_3$ and want to estimate how notifications in weeks 1 and 2 affect engagement over this period.

  1. Compute stabilized IP weights as before, based on treatment and covariate histories.

  2. Specify an MSM modeling the marginal mean outcome at each time point $t$:

    \[E[Y_t^{\bar{a}}] = \beta_0 + \beta_1 A_1 + \beta_2 A_2 + \beta_3 t + \beta_4 (A_1 \times t) + \beta_5 (A_2 \times t) + \beta_6 \text{(baseline covariates)}\]

    This specification allows treatment effects to vary over time.

  3. Fit this model using weighted regression or generalized estimating equations (GEE) to account for within-subject correlation, applying the IP weights.

  4. Interpret coefficients as the average causal effect of treatment history on the trajectory of engagement over time.

Practical Considerations:

MSMs fill an important niche between fully conditional approaches like G-computation and non-parametric weighting approaches like IPW. By explicitly modeling outcomes while using weighting to balance confounders, they offer a blend of flexibility and interpretability.

Method comparison

Aspect Simplification: Single Treatment Point G-computation (G-formula) IPW (Inverse Probability Weighting) MSMs (Marginal Structural Models)
Method Overview & Modeling Approach Uses standard regression to estimate treatment effect at one fixed time point, conditioning only on baseline covariates. No modeling of treatment assignment or covariate evolution. Sequential modeling of outcomes and covariates over time, simulating full treatment and covariate trajectories under different treatment strategies. Requires correctly specified outcome and covariate models. Models treatment assignment probabilities (propensity scores) at each time point to calculate weights that create a pseudo-population with balanced confounders; no outcome model needed for weighting. Combines IP weights with a weighted regression model of outcome on treatment history, estimating marginal treatment effects averaged over the population without conditioning on confounders directly.
Estimand & Interpretation Conditional effect of a one-time treatment at a specific time, given baseline covariates. Conditional expected outcomes under specified dynamic treatment strategies, accounting for treatment-confounder feedback. Marginal (population-average) causal effect of treatment history, reflecting average effect across the population. Marginal (population-average) treatment effects capturing how treatment history affects outcomes on average over the population, can include time-varying effects.
Handling Time-varying Confounders Avoids time-varying confounding by only adjusting for baseline covariates; ignores repeated treatments and their feedback. Explicitly models time-varying confounders and their evolution, simulating their interaction with treatment over time. Adjusts for time-varying confounders indirectly by weighting observations according to inverse probability of observed treatment history given covariates. Same as IPW, with the addition of outcome modeling to capture treatment effect marginally; weights balance confounders before outcome regression.
Assumptions No unmeasured confounding at baseline; treatment assignment at the chosen time is independent of unmeasured factors given baseline covariates. Positivity, consistency, sequential exchangeability; correct model specification for both covariate and outcome processes. Positivity, consistency, sequential exchangeability; correct specification of treatment assignment (propensity) models at all times. Same as IPW, plus correct specification of MSM outcome model; relies on stable weights for reliable estimates.
Strengths Simple and easy to implement; interpretable; requires fewer assumptions; suitable for short-term or single treatment effect questions. Highly flexible; can evaluate complex, dynamic treatment regimes with feedback; allows direct simulation of counterfactual outcomes. Conceptually straightforward; avoids outcome modeling bias; targets marginal causal effects; scalable to longer time horizons. Balances confounding with weights and incorporates outcome regression for efficiency; interpretable marginal effects; flexible for dynamic treatments and repeated outcomes.
Limitations Discards information on repeated treatments; may suffer residual confounding if baseline covariates are incomplete; not suitable for cumulative or dynamic effects. Computationally intensive; sensitive to misspecification of multiple models; simulation can be complex and resource-heavy. Sensitive to extreme or unstable weights; requires careful model checking and diagnostics; no direct outcome modeling may limit efficiency. Shares weighting challenges with IPW; outcome model must be correctly specified; can be unstable with complex or sparse data; requires balance between flexibility and complexity.
Recommended Use Cases Short time horizon; interest in a single treatment effect; when only baseline covariates are reliable; simpler scenarios. Evaluating specific static or dynamic treatment rules involving time-varying confounding and feedback loops; when flexible outcome modeling is feasible. Estimating marginal population-level effects over longer time horizons; when avoiding outcome model bias is important; computational tractability preferred. When population-average causal effects are desired with complex treatment histories; balancing interpretability with flexibility; often preferred for longer or repeated outcome data.

Dropout (loss to follow-up)

In longitudinal studies and real-world observational settings, it’s common for some users or subjects to stop providing data before the study ends. This is called loss to follow-up or dropout. If dropout is related to both the treatment and the outcome, it can bias your estimates. For example, if users who disengage differ systematically from those who stay, treating the missing data as random will lead to distorted results. This is known as informative censoring, because the missingness carries information about the outcome.

Ignoring it

Before applying specialized methods, it’s worth asking: what happens if you don’t model dropout at all? There are a few common ways practitioners try to “handle” dropout without explicitly modeling it—each with its own risks:

In industry, dropout is rarely random. Users may disengage because of poor experience, lack of value from treatment, or external pressures that are also related to outcomes. Ignoring this can:

Rather than ignore dropout, you should either adjust for the dropout mechanism or try to recover information about what would have happened to those who left. Methods like censoring indicators, inverse probability of censoring weighting (IPCW), and imputation (when done carefully and transparently) can help you estimate valid causal effects even in the presence of missing outcome data.

Censoring Indicators

At each time point $t$, define a binary variable $C_t$ that equals 1 if the unit has dropped out before time $t$ (i.e., is no longer observed at or after time $t$), and 0 otherwise. These indicators are included as covariates in the outcome model to explicitly account for dropout status.

How it works

Including $C_t$ in the model lets the regression condition on dropout status, explicitly accounting for whether someone has dropped out earlier. Conceptually, this splits the outcome prediction into two groups: those still active ($C_t=0$) and those who have dropped out ($C_t=1$), allowing the model to adjust treatment effect estimates accordingly. The treatment effect estimate becomes conditional on remaining uncensored up to time $t$, answering questions like: “Among users who stayed active through week 2, what was the effect of receiving notifications in week 1?”

You can include an interaction between treatment and $C_t$ to estimate separate treatment effects for those at risk of dropout and those who remained engaged. This captures potential heterogeneity and partially adjusts for confounding introduced by selective dropout.

Example

Suppose you model week 2 engagement as:

\[Engagement_2 \sim Email_1 + Email_2 + Engagement_1 + C_2 + Email_2 \times C_2\]

The marginal (population-average) treatment effect is recovered by weighting these conditional effects by the observed dropout distribution:

\[ATE = \beta_{\text{Email}_2} \cdot P(C_2 = 0) + (\beta_{\text{Email}_2} + \beta_{\text{Email}_2 \times C_2}) \cdot P(C_2 = 1)\]

where $P(C_2=0)$ and $P(C_2=1)$ are the proportions of users still active or dropped out before time 2.

This method is straightforward to implement and works well when:

Note this approach assumes no strong unmeasured confounding between dropout and outcome beyond what is captured by covariates and $C_t$.

Limitations

Censoring indicators flag dropout but do not recover missing outcome data or adjust for the underlying dropout mechanism. This limits their ability to correct bias when dropout is strongly informative or related to unmeasured factors.

Since censoring indicators rely on the assumption that dropout is independent of unmeasured factors given observed covariates and $C_t$, violation of this assumption can leave residual bias. In such scenarios, methods like Inverse Probability of Censoring Weighting (IPCW) or joint modeling offer more robust adjustment.

Inverse Probability of Censoring Weighting (IPCW)

Inverse Probability of Censoring Weighting (IPCW) provides a principled way to recover unbiased treatment effect estimates in the presence of dropout by reweighting the data to account for who remained in the sample at each time point. Instead of modeling the probability of receiving treatment (as in IPW), you model the probability of remaining uncensored (i.e. not dropping out) up to each time point. It’s especially useful when dropout is informative (i.e. related to prior covariates, treatment, or intermediate outcomes), but you assume it’s conditionally independent of the outcome given these variables.

How it works

  1. Model censoring: At each time point, estimate the probability that a user is still active (uncensored) given their observed history up to that point—this includes past treatments, covariates, and outcomes.

    For example, you might fit a logistic regression or survival model:

    \[P(C_t = 0 \mid \text{history up to } t)\]
  2. Compute weights: For each individual at each time point, calculate the inverse of their probability of remaining uncensored:

    \[w_t = \frac{1}{P(C_t = 0 \mid \text{history})}\]

    These weights increase the influence of individuals who are similar to those who dropped out—effectively upweighting underrepresented trajectories.

  3. Fit outcome model using weights: Use these weights in your outcome regression to estimate the treatment effect. This creates a “pseudo-population” where dropout is independent of covariates and treatment—approximating the full original cohort.

Example

Let’s return to our example of estimating the effect of sending push notifications each week on user engagement, but now we have drop-out (some users stop engaging). Your data includes:

At each week $t$, users may receive notifications based on prior engagement and notification history (time-varying confounding), and may also drop out based on their history (for simplicity, dropout here means not being active in week 3). You want to estimate the effect of notifications on week 3 engagement while adjusting for both confounding and informative dropout.

Steps:

  1. Estimate treatment (notification) propensity scores at each week:
    Model the probability of receiving a notification at week $t$ conditional on past covariates and treatment:

    \[P(A_1 = a_1 \mid L_0), \quad P(A_2 = a_2 \mid L_0, A_1, L_1)\]
  2. Calculate stabilized treatment weights:
    For each user, calculate treatment weights up to week 2:

    \[w_i^{treatment} = \frac{P(A_1 = a_1) \times P(A_2 = a_2)}{P(A_1 = a_1 \mid L_0) \times P(A_2 = a_2 \mid L_0, A_1, L_1)}\]
  3. Model dropout (censoring) probabilities at each week:
    Estimate the probability of remaining uncensored (still engaged) at each week $t$ given past treatment and covariate history:

    \[P(C_t = 0 \mid L_0, A_1, L_1, A_2, L_2, \ldots)\]
  4. Calculate stabilized censoring weights:
    Calculate inverse probability of censoring weights to upweight users similar to those who dropped out:

    \[w_i^{censor} = \frac{P(C_1 = 0) \times P(C_2 = 0 \mid C_1=0)}{P(C_1 = 0 \mid \text{history}) \times P(C_2 = 0 \mid \text{history})}\]
  5. Combine treatment and censoring weights:
    Multiply treatment and censoring weights to obtain the final weight adjusting for confounding and dropout:

    \[w_i = w_i^{treatment} \times w_i^{censor}\]
  6. Fit weighted outcome model:
    Fit a weighted regression of week 3 engagement on treatment history without adjusting for covariates:

    \[E[Y_3 \mid A_1, A_2]\]
  7. Interpret results:
    Coefficients estimate the causal effect of notifications on week 3 engagement, accounting for time-varying confounding and informative dropout. For example, a coefficient of 0.15 on $A_2$ means receiving a notification in week 2 increases the probability of engagement in week 3 by 15 percentage points on average.

  8. Quantify uncertainty:
    Use bootstrapping to account for variability in estimating both treatment and censoring models.

Nuance in defining dropout

IPCW requires clear definitions of censoring: usually, censoring is defined as the last observed time point before permanent loss of follow-up:

When using IPCW or other dropout methods, explicitly define what counts as dropout in your context, considering user behavior. For many product metrics, a certain period of inactivity might count as dropout, rather than absence at a single time point.

Assumptions

Pros and cons

Advantages:

Limitations:

When to use

Use IPCW when you believe dropout is related to treatment or outcome, but you can model it using observed covariates. It’s particularly useful in longitudinal product data where churn or disengagement is a non-random signal, but not inevitable, and you want to generalize results to the full user base rather than just the engaged subset.

Multiple Imputation

Multiple imputation (MI) is a popular approach to handle missing outcome data due to dropout by filling in (imputing) plausible values for missing observations multiple times, creating several complete datasets. Each dataset is then analyzed separately, and results are combined to produce overall estimates that reflect uncertainty due to missing data.

How It Works

  1. Model the missing data mechanism:
    Using observed data (including baseline covariates, treatment history, and previous outcomes), you build a model to predict missing outcomes. This can be done with regression, machine learning methods, or other flexible models.

  2. Generate multiple imputations:
    For each missing outcome, the model generates multiple plausible values (e.g., 5 to 20 datasets), reflecting the uncertainty about what the true value might have been.

  3. Analyze each completed dataset:
    Apply your planned analysis (e.g. regression modeling of outcome on treatment) to each imputed dataset independently.

  4. Pool results:
    Combine the estimates and standard errors from each imputed dataset to obtain a single set of inference results that properly account for missing data uncertainty. Specifically, for $m$ imputed datasets, let $\hat{\theta}_j$ be the estimate from the $j^{th}$ dataset and $U_j$ its estimated variance.

    • Pooled estimate: \(\bar{\theta} = \frac{1}{m} \sum_{j=1}^m \hat{\theta}_j\)

    • Within-imputation variance: \(\bar{U} = \frac{1}{m} \sum_{j=1}^m U_j\)

    • Between-imputation variance: \(B = \frac{1}{m-1} \sum_{j=1}^m (\hat{\theta}_j - \bar{\theta})^2\)

    • Total variance: \(T = \bar{U} + \left(1 + \frac{1}{m}\right) B\)

    The standard error of $\bar{\theta}$ is $\sqrt{T}$. Confidence intervals and hypothesis tests can then be constructed using $\bar{\theta}$ and $T$, often employing a $t$-distribution with adjusted degrees of freedom.

When to use

Limitations

Example

Suppose you want to estimate the effect of sending push notifications each week on user engagement, focusing on engagement measured at week 3. Some users stop using the product after week 2 (dropout), so their engagement at week 3 is missing.

Steps:

This approach assumes that, conditional on the observed data, the dropout mechanism is ignorable and that your imputation model sufficiently captures the key predictors of dropout and engagement.

In practice, multiple imputation is a valuable tool when dropout is moderate and primarily related to observed data. However, when dropout is strongly informative or related to unmeasured factors, methods like IPCW or joint modeling are generally preferred to reduce bias.

Joint Modeling of Outcomes and Dropout

Joint modeling is an advanced approach that simultaneously models the outcome process and the dropout (or censoring) process within a unified statistical framework. Unlike simpler methods that treat dropout as a nuisance or handle it separately, joint modeling explicitly recognizes that dropout and the outcome may be related through shared underlying factors.

How it works

In joint modeling, two sub-models are specified:

  1. Outcome model: Describes the trajectory of the outcome over time (e.g., user engagement over multiple weeks).
  2. Dropout model: Captures the probability or timing of dropout, often modeled as a survival or time-to-event process.

These models are linked through shared random effects or latent variables, which represent unobserved characteristics that influence both the outcome and dropout. This allows the method to account for informative dropout, where the likelihood of dropping out depends on unmeasured factors related to the outcome.

When to use

Benefits

Limitations

Example

Suppose you track weekly engagement of users receiving push notifications. You suspect users who disengage early (dropout) do so because of unobserved dissatisfaction that also affects their engagement. A joint model would simultaneously model engagement over time and the dropout process, linking them through shared latent factors. By doing this, the model can better adjust for dropout bias and provide more accurate estimates of how notifications affect engagement, even when dropout is informative and related to unmeasured user traits.

Joint modeling complements other approaches like inverse probability weighting or multiple imputation by explicitly modeling the dropout mechanism and outcome together. It is a powerful tool when you face complex dropout patterns that simpler methods cannot adequately address.

General tips

Comaprison of methods

Method Key Steps When to Use / Pros Limitations Use in Static Causal Methods Use in Longitudinal Causal Methods
Censoring Indicators Add dropout flags to outcome models or jointly model dropout + outcome Simple to implement; makes dropout explicit May not fully adjust for complex or informative dropout Add dropout indicators in regression, matching, DiD, etc. Include time-varying dropout indicators in G-computation or MSMs
IPCW Model dropout → compute inverse probabilities → use weights in outcome model Robust for informative censoring if dropout model is well specified Requires a good model; unstable weights can inflate variance Weight complete cases by dropout probabilities Combine with treatment IPW (double weights) in MSMs or G-computation
Multiple Imputation Impute missing data multiple times → analyze each → pool results (Rubin’s rules) Familiar method; handles MAR well; easy to apply Assumes MAR; can mislead if dropout is MNAR Impute missing $Y$ → use regression, matching, DiD, synthetic control Impute $Y_t$ at each time step → use in longitudinal outcome models
Joint Modeling Jointly model outcome + dropout (e.g. shared random effects or latent classes) Handles MNAR; models the dropout mechanism directly Complex; requires correct specification and more assumptions Less common; possible when dropout is tightly tied to outcome Powerful for MNAR dropout in longitudinal settings; used with mixed models

Time-to-Event Models

Causal questions often involve not just whether an event occurs, but when (for example, when a customer will churn or a machine will fail). Traditional survival analysis methods, such as the Kaplan-Meier estimator or Cox proportional hazards model, are commonly used to analyze these outcomes. However, these approaches typically estimate associations rather than causal effects. Moving from association to causation in time-to-event settings requires methods that adjust for confounding, appropriately handle censoring (for example, when the event is not observed during the follow-up window), and accommodate time-varying treatments and covariates. Below, we first review common associational methods and then introduce methods capable of estimating causal effects.

Standard approaches

Kaplan-Meier Estimator

The Kaplan-Meier curve estimates the probability of survival over time without making assumptions about the distribution of event times. It is a nonparametric method that accounts for right censoring due to end of follow-up or loss to follow-up. Survival curves can be compared across groups (e.g., treatment versus control), but without adjustment for confounding. Therefore, it typically describes associations, not causal effects.

Cox Proportional Hazards Model

The Cox proportional hazards model is a widely used approach that estimates how different factors (like treatment) affect the instantaneous risk of the event occurring at any given moment in time. You can think of the hazard as the chance the event happens right now, assuming it hasn’t happened yet. Formally, the hazard function quantifies the instantaneous event rate at time $t$, conditional on survival up to that point:

\[\lambda(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t \mid T \geq t)}{\Delta t}\]

The Cox model estimates how covariates (such as treatment and other variables) affect this hazard:

\[\lambda(t \mid A, L) = \lambda_0(t) \exp(\beta_A A + \beta_L L)\]

Where:

This model is semi-parametric, leaving the baseline hazard unspecified but assuming a multiplicative form for the effect of treatment and covariates. The coefficients are estimated by analyzing the timing of observed events and censoring (when someone leaves the study without experiencing the event).

The exponentiated coefficient $\exp(\beta_A)$ is interpreted as a hazard ratio (HR), which compares the relative hazard under treatment versus control, holding covariates constant:

\[\text{HR}(t) = \frac{\lambda_1(t)}{\lambda_0(t)}\]

For example, a hazard ratio of 0.75 means that, at time $t$, the treated group has a 25% lower instantaneous event rate among those still at risk. However, these estimates represent causal effects only if treatment is effectively randomized (or all confounders are measured and properly adjusted for) and censoring is non-informative (unrelated to the event risk). In practice, Cox models often estimate associations rather than causal effects for several reasons:

  1. Hazard ratios condition on survival up to time $t$. If treatment influences who survives, this induces selection bias. For example, if treatment removes high-risk individuals early, survivors in the treated group may appear artificially healthier than those in the control group.

  2. The hazard ratio is typically not constant over time, even though the Cox model assumes proportional hazards unless extended with time-varying effects. Reported hazard ratios are weighted averages of time-specific hazard ratios, with weights driven by the event distribution rather than causal mechanisms.

  3. The hazard ratio is not a direct measure of absolute survival benefit. It does not indicate how many more people survive at a specific time point (such as 12 months) under treatment versus control. For that, survival probabilities are more informative.

Survival Curves

While hazard ratios summarize relative risk, survival curves estimate the probability of surviving beyond a given time point. In the Cox model, the survival curve for an individual with covariates $X$ is derived from the estimated baseline survival curve $\hat{S}_0(t)$ as:

\[\hat{S}(t \mid X) = \hat{S}_0(t)^{\exp(\beta^\top X)}\]

Where:

To illustrate, consider a subscription service. We can use a Cox model to estimate survival curves for customers who received a promotional offer versus those who didn’t, holding other factors (like age and plan type) constant. The y-axis shows the estimated probability that a customer remains subscribed at a given time (in days), and the x-axis shows the number of days since subscription began. The promotion offer curve is consistently higher, suggesting an association between the offer and improved retention (those that received the offer were more likely to still be subscribed at all time points), and the growing gap indicates some cumulative benefit.

Survival curves example

However, because this is a standard Cox model, the survival curves reflect associational rather than causal differences, meaning other factors not included in the model could explain part of the observed effect. Conditional survival curves such as this give the average survival probability over time for individuals with a specific covariate profile. They don’t give counterfactual survival. For example, what survival would have been had everyone received treatment $A$ versus treatment $B$. Estimating such causal survival curves requires specialized causal inference methods that simulate or estimate potential outcomes under different treatment strategies.

Causal Methods

Pooled Logistic Regression

Pooled logistic regression models discrete-time hazards using standard logistic regression. The key idea is to convert survival analysis into a classification problem. Rather than modeling the exact timing of an event, we break time into discrete intervals (e.g. days or weeks) and ask: Did the event happen in this interval, given that it hasn’t happened yet?. This involves restructuring the data so that each individual contributes one row per time period they were at risk, then fitting a standard logistic regression model (including time, treatment, and covariates). The model estimates the conditional probability of the event occurring at each interval, which serves as a discrete-time approximation of the hazard function:

\[\text{logit} \left[ P(Y_{it} = 1 \mid Y_{i,t-1} = 0, A_{i,t}, L_{i,t}) \right] = \beta_0(t) + \beta_A A_{i,t} + \beta_L L_{i,t}\]

By setting treatment to a fixed value across all individuals and time points, and predicting the hazards for each interval, you can construct counterfactual survival curves. This allows for direct estimation of causal effects, assuming time-varying confounding is appropriately controlled via covariate adjustment or inverse probability weighting.

Steps:

  1. Reshape data Structure the dataset so each person contributes one row per time period they are at risk, stopping at the time of the event or censoring.

  2. Fit pooled logistic regression Estimate the probability of the event occurring in each interval using logistic regression with time indicators, treatment, and time-varying covariates.

  3. Simulate counterfactual survival To estimate survival under a fixed treatment strategy, set treatment to the desired level for all rows and compute predicted hazards:

    \[\hat{S}(t) = \prod_{k=1}^t (1 - \hat{h}(k))\]

    where $\hat{h}(k)$ is the predicted hazard (probability of the event) at time k. Simulating under different treatment strategies yields causal survival curves.

Example

Suppose we want to estimate the effect of sending weekly promotional emails on time to unsubscribe. We have data on 500 users tracked weekly for up to 12 weeks. Each week, we observe:

First we need to reshape the data (convert to long format, so each row is one user-week observation, stopping at unsubscribe or censoring). Then we can model the discrete-time hazard (probability of unsubscribe) as a function of: week indicators (to flexibly model baseline hazard over time); treatment at that week; and engagement status at that week. For example, it might look something like: glm(Y ~ factor(week) + A + L, family = binomial, data = long_data).

Simulating Counterfactual Survival Curves

Now we can estimate survival under fixed treatment strategies (always send promos vs never send promos):

  1. Create counterfactual datasets: Copy the original long dataset twice:

    • In one copy, set $A_t = 1$ for all user-weeks (always send promotions)
    • In the other, set $A_t = 0$ for all user-weeks (never send promotions)

    Keep $L_t$ and time variables as observed or appropriately updated depending on assumptions (e.g., time-varying covariates can be set to observed or fixed values depending on the method).

  2. Predict hazards: Use the fitted model to predict the probability of unsubscribing each week for every user in each counterfactual dataset:

    \[\hat{h}_{i,t} = P(Y_{i,t} = 1 \mid Y_{i,t-1} = 0, A_{i,t}, L_{i,t})\]
  3. Calculate survival probabilities: For each user, calculate the survival probability at each week $t$:

    \[\hat{S}_{i}(t) = \prod_{k=1}^t (1 - \hat{h}_{i,k})\]
  4. Aggregate over users: Average these individual survival probabilities across all users to get the estimated survival curve under each treatment strategy:

    \[\hat{S}(t) = \frac{1}{N} \sum_{i=1}^N \hat{S}_i(t)\]

Plotting survival_always and survival_never over weeks shows the estimated probability of remaining subscribed under each promotional strategy. The difference between curves reflects the causal effect of sending weekly promotional emails on retention, assuming no unmeasured confounding and correct model specification. Keep in mind, the causal conclusions still depend on assumptions like no unmeasured confounding and correct model specification.

Inverse Probability of Censoring Weighting (IPCW)

IPCW adjusts for informative censoring by reweighting observed individuals based on the inverse of their probability of remaining uncensored.

Steps:

  1. Fit a model for censoring probability at each time $t$, conditioning on covariate and treatment history.
  2. Compute stabilized weights:
\[w_{i,t} = \prod_{k=1}^t \frac{P(C_{i,k} = 0)}{P(C_{i,k} = 0 \mid A_{i,1:k-1}, L_{i,1:k-1})}\]
  1. Apply these weights in the outcome model (e.g. pooled logistic or Cox) to obtain unbiased estimates of treatment effects.

Marginal Structural Models (MSMs)

MSMs estimate causal effects of treatment strategies when treatment and confounders vary over time, and treatment affects future confounders.

Use inverse probability of treatment and censoring weights to adjust for time-varying confounding:

\[w_{i} = \prod_{k=1}^t \frac{P(A_{i,k} \mid A_{i,1:k-1})}{P(A_{i,k} \mid A_{i,1:k-1}, L_{i,1:k})} \times \frac{P(C_{i,k} = 0)}{P(C_{i,k} = 0 \mid A_{i,1:k-1}, L_{i,1:k-1})}\]

Then, fit a weighted outcome model such as:

\[\text{logit} \left[ P(Y_{it} = 1 \mid A_{i,t}) \right] = \beta_0(t) + \beta_A A_{i,t}\]

Parametric g-Formula

The parametric g-formula estimates causal effects by explicitly modeling the full joint distribution of the data and simulating counterfactual outcomes under different treatment strategies.

Steps:

  1. Fit models for:

    • Covariates $L_t$
    • Treatment $A_t$
    • Event $D_t$
    • Censoring $C_t$
  2. Simulate full individual trajectories under specified treatment regimes.

  3. Compute survival probabilities under each regime and compare them to estimate causal effects.

\[\hat{S}^a(t) = \text{average survival at } t \text{ under regime } a\]

Notes on Survival Models

Survival analysis methods must accommodate right censoring, which commonly arises due to fixed end-of-study follow-up. Some key distinctions:

Pooled logistic regression is a parametric model for discrete time, whereas Cox models are semi-parametric.

Comparison of methods

Method Purpose Adjusts for Handles Time-varying Supports Causal Inference
Kaplan-Meier Compare survival curves No No No
Cox model Estimate hazard ratios Some With extensions Only under strong assumptions
Pooled Logistic Flexible discrete hazard model Yes Yes Yes (with g-formula or MSM)
IPCW Adjust for censoring bias Yes Yes Yes
MSM Estimate marginal causal effects Yes Yes Yes
Parametric g-Formula Simulate counterfactual outcomes Yes Yes Yes

Preparing Data for Time-to-Event Analysis

Survival-style models require:

The data is typically transformed to either:

Confidence Intervals for Causal Time-to-Event Models

Causal estimates from time-to-event models require valid uncertainty estimates. Common methods:

Even with valid standard errors, misspecified models (e.g. incorrect functional forms for time, non-proportional hazards, omitted confounders), can lead to biased point estimates and misleading confidence intervals. Confidence intervals only reflect precision, not accuracy. It’s critical to assess model assumptions and conduct sensitivity analyses where possible.

Effect heterogeneity over time

The causal effect of a treatment or exposure may not be constant throughout the entire follow-up period. Instead, the treatment effect can vary at different time points, meaning there is effect heterogeneity over time (time-varying treatment effects). Accounting for this heterogeneity is crucial for understanding when and how an intervention exerts its influence, which can inform better decisions.

Several approaches allow estimation of time-varying treatment effects:

Effect heterogeneity over time can be visualized by plotting estimated hazard ratios, survival differences, or cumulative incidence differences across time points. These dynamic estimates help identify critical periods when treatment is most or least effective, supporting targeted intervention strategies or monitoring plans.


Other common methods

Mediation analysis

Mediation analysis is a statistical technique used to understand the mechanism or causal pathway by which a treatment (or exposure) influences an outcome through an intermediate variable known as the mediator. Rather than simply estimating whether a treatment affects an outcome, mediation analysis seeks to quantify \(\text{how much}\) of this effect occurs \(\text{directly}\) and \(\text{how much}\) is transmitted \(\text{indirectly}\) through the mediator.

Key concepts

Goals

The primary goal of mediation analysis is to decompose the total effect of the treatment on the outcome into two components:

Mathematically, this decomposition helps to clarify how the treatment brings about change in the outcome, offering richer insights beyond the average treatment effect (ATE).

Typical workflow

Mediation analysis is generally conducted on individual-level data collected in a single study or experiment. The most common approach involves a series of regression models to estimate direct and indirect effects:

  1. Estimate the total effect of \(T\) on \(Y\):
    Fit a regression model predicting \(Y\) from \(T\) (and relevant covariates). The coefficient of \(T\) here captures the total effect.

  2. Estimate the effect of \(T\) on the mediator \(M\):
    Fit a regression model predicting \(M\) from \(T\) (and covariates). This captures how the treatment influences the mediator.

  3. Estimate the effect of the mediator \(M\) on the outcome \(Y\), controlling for \(T\):
    Fit a regression model predicting \(Y\) from both \(T\) and \(M\) (and covariates). The coefficient on \(M\) here captures the mediator’s effect on the outcome, while the coefficient on \(T\) estimates the direct effect.

  4. Calculate the indirect effect:
    Multiply the effect of \(T \to M\) by the effect of \(M \to Y | T\) to estimate the indirect effect. Note, this approach only works if there’s no interaction between the mediator and treatment (i.e. the effect of the mediator on the outcome doesn’t vary by treatment level). See the next section for how to handle interactions.

  5. Calculate the direct effect:
    The coefficient of \(T\) in the outcome model controlling for \(M\) is the direct effect.

  6. Inference and uncertainty:
    Since the indirect effect is a product of two coefficients, its sampling distribution is usually non-normal. Bootstrapping or other resampling techniques are commonly used to estimate confidence intervals and test significance.

Other methods

Structural Equation Modeling (SEM)

A flexible framework that models direct and indirect pathways simultaneously using a system of structural equations, allowing for complex mediators and latent variables.

Causal Mediation Analysis

Uses counterfactual definitions of natural direct and indirect effects (e.g. the effect of treatment if the mediator were held fixed at its untreated value). These approaches allow for treatment-mediator interactions and enable more rigorous causal interpretation, under strong assumptions about ignorability.

Assumptions

For mediation analysis to yield valid causal interpretations, several assumptions must hold:

By carefully applying mediation analysis, researchers can gain insights not only about whether a treatment works, but also how it works, illuminating the causal mechanisms underlying observed effects.

Allowing interactions

When the effect of the mediator on the outcome varies by treatment level (i.e., there is a treatment-mediator interaction), standard linear decomposition (e.g. multiplying coefficients) may not give valid direct and indirect effects. In such cases, counterfactual-based methods (e.g., causal mediation analysis using potential outcomes) are more appropriate, as they allow for and estimate these interactions explicitly.

Meta mediation analysis

Meta-mediation extends this idea to a higher level. Instead of analyzing individual-level data, meta-mediation uses summary effect estimates from multiple independent experiments or studies (meta-experiments). It estimates how much of the variation in treatment effects on the outcome across experiments can be explained by variation in treatment effects on a mediator. In other words, meta-mediation performs mediation analysis across experiments, using the estimated treatment effects on the mediator and outcome from each experiment to infer causal mediation pathways at the meta-experiment level.

Suppose you have multiple experiments (meta-experiments), each perturbing some feature and measuring:

Our goal is to understand how much the offline metric \(M\) affects the business KPI \(Y\), and how much of this effect is mediated through the online metric \(O\).

Imagine \(n\) independent experiments, each producing estimated treatment effects on \(M\), \(O\), and \(Y\):

where \(i = 1, 2, \ldots, n\).

We assume the causal relations:

\[\hat{\delta}_O^{(i)} = \beta_{MO} \hat{\delta}_M^{(i)} + \epsilon_O^{(i)}\] \[\hat{\delta}_Y^{(i)} = \beta_{OY} \hat{\delta}_O^{(i)} + \epsilon_Y^{(i)}\]

Here,

Two-Stage Regression

We perform two regressions across the meta-experiments:

  1. Regress \(\hat{\delta}_O\) on \(\hat{\delta}_M\):
\[\hat{\delta}_O^{(i)} = \alpha_1 + \hat{\beta}_{MO} \hat{\delta}_M^{(i)} + u_i\]
  1. Regress \(\hat{\delta}_Y\) on \(\hat{\delta}_O\):
\[\hat{\delta}_Y^{(i)} = \alpha_2 + \hat{\beta}_{OY} \hat{\delta}_O^{(i)} + v_i\]

The total mediated effect of the offline metric on the business KPI is estimated by the product:

\[\hat{\beta}_{\text{total}} = \hat{\beta}_{MO} \times \hat{\beta}_{OY}\]

Estimating Confidence Intervals (CIs) Using Bootstrap

Because the total effect is a product of two regression coefficients, its uncertainty is non-trivial. To estimate CIs:

\[\text{CI}_{95\%} = \left[ \text{percentile}_{2.5\%}(\hat{\beta}_{\text{total}}), \quad \text{percentile}_{97.5\%}(\hat{\beta}_{\text{total}}) \right]\]

Interpretation

Assumptions

This meta-mediation approach relies on several key assumptions to produce valid causal estimates:

  1. Randomization and No Unmeasured Confounding
    Each experiment in the meta-analysis is assumed to be a randomized controlled trial or equivalent, ensuring that treatment assignment is independent of potential outcomes. This justifies interpreting estimated effects \(\hat{\delta}_M^{(i)}, \hat{\delta}_O^{(i)}, \hat{\delta}_Y^{(i)}\) as unbiased causal effects.

  2. Linearity and Additivity of Effects
    The model assumes a linear relationship between the offline metric and the online metric, and between the online metric and the business KPI:

    \[\hat{\delta}_O^{(i)} = \beta_{MO} \hat{\delta}_M^{(i)} + \epsilon_O^{(i)}\] \[\hat{\delta}_Y^{(i)} = \beta_{OY} \hat{\delta}_O^{(i)} + \epsilon_Y^{(i)}\]

    This means the mediation is additive and effects do not vary nonlinearly or interact with other unmodeled factors.

  3. No Direct Effect of Offline Metric on Business KPI Outside the Mediator
    The framework assumes that all causal influence of the offline metric \(M\) on the business KPI \(Y\) operates through the online metric \(O\) (the mediator). There is no unmodeled direct pathway from \(M\) to \(Y\) that bypasses \(O\).

  4. Independent and Identically Distributed (i.i.d.) Errors
    The noise terms \(\epsilon_O^{(i)}\) and \(\epsilon_Y^{(i)}\) are assumed to be independent across experiments and have mean zero. This justifies the use of standard regression techniques and bootstrap inference.

  5. Sufficient Variation Across Experiments
    The meta-experiments must provide enough variation in the treatment effects on \(M\) and \(O\) to reliably estimate the regression coefficients \(\beta_{MO}\) and \(\beta_{OY}\).

  6. Stable Unit Treatment Value Assumption (SUTVA)
    There are no interference effects between units or experiments — that is, the treatment effect for one experiment is unaffected by others.

Violations of these assumptions can bias estimates or invalidate confidence intervals. For example, if the offline metric affects the business KPI through pathways other than the online metric, the mediation estimate will be incomplete. Similarly, unmeasured confounding or nonlinear effects require more advanced methods beyond this linear meta-mediation framework.

Summary

Meta-mediation can estimate how a treatment’s effect on an upstream (offline) metric transmits through a mediator (online metric) to impact a final outcome (business KPI). It focuses on decomposing causal pathways using aggregated effects from multiple experiments.

This methodology helps organizations understand which offline improvements lead to better business outcomes by driving changes in online metrics, guiding metric prioritization and investment. In practice, you often use propensity scores or matching to reduce confounding before performing mediation analysis, or you use IV methods to address confounding in mediator or treatment assignment when needed.

Surrogate/Proxy Validation

In randomised experiments, we often rely on short-term proxy or surrogate metrics as stand-ins for more important but harder to measure outcomes (e.g. north stars like long term retention or revenue). Validating whether a proxy reliably predicts the true outcome across multiple experiments is critical for trustworthy decision-making. However, a common challenge is measurement error: each experiment’s estimated treatment effect on the proxy metric is noisy, and this noise can be correlated with the noise in the estimated effect on the north star metric (in the same experiment). This correlation can bias naïve regressions relating proxy effects to outcome effects, typically attenuating estimates (biasing towards zero), and obscuring true relationships.

To correct for this bias, the leave-one-out instrumental variable (LOO-IV) method leverages the logic of instrumental variables but in a meta-analytic context. Instead of estimating treatment effects, it focuses on the relationship between proxy and outcome effect estimates across experiments. The key idea is to construct an instrument by removing the noisy estimate from the experiment in question, thus creating a proxy measure that is correlated with the true effect but uncorrelated with the noise in that experiment’s outcome estimate.

Steps

  1. Collect summary statistics from multiple experiments: For each experiment $i$, obtain:

    • $\hat{\beta}_{\text{proxy}, i}$: Estimated treatment effect on the proxy metric.
    • $\hat{\beta}_{\text{outcome}, i}$: Estimated treatment effect on the outcome (north star) metric.
  2. Construct the leave-one-out instrument: For each experiment $i$, compute:

    \[Z_i = \frac{1}{N - 1} \sum_{\substack{j=1 \\ j \neq i}}^{N} \hat{\beta}_{\text{proxy}, j}\]

    This is the average proxy effect excluding experiment $i$.

  3. First-stage regression (predict proxy effect): Regress the noisy proxy effect on the instrument:

    \[\hat{\beta}_{\text{proxy}, i} = \pi_0 + \pi_1 Z_i + u_i\]
  4. Second-stage regression (predict outcome effect): Use predicted proxy effects from step 3 to regress on outcome effects:

    \[\hat{\beta}_{\text{outcome}, i} = \alpha + \gamma \hat{\beta}_{\text{proxy}, i}^{\text{pred}} + \varepsilon_i\]
  5. Interpretation: The coefficient $\gamma$ estimates how changes in the proxy metric relate to changes in the true outcome metric, corrected for measurement error in the proxy.

Example

Imagine you run 50 A/B experiments testing UI changes. Each experiment reports:

Because CTR estimates are noisy and measured on the same users as retention, simply regressing retention effects on CTR effects underestimates the true relationship. Using the leave-one-out method, for each experiment:

This approach produces an unbiased estimate of how well CTR predicts retention changes, helping decide if CTR is a valid proxy for long-term success.

Summary

The leave-one-out instrumental variable method provides a practical and flexible way to validate surrogate or proxy metrics across multiple experiments by correcting for correlated measurement error. It does not require the strong homogeneity assumptions of other methods and is especially useful when experiments are heterogeneous and noise levels vary.


Conclusion

Causal inference is inherently difficult because its core assumptions are untestable. Unlike predictive modeling, where performance can be evaluated against ground truth, causal analyses always carry the risk of bias, whether from unmeasured confounding, model misspecification, or violated identification assumptions. You’ll always get a result, and it may look reasonable, but that doesn’t mean it’s correct. That’s why interpretation, sensitivity analysis, and clear communication about what the result does and does not mean are essential.

In industry, where the pressure to move fast is intense, this kind of careful consideration often gets overlooked. Even if it’s just a quick whiteboard exercise, sketching a causal graph early on can be invaluable. Bring in both product and technical stakeholders to help surface and challenge assumptions, and keep a copy to refer back to. Adding notes on the expected direction of effects (i.e. a signed graph) can also help avoid hindsight bias or post-hoc rationalizations.

A common pushback to this structured approach is: “Shouldn’t we let the data speak for itself instead of assuming?” While tempting, the data alone usually can’t answer causal questions. Observing an association between A and B doesn’t reveal whether A causes B, B causes A, or if both are driven by some unmeasured third factor. In most cases, domain knowledge is essential to build a plausible causal structure. That said, some situations, such as natural experiments, time ordering, or policy changes, can provide useful leverage.

It can be discouraging to realize how rarely we can identify effects with full confidence from observational data. However, even imperfect causal estimates can be valuable. When decisions must be made under uncertainty, thoughtful causal reasoning is far better than ignoring causality altogether.


Appendix

Commonly Confused Terms

Term Meaning in One Field Common Confusion or Contrast
Marginal effect In epidemiology/econometrics: average population-level effect In ML/stats: partial derivative in a GLM or change in outcome for one unit change in input
Conditional effect In ML: subgroup-specific (e.g. CATE) Often confused with marginal effects; reverse meanings in some contexts
Confounder In causal inference: common cause of treatment and outcome In industry: sometimes any variable correlated with outcome
Collider Variable caused by both treatment and outcome Often mistaken for a confounder — adjusting for it introduces bias
Interaction In regression/ML: product term between variables In causal inference: may imply effect heterogeneity (CATE)
Effect modification / Moderation Heterogeneity of effect by subgroup Often confused with mediation (i.e., how vs. for whom an effect occurs)
Mediation Effect through an intermediate variable Mistaken for moderation (effect heterogeneity); needs different methods
Instrument / IV Valid instrument must satisfy exclusion + relevance assumptions In industry, often misapplied to any “natural variation”
Propensity score Treatment probability used for balancing Sometimes misused as a predictor or feature in a non-causal model
Exchangeability Assumption that potential outcomes are independent of treatment given covariates Often confused with covariate balance or random sampling
Selection bias Bias from conditioning on a collider or restricted sample In industry, used loosely to mean any sample skew
Surrogate / Proxy Stand-in for a hard-to-measure outcome Risk of invalid inference if it’s not a valid surrogate
Model In ML: a predictive system (e.g., neural net) In causal inference: often an estimation tool based on assumptions
Bias In ML: systematic prediction error In causal inference: failure to estimate the true causal effect due to confounding or misspecification

do-calculus

Up to now, I’ve focused on the potential outcomes framework. The do-calculus framework offers a complementary graph-based approach to causal inference, providing formal rules to determine when and how causal effects can be identified from observational data using a causal DAG. The rules rely on altering the graph by “cutting” edges to represent interventions (removing incoming arrows to nodes intervened upon) or blocking paths.

Rule Name Purpose Graphical Condition PO Analogy Effect on Expression
Rule 1 Insertion/deletion of observations Drop or add observed variable from conditioning set \(Y \perp Z \mid X, W \text{ in } G_{\text{do}(X)}\) Ignoring irrelevant covariates \(P(Y \mid do(X), Z, W) = P(Y \mid do(X), W)\)
Rule 2 Action/observation exchange Replace \(do(Z)\) with conditioning on \(Z\) \(Y \perp Z \mid X, W \text{ in } G_{\text{do}(X), \overline{Z}}\) Treating observed covariates as fixed treatments \(P(Y \mid do(X), do(Z), W) = P(Y \mid do(X), Z, W)\)
Rule 3 Insertion/deletion of actions Remove unnecessary intervention \(Y \perp Z \mid X, W \text{ in } G_{\text{do}(X), \overline{Z(W)}}\) Dropping irrelevant treatment terms \(P(Y \mid do(X), do(Z), W) = P(Y \mid do(X), W)\)

Here, \(G_{\text{do}(X)}\) denotes the causal graph after intervening on \(X\) (removing incoming edges into \(X\)), and \(G_{\text{do}(X), \overline{Z}}\) denotes the graph after intervening on \(X\) and removing outgoing edges from \(Z\).

The backdoor and frontdoor criteria are specific applications of do-calculus that give usable adjustment formulas without needing to chain all 3 rules manually:

1. Backdoor Criterion

If there exists a set of variables \(Z\) that blocks all backdoor paths from \(X\) to \(Y\) (paths that end with an arrow into \(X\)), and no member of \(Z\) is a descendant of \(X\), then:

\[P(Y \mid do(X = x)) = \sum_z P(Y \mid X = x, Z = z) P(Z = z)\]

This corresponds to covariate adjustment in the potential outcomes framework (exchangeability conditional on \(Z\)).

2. Frontdoor Criterion

When no set of confounders exists that can block backdoor paths from \(X\) to \(Y\) (e.g., due to unmeasured confounding), but there is a mediator \(M\) that fully transmits the effect of \(X\) on \(Y\), and there are no unmeasured confounders affecting both \(X\) and \(M\) or \(M\) and \(Y\), the frontdoor adjustment formula applies:

\[P(Y \mid do(X = x)) = \sum_m P(M = m \mid X = x) \sum_{x'} P(Y \mid M = m, X = x') P(X = x')\]

This works by first estimating the causal effect of \(X\) on \(M\), then the effect of \(M\) on \(Y\) while holding \(X\) constant, and combining these to recover the unbiased effect of \(X\) on \(Y\). The validity of this approach depends on the absence of confounding between \(X\) and \(M\), and on \(X\) blocking all backdoor paths from \(M\) to \(Y\). Essentially, the frontdoor criterion applies the backdoor adjustment consecutively to these two causal steps. Given it has very strong assumptions and requires a specific causal structure, this is less commonly used in practice.

Assumptions

Do-calculus assumes a correct and complete causal DAG. Omitted variables or misspecified directions can lead to wrong conclusions. It cannot fix issues with measurement error, selection bias unrelated to the graph structure, or violations of consistency. Several packages (like DoWhy or CausalInference in Python) implement do-calculus identification algorithms to automate these calculations.


References / further reading