Causal Inference#

“Correlation is not causation.” — Every statistician, always.

Standard machine learning excels at prediction: given features \(X\), estimate \(E[Y|X]\) as accurately as possible. But many of the most important business questions are causal: not “which customers are likely to churn?” but “for which customers does sending a retention offer actually reduce churn?”

These are fundamentally different questions. A predictive model that tells you a customer has a 90% churn probability gives you no information about whether your intervention will help. Causal inference provides the framework to answer the intervention question directly.


Why Causal Inference in Applied Data Science?#

In industry, decisions are actions — discounts, emails, drugs, policy changes. The relevant quantity is always the effect of the action, not the baseline prediction. Causal inference methods allow us to:

  • Estimate the effect of a treatment or intervention from observational data (without running an expensive randomized trial).

  • Identify which individuals benefit most from an intervention — not just whether it works on average.

  • Avoid selection bias: the systematic difference between who gets treated and who doesn’t in the real world.


Key Concepts#

Potential Outcomes Framework#

For each individual \(i\), define two potential outcomes:

  • \(Y_i(1)\): the outcome if individual \(i\) is treated (\(T=1\)).

  • \(Y_i(0)\): the outcome if individual \(i\) is not treated (\(T=0\)).

The Individual Treatment Effect (ITE) is:

\[\tau_i = Y_i(1) - Y_i(0)\]

The fundamental problem of causal inference is that we only ever observe one of these — we never see both \(Y_i(1)\) and \(Y_i(0)\) for the same person at the same time. Causal inference methods are strategies for estimating the unobserved counterfactual.

Average Treatment Effect (ATE) vs. CATE#

Quantity

Formula

Question answered

ATE

\(E[\tau_i] = E[Y(1) - Y(0)]\)

Does the treatment work on average?

ATT

\(E[\tau_i \mid T=1]\)

Does it work for those who were treated?

CATE

\(\tau(x) = E[Y(1) - Y(0) \mid X=x]\)

Does it work for individuals like this?

The Conditional Average Treatment Effect (CATE) is the workhorse of personalized decision-making. Instead of a single number, CATE is a function of individual features \(X\) — it tells you the expected effect for a person with a particular profile.

Confounding and Selection Bias#

In a randomized controlled trial (RCT), treatment is assigned randomly, so \(T\) is independent of \(Y(0)\) and \(Y(1)\). In observational data, this is rarely true. Customers who receive a loyalty email may have already been more likely to book. Patients who get a drug may already be sicker. This confounding biases naive comparisons.

Causal methods handle confounding through a set of identification assumptions — most commonly:

  • Unconfoundedness (Ignorability): \((Y(0), Y(1)) \perp T \mid X\) — conditional on observed features, treatment is as-good-as-random.

  • Overlap (Positivity): \(0 < P(T=1 \mid X) < 1\) — every individual has a nonzero chance of being in either group.

Because \(X\) carries so much weight — it must contain confounders for validity, effect modifiers for CATE resolution, and must exclude bad controls that introduce bias — choosing what to put in \(X\) is one of the most consequential decisions in any causal analysis. See What Goes Into X? Feature Roles in Causal Inference for a detailed breakdown.


The Meta-Learner Family#

A meta-learner is a strategy that uses standard supervised learning models as building blocks to estimate CATE. Rather than inventing a new algorithm from scratch, meta-learners orchestrate existing models (gradient boosting, random forests, neural networks) to produce causal estimates.

The four main meta-learners differ in how they use the treatment variable:

Learner

Core idea

Key limitation

S-Learner

One model with \(T\) as a feature

May ignore \(T\) if its signal is weak

T-Learner

Separate models for treated/control

High variance when one group is small

X-Learner

Imputes individual effects; iterates

Complex; adds assumptions

DR-Learner

Combines outcome models + propensity weighting

Robust to misspecification of one nuisance model

Meta-learners are not limited to tabular data. When causal variables are embedded in unstructured text — support tickets, clinical notes, customer emails — LLMs can extract or generate the structured inputs these estimators need. See Causal Inference in NLP and LLMs for the full treatment of Causal NLP.


Sections in This Chapter#