Learning to Rank#
Learning to Rank (LTR) is the application of supervised machine learning to train a ranking model from labelled query–document pairs. It bridges classical IR (where relevance functions are hand-crafted) and modern neural approaches.
Problem framing#
Given a query q and a candidate set of documents {d_1, …, d_n}, learn a
scoring function f(q, d) such that sorting by f maximises a ranking metric
(NDCG, MAP, MRR) on held-out queries.
Three paradigms#
Paradigm |
Loss |
Typical model |
Benchmark dataset |
|---|---|---|---|
Pointwise |
Regression / classification on relevance label |
Gradient boosted trees |
— |
Pairwise |
Prefer |
RankSVM, RankNet |
LETOR |
Listwise |
Directly optimise a list-level metric |
LambdaMART, ListNet |
MSLR-WEB10K |
Listwise methods — especially LambdaMART — dominate classic LTR benchmarks and are still the industry default for production ranking systems.
Relevance labels: getting them right#
The quality of your relevance labels is the single most important factor in training a good LTR model. A surprisingly common mistake is using rank-derived fractional relevance instead of magnitude-based integer buckets, and it silently sabotages model training.
The anti-pattern: rank-based fractions#
Consider a scheme like:
relevance = (n_booked - rank + 1) / n_booked
This maps the first-booked item to 1.0, the second to 0.67, the third to
0.33, and so on — regardless of when they were booked. Two very different
scenarios become indistinguishable:
Scenario |
Bookings |
Rank-derived relevances |
|---|---|---|
A — last-minute market |
3, 2, 1 days before arrival |
1.0, 0.67, 0.33 |
B — high-demand market |
120, 45, 5 days before arrival |
1.0, 0.67, 0.33 |
The model gets no signal about absolute demand strength. A panicked last-minute price-slash looks functionally identical to a premium booking secured months in advance.
Why fractions break NDCG gradients#
LambdaMART and other listwise algorithms derive their update signal from the NDCG gain formula \(\text{Gain} = 2^{\text{rel}} - 1\). The exponential is intentional: it creates large gradients that force the model to aggressively correct misordered highly-relevant items.
With fractional labels in \([0, 1]\) the gain spread collapses:
Relevance scheme |
Best gain |
Worst gain |
Spread |
|---|---|---|---|
Fractions (0–1) |
\(2^{1.0}-1 = 1.0\) |
\(2^{0.33}-1 = 0.25\) |
0.75 |
Integer buckets (0–4) |
\(2^4 - 1 = 15\) |
\(2^1 - 1 = 1\) |
14 |
The model trained on fractions sees a penalty 18× smaller for a mislabelled result. It will not try hard to fix misordered items — this is gradient starvation.
The fix: absolute integer buckets#
Replace the rank-based fraction with a business-driven bucketing of the underlying continuous signal (e.g. booking lead time in days):
import pandas as pd
import numpy as np
bins = [-np.inf, 0, 14, 30, 90, np.inf]
labels = [0, 1, 2, 3, 4]
# 0 = Unbooked / past
# 1 = Last minute (1–14 days)
# 2 = Short window (15–30 days)
# 3 = Standard window(31–90 days)
# 4 = High demand (90+ days)
df['relevance'] = pd.cut(
df['lead_time_days'].fillna(-1),
bins=bins,
labels=labels,
).astype(int)
With this scheme Scenario A yields [1, 1, 1] — the model correctly learns
“all of these are weak last-minute bookings” — while Scenario B yields
[4, 3, 1], giving the model rich gradient signal to learn what drives premium
demand.
Static buckets vs. percentile (market-aware) bucketing#
Static thresholds (like the 90-day cutoff above) assume all markets behave similarly. In a heterogeneous marketplace this breaks down:
Market |
Typical booking window |
Static bucket for a 7-day booking |
|---|---|---|
Ski resort in Aspen |
6–9 months ahead |
1 (last-minute) |
Business hotel in Manhattan |
2–5 days ahead |
1 (last-minute) |
With static buckets both markets assign 1 to their best bookings. The
Manhattan comp-set never sees a 3 or 4 — gradient starvation returns for
that entire city.
Percentile (quantile) bucketing solves this by computing bucket boundaries relative to each market’s historical distribution:
df['relevance'] = (
df.groupby('market_id')['lead_time_days']
.transform(lambda x: pd.qcut(x, q=5, labels=[0, 1, 2, 3, 4],
duplicates='drop'))
.astype(int)
)
Now a 7-day booking in Manhattan that is in the top 20% for that market
receives a 4 — the same gradient weight as a 180-day Aspen booking. Every
comp-set always has a winner and a loser, maximising the learning signal across
all markets.
When to prefer static buckets: your business objective is explicitly absolute (e.g. “only care about bookings locked in 90+ days in advance”), your market is highly homogeneous, or computing per-market quantiles is operationally infeasible.
When to prefer percentile bucketing: you operate across diverse geographies or categories and want the model to learn relative demand dynamics within each context.
LambdaMART#
LambdaMART combines:
Lambda gradients — pseudo-gradients derived from pairwise preference swaps, weighted by the change in NDCG they would produce.
MART (Multiple Additive Regression Trees) — gradient boosted decision trees (GBDT).
Implementations: lightgbm (objective='lambdarank'),
xgboost (objective='rank:ndcg'), ranklib.
import lightgbm as lgb
params = {
"objective": "lambdarank",
"metric": "ndcg",
"ndcg_eval_at": [1, 5, 10],
"num_leaves": 64,
"learning_rate": 0.05,
}
train_data = lgb.Dataset(X_train, label=y_train, group=group_train)
model = lgb.train(params, train_data, num_boost_round=200)
Neural rankers#
Cross-encoders (e.g. fine-tuned BERT) jointly encode query + document and output a relevance score. They are more accurate than bi-encoders but far slower (no pre-indexing), so they are typically used as a re-ranker on the top-k candidates returned by a first-stage retriever (BM25 or dense).
query + candidate → [CLS] token → linear head → relevance score
Two-stage pipeline#
Stage 1 — Retrieval (fast, high recall)
BM25 or dense retriever → top-1000 candidates
Stage 2 — Re-ranking (slow, high precision)
LambdaMART or cross-encoder → final top-10
Custom loss functions#
The existing notebook (Learning to rank/LTR with custom loss function.ipynb)
demonstrates how to implement a bespoke ranking loss in a GBDT framework,
allowing direct optimisation of business-specific objectives beyond standard
NDCG.
Key ideas:
Replace lambda gradients with domain-specific pair weights.
Incorporate business constraints (diversity, freshness, inventory) into the loss.
Evaluate offline (NDCG) and online (CTR, conversion) in tandem.
Public datasets#
Dataset |
Queries |
Docs per query |
Relevance |
|---|---|---|---|
MSLR-WEB10K |
10,000 |
~120 |
0–4 graded |
MSLR-WEB30K |
30,000 |
~120 |
0–4 graded |
Yahoo! LTR (C14) |
29,921 |
~24 |
0–4 graded |
ISTELLA |
33,018 |
~103 |
0–4 graded |
The MSLR-WEB10K/ dataset is already present in
applied_data_science_book/Learning to rank/MSLR-WEB10K/.