Learning to Rank

Learning to Rank#

Learning to Rank (LTR) is the application of supervised machine learning to train a ranking model from labelled query–document pairs. It bridges classical IR (where relevance functions are hand-crafted) and modern neural approaches.

Problem framing#

Given a query q and a candidate set of documents {d_1, …, d_n}, learn a scoring function f(q, d) such that sorting by f maximises a ranking metric (NDCG, MAP, MRR) on held-out queries.

Three paradigms#

Paradigm	Loss	Typical model	Benchmark dataset
Pointwise	Regression / classification on relevance label	Gradient boosted trees	—
Pairwise	Prefer `d_i > d_j` when `rel(d_i) > rel(d_j)`	RankSVM, RankNet	LETOR
Listwise	Directly optimise a list-level metric	LambdaMART, ListNet	MSLR-WEB10K

Listwise methods — especially LambdaMART — dominate classic LTR benchmarks and are still the industry default for production ranking systems.

Relevance labels: getting them right#

The quality of your relevance labels is the single most important factor in training a good LTR model. A surprisingly common mistake is using rank-derived fractional relevance instead of magnitude-based integer buckets, and it silently sabotages model training.

The anti-pattern: rank-based fractions#

Consider a scheme like:

relevance = (n_booked - rank + 1) / n_booked

This maps the first-booked item to 1.0, the second to 0.67, the third to 0.33, and so on — regardless of when they were booked. Two very different scenarios become indistinguishable:

Scenario	Bookings	Rank-derived relevances
A — last-minute market	3, 2, 1 days before arrival	1.0, 0.67, 0.33
B — high-demand market	120, 45, 5 days before arrival	1.0, 0.67, 0.33

The model gets no signal about absolute demand strength. A panicked last-minute price-slash looks functionally identical to a premium booking secured months in advance.

Why fractions break NDCG gradients#

LambdaMART and other listwise algorithms derive their update signal from the NDCG gain formula \(\text{Gain} = 2^{\text{rel}} - 1\). The exponential is intentional: it creates large gradients that force the model to aggressively correct misordered highly-relevant items.

With fractional labels in \([0, 1]\) the gain spread collapses:

Relevance scheme	Best gain	Worst gain	Spread
Fractions (0–1)	\(2^{1.0}-1 = 1.0\)	\(2^{0.33}-1 = 0.25\)	0.75
Integer buckets (0–4)	\(2^4 - 1 = 15\)	\(2^1 - 1 = 1\)	14

The model trained on fractions sees a penalty 18× smaller for a mislabelled result. It will not try hard to fix misordered items — this is gradient starvation.

The fix: absolute integer buckets#

Replace the rank-based fraction with a business-driven bucketing of the underlying continuous signal (e.g. booking lead time in days):

import pandas as pd
import numpy as np

bins   = [-np.inf, 0, 14, 30, 90, np.inf]
labels = [0, 1, 2, 3, 4]
# 0 = Unbooked / past
# 1 = Last minute    (1–14 days)
# 2 = Short window   (15–30 days)
# 3 = Standard window(31–90 days)
# 4 = High demand    (90+ days)

df['relevance'] = pd.cut(
    df['lead_time_days'].fillna(-1),
    bins=bins,
    labels=labels,
).astype(int)

With this scheme Scenario A yields [1, 1, 1] — the model correctly learns “all of these are weak last-minute bookings” — while Scenario B yields [4, 3, 1], giving the model rich gradient signal to learn what drives premium demand.

Static buckets vs. percentile (market-aware) bucketing#

Static thresholds (like the 90-day cutoff above) assume all markets behave similarly. In a heterogeneous marketplace this breaks down:

Market	Typical booking window	Static bucket for a 7-day booking
Ski resort in Aspen	6–9 months ahead	1 (last-minute)
Business hotel in Manhattan	2–5 days ahead	1 (last-minute)

With static buckets both markets assign 1 to their best bookings. The Manhattan comp-set never sees a 3 or 4 — gradient starvation returns for that entire city.

Percentile (quantile) bucketing solves this by computing bucket boundaries relative to each market’s historical distribution:

df['relevance'] = (
    df.groupby('market_id')['lead_time_days']
      .transform(lambda x: pd.qcut(x, q=5, labels=[0, 1, 2, 3, 4],
                                   duplicates='drop'))
      .astype(int)
)

Now a 7-day booking in Manhattan that is in the top 20% for that market receives a 4 — the same gradient weight as a 180-day Aspen booking. Every comp-set always has a winner and a loser, maximising the learning signal across all markets.

When to prefer static buckets: your business objective is explicitly absolute (e.g. “only care about bookings locked in 90+ days in advance”), your market is highly homogeneous, or computing per-market quantiles is operationally infeasible.

When to prefer percentile bucketing: you operate across diverse geographies or categories and want the model to learn relative demand dynamics within each context.

LambdaMART#

LambdaMART combines:

Lambda gradients — pseudo-gradients derived from pairwise preference swaps, weighted by the change in NDCG they would produce.
MART (Multiple Additive Regression Trees) — gradient boosted decision trees (GBDT).

Implementations: lightgbm (objective='lambdarank'), xgboost (objective='rank:ndcg'), ranklib.

import lightgbm as lgb

params = {
    "objective": "lambdarank",
    "metric": "ndcg",
    "ndcg_eval_at": [1, 5, 10],
    "num_leaves": 64,
    "learning_rate": 0.05,
}

train_data = lgb.Dataset(X_train, label=y_train, group=group_train)
model = lgb.train(params, train_data, num_boost_round=200)

Neural rankers#

Cross-encoders (e.g. fine-tuned BERT) jointly encode query + document and output a relevance score. They are more accurate than bi-encoders but far slower (no pre-indexing), so they are typically used as a re-ranker on the top-k candidates returned by a first-stage retriever (BM25 or dense).

query + candidate → [CLS] token → linear head → relevance score

Two-stage pipeline#

Stage 1 — Retrieval (fast, high recall)
  BM25 or dense retriever → top-1000 candidates

Stage 2 — Re-ranking (slow, high precision)
  LambdaMART or cross-encoder → final top-10

Custom loss functions#

The existing notebook (Learning to rank/LTR with custom loss function.ipynb) demonstrates how to implement a bespoke ranking loss in a GBDT framework, allowing direct optimisation of business-specific objectives beyond standard NDCG.

Key ideas:

Replace lambda gradients with domain-specific pair weights.
Incorporate business constraints (diversity, freshness, inventory) into the loss.
Evaluate offline (NDCG) and online (CTR, conversion) in tandem.

Public datasets#

Dataset	Queries	Docs per query	Relevance
MSLR-WEB10K	10,000	~120	0–4 graded
MSLR-WEB30K	30,000	~120	0–4 graded
Yahoo! LTR (C14)	29,921	~24	0–4 graded
ISTELLA	33,018	~103	0–4 graded

The MSLR-WEB10K/ dataset is already present in applied_data_science_book/Learning to rank/MSLR-WEB10K/.