Learning to Rank#

Learning to Rank (LTR) is the application of supervised machine learning to train a ranking model from labelled query–document pairs. It bridges classical IR (where relevance functions are hand-crafted) and modern neural approaches.

Problem framing#

Given a query q and a candidate set of documents {d_1, …, d_n}, learn a scoring function f(q, d) such that sorting by f maximises a ranking metric (NDCG, MAP, MRR) on held-out queries.

Three paradigms#

Paradigm

Loss

Typical model

Benchmark dataset

Pointwise

Regression / classification on relevance label

Gradient boosted trees

Pairwise

Prefer d_i > d_j when rel(d_i) > rel(d_j)

RankSVM, RankNet

LETOR

Listwise

Directly optimise a list-level metric

LambdaMART, ListNet

MSLR-WEB10K

Listwise methods — especially LambdaMART — dominate classic LTR benchmarks and are still the industry default for production ranking systems.

Relevance labels: getting them right#

The quality of your relevance labels is the single most important factor in training a good LTR model. A surprisingly common mistake is using rank-derived fractional relevance instead of magnitude-based integer buckets, and it silently sabotages model training.

The anti-pattern: rank-based fractions#

Consider a scheme like:

relevance = (n_booked - rank + 1) / n_booked

This maps the first-booked item to 1.0, the second to 0.67, the third to 0.33, and so on — regardless of when they were booked. Two very different scenarios become indistinguishable:

Scenario

Bookings

Rank-derived relevances

A — last-minute market

3, 2, 1 days before arrival

1.0, 0.67, 0.33

B — high-demand market

120, 45, 5 days before arrival

1.0, 0.67, 0.33

The model gets no signal about absolute demand strength. A panicked last-minute price-slash looks functionally identical to a premium booking secured months in advance.

Why fractions break NDCG gradients#

LambdaMART and other listwise algorithms derive their update signal from the NDCG gain formula \(\text{Gain} = 2^{\text{rel}} - 1\). The exponential is intentional: it creates large gradients that force the model to aggressively correct misordered highly-relevant items.

With fractional labels in \([0, 1]\) the gain spread collapses:

Relevance scheme

Best gain

Worst gain

Spread

Fractions (0–1)

\(2^{1.0}-1 = 1.0\)

\(2^{0.33}-1 = 0.25\)

0.75

Integer buckets (0–4)

\(2^4 - 1 = 15\)

\(2^1 - 1 = 1\)

14

The model trained on fractions sees a penalty 18× smaller for a mislabelled result. It will not try hard to fix misordered items — this is gradient starvation.

The fix: absolute integer buckets#

Replace the rank-based fraction with a business-driven bucketing of the underlying continuous signal (e.g. booking lead time in days):

import pandas as pd
import numpy as np

bins   = [-np.inf, 0, 14, 30, 90, np.inf]
labels = [0, 1, 2, 3, 4]
# 0 = Unbooked / past
# 1 = Last minute    (1–14 days)
# 2 = Short window   (15–30 days)
# 3 = Standard window(31–90 days)
# 4 = High demand    (90+ days)

df['relevance'] = pd.cut(
    df['lead_time_days'].fillna(-1),
    bins=bins,
    labels=labels,
).astype(int)

With this scheme Scenario A yields [1, 1, 1] — the model correctly learns “all of these are weak last-minute bookings” — while Scenario B yields [4, 3, 1], giving the model rich gradient signal to learn what drives premium demand.

Static buckets vs. percentile (market-aware) bucketing#

Static thresholds (like the 90-day cutoff above) assume all markets behave similarly. In a heterogeneous marketplace this breaks down:

Market

Typical booking window

Static bucket for a 7-day booking

Ski resort in Aspen

6–9 months ahead

1 (last-minute)

Business hotel in Manhattan

2–5 days ahead

1 (last-minute)

With static buckets both markets assign 1 to their best bookings. The Manhattan comp-set never sees a 3 or 4 — gradient starvation returns for that entire city.

Percentile (quantile) bucketing solves this by computing bucket boundaries relative to each market’s historical distribution:

df['relevance'] = (
    df.groupby('market_id')['lead_time_days']
      .transform(lambda x: pd.qcut(x, q=5, labels=[0, 1, 2, 3, 4],
                                   duplicates='drop'))
      .astype(int)
)

Now a 7-day booking in Manhattan that is in the top 20% for that market receives a 4 — the same gradient weight as a 180-day Aspen booking. Every comp-set always has a winner and a loser, maximising the learning signal across all markets.

When to prefer static buckets: your business objective is explicitly absolute (e.g. “only care about bookings locked in 90+ days in advance”), your market is highly homogeneous, or computing per-market quantiles is operationally infeasible.

When to prefer percentile bucketing: you operate across diverse geographies or categories and want the model to learn relative demand dynamics within each context.

LambdaMART#

LambdaMART combines:

  • Lambda gradients — pseudo-gradients derived from pairwise preference swaps, weighted by the change in NDCG they would produce.

  • MART (Multiple Additive Regression Trees) — gradient boosted decision trees (GBDT).

Implementations: lightgbm (objective='lambdarank'), xgboost (objective='rank:ndcg'), ranklib.

import lightgbm as lgb

params = {
    "objective": "lambdarank",
    "metric": "ndcg",
    "ndcg_eval_at": [1, 5, 10],
    "num_leaves": 64,
    "learning_rate": 0.05,
}

train_data = lgb.Dataset(X_train, label=y_train, group=group_train)
model = lgb.train(params, train_data, num_boost_round=200)

Neural rankers#

Cross-encoders (e.g. fine-tuned BERT) jointly encode query + document and output a relevance score. They are more accurate than bi-encoders but far slower (no pre-indexing), so they are typically used as a re-ranker on the top-k candidates returned by a first-stage retriever (BM25 or dense).

query + candidate → [CLS] token → linear head → relevance score

Two-stage pipeline#

Stage 1 — Retrieval (fast, high recall)
  BM25 or dense retriever → top-1000 candidates

Stage 2 — Re-ranking (slow, high precision)
  LambdaMART or cross-encoder → final top-10

Custom loss functions#

The existing notebook (Learning to rank/LTR with custom loss function.ipynb) demonstrates how to implement a bespoke ranking loss in a GBDT framework, allowing direct optimisation of business-specific objectives beyond standard NDCG.

Key ideas:

  • Replace lambda gradients with domain-specific pair weights.

  • Incorporate business constraints (diversity, freshness, inventory) into the loss.

  • Evaluate offline (NDCG) and online (CTR, conversion) in tandem.

Public datasets#

Dataset

Queries

Docs per query

Relevance

MSLR-WEB10K

10,000

~120

0–4 graded

MSLR-WEB30K

30,000

~120

0–4 graded

Yahoo! LTR (C14)

29,921

~24

0–4 graded

ISTELLA

33,018

~103

0–4 graded

The MSLR-WEB10K/ dataset is already present in applied_data_science_book/Learning to rank/MSLR-WEB10K/.