Skip to main content
Ctrl+K
Applied Data Science - Diary of a Lonely Scientist - Home Applied Data Science - Diary of a Lonely Scientist - Home
  • Applied Data Science: Diary of a Lonely Scientist
  • Sports Analytics and Performance Optimization
  • Fraud detection
  • Marketing Analytics and Customer Insights
    • Marketing Engagement
  • Satellite Imagery and Remote Sensing
  • Algorithmic trading
    • Energy tender pricing
  • Risk assesment:
  • Information Retrieval, Search and Ranking
    • Search Fundamentals
    • Semantic Search
    • Learning to Rank
  • Causal Inference
    • What Goes Into X? Feature Roles in Causal Inference
    • CATE and the Doubly Robust Learner
    • Causal Inference in NLP and LLMs
  • Repository
  • Open issue
  • .md

Semantic Search

Contents

  • From sparse to dense retrieval
  • Distributional semantics: topical vs. typical relatedness
    • Two types of relatedness
    • How classic embedding models map onto this distinction
  • Text embeddings
  • Approximate Nearest Neighbour (ANN) search
  • Vector databases
  • Retrieval-Augmented Generation (RAG)
  • Evaluation

Semantic Search#

Sparse retrieval (BM25) relies on exact token overlap between query and document. Semantic search addresses the vocabulary mismatch problem by mapping both queries and documents into a shared dense vector space where proximity reflects meaning rather than surface form.

From sparse to dense retrieval#

Sparse (BM25)

Dense (bi-encoder)

Representation

Bag-of-words vector

Fixed-size embedding

Index size

Proportional to vocabulary

Proportional to corpus size

Latency

Sub-millisecond

~1–10 ms with ANN

Handles synonyms

No

Yes

Training data needed

No

Yes (or pre-trained model)

Explainability

High

Low

In practice the two are complementary: hybrid search (BM25 + dense) often outperforms either alone.

Distributional semantics: topical vs. typical relatedness#

Before picking an embedding model, it helps to understand the two fundamentally different ways words can be “semantically close” — because different model architectures optimise for different kinds of closeness.

Two types of relatedness#

Topical (associative) relatedness — words belong to the same subject matter or semantic field. They co-occur frequently in the same documents, but one cannot necessarily substitute for the other in a sentence.

Examples: doctor & hospital, coffee & mug, sun & beach

Typical (substitutionary / functional) relatedness — words share semantic features and function similarly in a sentence. They can often replace each other without changing grammatical structure.

Examples: car & bus, big & huge, run & walk

Diagram showing topical vs. typical semantic relatedness

How classic embedding models map onto this distinction#

Model

Mechanism

Prioritises

Word2Vec (Skip-gram / CBOW)

Local sliding window (~5 words); predicts nearby context words

Typical — learns words that appear next to the same context words

LSA (Latent Semantic Analysis)

SVD on a global term–document co-occurrence matrix

Topical — learns words that appear in the same overall documents

GloVe

Factorises a global word–word co-occurrence matrix

Both — strong on analogies (typical) while still capturing topic (topical)

This distinction has direct practical consequences:

  • Query expansion built on Word2Vec will surface good synonyms (laptop → notebook) but may miss thematically related terms (laptop → charger).

  • LSA-style models are better at clustering documents by topic, but poorer at synonym detection.

  • GloVe is a safer default when you want a single static embedding for general-purpose retrieval.

  • Contextual models (BERT, Sentence-Transformers) learn both types simultaneously from billions of tokens, which is a key reason they outperform all static embeddings on retrieval benchmarks.

Text embeddings#

A bi-encoder independently embeds the query and each document with the same encoder (e.g. a fine-tuned BERT / Sentence-Transformers model). Similarity is computed as the dot product or cosine of the two vectors.

Popular pre-trained models:

  • all-MiniLM-L6-v2 — fast, good general-purpose baseline (384 dims)

  • text-embedding-3-small — OpenAI, strong out-of-the-box

  • e5-large-v2 — strong on BEIR benchmark

  • bge-m3 — multilingual, supports sparse + dense hybrid natively

Approximate Nearest Neighbour (ANN) search#

Brute-force cosine search over millions of vectors is too slow. ANN indexes trade a small amount of recall for orders-of-magnitude speed-up:

Algorithm

Library

Notes

HNSW

hnswlib, Faiss, Weaviate

Best recall/speed trade-off; in-memory

IVF-PQ

Faiss

Compressed; scales to billions of vectors

ScaNN

Google ScaNN

Optimised for Google-scale workloads

DiskANN

Microsoft

SSD-based; low memory footprint

Vector databases#

Managed ANN search + metadata filtering + CRUD:

  • Pinecone — fully managed, serverless option

  • Weaviate — open-source, built-in hybrid search

  • Qdrant — open-source, Rust-based, fast filtering

  • Milvus / Zilliz — open-source, GPU-accelerated

  • pgvector — vector extension for PostgreSQL; simplest ops story

Retrieval-Augmented Generation (RAG)#

Semantic search is the retrieval backbone of RAG systems:

query → embed → ANN search → top-k chunks → LLM prompt → answer

Retrieval quality has an outsized impact on answer quality — improving the retriever is usually more effective than prompt engineering.

Evaluation#

Use the same NDCG / MRR / Recall@k metrics as for sparse retrieval. The BEIR benchmark provides a standard heterogeneous test suite across 18 retrieval tasks.

previous

Search Fundamentals

next

Learning to Rank

Contents
  • From sparse to dense retrieval
  • Distributional semantics: topical vs. typical relatedness
    • Two types of relatedness
    • How classic embedding models map onto this distinction
  • Text embeddings
  • Approximate Nearest Neighbour (ANN) search
  • Vector databases
  • Retrieval-Augmented Generation (RAG)
  • Evaluation

By Andrey Spiridonov

© Copyright 2023.