Hélain Zimmermann

Hybrid Search: Combining Dense and Sparse Retrieval

Most RAG systems I see in production fail not because the LLM is weak, but because retrieval silently drops the one document that actually mattered. Dense retrieval misses exact keywords. Sparse retrieval misses semantics. Users do not care which one you chose, they just see irrelevant answers.

Hybrid search is how we stop choosing.

Rather than betting everything on either a vector index or a BM25 index, we combine them and let each do what it does best: sparse retrieval for exact lexical matches, dense retrieval for semantic similarity. When tuned correctly, hybrid search gives you meaning and precision, with surprisingly little additional complexity.

This article goes beyond toy explanations and focuses on practical design patterns and implementation details that matter in production RAG systems.

Why hybrid search matters in real systems

Dense and sparse retrieval each have structural blind spots.

Where dense retrieval fails

Dense retrieval (embeddings) shines at semantic similarity, but struggles with:

  • Rare entities and identifiers: invoice numbers, SKUs, error codes, cryptographic keys
  • Exact phrasing requirements: legal clauses, compliance language, contractual terms
  • Acronyms and technical jargon: often underrepresented in the embedding training data
  • Code and logs: minor token changes can be crucial, but embeddings may smooth them out

Anyone who has built a RAG system for technical documentation or legal text has seen this: the embedding model returns conceptually related paragraphs, but skips the one containing the exact function name, column name, or regulatory clause.

Where sparse retrieval fails

Sparse retrieval (BM25, keyword search) excels at lexical matching, but misses:

  • Paraphrases and semantic similarity: "terminate the agreement" vs "end the contract"
  • Cross-lingual matches: query in English, docs in French
  • Contextual meaning: "apple" the company vs the fruit
  • Long queries: term frequency and document length normalization can behave strangely

Semantic Search vs Keyword Search: When to Use What goes deeper into this tradeoff. The core point is simple: both approaches are incomplete views of similarity.

Practical motivation: safety and debuggability

Two more reasons I strongly prefer hybrid search in production:

  1. Safety critical recall: if you are building compliance tools, medical assistants, or internal policy bots, you really do not want to miss edge case documents that contain a specific phrase such as "except where explicitly authorized".
  2. Debuggability: with sparse retrieval in the loop, you can inspect exactly which terms matched and why a document was retrieved. This can be much more interpretable than purely opaque vector similarity.

Hybrid search design patterns

There is no single "hybrid search". Most systems fall into one of a few patterns.

Pattern 1 - Score fusion on a shared candidate set

This is the classic approach: you query both the sparse index and the dense index, then combine their scores.

Steps:

  1. Send the query q to sparse retriever, get top-k_s docs with scores s_sparse(d)
  2. Send q to dense retriever, get top-k_d docs with scores s_dense(d)
  3. Normalize scores
  4. Merge candidates and compute a hybrid score

Hybrid score is usually a weighted sum:

from collections import defaultdict

ALPHA = 0.6  # weight for dense component

def normalize(scores):
    # simple min-max normalization per retriever
    if not scores:
        return {}
    values = list(scores.values())
    mn, mx = min(values), max(values)
    if mx == mn:
        return {k: 0.5 for k in scores}  # all equal
    return {k: (v - mn) / (mx - mn) for k, v in scores.items()}


def hybrid_scores(dense_results, sparse_results, alpha=ALPHA):
    # dense_results and sparse_results: list of (doc_id, score)
    dense_dict = {doc_id: score for doc_id, score in dense_results}
    sparse_dict = {doc_id: score for doc_id, score in sparse_results}

    dense_norm = normalize(dense_dict)
    sparse_norm = normalize(sparse_dict)

    all_ids = set(dense_norm) | set(sparse_norm)
    final = {}
    for doc_id in all_ids:
        d = dense_norm.get(doc_id, 0.0)
        s = sparse_norm.get(doc_id, 0.0)
        final[doc_id] = alpha * d + (1 - alpha) * s

    # sort by hybrid score descending
    return sorted(final.items(), key=lambda x: x[1], reverse=True)

Where to run this:

  • Client side: with two independent APIs (e.g. OpenSearch BM25 + a vector DB)
  • Server side: inside a retrieval service that abstracts the hybrid logic away

Pros:

  • Very flexible: you can change normalization, weights, or add extra signals
  • Works with any combination of sparse and dense backends
  • Easy to reason about

Cons:

  • Two queries per user request if backends are separate
  • You must tune normalization and alpha carefully

Pattern 2 - Cascaded retrieval (dense then sparse or vice versa)

Sometimes you cannot afford to search the full corpus twice. In that case, you cascade:

  1. Use a fast retriever to get a candidate pool of size K
  2. Use a more expensive retriever to re-rank only those K candidates

Two common variants:

  • BM25 first, then dense re-ranking: good when you want exact match recall with better semantic ranking
  • Dense first, then BM25 re-ranking: useful when queries are long or noisy and BM25 alone performs poorly

A simple re-ranker using a cross-encoder LLM (for small K) or a cheap embedding model:

from typing import List, Dict


def rerank_with_dense(query_emb, docs, doc_embs):
    """Cosine similarity based re-ranker.
    docs: list[dict] with 'id', 'text'
    doc_embs: dict[id] -> vector
    """
    import numpy as np

    def cosine(a, b):
        return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8))

    scores = []
    for d in docs:
        emb = doc_embs[d["id"]]
        scores.append((d["id"], cosine(query_emb, emb)))

    return sorted(scores, key=lambda x: x[1], reverse=True)

This pattern works well when you use BM25 on fine-grained chunks to guarantee lexical coverage, then a dense re-ranker that reasons over slightly larger windows for semantic coherence.

Pattern 3 - Native hybrid indices in vector databases

Many modern vector databases and search engines implement hybrid search natively:

  • OpenSearch: query with a bool clause combining match and knn subqueries
  • Elasticsearch: similar knn search with must and should clauses
  • Qdrant, Weaviate, Milvus: support payload filters plus BM25-like scoring or field boosting

These engines typically handle score normalization internally. That simplifies your code, but you still have to tune parameters.

Example with a hypothetical Python client:

from my_vectordb import Client

client = Client()

query_vector = embed_query("How do I terminate my employment contract?")

results = client.search(
    collection="docs",
    hybrid={
        "dense": {
            "vector": query_vector,
            "top_k": 50,
            "weight": 0.6,
        },
        "sparse": {
            "query": "\"terminate\" AND \"employment contract\"",
            "top_k": 50,
            "weight": 0.4,
        },
        "top_k": 10,
    },
)

Hybrid search begins at query representation. For dense retrieval it is obvious that you need an embedding. For sparse retrieval you also have more options than naive keyword search.

Dense query encoding

When encoding queries, two best practices from Embedding Models Compared: OpenAI vs Open-Source are especially relevant:

  • Use a model trained for retrieval, not for general sentence similarity or classification
  • If your domain is specialized (legal, medical, code), consider a domain-specific model

Typical pattern:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/e5-base-v2")


def embed_query(text: str):
    # Many retrieval models perform better with special prefixes
    return model.encode(["query: " + text])[0]

Make sure you use the same or compatible models for document and query embeddings.

Sparse query representation

Sparse retrieval is more powerful than just sending the raw user query. Practical techniques:

  • Query expansion: add synonyms or abbreviations
  • Phrase matching: explicit quotes for exact phrases the user typed
  • Field boosting: title and headings often carry more signal than the body

Pseudo-code for BM25 with a search engine like OpenSearch:

query = "terminate employment contract"

sparse_query = {
    "bool": {
        "should": [
            {"match": {"title": {"query": query, "boost": 3.0}}},
            {"match": {"body": {"query": query}}},
            {"match_phrase": {"body": {"query": "employment contract", "boost": 2.0}}},
        ]
    }
}

This combination of phrase queries, multi-field search and boosting is often a cheap but powerful upgrade.

Calibrating dense and sparse scores

The non-obvious hard part in hybrid search is not calling two retrievers. It is making their scores comparable.

Key approaches:

1. Per-query normalization

Compute min-max or z-score normalization per retriever and per query. The earlier normalize function is a simple example.

Pros:

  • Easy to implement
  • No need for labeled data

Cons:

  • Still somewhat arbitrary
  • Sensitive to outliers in each candidate set

2. Learned calibration layer

For high traffic systems I prefer a small learned model that takes raw retrieval scores and outputs a calibrated hybrid score.

Inputs might include:

  • Dense score
  • Sparse score
  • Document length
  • Term coverage ratios (how many query terms matched)
  • Field match indicators (title match, heading match, etc.)

You can train a simple logistic regression or gradient boosted tree on clickthrough or offline relevance labels.

Sketch:

from sklearn.linear_model import LogisticRegression
import numpy as np

# X: [dense_score, sparse_score, title_match, term_coverage]
X = np.load("features.npy")
y = np.load("labels.npy")  # 1 if relevant, 0 otherwise

model = LogisticRegression()
model.fit(X, y)


def hybrid_score_example(dense_score, sparse_score, title_match, term_cov):
    x = np.array([[dense_score, sparse_score, title_match, term_cov]])
    return float(model.predict_proba(x)[0, 1])

You can later re-use this model as a lightweight re-ranker over the top-K candidates returned by independent dense and sparse queries. The evaluation approaches for RAG systems apply directly here: construct relevance datasets and track metrics like NDCG and recall to validate your calibration.

Practical tuning strategies

Hybrid search without tuning is just noise. Here is how I tune systems in practice.

Step 1 - Establish single-modality baselines

  1. Implement pure sparse retrieval (BM25)
  2. Implement pure dense retrieval
  3. Evaluate both on the same dataset

Track at least:

  • Recall@K (how often the relevant doc is in the top K)
  • MRR or NDCG@K (ranking quality)
  • Latency

Start with simple score fusion and tune alpha on a validation set.

import numpy as np

alphas = np.linspace(0.0, 1.0, 11)  # 0.0, 0.1, ..., 1.0

best_alpha = None
best_metric = -1

for alpha in alphas:
    metric = evaluate_hybrid(alpha)  # your offline eval function
    if metric > best_metric:
        best_metric = metric
        best_alpha = alpha

print("Best alpha:", best_alpha, "metric:", best_metric)

You will often see something like:

  • Pure BM25: good recall on IDs, poor semantics
  • Pure dense: good semantics, misses edge cases
  • Hybrid with alpha between 0.4 and 0.7: best overall

Step 3 - Segment by query type

Not all queries are equal. Hybrid parameters that work for "how to" questions might be bad for specific ID lookups.

Heuristics:

  • If the query contains long numeric tokens or patterns like ERR_1234 or INV-2024-001, increase sparse weight
  • If the query is long and conversational, increase dense weight

You can implement this with simple rules or a classifier. For example:

import re


def estimate_query_type(query: str) -> str:
    if re.search(r"[A-Z]{2,}-?\d{3,}", query):
        return "id_lookup"
    if len(query.split()) > 12:
        return "long_natural"
    return "default"


def choose_alpha(query: str) -> float:
    qtype = estimate_query_type(query)
    if qtype == "id_lookup":
        return 0.2  # favor sparse
    if qtype == "long_natural":
        return 0.7  # favor dense
    return 0.5

You can later learn this mapping from data.

Integrating hybrid search into RAG pipelines

Hybrid retrieval is most impactful when combined with strong LLM prompting and post-processing.

Multi-stage RAG with hybrid retrieval

A pattern I often use:

  1. Clarify or rewrite the user query with an LLM
  2. Hybrid retrieval over chunked documents
  3. Optional second-stage re-ranking with a cross-encoder or LLM
  4. Context preparation: merge overlapping chunks, highlight matched terms
  5. LLM generation with instructions to quote or reference exact phrases

Hybrid retrieval is especially important in stage 2 when your corpus is noisy or heterogeneous, for example a mix of PDFs, emails, logs, and wiki pages.

Guardrails and privacy

When building privacy-preserving systems, hybrid search offers some additional levers:

  • You can configure the sparse index to exclude specific fields by default
  • You can apply per-index or per-field access control for sensitive data
  • You can implement allowlist / denylist filters at the BM25 level even before dense retrieval

Implementation considerations

A few low-level details that tend to bite engineers later.

Indexing pipeline

Your indexing code should build both indices from a single, well-defined document representation.

from dataclasses import dataclass


@dataclass
class Doc:
    id: str
    title: str
    body: str
    metadata: dict


def index_document(doc: Doc, dense_client, sparse_client, embed_model):
    text_for_embedding = doc.title + "\n" + doc.body
    emb = embed_model.encode([text_for_embedding])[0]

    # index into vector DB
    dense_client.upsert({
        "id": doc.id,
        "vector": emb,
        "payload": {
            "title": doc.title,
            "body": doc.body,
            **doc.metadata,
        },
    })

    # index into search engine for BM25
    sparse_client.index(index="docs", id=doc.id, body={
        "title": doc.title,
        "body": doc.body,
        **doc.metadata,
    })

Latency and caching

Two retrievers usually mean higher latency. To keep things under control:

  • Cache embeddings for frequent queries (or precompute for common templates)
  • Co-locate dense and sparse services to minimize network overhead
  • Use approximate nearest neighbor for dense retrieval with tuned recall/latency tradeoffs
  • Limit candidate set sizes early, then re-rank a small subset

Measure each component individually, not just overall end-to-end latency.

Key Takeaways

  • Hybrid search combines dense and sparse retrieval to cover each other's blind spots: semantics from embeddings and exact matching from BM25.
  • Score fusion, cascaded retrieval, and native hybrid indices in modern vector databases are the main implementation patterns.
  • Proper score normalization or a learned calibration layer is essential. Raw scores from dense and sparse retrievers are not directly comparable.
  • Query-aware weighting (for example higher sparse weight for ID lookups) significantly improves performance, even with simple heuristics.
  • Multi-stage RAG pipelines benefit most from hybrid retrieval, especially in noisy or high-stakes domains like legal, medical, or internal policy search.
  • Latency, privacy, and access control must be designed jointly with retrieval, not as afterthoughts.

Related Articles

All Articles