Hybrid Search, Reranking, and HyDE: The Upgrades That Actually Move the Needle

When the basic loop retrieves the wrong passages, the next step is usually not an agent. It is better retrieval.

Advanced retrieval is a collection of practical techniques that improve different parts of the same pipeline - what gets searched, how results are combined, and which passages the model actually sees. Most teams reach for these before considering structural changes to the architecture.

Click to enlarge

Query rewriting

Users write questions for people, not indexes. A query like "does it cover contractors too?" is meaningless without context. Query rewriting reformulates the user's input into a form that retrieves better - resolving pronouns, expanding abbreviations, or making implicit constraints explicit. The original and rewritten queries should be retained for tracing, since a rewriting model can silently change the user's intent.

Hybrid dense and keyword retrieval

Dense (vector) search catches semantic similarity - finding a passage about "contract termination" when the user asks about "ending an agreement." Keyword search catches exact matches: error codes, product identifiers, model numbers, proper nouns. Neither alone is sufficient for a corpus that mixes prose with identifiers. Hybrid retrieval combines both signals before ranking, and is the minimum viable strategy for most enterprise content.

Multi-query retrieval and Reciprocal Rank Fusion

A single embedding of an ambiguous or compound question may not capture all dimensions of the required answer. Multi-query retrieval generates several paraphrased versions of the question, retrieves for each independently, and merges the result sets. Reciprocal Rank Fusion then combines the ranked lists: passages that multiple retrieval variants agree on rise to the top, reducing the effect of any single query's blind spots.

Reranking

The first-stage retriever is optimized for recall - casting a wide net quickly. A cross-encoder reranker is optimized for precision - it scores the query and each candidate passage jointly to determine which ones most likely contain the answer. The standard pattern is to retrieve a broad candidate set (top 50), then rerank down to a small context window (top 5). Reranking cannot surface evidence the first stage never found; candidate recall and ranking quality should be evaluated separately.

Context compression

Retrieved passages often contain more text than is directly relevant. Context compression strips irrelevant sentences from surviving chunks before they reach the model, reducing token usage and keeping the model's attention on the evidence that actually matters. It is applied after reranking, not before.

HyDE (Hypothetical Document Embeddings)

Questions and answers often occupy different regions of embedding space, because questions are phrased differently from the documents that answer them. HyDE addresses this by having the model generate a hypothetical answer to the question, then embedding that hypothetical answer instead of the original query. The generated answer's embedding lands much closer to real documents, because it is written in the same register as the source material. It is particularly effective in legal, medical, and technical domains.

What these techniques cost

These techniques do not all carry the same production cost, and they compound when combined. Hybrid retrieval is almost always worth adding first: BM25 is fast, infrastructure cost is moderate, and the recall improvement on exact-term queries is consistent. Reranking adds 200-800ms per query depending on candidate set size and model - measure this latency before deploying. Multi-query retrieval multiplies your embedding calls by the number of paraphrases generated, typically three to five. HyDE adds one LLM inference call per query to generate the hypothesis. Context compression adds another. A pipeline combining multi-query, reranking, and compression adds three to four inference calls to every user query. At moderate traffic that cost compounds quickly. Add techniques where evaluation shows a gap, not speculatively.

Recommender

Which technique should you try first?

Select every failure you are currently observing. The recommendation is ordered by cost-to-impact ratio - lowest cost, highest ROI first.

Exact terms, product codes, or identifiers are not being found in results Relevant passages exist but rank below less useful ones User questions are vague, ambiguous, or phrased differently from source language Questions have multiple angles that a single query cannot fully capture Retrieved context is too long or contains mostly irrelevant sentences Domain language differs significantly from how users phrase questions

Remember: candidate retrieval finds possible evidence. Reranking decides which evidence deserves attention. Evaluate them separately - conflating them into a single "retrieval quality" score makes it impossible to know which stage to improve.

RAG Architecture Series - 6 Parts

← PreviousNaive RAG Next →Modular & Graph RAG

Hybrid Search, Reranking, and HyDE: The Upgrades That Actually Move the Needle

Query rewriting

Hybrid dense and keyword retrieval

Multi-query retrieval and Reciprocal Rank Fusion

Reranking

Context compression

HyDE (Hypothetical Document Embeddings)

What these techniques cost

Which technique should you try first?

Stay sharp on AI engineering

Let's Connect