← Back to Series
RAG Architecture Series Part 3 of 6

Hybrid Search, Reranking, and HyDE: The Upgrades That Actually Move the Needle

If naive RAG's failure modes cluster around precision and recall, Advanced RAG is the set of techniques that directly target those two metrics - by wrapping the naive core with pre-retrieval and post-retrieval stages.

The diagram above shows the three-stage structure. Each stage has a specific job: pre-retrieval improves what you ask for, retrieval casts a wide net, post-retrieval narrows it to what actually matters.

Pre-retrieval: improving the query before it hits the index

Query rewriting reformulates ambiguous queries into a form that retrieves better. Multi-query expansion generates several paraphrased versions of the question, retrieves for each, and merges the results. Decomposition splits compound questions into sub-queries retrieved independently, directly addressing the multi-hop failure mode from Part 2.

PRE-RETRIEVAL Improve query before it hits the index Query Rewriting Reformulates ambiguous queries into retrievable form Multi-Query Expansion N paraphrased variants, merged results Decomposition Compound questions → sub-queries independently HyDE Embed hypothetical answer, not the question RETRIEVAL Cast a wide net - optimise for recall Dense Search Semantic similarity - meaning understands Sparse / BM25 Exact match - error codes, IDs, names RRF Fusion Rank fusion - agreement wins between methods Top-50 candidates Wide retrieval before precision filtering POST-RETRIEVAL Narrow down - optimise for precision Cross-Encoder Rerank Scores (query, doc) jointly - far more precise Contextual Compression Strips irrelevant sentences from chunks Top-5 to LLM High-signal context window for generation Precision over noise Trustworthy top-5 from a noisy top-50 Recall first → precision second. This sequencing closes most of Naive RAG's quality gap.

Hybrid retrieval: combining two fundamentally different search mechanisms

Dense (vector) search catches semantic similarity. Sparse (BM25/keyword) search catches exact matches - error codes, product IDs, names. Neither alone is sufficient for enterprise content, which mixes prose with identifiers constantly. Reciprocal Rank Fusion (RRF) merges both: documents that multiple retrieval methods agree on rise to the top.

Post-retrieval: refining results before they reach the model

Cross-encoder reranking is the highest-leverage technique here. The standard pattern: retrieve top-50 with fast vector search, rerank down to top-5 with a cross-encoder that scores (query, document) jointly. Contextual compression then strips irrelevant sentences from surviving chunks.

Spotlight: HyDE (Hypothetical Document Embeddings)

Questions and answers often live in different regions of embedding space. HyDE's fix: have the LLM generate a hypothetical answer to the question, then embed that instead. The hypothetical answer's embedding lands much closer to real documents - because it's written in the same register as the documents. Particularly valuable in legal, medical, and technical support domains.

The takeaway: optimise retrieval for recall first - cast a wide net. Then optimise for precision - turning a noisy top-50 into a trustworthy top-5.

-->

Let's Connect

Interested in discussing AI architecture, LLMOps, or production agent systems?

Get in Touch