Deep dive into retrieval: why pure generation hallucinates, vector similarity search, dense vs sparse retrieval, chunking strategies, and multi-stage retrieval with reranking
15 minutes•Intermediate Level•Dec 2024
Building On Previous Knowledge
In the previous progression, you learned how LLMs generate text token-by-token using learned probability distributions. This creates a fundamental problem: the model can only generate from patterns it memorized during training.
If the answer isn’t in the training data, the model will either refuse to answer or—more dangerously—generate a plausible-sounding but fabricated response.
This progression solves that problem by introducing retrieval: giving the model access to external knowledge at inference time.
What Goes Wrong Without This:
Retrieval Failure Patterns
Retrieval Failure Patterns
Symptom: Your AI assistant confidently answers questions about your
company's products with completely fabricated information.
Cause: The LLM generates plausible text from patterns, but has no
access to your actual documentation. High confidence ≠ correctness.
Symptom: Retrieval returns documents with high similarity scores,
but the RAG system still produces incorrect answers.
Cause: You treated retrieval as similarity search. Similarity is
SYMMETRIC (A similar to B = B similar to A). Relevance is NOT.
Symptom: RAG works for demo queries but fails for real user queries.
Cause: Demo queries match document phrasing. Real queries use
different vocabulary. Query-document mismatch.
The Limits of Pure Generation
LLMs are trained on static data with a cutoff date. They generate from learned patterns, not live facts.
Pure Generation Problems
Pure Generation Problems
Problems with pure generation:
1. KNOWLEDGE CUTOFF
Q: "Who won the 2024 election?"
A: "I don't have information past my training cutoff..."
2. HALLUCINATION
Q: "What's the API for uploading files in our product?"
A: "Use POST /api/upload with multipart/form-data..."
(confidently wrong—made up based on patterns)
3. NO PRIVATE DATA
Q: "What did the client say in yesterday's email?"
A: Cannot access—not in training data
4. OUTDATED FACTS
Q: "What's the current price of Bitcoin?"
A: Training data price, not live price
The model generates plausible text, but plausible ≠ true.
Retrieval: Grounding Generation in Facts
Instead of asking the model to recall facts, give it facts to use.
Retrieval vs Pure Generation
Retrieval vs Pure Generation
Without retrieval:
User query → LLM → Generated answer (may hallucinate)
With retrieval:
User query →Searchknowledge base→ Relevant docs
↓
[Query + Docs] → LLM →Grounded answer
The LLM now has context to work with.
The Retrieval Pipeline
Retrieval Pipeline
Retrieval Pipeline
INDEXING (offline):
Documents
│↓┌───────────┐│Chunk│Split into manageable pieces
└───────────┘│↓┌───────────┐│Embed│Convert chunks to vectors
└───────────┘│↓┌───────────┐│Index│Store in vector database
└───────────┘QUERY (online):
User query
│↓┌───────────┐│Embed│ Same embedding model as indexing
└───────────┘│↓┌───────────┐│Search│Findsimilar vectors in index
└───────────┘│↓Top-K most similar chunks
Vector Similarity Search
Core mechanic: find vectors closest to query vector.
Why it works: Embedding models map similar meanings to similar vectors. “How do I reset my password?” is close to “Password reset instructions” even though they share few words.
Query: "error code E1234"
Sparse (BM25): Finds docs with exact string "E1234" ✓Dense: May not find if "E1234" wasn't in training data✗Query: "my application keeps crashing"
Sparse (BM25): Needs exact word "crashing" ✗Dense: Matches "app failure", "program stops working" ✓Hybrid: Best of both
Reciprocal Rank Fusion (RRF)
When combining dense and sparse results, use RRF to merge ranked lists:
RRF is simple and often performs just as well as learned score combination.
Chunking: Why and How
Documents are too long to embed as single units:
Embedding models have token limits (512-8192)
Long texts dilute specific information
Retrieval granularity matters
Why Chunking Matters
Why Chunking Matters
Document: 50-page manual
Bad: One embedding for entire document
→ Query matches but relevant info buried in noiseGood: Chunk into ~500 token pieces→ Query matches specific relevant section
Chunking Strategies
Chunking Strategies
Chunking Strategies
Chunking strategies:
Fixed-size: Every N tokens
Simple, may break mid-sentenceSentence-based: Split at sentence boundaries
Preserves complete thoughtsParagraph-based: Split at paragraph breaks
Preserves larger context
Semantic: Split where topic changes
Best quality, more complex
Recursive: Try larger splitters first, fall back
Hierarchical, respects structure
Overlap
Include some text from previous chunk to preserve context at boundaries:
Chunk Overlap
Chunk Overlap
Chunk 1: "...the password reset link. Click it to..."
Chunk 2: "...reset link. Click it to create a new password..."
↑overlap region
Why: Context at boundaries isn't lost.
Tradeoff: More storage, potential duplicate retrieval.
Multi-Stage Retrieval
Single-stage retrieval trades off speed vs accuracy. Multi-stage gets both.
Multi-Stage Retrieval
Multi-Stage Retrieval
SINGLE-STAGE RETRIEVAL
--------------------------------------------------------------------
Query → Embedding Model → Vector Search → Top 10 Results
Fast (~20ms), but accuracy limited by bi-encoder's ability
to independently embed query and docs.
MULTI-STAGE RETRIEVAL (Retrieve →Rerank)
──────────────────────────────────────────
Stage 1: Fastretrieval (bi-encoder)
Query → Top 100 candidates (~20ms)
Uses: Dense/sparse/hybrid retrieval
Stage 2: Accurate reranking (cross-encoder)
Rerank 100 → Top 10 (~200ms for 100 pairs)
Uses: Cross-encoder model
Total: ~250ms, but significantly better accuracy
Bi-Encoder vs Cross-Encoder
Bi-Encoder vs Cross-Encoder
Bi-Encoder vs Cross-Encoder
BI-ENCODER (used in retrieval):
Query ─────→ [Encoder] ─────→query_vector↓cosine_similarity = score↑
Doc ─────→ [Encoder] ─────→doc_vector✓ Can pre-compute doc vectors (once)
✓ Fast similarity search at query time
✓ Scales to millions of documents
✗ Query and doc don't "see" each other
✗Lower accuracyCROSS-ENCODER (used in reranking):
[CLS] query [SEP] document [SEP] ─────→ [BERT] ─────→score✓ Query and doc interact via attention
✓Higher accuracy (5-10% improvement)
✗ Must encode every (query, doc) pair
✗ Can't pre-compute anything
✗Slow: O(n) for n documents
Why Two Stages?
Two-Stage Tradeoffs
Two-Stage Tradeoffs
┌──────────────────────────┬───────────┬──────────┬───────────────────┐│Method│Latency│Accuracy│Use Case│├──────────────────────────┼───────────┼──────────┼───────────────────┤│ Bi-encoder only │ ~20ms │ 85% │ Speed-critical ││ Cross-encoder only │ ~20s/1M │ 95% │Tiny corpus only││ Bi-encoder → Cross-enc │ ~250ms │ 93% │Production│└──────────────────────────┴───────────┴──────────┴───────────────────┘
The bi-encoder filters to candidates.
The cross-encoder reranks for precision.
Best of both worlds.
Retrieval Quality Metrics
How do you know retrieval is working?
Retrieval Quality Metrics
Retrieval Quality Metrics
Recall@K
"Of all relevant docs, how many are in my top-K?"
5 relevant docs exist, top-10 retrieval finds 4
Recall@10 = 4/5 = 0.80
Critical for RAG: if relevant doc isn't retrieved,
the LLM can't use it.
Precision@K
"Of my top-K results, how many are relevant?"
Top-10 has 4 relevant, 6 irrelevant
Precision@10 = 4/10 = 0.40
Matters for: context window efficiency, noise reduction
MRR (Mean Reciprocal Rank)
"How high is the first relevant result?"
First relevant at position 3 → RR = 1/3
Average across queries = MRR
For RAG, Recall@K usually matters most. If the answer isn’t retrieved, generation fails.
Query: "How do I bake cookies?"
Document: "I baked cookies yesterday and they were delicious."
Similarity: HIGH (same topic, same words)
Relevance: ZERO (describes past event, doesn't answer the question)
Similarity is SYMMETRIC. Relevance is NOT.
A relevant document must be similar, but a similar document
isn't necessarily relevant.
The Query-Document Mismatch Problem
Query-Document Mismatch
Query-Document Mismatch
Query: "How do I fix the login bug?"
(question format, user language)
Doc: "Authentication failures can be resolved by..."
(statement format, technical language)
Problem: Different phrasing may have lower similarity
even when doc answers the query.
Solutions:
Query Expansion: Add synonyms and related terms
HyDE: Generate hypothetical answer, embed that instead
Query Rewriting: Transform user query to match document style
Fine-tuned Retrievers: Train on your domain’s (query, doc) pairs
Chunking Mistakes
Chunking Pitfalls
Chunking Pitfalls
The correct answer might be split across two chunks.
Neither chunk alone answers the query.
Both chunks score medium similarity.
Retrieval "succeeds" (returns chunks).
RAG fails (no chunk contains the answer).
Chunking is a system design decision, not preprocessing trivia.
Code Example
Basic semantic search with sentence-transformers:
import numpy as npfrom sentence_transformers import SentenceTransformer# Initialize embedding modelmodel = SentenceTransformer('all-MiniLM-L6-v2')# Sample knowledge basedocuments = [ "To reset your password, go to Settings > Security > Reset Password.", "Our API rate limit is 100 requests per minute for free tier.", "Contact support@example.com for billing questions.", "The application requires Python 3.9 or higher.", "Two-factor authentication can be enabled in Settings > Security.",]# Index: Embed all documentsdoc_embeddings = model.encode(documents)def search(query: str, top_k: int = 3) -> list[tuple[str, float]]: """Semantic search: find most similar documents to query.""" # Embed query with same model query_embedding = model.encode(query) # Compute cosine similarities # (embeddings are normalized, so dot product = cosine similarity) similarities = np.dot(doc_embeddings, query_embedding) # Get top-k indices top_indices = np.argsort(similarities)[::-1][:top_k] # Return documents with scores results = [] for idx in top_indices: results.append((documents[idx], similarities[idx])) return results# Test queriesqueries = [ "How do I change my password?", "What are the API limits?", "I need help with my bill",]for query in queries: print(f"\nQuery: {query}") results = search(query, top_k=2) for doc, score in results: print(f" [{score:.3f}] {doc[:50]}...")
Key Takeaways
Key Takeaways
Key Takeaways
1. Pure LLM generation has limits
- Knowledge cutoff, hallucinations, no private data access
2. Retrieval grounds generation in facts
- Find relevant docs, then generate with context
3. Dense retrieval uses semantic similarity
- Embed query and docs, find closest vectors
4. Sparse retrieval (BM25) uses keyword matching
- Better for exact terms, combine with dense = hybrid
5. Chunking matters
- Documents → smaller pieces for granular retrieval
- Overlap preserves context at boundaries
6. Multi-stage retrieval improves accuracy
- Fast bi-encoder for recall, slow cross-encoder for precision
7. Similarity ≠ Relevance
- High similarity score doesn't mean the doc answers the query
Verify Your Understanding
Before proceeding, you should be able to:
Explain why similarity = 0.95 can still be useless — Articulate with a concrete example.
Given this scenario, identify the problem:
Query: “How do I reset my password?”
Doc1: “Password reset failed for user X at 3:42pm” (similarity = 0.91)