Skip to content

Ai-engineering Series

Generation to Retrieval - Grounding LLMs in Facts

Deep dive into retrieval: why pure generation hallucinates, vector similarity search, dense vs sparse retrieval, chunking strategies, and multi-stage retrieval with reranking

Building On Previous Knowledge

In the previous progression, you learned how LLMs generate text token-by-token using learned probability distributions. This creates a fundamental problem: the model can only generate from patterns it memorized during training.

If the answer isn’t in the training data, the model will either refuse to answer or—more dangerously—generate a plausible-sounding but fabricated response.

This progression solves that problem by introducing retrieval: giving the model access to external knowledge at inference time.

What Goes Wrong Without This:

Retrieval Failure Patterns

The Limits of Pure Generation

LLMs are trained on static data with a cutoff date. They generate from learned patterns, not live facts.

Pure Generation Problems

The model generates plausible text, but plausible ≠ true.


Retrieval: Grounding Generation in Facts

Instead of asking the model to recall facts, give it facts to use.

Retrieval vs Pure Generation

The Retrieval Pipeline

Retrieval Pipeline

Core mechanic: find vectors closest to query vector.

Vector Similarity Search

Why it works: Embedding models map similar meanings to similar vectors. “How do I reset my password?” is close to “Password reset instructions” even though they share few words.


Dense vs Sparse Retrieval

Two fundamentally different approaches:

Dense vs Sparse Retrieval

In practice: Combine both (hybrid search).

Hybrid Search Benefits

Reciprocal Rank Fusion (RRF)

When combining dense and sparse results, use RRF to merge ranked lists:

Reciprocal Rank Fusion

RRF is simple and often performs just as well as learned score combination.


Chunking: Why and How

Documents are too long to embed as single units:

  1. Embedding models have token limits (512-8192)
  2. Long texts dilute specific information
  3. Retrieval granularity matters
Why Chunking Matters

Chunking Strategies

Chunking Strategies

Overlap

Include some text from previous chunk to preserve context at boundaries:

Chunk Overlap

Multi-Stage Retrieval

Single-stage retrieval trades off speed vs accuracy. Multi-stage gets both.

Multi-Stage Retrieval

Bi-Encoder vs Cross-Encoder

Bi-Encoder vs Cross-Encoder

Why Two Stages?

Two-Stage Tradeoffs

Retrieval Quality Metrics

How do you know retrieval is working?

Retrieval Quality Metrics

For RAG, Recall@K usually matters most. If the answer isn’t retrieved, generation fails.


Common Pitfalls

Misconception: “High similarity score = relevant result”

Similarity vs Relevance

The Query-Document Mismatch Problem

Query-Document Mismatch

Solutions:

  1. Query Expansion: Add synonyms and related terms
  2. HyDE: Generate hypothetical answer, embed that instead
  3. Query Rewriting: Transform user query to match document style
  4. Fine-tuned Retrievers: Train on your domain’s (query, doc) pairs

Chunking Mistakes

Chunking Pitfalls

Code Example

Basic semantic search with sentence-transformers:

import numpy as np
from sentence_transformers import SentenceTransformer

# Initialize embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample knowledge base
documents = [
    "To reset your password, go to Settings > Security > Reset Password.",
    "Our API rate limit is 100 requests per minute for free tier.",
    "Contact support@example.com for billing questions.",
    "The application requires Python 3.9 or higher.",
    "Two-factor authentication can be enabled in Settings > Security.",
]

# Index: Embed all documents
doc_embeddings = model.encode(documents)

def search(query: str, top_k: int = 3) -> list[tuple[str, float]]:
    """Semantic search: find most similar documents to query."""
    # Embed query with same model
    query_embedding = model.encode(query)

    # Compute cosine similarities
    # (embeddings are normalized, so dot product = cosine similarity)
    similarities = np.dot(doc_embeddings, query_embedding)

    # Get top-k indices
    top_indices = np.argsort(similarities)[::-1][:top_k]

    # Return documents with scores
    results = []
    for idx in top_indices:
        results.append((documents[idx], similarities[idx]))
    return results

# Test queries
queries = [
    "How do I change my password?",
    "What are the API limits?",
    "I need help with my bill",
]

for query in queries:
    print(f"\nQuery: {query}")
    results = search(query, top_k=2)
    for doc, score in results:
        print(f"  [{score:.3f}] {doc[:50]}...")

Key Takeaways

Key Takeaways

Verify Your Understanding

Before proceeding, you should be able to:

Explain why similarity = 0.95 can still be useless — Articulate with a concrete example.

Given this scenario, identify the problem:

  • Query: “How do I reset my password?”
  • Doc1: “Password reset failed for user X at 3:42pm” (similarity = 0.91)
  • Doc2: “Go to Settings > Security > Reset Password” (similarity = 0.87)

Which document is more relevant? Why is it ranked lower?

Identify the error in this statement: “BM25 is outdated, dense retrieval is always better.”

Your retrieval returns 10 documents. 8 are similar. 2 answer the question. Which metric captures this problem—Recall@10 or Precision@10?


What’s Next

After this, you can:

  • Continue → Retrieval → RAG — putting it all together
  • Build → Semantic search for your documents

Concepts covered in this article