Generation to Retrieval - Grounding LLMs in Facts | Intentional / Deliberate / Engineering

Building On Previous Knowledge

In the previous progression, you learned how LLMs generate text token-by-token using learned probability distributions. This creates a fundamental problem: the model can only generate from patterns it memorized during training.

If the answer isn’t in the training data, the model will either refuse to answer or—more dangerously—generate a plausible-sounding but fabricated response.

This progression solves that problem by introducing retrieval: giving the model access to external knowledge at inference time.

What Goes Wrong Without This:

Retrieval Failure Patterns

The Limits of Pure Generation

LLMs are trained on static data with a cutoff date. They generate from learned patterns, not live facts.

Pure Generation Problems

The model generates plausible text, but plausible ≠ true.

Retrieval: Grounding Generation in Facts

Instead of asking the model to recall facts, give it facts to use.

Retrieval vs Pure Generation

The Retrieval Pipeline

Retrieval Pipeline

Vector Similarity Search

Core mechanic: find vectors closest to query vector.

Vector Similarity Search

Why it works: Embedding models map similar meanings to similar vectors. “How do I reset my password?” is close to “Password reset instructions” even though they share few words.

Dense vs Sparse Retrieval

Two fundamentally different approaches:

Dense vs Sparse Retrieval

In practice: Combine both (hybrid search).

Hybrid Search Benefits

Reciprocal Rank Fusion (RRF)

When combining dense and sparse results, use RRF to merge ranked lists:

Reciprocal Rank Fusion

RRF is simple and often performs just as well as learned score combination.

Chunking: Why and How

Documents are too long to embed as single units:

Embedding models have token limits (512-8192)
Long texts dilute specific information
Retrieval granularity matters

Why Chunking Matters

Chunking Strategies

Overlap

Include some text from previous chunk to preserve context at boundaries:

Chunk Overlap

Multi-Stage Retrieval

Single-stage retrieval trades off speed vs accuracy. Multi-stage gets both.

Multi-Stage Retrieval

Bi-Encoder vs Cross-Encoder

Why Two Stages?

Two-Stage Tradeoffs

Retrieval Quality Metrics

How do you know retrieval is working?

Retrieval Quality Metrics

For RAG, Recall@K usually matters most. If the answer isn’t retrieved, generation fails.

Common Pitfalls

Misconception: “High similarity score = relevant result”

Similarity vs Relevance

The Query-Document Mismatch Problem

Query-Document Mismatch

Solutions:

Query Expansion: Add synonyms and related terms
HyDE: Generate hypothetical answer, embed that instead
Query Rewriting: Transform user query to match document style
Fine-tuned Retrievers: Train on your domain’s (query, doc) pairs

Chunking Mistakes

Chunking Pitfalls

Code Example

Basic semantic search with sentence-transformers:

import numpy as np
from sentence_transformers import SentenceTransformer

# Initialize embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample knowledge base
documents = [
    "To reset your password, go to Settings > Security > Reset Password.",
    "Our API rate limit is 100 requests per minute for free tier.",
    "Contact support@example.com for billing questions.",
    "The application requires Python 3.9 or higher.",
    "Two-factor authentication can be enabled in Settings > Security.",
]

# Index: Embed all documents
doc_embeddings = model.encode(documents)

def search(query: str, top_k: int = 3) -> list[tuple[str, float]]:
    """Semantic search: find most similar documents to query."""
    # Embed query with same model
    query_embedding = model.encode(query)

    # Compute cosine similarities
    # (embeddings are normalized, so dot product = cosine similarity)
    similarities = np.dot(doc_embeddings, query_embedding)

    # Get top-k indices
    top_indices = np.argsort(similarities)[::-1][:top_k]

    # Return documents with scores
    results = []
    for idx in top_indices:
        results.append((documents[idx], similarities[idx]))
    return results

# Test queries
queries = [
    "How do I change my password?",
    "What are the API limits?",
    "I need help with my bill",
]

for query in queries:
    print(f"\nQuery: {query}")
    results = search(query, top_k=2)
    for doc, score in results:
        print(f"  [{score:.3f}] {doc[:50]}...")

Key Takeaways

Verify Your Understanding

Before proceeding, you should be able to:

Explain why similarity = 0.95 can still be useless — Articulate with a concrete example.

Given this scenario, identify the problem:

Query: “How do I reset my password?”
Doc1: “Password reset failed for user X at 3:42pm” (similarity = 0.91)
Doc2: “Go to Settings > Security > Reset Password” (similarity = 0.87)

Which document is more relevant? Why is it ranked lower?

Identify the error in this statement: “BM25 is outdated, dense retrieval is always better.”

Your retrieval returns 10 documents. 8 are similar. 2 answer the question. Which metric captures this problem—Recall@10 or Precision@10?

What’s Next

After this, you can:

Continue → Retrieval → RAG — putting it all together
Build → Semantic search for your documents

Generation to Retrieval - Grounding LLMs in Facts

Building On Previous Knowledge

The Limits of Pure Generation

Retrieval: Grounding Generation in Facts

The Retrieval Pipeline

Vector Similarity Search

Dense vs Sparse Retrieval

Reciprocal Rank Fusion (RRF)

Chunking: Why and How

Chunking Strategies

Overlap

Multi-Stage Retrieval

Bi-Encoder vs Cross-Encoder

Retrieval Quality Metrics

Common Pitfalls

Misconception: “High similarity score = relevant result”

The Query-Document Mismatch Problem

Chunking Mistakes

Code Example

Key Takeaways

Verify Your Understanding

What’s Next

Concepts covered in this article

Table of Contents

Ai-engineering Series

Building On Previous Knowledge

The Limits of Pure Generation

Retrieval: Grounding Generation in Facts

The Retrieval Pipeline

Vector Similarity Search

Dense vs Sparse Retrieval

Reciprocal Rank Fusion (RRF)

Chunking: Why and How

Chunking Strategies

Overlap

Multi-Stage Retrieval

Bi-Encoder vs Cross-Encoder

Retrieval Quality Metrics

Common Pitfalls

Misconception: “High similarity score = relevant result”

The Query-Document Mismatch Problem

Chunking Mistakes

Code Example

Key Takeaways

Verify Your Understanding

What’s Next

Concepts covered in this article

Table of Contents