I/D/E · ai-engineering

Retrieval to RAG - The Complete Pipeline

Summary

Deep dive into RAG: prompt construction, reranking, failure modes, the debugging decision tree, and how to diagnose when things go wrong

RAG debugging tree: every wrong answer is one of three things

Most teams try random fixes when RAG breaks. This tree saves the hours — and the one diagnostic question at the root saves most of them.

Building On Previous Knowledge

The previous chapter ended with a Takeaway: Recall@K is the load-bearing metric for retrieval — but Recall@K only tells you the chunk was returned, not that the LLM used it correctly. Retrieval gave us ingredients; now we have to cook them properly, and the cooking has its own failure modes.

The LLM might ignore the retrieved context. It might hallucinate despite having good context. It might use the context but synthesise incorrectly. Two systems, two independent ways to fail.

Where most RAG tutorials stop: they show you “Retrieve + Augment + Generate”, hand you a LangChain code snippet, and ship. They never tell you what to do when the output is wrong. Teams respond by guessing at prompt edits — when the actual bug is in retrieval.

This chapter delivers the differentiator that public coverage misses: the debugging decision tree. One diagnostic question at the root partitions every RAG failure into one of three branches, each with a different fix list. The original RAG paper [lewis2020] introduced the architecture; the RAGAS evaluation framework [ragas] gave us metrics to score the branches separately. The tree is what ties them together in production.

Takeaway: RAG is a two-component system (retrieval + generation), and the load-bearing engineering skill is being able to diagnose which component broke — random prompt-tuning is the most expensive bug in the field.

What Goes Wrong Without This:

RAG Failure Patterns
Symptom: "RAG doesn't work" (your team gives up on the approach).
Cause:   No debugging methodology. When output is wrong, random changes
       are made. Nobody identified whether the problem is retrieval
       or generation.

Symptom: RAG works perfectly for demo queries, fails for real user queries.
Cause: Demo queries were crafted to match document phrasing.
Real user queries are messy and use different vocabulary.

Symptom: The LLM confidently produces an answer that contradicts
the retrieved documents.
Cause: Weak grounding instruction in prompt. The LLM's prior knowledge
is more "confident" than the provided context.

The Complete RAG Pipeline

RAG = Retrieval-Augmented Generation. The term comes from Lewis et al. 2020 [lewis2020], which introduced the architecture for knowledge-intensive NLP tasks. The original paper paired a Dense Passage Retriever (DPR) [karpukhin2020] with a BART seq2seq generator over a dense Wikipedia index. The production pattern shipped today is the same shape with different parts swapped in — any embedding model, any vector store, any chat LLM.

RAG Pipeline
User: "What's the refund policy for premium plans?"


  1. RETRIEVE                                        
                                                     
  Query  Embed  Search vector DB  Top-K docs      
                                                     
  Retrieved:                                         
  • "Premium plans have a 30-day refund window..."   
  • "To request a refund, contact support..."        
  • "Refunds are processed within 5 business days"   

                       
                       

  2. AUGMENT                                         
                                                     
  Construct prompt with retrieved context:           
                                                     
  "Based on the following information:               
   [Retrieved docs]                                  
                                                     
   Answer the user's question:                       
   [User query]"                                     

                       
                       

  3. GENERATE                                        
                                                     
  LLM produces grounded answer using context         
                                                     
  "Premium plans have a 30-day refund policy.        
   To request a refund, contact support@... and      
   expect processing within 5 business days."        

Takeaway: RAG is two systems composed — a retriever and a generator. Each is independently testable, independently breakable, and independently fixable. End-to-end thinking is what makes RAG bugs feel mysterious.


Prompt Construction

How you present retrieved context to the LLM matters:

Basic RAG Prompt Template

  Basic RAG prompt template                                       

                                                                  
  You are a helpful assistant. Answer questions based only        
  on the provided context. If the context doesn't contain         
  enough information, say "I don't have enough information."      
                                                                  
  Context:                                                        
  ---                                                             
  {retrieved_document_1}                                          
  ---                                                             
  {retrieved_document_2}                                          
  ---                                                             
  {retrieved_document_3}                                          
                                                                  
  Question: {user_query}                                          
                                                                  
  Answer:                                                         
                                                                  

Key elements:

Prompt Key Elements
1. GROUNDING INSTRUCTION
 "Answer based only on the provided context"
  Reduces hallucination, keeps model on topic

2. FALLBACK INSTRUCTION
 "If context doesn't contain enough information, say so"
  Prevents confident wrong answers

3. CLEAR SEPARATION
 Use delimiters (---, XML tags) between chunks
  Model can distinguish sources

4. SOURCE ATTRIBUTION (optional)
 Include metadata: "From: billing_policy.md, Section 3"
  Enables citations in response

Takeaway: prompt construction is grounding-instruction + fallback-instruction + clear separators + optional attribution — four levers that together control hallucination far more than the retrieval quality does.


Reranking: Quality Over Quantity

Initial retrieval is fast but imprecise. Reranking improves quality:

Reranking Pipeline
Without reranking:

Query  Retrieve top-20  Use top-5 in prompt

Problem: Top-5 by embedding similarity may not be
the most relevant for answering the question.

With reranking:

Query  Retrieve top-20  Rerank  Use top-5 in prompt

Reranker: Cross-encoder that scores (query, doc) pairs
More accurate than bi-encoder similarity, but slower

How Many Documents?

More context isn’t always better:

Trade-offs in K
Trade-offs in K (number of retrieved docs):

Small K (1-3):
 Less noise, focused context
 Lower cost (fewer tokens)
 May miss relevant information
 Low recall

Large K (10-20):
 Higher recall, more coverage
 Redundancy can help
 More noise, irrelevant content
 Higher cost, possible "lost in the middle"

“Lost in the middle” problem: LLMs attend more to beginning and end of context. Information in the middle may be ignored.

Practical guidance:

Choosing K Values
Factoid questions:       K = 3-5 (need specific answer)
Complex questions:       K = 5-10 (need multiple aspects)
Research/synthesis:      K = 10-20 (need comprehensive coverage)

After reranking: Use top 3-5 from reranked results

Takeaway: reranking turns “retrieve top-100, send top-5” into a precision pass — cheap bi-encoder for recall, expensive cross-encoder for the final ranking. The “more docs = better” intuition is wrong; LLMs are noisier than search engines.


RAG Failure Modes

When RAG goes wrong:

RAG Failure Modes
1. RETRIEVAL FAILURE
 Relevant document exists but wasn't retrieved

Causes:
• Query-document vocabulary mismatch
• Poor chunking (answer split across chunks)
• Embedding model doesn't capture domain semantics
• K too small

Diagnosis: Check if relevant doc is in top-100

2. CONTEXT IGNORED
 Relevant doc retrieved but LLM didn't use it

 Causes:
 • Lost in the middle (long context)
 • LLM's prior knowledge conflicts with context
 • Poor prompt construction
 • Answer requires synthesis across multiple chunks

 Diagnosis: Is the answer literally in the context?

3. HALLUCINATION DESPITE CONTEXT
 LLM generates plausible but incorrect information

 Causes:
 • Weak grounding instruction
 • Context partially relevant, LLM fills gaps
 • Model confident in prior knowledge

 Diagnosis: Does response contain info not in context?

4. MISSING INFORMATION
 Information doesn't exist in knowledge base

 Correct behavior: LLM should say "I don't know"
 Failure: LLM makes up answer anyway

 Solution: Strong fallback instruction in prompt

Takeaway: four named failure modes — retrieval failure, context-ignored, hallucination-despite-context, knowledge-gap. Each has its own root cause and its own fix list. Calling them all “the RAG isn’t working” is the misdiagnosis trap the next section solves.


The RAG Debugging Decision Tree

One diagnostic question partitions every wrong RAG answer into one of three branches. The question is: “Is the correct answer in the retrieved documents?” Teams that skip it spend hours tuning the prompt when the bug is in retrieval — or rebuilding the index when the bug is in generation. The hero diagram at the top of this chapter shows the full tree; what follows is the ASCII form for at-the-keyboard use.

RAG Debugging Decision Tree
Output is wrong
     
     
Is the correct answer in the retrieved documents?
     
      YES  GENERATION PROBLEM
                
                 Check prompt construction
                 Check grounding instruction strength
                 Check for "lost in the middle" (reorder context)
                 Check if model's prior conflicts with context
     
      NO  Does the correct document exist in corpus?
                         
                          YES  RETRIEVAL PROBLEM
                                    
                                     Check query-document vocabulary mismatch
                                     Check chunking (answer split across chunks?)
                                     Check embedding model domain fit
                                     Check K (too small?)
                         
                          NO  KNOWLEDGE GAP
                                     
                                      Add missing information to corpus
                                      Or implement "I don't know" fallback

Commit this decision tree to memory. It will save you hours of random debugging.

How to actually run the root question in practice:

  1. Save the user query, the retrieved chunks, and the model’s answer to a log.
  2. Grep the retrieved chunks for the ground-truth answer span (you have it for at least your eval set).
  3. If the span is in the chunks → generation problem: prompt, ordering, fallback instruction, or RAGAS-faithfulness score below threshold.
  4. If the span is not in the chunks but exists in the corpus → retrieval problem: chunking, embedding model, hybrid path, or rerank.
  5. If the span doesn’t exist in the corpus at all → knowledge gap: add to corpus, or ship the “I don’t know” fallback the prompt promised.

Takeaway: one diagnostic question — “is the correct answer in the retrieved docs?” — splits every RAG failure into generation / retrieval / knowledge-gap, with a different fix list for each. Most teams skip the question. Don’t skip the question.


Evaluation

RAG has two components to evaluate:

RAG Evaluation Framework

  RETRIEVAL EVALUATION                                            

                                                                  
  Recall@K: Are relevant docs in top-K?                           
  Precision@K: Are top-K docs relevant?                           
  MRR: How high is first relevant doc?                            
                                                                  
  Requires: Ground truth (query  relevant doc mappings)          
  Can be automated with labeled dataset                           
                                                                  

  GENERATION EVALUATION                                           

                                                                  
  Faithfulness: Is answer supported by retrieved context?         
  Relevance: Does answer address the question?                    
  Completeness: Does answer cover all aspects?                    
  Correctness: Is the answer factually correct?                   
                                                                  
  Requires: Human evaluation or LLM-as-judge                      
  Harder to automate than retrieval metrics                       
                                                                  

  END-TO-END EVALUATION                                           

                                                                  
  Answer correctness: Given query, is final answer right?         
                                                                  
  Note: End-to-end can mask where failures occur.                 
  If answer is wrong, is it retrieval or generation fault?        
  Evaluate components separately for debugging.                   
                                                                  

The RAGAS framework [ragas] operationalises this split with four canonical metrics:

  • faithfulness“how factually consistent a response is with the retrieved context”. Computed as (number of claims supported by context) / (total claims in response). Score < 1.0 means the generator added unsupported claims — flag as generation-side hallucination.
  • context_precision@K“the retriever’s ability to rank relevant chunks higher than irrelevant ones”. Defined as the rank-weighted mean of precision@k over the retrieved chunks (∑ P@k · v_k / |relevant|). Low score means the top-K is noisy — fix retrieval ranking (rerank, hybrid).
  • context_recall — fraction of the ground-truth answer that is in the retrieved context. Low score means the right chunks didn’t come back — fix retrieval (chunking, embedding model, K).
  • answer_relevancy — how well the answer addresses the question regardless of factual correctness. Low score plus high faithfulness means the model answered a different question — fix the prompt.

Read the four scores together and the debugging tree branch is usually obvious before you grep a single chunk.

Takeaway: end-to-end “is the answer right?” hides which component broke. RAGAS’s faithfulness / context-precision / context-recall split is the production-grade way to read the decision tree at scale.


Advanced Patterns

Beyond basic RAG:

Advanced RAG Patterns

  QUERY TRANSFORMATION                                            

  Query expansion: Add synonyms, rephrase                         
  Query decomposition: Break complex query into sub-queries       
  HyDE: Generate hypothetical answer, embed that                  
                                                                  

  ITERATIVE RETRIEVAL                                             

  Multi-hop: First retrieval informs second retrieval             
  "Who is the CEO of the company that acquired Twitter?"          
  Step 1: Retrieve  "X Corp acquired Twitter"                    
  Step 2: Retrieve  "Elon Musk is CEO of X Corp"                 
                                                                  

  SELF-REFLECTION                                                 

  Generate  Check if answer uses context  If not, retry         
  Generate  Verify answer against sources  Correct if needed    
                                                                  

  AGENTIC RAG                                                     

  LLM decides when to retrieve, what to search                    
  Can search multiple sources, combine results                    
  More flexible but harder to control                             

Takeaway: advanced RAG (HyDE, multi-hop, self-reflection, agentic) is the right tool when the basic pipeline plateaus — but it adds latency, cost, and failure modes. Earn the complexity; don’t lead with it.


Common Pitfalls & Misconceptions

MisconceptionWhy it’s wrongWhat to do instead
”RAG is just Retrieve + Generate”That’s the happy path. Production RAG has four named failure modes — retrieval-miss, context-ignored, hallucination-despite-context, knowledge-gap — each with different causes.Memorise the four failure modes and the decision-tree root question. Understanding the failures is understanding RAG.
”If retrieval is good, generation will be good”Lost-in-the-Middle is real — LLMs attend more to context-start and context-end positions (Liu 2023 [liu2023], Chroma 2025 [chroma-rot]). Even perfect retrieval gets ignored without grounding instructions.Tighten the grounding instruction (“answer based only on context”) + reorder context so the answer span lives near the start or end.
”More retrieved documents = better answers”Retrieving 20 docs when 3 are relevant adds 17 distractors. Cost goes up, latency goes up, and Lost-in-the-Middle gets worse.Start with K=3–5. Only increase K when measured Recall@K is the bottleneck. After reranking, drop to top 3–5 of the reranked list.
”We tuned the prompt and the RAG output got better”A common false-positive. The prompt change probably masked one symptom while the underlying retrieval problem still misfires on other queries.Run the decision tree first. If retrieval is broken, no prompt change is a real fix — it’s a hardcoded patch for that one query class.
”LLM-as-judge eval gave us 0.92, so we’re shipping”LLM-judge scores drift with the judge model version. Same RAG, same answers, different OpenAI release → different score. Production-agents Ch08 covers this in depth [pa-testing].Pin the judge model version. Run a held-out human-labelled eval set quarterly to calibrate. Don’t trust month-over-month LLM-judge drift as a signal.
”The model contradicted the context — we need a better model”The model isn’t broken. The prompt is. Without a strong grounding instruction the model’s prior wins ties; with the wrong context order it ignores the middle.Two prompt fixes: grounding + fallback instructions, and reordering context so critical info is at start/end. Then re-evaluate.
”Our RAG worked in dev but breaks in production”Dev queries are crafted (you wrote them to match docs). Real users use synonyms, abbreviations, typos, multi-language input, and out-of-distribution phrasing.Add HyDE / query rewriting / fine-tune the retriever on real (query, doc) pairs from production logs. Eval against real query distribution.

Takeaway: every RAG misconception traces to the same root error — treating RAG as one system instead of two. The decision tree, the four failure modes, and the RAGAS metrics all exist to break that habit.


Code Example

A complete RAG pipeline pinned to current library versions, with the debugging-tree instrumentation baked in. The rag() function returns both the answer and the retrieved chunks so you can run the root diagnostic question programmatically:

# Tested on:
#   openai==1.40.0
#   sentence-transformers==3.0.1
#   numpy==1.26.4
# Python 3.11
import numpy as np
from sentence_transformers import SentenceTransformer
from openai import OpenAI

embedder = SentenceTransformer("all-MiniLM-L6-v2")
llm = OpenAI()

# Knowledge base ---------------------------------------------------------------
documents = [
    {"id": "policy_1", "content": "Premium plans have a 30-day refund policy. Users can request a full refund within 30 days of purchase."},
    {"id": "policy_2", "content": "To request a refund, email support@example.com with your order ID and reason for refund."},
    {"id": "policy_3", "content": "Refunds are processed within 5 business days. The amount will be credited to the original payment method."},
]
doc_index = embedder.encode(
    [d["content"] for d in documents],
    normalize_embeddings=True,
)


def retrieve(query: str, top_k: int = 3) -> list[dict]:
    q = embedder.encode(query, normalize_embeddings=True)
    ranks = np.argsort(doc_index @ q)[::-1][:top_k]
    return [documents[i] for i in ranks]


def generate_with_context(query: str, context_docs: list[dict]) -> str:
    context = "\n---\n".join(f"[Source: {d['id']}]\n{d['content']}" for d in context_docs)

    # Grounding + fallback are both load-bearing — never one without the other.
    prompt = (
        "Answer the question based ONLY on the provided context.\n"
        "If the context doesn't contain enough information, reply exactly: "
        "\"I don't have enough information to answer that.\"\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {query}\n\n"
        "Answer:"
    )
    resp = llm.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return resp.choices[0].message.content


def rag(query: str) -> dict:
    chunks = retrieve(query, top_k=3)
    answer = generate_with_context(query, chunks)
    return {
        "query": query,
        "retrieved_docs": [d["id"] for d in chunks],
        "retrieved_chunks": [d["content"] for d in chunks],  # for the decision-tree root question
        "answer": answer,
    }


# Debugging-tree usage: when the answer is wrong, grep retrieved_chunks for the ground-truth span.
result = rag("What's the refund policy and how do I get one?")
print("Query:    ", result["query"])
print("Retrieved:", result["retrieved_docs"])
print("Answer:   ", result["answer"])

The retrieved_chunks field is what the decision tree consumes. Save it to your logs. When a user reports a bad answer, the first action is grep against the ground-truth span — not tuning the prompt.


Verify Your Understanding

Before continuing, you should be able to answer these from memory:

  1. State the root question of the RAG debugging decision tree. Then describe the three branches and the fix list you reach for in each. Name the misdiagnosis that skipping the question produces.
  2. Apply the tree to a concrete failure. Wrong answer: “The refund policy is 60 days.” Ground truth: “30 days.” Retrieved chunk #1 says: “Premium plans have a 30-day refund policy.” Which branch are you on? What’s the first fix?
  3. Reranking vs increasing K. Name one query class where reranking beats raising K, and one query class where raising K beats reranking. If your answer is “always rerank” or “never rerank”, you haven’t reasoned about the trade-off.
  4. Dev-works / prod-breaks. Your RAG passes every dev query, fails on real production queries. Give three concrete hypotheses with named fixes — one each from retrieval, generation, and the dev/prod data-distribution gap.
  5. Map RAGAS metrics onto the tree. Faithfulness = 0.55, context-precision@5 = 0.30, context-recall = 0.85, answer-relevancy = 0.92. Which branch of the decision tree are you on? What’s the most likely root cause?

What’s Next

RAG is one-shot: retrieve once, generate once. The next chapter — RAG → Agents — extends the same data path into multi-step loops where the model decides when to retrieve, what tool to call, and how to stop. The pedagogical bridge to the production-agents series begins there.


References

  • [lewis2020] Lewis, P. et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. arXiv:2005.11401. Source of the RAG architecture name; pairs a Dense Passage Retriever with a BART generator over a dense Wikipedia index. Two formulations: RAG-Sequence (one retrieval, full generation) and RAG-Token (different passages per token). Cited in §§ Building On Previous Knowledge, The Complete RAG Pipeline.
  • [ragas] Es, S. et al. RAGAS: Automated Evaluation of Retrieval-Augmented Generation. EACL 2024 + ongoing OSS framework. docs.ragas.io. Source of the four canonical metrics (faithfulness, context_precision, context_recall, answer_relevancy) used to instrument the debugging decision tree. Cited in §§ Building On Previous Knowledge, Evaluation, Common Pitfalls & Misconceptions.
  • [liu2023] Liu, N. F. et al. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172. Original U-shape attention-vs-position finding — context-start and context-end positions outperform context-middle. Cited in § Common Pitfalls & Misconceptions.
  • [chroma-rot] Hong, K., Troynikov, A., Huber, J. Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma Technical Report, 2025-07-14. trychroma.com/research/context-rot. Tested 18 LLMs across Anthropic, OpenAI, Google, Alibaba; “performance grows increasingly unreliable as input length grows.” Cited in § Common Pitfalls & Misconceptions.
  • [karpukhin2020] Karpukhin, V. et al. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020. arXiv:2004.04906. The retriever component of the original RAG paper; the practical baseline for any modern dense-retrieval RAG system. Cited in § The Complete RAG Pipeline.
  • [pa-testing] Production Agents — Part 8: Testing & Evaluation. Operator-grade companion to RAG evaluation, including the LLM-judge drift trap referenced in this chapter’s pitfalls table. Cross-series bridge.
  • [hyde] Gao, L. et al. Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE). ACL 2023. arXiv:2212.10496. Generate a hypothetical answer, embed that — fixes query-document phrasing mismatch without training. Cited in § The RAG Debugging Decision Tree and § Advanced Patterns.
Ai-engineering Ch 6/8
  1. 1 Text to Tokens - The Foundation 12m
  2. 2 Tokens to Embeddings - Vectors That Capture Meaning 12m
  3. 3 Embeddings to Attention - Relating Tokens to Each Other 15m
  4. 4 Attention to Generation - Producing Text Token by Token 12m
  5. 5 Generation to Retrieval - Grounding LLMs in Facts 15m
  6. 6 Retrieval to RAG - The Complete Pipeline 15m
  7. 7 RAG to Agents - From Retrieval to Action 15m
  8. 8 Agents to Evaluation - Measuring What Matters 12m