Most teams try random fixes when RAG breaks. This tree saves the hours — and the one diagnostic question at the root saves most of them.
Building On Previous Knowledge
The previous chapter ended with a Takeaway: Recall@K is the load-bearing metric for retrieval — but Recall@K only tells you the chunk was returned, not that the LLM used it correctly. Retrieval gave us ingredients; now we have to cook them properly, and the cooking has its own failure modes.
The LLM might ignore the retrieved context. It might hallucinate despite having good context. It might use the context but synthesise incorrectly. Two systems, two independent ways to fail.
Where most RAG tutorials stop: they show you “Retrieve + Augment + Generate”, hand you a LangChain code snippet, and ship. They never tell you what to do when the output is wrong. Teams respond by guessing at prompt edits — when the actual bug is in retrieval.
This chapter delivers the differentiator that public coverage misses: the debugging decision tree. One diagnostic question at the root partitions every RAG failure into one of three branches, each with a different fix list. The original RAG paper [lewis2020] introduced the architecture; the RAGAS evaluation framework [ragas] gave us metrics to score the branches separately. The tree is what ties them together in production.
Takeaway: RAG is a two-component system (retrieval + generation), and the load-bearing engineering skill is being able to diagnose which component broke — random prompt-tuning is the most expensive bug in the field.
What Goes Wrong Without This:
Symptom: "RAG doesn't work" (your team gives up on the approach). Cause: No debugging methodology. When output is wrong, random changes are made. Nobody identified whether the problem is retrieval or generation. Symptom: RAG works perfectly for demo queries, fails for real user queries. Cause: Demo queries were crafted to match document phrasing. Real user queries are messy and use different vocabulary. Symptom: The LLM confidently produces an answer that contradicts the retrieved documents. Cause: Weak grounding instruction in prompt. The LLM's prior knowledge is more "confident" than the provided context.
The Complete RAG Pipeline
RAG = Retrieval-Augmented Generation. The term comes from Lewis et al. 2020 [lewis2020], which introduced the architecture for knowledge-intensive NLP tasks. The original paper paired a Dense Passage Retriever (DPR) [karpukhin2020] with a BART seq2seq generator over a dense Wikipedia index. The production pattern shipped today is the same shape with different parts swapped in — any embedding model, any vector store, any chat LLM.
User: "What's the refund policy for premium plans?" ┌─────────────────────────────────────────────────────┐ │ 1. RETRIEVE │ │ │ │ Query → Embed → Search vector DB → Top-K docs │ │ │ │ Retrieved: │ │ • "Premium plans have a 30-day refund window..." │ │ • "To request a refund, contact support..." │ │ • "Refunds are processed within 5 business days" │ └─────────────────────────────────────────────────────┘ │ ↓ ┌─────────────────────────────────────────────────────┐ │ 2. AUGMENT │ │ │ │ Construct prompt with retrieved context: │ │ │ │ "Based on the following information: │ │ [Retrieved docs] │ │ │ │ Answer the user's question: │ │ [User query]" │ └─────────────────────────────────────────────────────┘ │ ↓ ┌─────────────────────────────────────────────────────┐ │ 3. GENERATE │ │ │ │ LLM produces grounded answer using context │ │ │ │ "Premium plans have a 30-day refund policy. │ │ To request a refund, contact support@... and │ │ expect processing within 5 business days." │ └─────────────────────────────────────────────────────┘
Takeaway: RAG is two systems composed — a retriever and a generator. Each is independently testable, independently breakable, and independently fixable. End-to-end thinking is what makes RAG bugs feel mysterious.
Prompt Construction
How you present retrieved context to the LLM matters:
┌──────────────────────────────────────────────────────────────────┐ │ Basic RAG prompt template │ ├──────────────────────────────────────────────────────────────────┤ │ │ │ You are a helpful assistant. Answer questions based only │ │ on the provided context. If the context doesn't contain │ │ enough information, say "I don't have enough information." │ │ │ │ Context: │ │ --- │ │ {retrieved_document_1} │ │ --- │ │ {retrieved_document_2} │ │ --- │ │ {retrieved_document_3} │ │ │ │ Question: {user_query} │ │ │ │ Answer: │ │ │ └──────────────────────────────────────────────────────────────────┘
Key elements:
1. GROUNDING INSTRUCTION "Answer based only on the provided context" → Reduces hallucination, keeps model on topic 2. FALLBACK INSTRUCTION "If context doesn't contain enough information, say so" → Prevents confident wrong answers 3. CLEAR SEPARATION Use delimiters (---, XML tags) between chunks → Model can distinguish sources 4. SOURCE ATTRIBUTION (optional) Include metadata: "From: billing_policy.md, Section 3" → Enables citations in response
Takeaway: prompt construction is grounding-instruction + fallback-instruction + clear separators + optional attribution — four levers that together control hallucination far more than the retrieval quality does.
Reranking: Quality Over Quantity
Initial retrieval is fast but imprecise. Reranking improves quality:
Without reranking: Query → Retrieve top-20 → Use top-5 in prompt Problem: Top-5 by embedding similarity may not be the most relevant for answering the question. With reranking: Query → Retrieve top-20 → Rerank → Use top-5 in prompt Reranker: Cross-encoder that scores (query, doc) pairs More accurate than bi-encoder similarity, but slower
How Many Documents?
More context isn’t always better:
Trade-offs in K (number of retrieved docs): Small K (1-3): ✓ Less noise, focused context ✓ Lower cost (fewer tokens) ✗ May miss relevant information ✗ Low recall Large K (10-20): ✓ Higher recall, more coverage ✓ Redundancy can help ✗ More noise, irrelevant content ✗ Higher cost, possible "lost in the middle"
“Lost in the middle” problem: LLMs attend more to beginning and end of context. Information in the middle may be ignored.
Practical guidance:
Factoid questions: K = 3-5 (need specific answer) Complex questions: K = 5-10 (need multiple aspects) Research/synthesis: K = 10-20 (need comprehensive coverage) After reranking: Use top 3-5 from reranked results
Takeaway: reranking turns “retrieve top-100, send top-5” into a precision pass — cheap bi-encoder for recall, expensive cross-encoder for the final ranking. The “more docs = better” intuition is wrong; LLMs are noisier than search engines.
RAG Failure Modes
When RAG goes wrong:
1. RETRIEVAL FAILURE Relevant document exists but wasn't retrieved Causes: • Query-document vocabulary mismatch • Poor chunking (answer split across chunks) • Embedding model doesn't capture domain semantics • K too small Diagnosis: Check if relevant doc is in top-100 2. CONTEXT IGNORED Relevant doc retrieved but LLM didn't use it Causes: • Lost in the middle (long context) • LLM's prior knowledge conflicts with context • Poor prompt construction • Answer requires synthesis across multiple chunks Diagnosis: Is the answer literally in the context? 3. HALLUCINATION DESPITE CONTEXT LLM generates plausible but incorrect information Causes: • Weak grounding instruction • Context partially relevant, LLM fills gaps • Model confident in prior knowledge Diagnosis: Does response contain info not in context? 4. MISSING INFORMATION Information doesn't exist in knowledge base Correct behavior: LLM should say "I don't know" Failure: LLM makes up answer anyway Solution: Strong fallback instruction in prompt
Takeaway: four named failure modes — retrieval failure, context-ignored, hallucination-despite-context, knowledge-gap. Each has its own root cause and its own fix list. Calling them all “the RAG isn’t working” is the misdiagnosis trap the next section solves.
The RAG Debugging Decision Tree
One diagnostic question partitions every wrong RAG answer into one of three branches. The question is: “Is the correct answer in the retrieved documents?” Teams that skip it spend hours tuning the prompt when the bug is in retrieval — or rebuilding the index when the bug is in generation. The hero diagram at the top of this chapter shows the full tree; what follows is the ASCII form for at-the-keyboard use.
Output is wrong │ ▼ Is the correct answer in the retrieved documents? │ ├─── YES ──▶ GENERATION PROBLEM │ │ │ ├─ Check prompt construction │ ├─ Check grounding instruction strength │ ├─ Check for "lost in the middle" (reorder context) │ └─ Check if model's prior conflicts with context │ └─── NO ───▶ Does the correct document exist in corpus? │ ├─── YES ──▶ RETRIEVAL PROBLEM │ │ │ ├─ Check query-document vocabulary mismatch │ ├─ Check chunking (answer split across chunks?) │ ├─ Check embedding model domain fit │ └─ Check K (too small?) │ └─── NO ───▶ KNOWLEDGE GAP │ ├─ Add missing information to corpus └─ Or implement "I don't know" fallback
Commit this decision tree to memory. It will save you hours of random debugging.
How to actually run the root question in practice:
- Save the user query, the retrieved chunks, and the model’s answer to a log.
- Grep the retrieved chunks for the ground-truth answer span (you have it for at least your eval set).
- If the span is in the chunks → generation problem: prompt, ordering, fallback instruction, or RAGAS-faithfulness score below threshold.
- If the span is not in the chunks but exists in the corpus → retrieval problem: chunking, embedding model, hybrid path, or rerank.
- If the span doesn’t exist in the corpus at all → knowledge gap: add to corpus, or ship the “I don’t know” fallback the prompt promised.
Takeaway: one diagnostic question — “is the correct answer in the retrieved docs?” — splits every RAG failure into generation / retrieval / knowledge-gap, with a different fix list for each. Most teams skip the question. Don’t skip the question.
Evaluation
RAG has two components to evaluate:
┌──────────────────────────────────────────────────────────────────┐ │ RETRIEVAL EVALUATION │ ├──────────────────────────────────────────────────────────────────┤ │ │ │ Recall@K: Are relevant docs in top-K? │ │ Precision@K: Are top-K docs relevant? │ │ MRR: How high is first relevant doc? │ │ │ │ Requires: Ground truth (query → relevant doc mappings) │ │ Can be automated with labeled dataset │ │ │ ├──────────────────────────────────────────────────────────────────┤ │ GENERATION EVALUATION │ ├──────────────────────────────────────────────────────────────────┤ │ │ │ Faithfulness: Is answer supported by retrieved context? │ │ Relevance: Does answer address the question? │ │ Completeness: Does answer cover all aspects? │ │ Correctness: Is the answer factually correct? │ │ │ │ Requires: Human evaluation or LLM-as-judge │ │ Harder to automate than retrieval metrics │ │ │ ├──────────────────────────────────────────────────────────────────┤ │ END-TO-END EVALUATION │ ├──────────────────────────────────────────────────────────────────┤ │ │ │ Answer correctness: Given query, is final answer right? │ │ │ │ Note: End-to-end can mask where failures occur. │ │ If answer is wrong, is it retrieval or generation fault? │ │ Evaluate components separately for debugging. │ │ │ └──────────────────────────────────────────────────────────────────┘
The RAGAS framework [ragas] operationalises this split with four canonical metrics:
faithfulness— “how factually consistent a response is with the retrieved context”. Computed as(number of claims supported by context) / (total claims in response). Score < 1.0 means the generator added unsupported claims — flag as generation-side hallucination.context_precision@K— “the retriever’s ability to rank relevant chunks higher than irrelevant ones”. Defined as the rank-weighted mean of precision@k over the retrieved chunks (∑ P@k · v_k / |relevant|). Low score means the top-K is noisy — fix retrieval ranking (rerank, hybrid).context_recall— fraction of the ground-truth answer that is in the retrieved context. Low score means the right chunks didn’t come back — fix retrieval (chunking, embedding model, K).answer_relevancy— how well the answer addresses the question regardless of factual correctness. Low score plus high faithfulness means the model answered a different question — fix the prompt.
Read the four scores together and the debugging tree branch is usually obvious before you grep a single chunk.
Takeaway: end-to-end “is the answer right?” hides which component broke. RAGAS’s faithfulness / context-precision / context-recall split is the production-grade way to read the decision tree at scale.
Advanced Patterns
Beyond basic RAG:
┌──────────────────────────────────────────────────────────────────┐ │ QUERY TRANSFORMATION │ ├──────────────────────────────────────────────────────────────────┤ │ Query expansion: Add synonyms, rephrase │ │ Query decomposition: Break complex query into sub-queries │ │ HyDE: Generate hypothetical answer, embed that │ │ │ ├──────────────────────────────────────────────────────────────────┤ │ ITERATIVE RETRIEVAL │ ├──────────────────────────────────────────────────────────────────┤ │ Multi-hop: First retrieval informs second retrieval │ │ "Who is the CEO of the company that acquired Twitter?" │ │ Step 1: Retrieve → "X Corp acquired Twitter" │ │ Step 2: Retrieve → "Elon Musk is CEO of X Corp" │ │ │ ├──────────────────────────────────────────────────────────────────┤ │ SELF-REFLECTION │ ├──────────────────────────────────────────────────────────────────┤ │ Generate → Check if answer uses context → If not, retry │ │ Generate → Verify answer against sources → Correct if needed │ │ │ ├──────────────────────────────────────────────────────────────────┤ │ AGENTIC RAG │ ├──────────────────────────────────────────────────────────────────┤ │ LLM decides when to retrieve, what to search │ │ Can search multiple sources, combine results │ │ More flexible but harder to control │ └──────────────────────────────────────────────────────────────────┘
Takeaway: advanced RAG (HyDE, multi-hop, self-reflection, agentic) is the right tool when the basic pipeline plateaus — but it adds latency, cost, and failure modes. Earn the complexity; don’t lead with it.
Common Pitfalls & Misconceptions
| Misconception | Why it’s wrong | What to do instead |
|---|---|---|
| ”RAG is just Retrieve + Generate” | That’s the happy path. Production RAG has four named failure modes — retrieval-miss, context-ignored, hallucination-despite-context, knowledge-gap — each with different causes. | Memorise the four failure modes and the decision-tree root question. Understanding the failures is understanding RAG. |
| ”If retrieval is good, generation will be good” | Lost-in-the-Middle is real — LLMs attend more to context-start and context-end positions (Liu 2023 [liu2023], Chroma 2025 [chroma-rot]). Even perfect retrieval gets ignored without grounding instructions. | Tighten the grounding instruction (“answer based only on context”) + reorder context so the answer span lives near the start or end. |
| ”More retrieved documents = better answers” | Retrieving 20 docs when 3 are relevant adds 17 distractors. Cost goes up, latency goes up, and Lost-in-the-Middle gets worse. | Start with K=3–5. Only increase K when measured Recall@K is the bottleneck. After reranking, drop to top 3–5 of the reranked list. |
| ”We tuned the prompt and the RAG output got better” | A common false-positive. The prompt change probably masked one symptom while the underlying retrieval problem still misfires on other queries. | Run the decision tree first. If retrieval is broken, no prompt change is a real fix — it’s a hardcoded patch for that one query class. |
| ”LLM-as-judge eval gave us 0.92, so we’re shipping” | LLM-judge scores drift with the judge model version. Same RAG, same answers, different OpenAI release → different score. Production-agents Ch08 covers this in depth [pa-testing]. | Pin the judge model version. Run a held-out human-labelled eval set quarterly to calibrate. Don’t trust month-over-month LLM-judge drift as a signal. |
| ”The model contradicted the context — we need a better model” | The model isn’t broken. The prompt is. Without a strong grounding instruction the model’s prior wins ties; with the wrong context order it ignores the middle. | Two prompt fixes: grounding + fallback instructions, and reordering context so critical info is at start/end. Then re-evaluate. |
| ”Our RAG worked in dev but breaks in production” | Dev queries are crafted (you wrote them to match docs). Real users use synonyms, abbreviations, typos, multi-language input, and out-of-distribution phrasing. | Add HyDE / query rewriting / fine-tune the retriever on real (query, doc) pairs from production logs. Eval against real query distribution. |
Takeaway: every RAG misconception traces to the same root error — treating RAG as one system instead of two. The decision tree, the four failure modes, and the RAGAS metrics all exist to break that habit.
Code Example
A complete RAG pipeline pinned to current library versions, with the debugging-tree instrumentation baked in. The rag() function returns both the answer and the retrieved chunks so you can run the root diagnostic question programmatically:
# Tested on:
# openai==1.40.0
# sentence-transformers==3.0.1
# numpy==1.26.4
# Python 3.11
import numpy as np
from sentence_transformers import SentenceTransformer
from openai import OpenAI
embedder = SentenceTransformer("all-MiniLM-L6-v2")
llm = OpenAI()
# Knowledge base ---------------------------------------------------------------
documents = [
{"id": "policy_1", "content": "Premium plans have a 30-day refund policy. Users can request a full refund within 30 days of purchase."},
{"id": "policy_2", "content": "To request a refund, email support@example.com with your order ID and reason for refund."},
{"id": "policy_3", "content": "Refunds are processed within 5 business days. The amount will be credited to the original payment method."},
]
doc_index = embedder.encode(
[d["content"] for d in documents],
normalize_embeddings=True,
)
def retrieve(query: str, top_k: int = 3) -> list[dict]:
q = embedder.encode(query, normalize_embeddings=True)
ranks = np.argsort(doc_index @ q)[::-1][:top_k]
return [documents[i] for i in ranks]
def generate_with_context(query: str, context_docs: list[dict]) -> str:
context = "\n---\n".join(f"[Source: {d['id']}]\n{d['content']}" for d in context_docs)
# Grounding + fallback are both load-bearing — never one without the other.
prompt = (
"Answer the question based ONLY on the provided context.\n"
"If the context doesn't contain enough information, reply exactly: "
"\"I don't have enough information to answer that.\"\n\n"
f"Context:\n{context}\n\n"
f"Question: {query}\n\n"
"Answer:"
)
resp = llm.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
)
return resp.choices[0].message.content
def rag(query: str) -> dict:
chunks = retrieve(query, top_k=3)
answer = generate_with_context(query, chunks)
return {
"query": query,
"retrieved_docs": [d["id"] for d in chunks],
"retrieved_chunks": [d["content"] for d in chunks], # for the decision-tree root question
"answer": answer,
}
# Debugging-tree usage: when the answer is wrong, grep retrieved_chunks for the ground-truth span.
result = rag("What's the refund policy and how do I get one?")
print("Query: ", result["query"])
print("Retrieved:", result["retrieved_docs"])
print("Answer: ", result["answer"])
The retrieved_chunks field is what the decision tree consumes. Save it to your logs. When a user reports a bad answer, the first action is grep against the ground-truth span — not tuning the prompt.
Verify Your Understanding
Before continuing, you should be able to answer these from memory:
- State the root question of the RAG debugging decision tree. Then describe the three branches and the fix list you reach for in each. Name the misdiagnosis that skipping the question produces.
- Apply the tree to a concrete failure. Wrong answer: “The refund policy is 60 days.” Ground truth: “30 days.” Retrieved chunk #1 says: “Premium plans have a 30-day refund policy.” Which branch are you on? What’s the first fix?
- Reranking vs increasing K. Name one query class where reranking beats raising K, and one query class where raising K beats reranking. If your answer is “always rerank” or “never rerank”, you haven’t reasoned about the trade-off.
- Dev-works / prod-breaks. Your RAG passes every dev query, fails on real production queries. Give three concrete hypotheses with named fixes — one each from retrieval, generation, and the dev/prod data-distribution gap.
- Map RAGAS metrics onto the tree. Faithfulness = 0.55, context-precision@5 = 0.30, context-recall = 0.85, answer-relevancy = 0.92. Which branch of the decision tree are you on? What’s the most likely root cause?
What’s Next
RAG is one-shot: retrieve once, generate once. The next chapter — RAG → Agents — extends the same data path into multi-step loops where the model decides when to retrieve, what tool to call, and how to stop. The pedagogical bridge to the production-agents series begins there.
References
- [lewis2020] Lewis, P. et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. arXiv:2005.11401. Source of the RAG architecture name; pairs a Dense Passage Retriever with a BART generator over a dense Wikipedia index. Two formulations: RAG-Sequence (one retrieval, full generation) and RAG-Token (different passages per token). Cited in §§ Building On Previous Knowledge, The Complete RAG Pipeline.
- [ragas] Es, S. et al. RAGAS: Automated Evaluation of Retrieval-Augmented Generation. EACL 2024 + ongoing OSS framework. docs.ragas.io. Source of the four canonical metrics (
faithfulness,context_precision,context_recall,answer_relevancy) used to instrument the debugging decision tree. Cited in §§ Building On Previous Knowledge, Evaluation, Common Pitfalls & Misconceptions. - [liu2023] Liu, N. F. et al. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172. Original U-shape attention-vs-position finding — context-start and context-end positions outperform context-middle. Cited in § Common Pitfalls & Misconceptions.
- [chroma-rot] Hong, K., Troynikov, A., Huber, J. Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma Technical Report, 2025-07-14. trychroma.com/research/context-rot. Tested 18 LLMs across Anthropic, OpenAI, Google, Alibaba; “performance grows increasingly unreliable as input length grows.” Cited in § Common Pitfalls & Misconceptions.
- [karpukhin2020] Karpukhin, V. et al. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020. arXiv:2004.04906. The retriever component of the original RAG paper; the practical baseline for any modern dense-retrieval RAG system. Cited in § The Complete RAG Pipeline.
- [pa-testing] Production Agents — Part 8: Testing & Evaluation. Operator-grade companion to RAG evaluation, including the LLM-judge drift trap referenced in this chapter’s pitfalls table. Cross-series bridge.
- [hyde] Gao, L. et al. Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE). ACL 2023. arXiv:2212.10496. Generate a hypothetical answer, embed that — fixes query-document phrasing mismatch without training. Cited in § The RAG Debugging Decision Tree and § Advanced Patterns.