Skip to content

Ai-engineering Series

Retrieval to RAG - The Complete Pipeline

Deep dive into RAG: prompt construction, reranking, failure modes, the debugging decision tree, and how to diagnose when things go wrong

Building On Previous Knowledge

In the previous progression, you learned how retrieval finds relevant documents for a query. This created a new problem: having the right documents doesn’t automatically produce the right answer.

The LLM might ignore the retrieved context. It might hallucinate despite having good context. It might use the context but synthesize incorrectly. Retrieval gave us ingredients; now we need to cook them properly.

This progression solves that problem by introducing the complete RAG pipeline: how to construct prompts, when to use reranking, and critically—how to diagnose failures when things go wrong.

What Goes Wrong Without This:

RAG Failure Patterns

The Complete RAG Pipeline

RAG = Retrieval-Augmented Generation. Combine retrieval with LLM generation:

RAG Pipeline

Prompt Construction

How you present retrieved context to the LLM matters:

Basic RAG Prompt Template

Key elements:

Prompt Key Elements

Reranking: Quality Over Quantity

Initial retrieval is fast but imprecise. Reranking improves quality:

Reranking Pipeline

How Many Documents?

More context isn’t always better:

Trade-offs in K

“Lost in the middle” problem: LLMs attend more to beginning and end of context. Information in the middle may be ignored.

Practical guidance:

Choosing K Values

RAG Failure Modes

When RAG goes wrong:

RAG Failure Modes

The RAG Debugging Decision Tree

When RAG output is wrong, use this systematic approach:

RAG Debugging Decision Tree

Commit this decision tree to memory. It will save you hours of random debugging.


Evaluation

RAG has two components to evaluate:

RAG Evaluation Framework

Advanced Patterns

Beyond basic RAG:

Advanced RAG Patterns

Common Misconceptions

”RAG is just Retrieve + Generate”

This is the happy path. The unhappy paths are:

  • Retrieved docs don’t contain the answer
  • Retrieved docs contain the answer but LLM ignores them
  • Retrieved docs are used but LLM synthesizes incorrectly
  • Retrieved docs conflict with each other

RAG is a system with multiple failure modes. Understanding the failure modes IS understanding RAG.

”If retrieval is good, generation will be good”

“Lost in the middle” is real—LLMs attend more to the beginning and end of context. Without strong grounding instructions, the model may prefer its training data.

Good retrieval is necessary but not sufficient.

”More retrieved documents = better answers”

More documents = more noise, higher cost, and “lost in the middle” problems. If you retrieve 20 documents and only 3 are relevant, you’ve added 17 distractors.

There’s an optimal K for your use case. It’s usually smaller than you think.


Code Example

Complete RAG pipeline with retrieval and generation:

import numpy as np
from sentence_transformers import SentenceTransformer
from openai import OpenAI

# Initialize models
embedder = SentenceTransformer('all-MiniLM-L6-v2')
llm_client = OpenAI()

# Knowledge base
documents = [
    {
        "id": "policy_1",
        "content": "Premium plans have a 30-day refund policy. Users can request a full refund within 30 days of purchase.",
    },
    {
        "id": "policy_2",
        "content": "To request a refund, email support@example.com with your order ID and reason for refund.",
    },
    {
        "id": "policy_3",
        "content": "Refunds are processed within 5 business days. The amount will be credited to the original payment method.",
    },
]

# Index documents
doc_texts = [d["content"] for d in documents]
doc_embeddings = embedder.encode(doc_texts)

def retrieve(query: str, top_k: int = 3) -> list[dict]:
    """Retrieve most relevant documents for query."""
    query_embedding = embedder.encode(query)
    similarities = np.dot(doc_embeddings, query_embedding)
    top_indices = np.argsort(similarities)[::-1][:top_k]
    return [documents[i] for i in top_indices]

def generate_with_context(query: str, context_docs: list[dict]) -> str:
    """Generate answer using retrieved context."""
    # Format context
    context = "\n---\n".join([
        f"[Source: {doc['id']}]\n{doc['content']}"
        for doc in context_docs
    ])

    # Construct prompt with grounding instruction
    prompt = f"""Answer the question based only on the provided context.
If the context doesn't contain enough information, say "I don't have enough information to answer that."

Context:
{context}

Question: {query}

Answer:"""

    response = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )

    return response.choices[0].message.content

def rag(query: str) -> dict:
    """Complete RAG pipeline."""
    # 1. Retrieve
    retrieved = retrieve(query, top_k=3)

    # 2. Generate
    answer = generate_with_context(query, retrieved)

    return {
        "query": query,
        "retrieved_docs": [d["id"] for d in retrieved],
        "answer": answer,
    }

# Test
result = rag("What's the refund policy and how do I get one?")
print(f"Query: {result['query']}")
print(f"Retrieved: {result['retrieved_docs']}")
print(f"Answer: {result['answer']}")

Key Takeaways

Key Takeaways

Verify Your Understanding

Before considering yourself RAG-capable:

Use the debugging decision tree from memory. Given a wrong RAG output, what’s your first diagnostic question?

Given this scenario, diagnose the problem:

RAG returns: “The refund policy is 60 days.” Ground truth: “The refund policy is 30 days.” Retrieved doc contains: “Premium plans have a 30-day refund policy.”

Is this a retrieval problem, generation problem, or knowledge gap? How do you know?

When would you use reranking vs. just increasing K? If your answer is “always rerank” or “never rerank,” you don’t understand the trade-offs.

Your RAG system works in development but fails in production. List 3 specific hypotheses for why this might happen.


What’s Next

After this, you can:

  • Continue → RAG → Agents — from single-shot RAG to multi-step agents
  • Build → Production RAG system with proper evaluation

Concepts covered in this article