Generation to Retrieval - Grounding LLMs in Facts | Intentional / Deliberate / Engineering

Three-panel hero. Left: small-chunk regime at 128 tokens — six chunks, two relevant, the answer to 'How do I revoke an API key?' is fragmented across chunks #15, #92, #93. A red panel reads 'precision: 2 / 6 chunks relevant.' Centre: a recall-at-5 curve over chunk size (128, 256, 512, 1024, 2048 tokens) — it peaks around 512 tokens at ~0.96, drops to 0.62 at 128 and 0.71 at 2048. Right: large-chunk regime at 2048 tokens — one chunk contains the answer plus 1500 tokens of unrelated context. A red panel reads 'Lost-in-the-Middle kicks in.' Footer: 'Optimise for recall of the answer span, not chunk size in isolation.' — Chunk size is non-monotonic — bigger chunks aren't strictly better

Building On Previous Knowledge

The previous chapter showed that LLMs generate token-by-token from a learned probability distribution — and crucially, even with temperature = 0 and a fixed seed, generation isn’t bit-identical across hardware. That non-determinism is the second-order problem. The first-order one is more obvious: the model can only generate from patterns it memorised during training. New facts, your private data, this week’s news — none of it is in there.

If the answer isn’t in the training data, the model refuses to answer or, worse, fabricates a plausible-sounding response.

Where most RAG tutorials stop: they tell you to “chunk your documents”, “embed them”, “query the vector store”, and ship. They never explain why chunk-size 128 fragments answers across pages, why chunk-size 2048 triggers Lost-in-the-Middle, why dense retrieval misses exact strings like E1234, or why BM25 — a 1994 algorithm — still beats fancy embeddings on keyword-heavy queries. This chapter walks the data path through Karpukhin et al. 2020 (DPR) [karpukhin2020], Robertson & Zaragoza’s BM25 [robertson2009], and Chroma’s 2025 Context Rot measurements [chroma-rot]. Every claim has a number you can verify.

Takeaway: retrieval grounds generation in facts the model never saw — but the configuration choices (chunk size, sparse vs dense, single-stage vs reranked) determine whether the grounding is precise enough to matter.

What Goes Wrong Without This:

Retrieval Failure Patterns

Symptom: Your AI assistant confidently answers questions about your
       company's products with completely fabricated information.
Cause:   The LLM generates plausible text from patterns, but has no
       access to your actual documentation. High confidence ≠ correctness.

Symptom: Retrieval returns documents with high similarity scores,
but the RAG system still produces incorrect answers.
Cause: You treated retrieval as similarity search. Similarity is
SYMMETRIC (A similar to B = B similar to A). Relevance is NOT.

Symptom: RAG works for demo queries but fails for real user queries.
Cause: Demo queries match document phrasing. Real queries use
different vocabulary. Query-document mismatch.

The Limits of Pure Generation

LLMs are trained on static data with a cutoff date. They generate from learned patterns, not live facts.

Pure Generation Problems

Problems with pure generation:

1. KNOWLEDGE CUTOFF
 Q: "Who won the 2024 election?"
 A: "I don't have information past my training cutoff..."

2. HALLUCINATION
 Q: "What's the API for uploading files in our product?"
 A: "Use POST /api/upload with multipart/form-data..."
 (confidently wrong—made up based on patterns)

3. NO PRIVATE DATA
 Q: "What did the client say in yesterday's email?"
 A: Cannot access—not in training data

4. OUTDATED FACTS
 Q: "What's the current price of Bitcoin?"
 A: Training data price, not live price

The model generates plausible text, but plausible ≠ true.

Takeaway: every LLM has a knowledge cutoff, hallucinates confidently on facts it doesn’t have, and cannot read your private data — three failure modes that no amount of bigger-model scaling will fix without external context.

Retrieval: Grounding Generation in Facts

Instead of asking the model to recall facts, give it facts to use.

Retrieval vs Pure Generation

Without retrieval:
User query → LLM → Generated answer (may hallucinate)

With retrieval:
User query → Search knowledge base → Relevant docs
↓
[Query + Docs] → LLM → Grounded answer

The LLM now has context to work with.

The Retrieval Pipeline

Retrieval Pipeline

INDEXING (offline):

Documents
│
↓
┌───────────┐
│ Chunk │ Split into manageable pieces
└───────────┘
│
↓
┌───────────┐
│ Embed │ Convert chunks to vectors
└───────────┘
│
↓
┌───────────┐
│ Index │ Store in vector database
└───────────┘

QUERY (online):

User query
│
↓
┌───────────┐
│ Embed │ Same embedding model as indexing
└───────────┘
│
↓
┌───────────┐
│ Search │ Find similar vectors in index
└───────────┘
│
↓
Top-K most similar chunks

Vector Similarity Search

Core mechanic: find vectors closest to query vector.

Vector Similarity Search

Query embedding: [0.2, 0.8, -0.1, ...]

Document embeddings in index:
doc1: [0.25, 0.75, -0.05, ...] → sim = 0.98 ← most similar
doc2: [0.1, 0.6, 0.3, ...] → sim = 0.85
doc3: [-0.5, 0.1, 0.8, ...] → sim = 0.23
doc4: [0.22, 0.78, -0.08, ...] → sim = 0.97

Return top-K (e.g., K=3): [doc1, doc4, doc2]

Why it works: Embedding models map similar meanings to similar vectors. “How do I reset my password?” is close to “Password reset instructions” even though they share few words.

Takeaway: retrieval factors generation into two stages — find relevant context offline-indexed, generate against it online — so the model becomes a reading-comprehension engine over text it has just been shown, not a memory engine over text it once saw.

Dense vs Sparse Retrieval

Two fundamentally different approaches:

Dense vs Sparse Retrieval

SPARSE RETRIEVAL (BM25, TF-IDF):
Representation: High-dimensional sparse vectors
(vocab_size dimensions, mostly zeros)

"The cat sat" → [0, 0, ..., 1, 0, ..., 1, 0, ..., 1, ...]
↑ ↑ ↑
cat sat the

Matching: Exact keyword overlap
Strengths: Precise keyword matching, interpretable
Weakness: Misses synonyms, requires exact terms

DENSE RETRIEVAL (Embeddings):
Representation: Low-dimensional dense vectors
(384-1536 dimensions, all non-zero)

"The cat sat" → [0.23, -0.41, 0.89, 0.12, ...]

Matching: Semantic similarity
Strengths: Captures meaning, handles synonyms
Weakness: May miss exact matches, less interpretable

The dense vs sparse contest isn’t settled. Karpukhin et al. 2020’s Dense Passage Retrieval paper showed dense retrievers outperform “a strong Lucene-BM25 system largely by 9%–19% absolute in terms of top-20 passage retrieval accuracy” on open-domain QA [karpukhin2020] — but that’s on natural-language questions paired with Wikipedia passages. Flip the workload to keyword-heavy queries (“error code E1234”, “stack trace NullPointerException”) and BM25 wins. The canonical BM25 formula (Robertson & Zaragoza 2009 [robertson2009]) uses two parameters — k₁ controls term-frequency saturation, b controls document-length normalisation. Lucene ships k₁ = 1.2, b = 0.75 by default; the literature recommends tuning k₁ within [1.2, 2.0]. Don’t tune them until you’ve measured.

In practice: Combine both (hybrid search).

Hybrid Search Benefits

Query: "error code E1234"

Sparse (BM25): Finds docs with exact string "E1234" ✓
Dense: May not find if "E1234" wasn't in training data ✗

Query: "my application keeps crashing"

Sparse (BM25): Needs exact word "crashing" ✗
Dense: Matches "app failure", "program stops working" ✓

Hybrid: Best of both

Reciprocal Rank Fusion (RRF)

When combining dense and sparse results, use RRF to merge ranked lists:

Reciprocal Rank Fusion

RRF_score(doc) = Σ 1 / (k + rank_i(doc))

Where:

- rank_i(doc) = position of doc in ranking i (1-indexed)
- k = constant (typically 60)

Example:
Dense ranking: [doc_A (rank 1), doc_B (rank 2), doc_C (rank 3)]
Sparse ranking: [doc_B (rank 1), doc_C (rank 2), doc_A (rank 3)]

RRF scores (k=60):
doc_A: 1/(60+1) + 1/(60+3) = 0.0164 + 0.0159 = 0.0323
doc_B: 1/(60+2) + 1/(60+1) = 0.0161 + 0.0164 = 0.0325 ← Highest!
doc_C: 1/(60+3) + 1/(60+2) = 0.0159 + 0.0161 = 0.0320

Final ranking: [doc_B, doc_A, doc_C]

RRF is simple and often performs just as well as learned score combination — Cormack, Clarke, Buettcher 2009 published the original k=60 default and the SIGIR evaluation showing it outperforms Condorcet and learned rank-aggregation methods [cormack2009].

Takeaway: dense retrieval wins on natural-language questions (DPR’s 9–19% lead on open-domain QA); sparse BM25 wins on keyword-heavy or out-of-distribution queries. Hybrid + RRF gets you both with one configuration knob (k=60) and no learned model.

Chunking: Why and How

Documents are too long to embed as single units:

Embedding models have token limits (512-8192)
Long texts dilute specific information
Retrieval granularity matters

Why Chunking Matters

Document: 50-page manual

Bad: One embedding for entire document
→ Query matches but relevant info buried in noise

Good: Chunk into ~500 token pieces
→ Query matches specific relevant section

Chunking Strategies

Fixed-size: Every N tokens
Simple, may break mid-sentence

Sentence-based: Split at sentence boundaries
Preserves complete thoughts

Paragraph-based: Split at paragraph breaks
Preserves larger context

Semantic: Split where topic changes
Best quality, more complex

Recursive: Try larger splitters first, fall back
Hierarchical, respects structure

Overlap

Include some text from previous chunk to preserve context at boundaries:

Chunk Overlap

Chunk 1: "...the password reset link. Click it to..."
Chunk 2: "...reset link. Click it to create a new password..."
                  ↑
            overlap region

Why: Context at boundaries isn't lost.
Tradeoff: More storage, potential duplicate retrieval.

Why Chunk-Size Is Non-Monotonic

This is the chapter’s load-bearing claim, and the one that most public RAG tutorials skip: bigger chunks are not strictly better. The recall-vs-chunk-size curve has an interior maximum, not a monotonic shape. The hero diagram shows the curve qualitatively; the mechanism is the tension between two effects:

Too small (e.g. 128 tokens) → the answer fragments across multiple chunks. The retriever returns chunk #15 (“revoked from Settings”) but misses chunk #92 (“click confirm”). The generator now has half the procedure. Recall is partial of the answer span, even though the chunks “matched” the query.
Too large (e.g. 2048 tokens) → each chunk contains the answer and 1,500 tokens of unrelated context. Two compounding harms: (1) dense-embedding precision drops because one vector now represents many topics; (2) Lost-in-the-Middle kicks in at generation time. Liu et al. 2023 named the U-shaped attention-vs-position pattern [liu2023]. Chroma’s 2025 Context Rot report then tested 18 LLMs across Anthropic, OpenAI, Google, and Alibaba and found “performance grows increasingly unreliable as input length grows”, with target information at the start outperforming target information in the middle [chroma-rot].

The Chroma study added a counter-intuitive finding: “models perform worse when the haystack preserves a logical flow of ideas. Shuffling the haystack and removing local coherence consistently improves performance” across all 18 models tested [chroma-rot]. Coherent long chunks aren’t free — coherence itself becomes a distractor.

The peak commonly lands around 256–512 tokens for English natural-language corpora with a text-embedding-3-small-class model — the peak shifts with embedding model, corpus structure, and query type. Test, don’t guess.

Takeaway: chunk size has an interior optimum, not a monotonic curve. Both 128-token fragmentation and 2048-token Lost-in-the-Middle hurt recall — usually peak at 256–512 tokens for English NL queries, but measure on your corpus.

Multi-Stage Retrieval

Single-stage retrieval trades off speed vs accuracy. Multi-stage gets both.

Multi-Stage Retrieval

SINGLE-STAGE RETRIEVAL
--------------------------------------------------------------------

Query → Embedding Model → Vector Search → Top 10 Results

Fast (~20ms), but accuracy limited by bi-encoder's ability
to independently embed query and docs.

MULTI-STAGE RETRIEVAL (Retrieve → Rerank)
──────────────────────────────────────────

Stage 1: Fast retrieval (bi-encoder)
Query → Top 100 candidates (~20ms)
Uses: Dense/sparse/hybrid retrieval

Stage 2: Accurate reranking (cross-encoder)
Rerank 100 → Top 10 (~200ms for 100 pairs)
Uses: Cross-encoder model

Total: ~250ms, but significantly better accuracy

Bi-Encoder vs Cross-Encoder

BI-ENCODER (used in retrieval):

Query ─────→ [Encoder] ─────→ query_vector
↓
cosine_similarity = score
↑
Doc ─────→ [Encoder] ─────→ doc_vector

✓ Can pre-compute doc vectors (once)
✓ Fast similarity search at query time
✓ Scales to millions of documents
✗ Query and doc don't "see" each other
✗ Lower accuracy

CROSS-ENCODER (used in reranking):

[CLS] query [SEP] document [SEP] ─────→ [BERT] ─────→ score

✓ Query and doc interact via attention
✓ Higher accuracy (5-10% improvement)
✗ Must encode every (query, doc) pair
✗ Can't pre-compute anything
✗ Slow: O(n) for n documents

Why Two Stages?

Two-Stage Tradeoffs

┌──────────────────────────┬───────────┬──────────┬───────────────────┐
│ Method                   │ Latency   │ Accuracy │ Use Case          │
├──────────────────────────┼───────────┼──────────┼───────────────────┤
│ Bi-encoder only          │ ~20ms     │ 85%      │ Speed-critical    │
│ Cross-encoder only       │ ~20s/1M   │ 95%      │ Tiny corpus only  │
│ Bi-encoder → Cross-enc   │ ~250ms    │ 93%      │ Production        │
└──────────────────────────┴───────────┴──────────┴───────────────────┘

The bi-encoder filters to candidates.
The cross-encoder reranks for precision.
Best of both worlds.

Takeaway: single-stage retrieval is bound by the bi-encoder’s ability to embed query and doc independently; multi-stage adds a cross-encoder reranker that lets the query and doc attend to each other — typical 5–10pp accuracy gain for ~10× latency in the reranking pass.

Retrieval Quality Metrics

How do you know retrieval is working?

Retrieval Quality Metrics

Recall@K
"Of all relevant docs, how many are in my top-K?"

5 relevant docs exist, top-10 retrieval finds 4
Recall@10 = 4/5 = 0.80

Critical for RAG: if relevant doc isn't retrieved,
the LLM can't use it.

Precision@K
"Of my top-K results, how many are relevant?"

Top-10 has 4 relevant, 6 irrelevant
Precision@10 = 4/10 = 0.40

Matters for: context window efficiency, noise reduction

MRR (Mean Reciprocal Rank)
"How high is the first relevant result?"

First relevant at position 3 → RR = 1/3
Average across queries = MRR

For RAG, Recall@K usually matters most. If the answer isn’t retrieved, generation fails.

Takeaway: Recall@K is the primary metric for RAG — if the answer span isn’t in the top-K, no generator can recover. Precision@K and MRR matter for cost and UX; Recall@K matters for correctness.

Common Pitfalls & Misconceptions

The table below names the failure modes that show up most often in production RAG. Each row is a concrete bug class — the subsections that follow expand the trickiest three.

Symptom	Cause	Fix
Similarity scores are high but answers are still wrong	Similarity is symmetric; relevance is not. A doc that matches the query may still not answer it (“How do I bake cookies?” vs “I baked cookies yesterday”).	Rerank with a cross-encoder; evaluate Recall@K against held-out answer spans, not blind similarity thresholds.
RAG works on demo queries, fails on real user queries	Query-document phrasing mismatch — demo queries echo the docs; real users use synonyms (“login bug” vs “authentication failure”).	Add HyDE (generate hypothetical answer, embed that), query rewriting, or fine-tune the retriever on real (query, doc) pairs from your domain.
Retrieval “succeeds” but generator produces wrong answer	Answer is split across two chunks; neither alone is sufficient. Each chunk scores medium similarity, generator gets fragments.	Increase chunk overlap (15–25% of chunk size is a common default), or move to a coarser chunk size — then re-measure Recall@K of the answer span, not chunks.
Top-K chunks all match the query but bury the answer	Chunk size too large; Lost-in-the-Middle hides the answer span inside ~2K tokens of noise (Chroma 2025 measured this across 18 LLMs).	Reduce chunk size to 256–512 tokens, or post-process retrieved chunks (sentence-window retrieval) so the generator sees a tight passage, not a 2KB block.
Dense retrieval misses exact error codes	Dense embeddings smear rare tokens (`E1234`, `NullPointerException`) into nearby semantic space; they don’t preserve exact strings the way BM25 does.	Use hybrid (BM25 + dense) with RRF combination. BM25 anchors the exact-match path; dense handles paraphrase.
Embedding model upgrade silently breaks production	Different embedding models produce different vector spaces — old indexed chunks aren’t comparable to new query embeddings.	Re-index the entire corpus when changing embedding models. Treat the embedding model as part of the index schema, not a swappable config flag.
Chunks hit the embedding model’s token limit and silently truncate	Embedding models cap at 512–8192 tokens; longer inputs are truncated server-side without warning.	Pin chunk size below the embedding model’s `max_input_tokens` (e.g. 8191 for `text-embedding-3-large`). Assert chunk-token-count at index time.

Takeaway: RAG bugs almost always trace to one of these seven classes — and most look like “the model is bad” until you instrument retrieval separately from generation.

Misconception: “High similarity score = relevant result”

Similarity vs Relevance

Query: "How do I bake cookies?"
Document: "I baked cookies yesterday and they were delicious."

Similarity: HIGH (same topic, same words)
Relevance: ZERO (describes past event, doesn't answer the question)

Similarity is SYMMETRIC. Relevance is NOT.
A relevant document must be similar, but a similar document
isn't necessarily relevant.

The Query-Document Mismatch Problem

Query-Document Mismatch

Query:  "How do I fix the login bug?"
      (question format, user language)

Doc: "Authentication failures can be resolved by..."
(statement format, technical language)

Problem: Different phrasing may have lower similarity
even when doc answers the query.

Solutions:

Query Expansion: Add synonyms and related terms
HyDE: Generate hypothetical answer, embed that instead
Query Rewriting: Transform user query to match document style
Fine-tuned Retrievers: Train on your domain’s (query, doc) pairs

Chunking Mistakes

Chunking Pitfalls

The correct answer might be split across two chunks.
Neither chunk alone answers the query.
Both chunks score medium similarity.
Retrieval "succeeds" (returns chunks).
RAG fails (no chunk contains the answer).

Chunking is a system design decision, not preprocessing trivia.

Code Example

A minimal hybrid (BM25 + dense) retriever in ~40 lines, pinned to current library versions. Uses sentence-transformers [sbert] for the dense path and rank-bm25 [rank-bm25] for the sparse path:

# Tested on:
#   sentence-transformers==3.0.1
#   rank-bm25==0.2.2
#   numpy==1.26.4
# Python 3.11
import numpy as np
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi

# 1. Corpus -------------------------------------------------------------------
documents = [
    "To reset your password, go to Settings > Security > Reset Password.",
    "Our API rate limit is 100 requests per minute for free tier.",
    "Authentication failures (error code E1234) require a password reset.",
    "Contact support@example.com for billing questions.",
    "The application requires Python 3.9 or higher.",
    "Two-factor authentication can be enabled in Settings > Security.",
]

# 2. Dense index (bi-encoder, normalized embeddings → dot = cosine) -----------
dense_model = SentenceTransformer("all-MiniLM-L6-v2")
dense_index = dense_model.encode(documents, normalize_embeddings=True)

# 3. Sparse index (BM25 with default k1=1.5, b=0.75) --------------------------
tokenised = [d.lower().split() for d in documents]
sparse_index = BM25Okapi(tokenised)  # rank-bm25 defaults: k1=1.5, b=0.75


def rrf_combine(rankings: list[list[int]], k: int = 60) -> list[int]:
    """Reciprocal Rank Fusion — Cormack et al. 2009. k=60 is the Lucene default."""
    scores: dict[int, float] = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)


def hybrid_search(query: str, top_k: int = 3) -> list[tuple[str, float]]:
    q_dense = dense_model.encode(query, normalize_embeddings=True)
    dense_rank = np.argsort(dense_index @ q_dense)[::-1].tolist()

    sparse_rank = np.argsort(sparse_index.get_scores(query.lower().split()))[::-1].tolist()

    fused = rrf_combine([dense_rank, sparse_rank])[:top_k]
    return [(documents[i], float(dense_index[i] @ q_dense)) for i in fused]


for q in [
    "How do I change my password?",      # paraphrase — dense should win
    "error code E1234",                  # exact match — BM25 should win
    "I need help with my bill",          # paraphrase
]:
    print(f"\nQuery: {q!r}")
    for doc, score in hybrid_search(q, top_k=2):
        print(f"  [{score:.3f}] {doc[:60]}")

The hybrid result outperforms either retriever alone on a corpus this small — and the gap widens with bigger corpora and more varied queries. The “exact code E1234” query is the decisive test: pure dense often misses it; pure BM25 catches it; RRF gets both paths.

Verify Your Understanding

Before continuing, you should be able to answer these from memory:

Similarity ≠ relevance, with a concrete example. Pick a query whose top-similarity doc is irrelevant. Explain in one sentence why similarity is symmetric and relevance is not.
The Doc1 vs Doc2 trap. Query "How do I reset my password?" — Doc1 "Password reset failed for user X at 3:42pm" (sim=0.91) vs Doc2 "Go to Settings > Security > Reset Password" (sim=0.87). Which is more relevant? Name the mechanism that lets cosine put the wrong doc on top.
“BM25 is outdated, dense retrieval is always better.” Identify the error. Cite a concrete query class where BM25 wins and explain why (hint: rare exact strings, OOV vocabulary).
Recall@10 vs Precision@10. Retrieval returns 10 docs; 8 score high similarity, only 2 actually answer the question. Which metric exposes this failure? What does the other metric tell you instead?
Walk through the chunk-size curve. At chunk size 128 you get fragmented answers; at 2048 you get Lost-in-the-Middle. Name the two opposing forces and predict where the peak lands for an English natural-language corpus on text-embedding-3-small.

What’s Next

Retrieval gave us relevant chunks. The next chapter — Retrieval → RAG — wires those chunks into a generation prompt, covers prompt construction and reranking strategies, and walks the debugging decision tree: when retrieval is wrong vs when generation is wrong, and why most teams misdiagnose the failure.

References

[karpukhin2020] Karpukhin, V. et al. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020. arXiv:2004.04906. Source for the 9–19% top-20 retrieval-accuracy gain of DPR over Lucene-BM25 on open-domain QA. Cited in §§ Building On Previous Knowledge, Dense vs Sparse Retrieval.
[robertson2009] Robertson, S., Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 2009. Canonical BM25 reference; covers the k₁, b parameters and the term-frequency saturation argument. Cited in § Dense vs Sparse Retrieval.
[chroma-rot] Hong, K., Troynikov, A., Huber, J. Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma Technical Report, 2025-07-14. trychroma.com/research/context-rot. Measured retrieval degradation across 18 LLMs as input length grew: Anthropic (Claude Opus 4, Sonnet 4, Sonnet 3.7, Sonnet 3.5, Haiku 3.5), OpenAI (o3, GPT-4.1, GPT-4.1 mini, GPT-4.1 nano, GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo), Google (Gemini 2.5 Pro, 2.5 Flash, 2.0 Flash), Alibaba (Qwen3-235B, Qwen3-32B, Qwen3-8B). Cited in §§ Chunking — Why Chunk-Size Is Non-Monotonic, Common Pitfalls.
[liu2023] Liu, N. F. et al. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172. Original U-shape attention-vs-position finding that the Chroma 2025 work expanded. Cited in § Chunking — Why Chunk-Size Is Non-Monotonic.
[cormack2009] Cormack, G., Clarke, C., Buettcher, S. Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods. SIGIR 2009. Source for the k=60 RRF default. Cited in § Dense vs Sparse Retrieval — Reciprocal Rank Fusion and § Code Example.
[sbert] Reimers, N., Gurevych, I. sentence-transformers library. sbert.net. Bi-encoder + cross-encoder reference implementations; all-MiniLM-L6-v2 and the canonical ms-marco-MiniLM-L-6-v2 cross-encoder reranker. Cited in §§ Multi-Stage Retrieval — Bi-Encoder vs Cross-Encoder, Code Example.
[rank-bm25] Brown, D. rank-bm25 Python package. GitHub: dorianbrown/rank_bm25. The pure-Python BM25Okapi implementation used in the Code Example. Cited in § Code Example.