The sampling math is clean and well-documented. The floating-point hardware underneath is not, and temperature=0 doesn't paper over that.
Building On Previous Knowledge
The previous chapter ended with one Takeaway: scaled dot-product attention is four linear-algebra steps, and the critical one is the √d_k divisor. Attention produces a sequence of context-aware hidden states — one vector per position, each one a soft mixture over every other token. This chapter picks up the data path at the final hidden state and turns it into a token.
The transformer’s last layer projects each position’s hidden state through a W_vocab matrix into a (vocab_size,) vector of logits — raw, unnormalised scores, one per token in the vocabulary. Generation is what happens between that logit vector and the next word the user sees. The math is small: temperature scales, top-k clips, top-p clips adaptively, softmax normalises, multinomial samples. The hero diagram at the top of this chapter walks the pipeline end-to-end.
Where most generation tutorials stop: they explain temperature, top-k, top-p, and recommend temperature=0 for “deterministic” output. Then a developer pins temperature=0, ships, and discovers that the same prompt against the same OpenAI model returns different responses across providers, GPU types, or even API versions. The chapter’s named differentiator: temperature=0 is not bit-deterministic across hardware — and the reasons are documented in NVIDIA’s own cuBLAS reference [cublas]. Holtzman et al. 2020 [holtzman2020] explains why greedy decoding produces bland text; the hardware-level explanation for why “deterministic” isn’t is the half public coverage skips.
Takeaway: generation has a clean four-step sampling pipeline and a dirty hardware caveat — production teams that rely on temperature=0 for reproducibility ship a bug; production teams that cache outputs ship reproducibility.
What Goes Wrong Without This:
Symptom: Your LLM-powered app gives different answers every time. Cause: Temperature > 0 introduces randomness. This is a feature, not a bug—but you might not want it for your use case. Symptom: Model outputs are repetitive and boring. Cause: You set temperature = 0 (greedy decoding). Model always picks highest probability = mode collapse. Symptom: LLM generates coherent first paragraphs, then rambles. Cause: Autoregressive generation accumulates errors. Each token conditions on previous (possibly wrong) tokens.
The Generation Pipeline
Text generation is autoregressive: generate one token, append it, generate the next.
Input: "The capital of France is" Step 1: Tokenize → [464, 3139, 286, 4881, 318] Step 2: Forward pass through transformer Input embeddings → Attention layers → FFN layers → Final hidden states for each position Step 3: Project to vocabulary (LM head) Last position's hidden state → (vocab_size,) logits hidden_state @ W_vocab → [2.3, -1.1, 0.5, ..., 1.8] ↑ 50,257 values (one per token) Step 4: Convert to probabilities softmax(logits / temperature) → probabilities [0.001, 0.0002, 0.3, ..., 0.001] ↑ Sum = 1.0, each value ∈ [0,1] Step 5: Sample next token Based on probabilities → token "Paris" (ID 6342) Step 6: Repeat from Step 2 with extended sequence New input: "The capital of France is Paris" → Generate next token...
Logits to Probabilities
The model outputs raw scores (logits), not probabilities:
Logits: [2.3, -1.1, 0.5, 4.1, -0.3, ...] ↑ ↑ "the" "Paris" Softmax converts to probabilities: P(token_i) = exp(logit_i) / Σ exp(logit_j) After softmax: [0.02, 0.001, 0.004, 0.85, 0.002, ...] ↑ "Paris" = 85% probability The token with highest logit gets highest probability. But it's not 100%—other tokens have non-zero chance.
Takeaway: generation is autoregressive — each token conditions only on tokens already in the context, and the model decides one token at a time by sampling from a softmax-normalised vocabulary distribution.
Temperature: Controlling Randomness
Temperature scales logits before softmax, controlling distribution “sharpness”:
T = 0.0 (or very small): logits / 0 → ∞ for max, 0 for others Always pick highest probability token (greedy) Output: deterministic, conservative, may be repetitive T = 1.0 (default): logits unchanged Sample according to trained distribution Output: balanced creativity and coherence T = 2.0 (high): logits / 2 → flatter distribution Low probability tokens become more likely Output: more random, creative, potentially incoherent Visual: T=0.1 T=1.0 T=2.0 Token A (logit 4): 99% 70% 45% Token B (logit 2): 1% 20% 30% Token C (logit 1): <1% 10% 25%
Practical guidance:
┌──────────────────┬───────────────┬───────────────────────────────┐ │ Temperature │ Use Case │ Behavior │ ├──────────────────┼───────────────┼───────────────────────────────┤ │ 0.0 - 0.3 │ Factual Q&A │ Conservative, deterministic │ │ 0.5 - 0.7 │ Code gen │ Balanced, mostly predictable │ │ 0.7 - 1.0 │ Creative │ More varied, still coherent │ │ 1.0 - 1.5 │ Brainstorm │ High variety, some wild │ └──────────────────┴───────────────┴───────────────────────────────┘
Takeaway: temperature is a one-knob exploration/exploitation dial — T<1 sharpens the distribution toward the highest-probability tokens, T>1 flattens it, T=0 collapses to argmax. It is not a determinism switch.
Decoding Strategies
Temperature alone doesn’t solve everything. Other strategies modify which tokens are considered.
Greedy Decoding
Always pick highest probability token: P = [0.02, 0.001, 0.85, 0.004, ...] ↓ Select: token_2 (0.85) Pros: Deterministic, fast (no sampling) Cons: Repetitive, no exploration, misses better paths "The best the best the best the best..." ← mode collapse
Top-K Sampling
Only consider the K most likely tokens: Original: [0.4, 0.3, 0.15, 0.08, 0.04, 0.02, 0.01, ...] └── many tiny probabilities top_k=3: [0.4, 0.3, 0.15, 0, 0, 0, 0, ...] └─────────────┘ Renormalize these to sum to 1.0 After renormalization: [0.47, 0.35, 0.18, 0, 0, ...] └── others impossible Sample from reduced distribution. Benefit: Prevents very unlikely tokens from being chosen. Risk: K is fixed, but vocabulary distribution varies. Sometimes top-5 is enough. Sometimes top-50 is needed.
Top-P (Nucleus) Sampling
Include tokens until cumulative probability reaches P: Sorted probs: [0.4, 0.3, 0.15, 0.08, 0.04, ...] Cumulative: [0.4, 0.7, 0.85, 0.93, 0.97, ...] ↑ Top-p=0.9 → include up to 0.93 Adaptive: includes more tokens when distribution is flat, fewer when one token dominates. This is often better than top-k because it adapts to context.
Combining Strategies
Real systems often combine: 1. Apply temperature 2. Apply top-k (e.g., k=50) 3. Apply top-p (e.g., p=0.9) 4. Sample from result Each filter removes tokens that "shouldn't" be generated. Order matters: temperature affects which tokens pass top-k.
Holtzman, Buys, Du, Forbes, Choi 2020 introduced nucleus sampling because greedy and beam-search decoding produce text that is “bland and strangely repetitive” [holtzman2020]. Nucleus sampling, in their words, gives “diversity without sacrificing fluency and coherence” by truncating “the unreliable tail” of the distribution adaptively — keeping more tokens when the distribution is flat, fewer when one token dominates. That adaptiveness is why top-p has displaced top-k as the production default for open-ended generation.
Takeaway: greedy, top-k, and top-p are three points on the same exploration spectrum — greedy is degenerate (mode-collapse), top-k is rigid, top-p is the adaptive default. Production pipelines typically apply temperature → top-k → top-p in that order.
Practical Generation Settings
Recommended settings for common use cases:
Factual/Deterministic
{
"temperature": 0,
"top_p": 1,
"max_tokens": 256
}
Or with slight randomness:
{
"temperature": 0.2,
"top_p": 0.95,
"max_tokens": 256
}
Code Generation
{
"temperature": 0.3,
"top_p": 0.95,
"max_tokens": 1024
}
Creative Writing
{
"temperature": 0.9,
"top_p": 0.95,
"max_tokens": 2048
}
When in Doubt
{
"temperature": 0.7,
"top_p": 0.9,
"max_tokens": 512
}
Takeaway: production-grade defaults are use-case-specific — factual lookups want T≈0.2, code wants T≈0.3, creative writing wants T≈0.9, and a top_p of 0.9–0.95 is the safe nucleus across all of them.
The Autoregressive Problem
Generation has a fundamental limitation:
Each token conditions only on PREVIOUS tokens. The model can't "look ahead" and fix mistakes. Step 1: "The answer is" Step 2: "The answer is definitely" Step 3: "The answer is definitely 42" ← committed to "definitely" Step 4: "The answer is definitely 42." What if "42" was wrong? Model already said "definitely"—can't take it back. Error propagation through the sequence.
Consequences:
1. Early mistakes compound Wrong direction at step 10 affects all subsequent tokens. 2. Hallucination momentum Once model starts hallucinating, it continues the pattern. "The author of Hamlet was Francis Bacon..." continues confidently. 3. No self-correction without explicit mechanisms Model doesn't naturally "notice" it's wrong. Chain-of-thought helps but doesn't eliminate the problem.
Takeaway: autoregressive generation has no look-ahead — early tokens lock in commitments the model can’t retract. Mitigations live above the loop (chain-of-thought, self-reflection prompts, retry-with-different-seed), not inside it.
Why “Deterministic” Generation Doesn’t Exist
This is the chapter’s load-bearing claim. Most public coverage stops at “set temperature=0 and the output is deterministic.” That is wrong at the hardware level, and the failure mode is concrete:
- You set
temperature=0. The argmax of the logit vector is supposed to win every time. - The runner-up logit is
4.1003198. The winner is4.1003217. The gap is~2e-6. - The same prompt routed to a different GPU SKU computes those logits via a slightly different reduction order in
cuBLAS’s matmul kernels — and4.1003198becomes4.1003241. - Argmax now picks the other token. Same prompt, same seed, same model weights, different word.
The hero diagram at the top of this chapter shows the three runs side by side. NVIDIA’s own cuBLAS documentation [cublas] states the constraint directly: bit-wise reproducibility holds only “when executed on GPUs with the same architecture and the same number of SMs”, and “no longer holds when multiple CUDA streams are active.” Some routines (e.g. cublasSetAtomicsMode) explicitly trade reproducibility for speed via non-associative atomic adds.
Even on identical hardware, batching effects break reproducibility: the same prompt in batch position 0 vs batch position 7 sees different padding tokens, different attention masking edge cases, and slightly different floating-point accumulation paths. Provider APIs amortise across batches dynamically — your temperature=0 request gets batched with whatever else is in the queue.
Even on identical hardware with the same batch position, model upgrades break reproducibility: provider re-quantising from FP16 → BF16, swapping kernels, or rolling out a new infra version moves the logits by small amounts. Two of those logits being within 1e-6 of each other is enough to flip the argmax.
1. FLOATING-POINT PRECISION Different GPUs/CPUs compute slightly differently exp(12.345) on GPU A ≠ exp(12.345) on GPU B (last bits) When tokens have similar probabilities, winner can change 2. BATCHING EFFECTS Same prompt in different batch positions → different padding Attention patterns slightly affected 3. API VERSION CHANGES Provider updates model weights, quantization, infrastructure "Same model" may not be same computation 4. PARALLEL COMPUTATION ORDER Operations aren't strictly ordered in parallel execution (a + b) + c vs a + (b + c) → floating point differs
Practical implications:
- Don't assume same prompt → same output, ever - If you need reproducibility, cache outputs - Test with multiple runs, not just one - Use seed parameter if available (helps but doesn't guarantee)
Takeaway: temperature=0 stops sampling; it does not stop floating-point. Bit-deterministic generation requires same-architecture / same-SM-count / same-batch-position / same-model-version — four conditions production rarely controls. If your application needs reproducibility, cache outputs.
Stopping Generation
How does the model know when to stop?
1. EOS TOKEN Model generates <|endoftext|> or equivalent Trained to output this when "done" 2. MAX TOKENS Hit the limit you specified (max_tokens=256) May cut off mid-sentence 3. STOP SEQUENCES Custom strings that terminate generation stop=["\n", "Human:", "```"] 4. TIMEOUT API or system timeout (less common)
Takeaway: generation needs an explicit exit signal — EOS, max_tokens, stop sequence, or timeout. Most production bugs in this layer trace to missing or wrong stop-sequence configuration (“Human:” not stripped, max_tokens too small, EOS not respected in custom finetune). For the operator-grade deep dive on budgeting these stops across an agent loop, see Production Agents Part 4: Cost Control [pa-cost].
Code Example
A minimal sampling pipeline pinned to torch==2.3.0 implementing the four-stage filter chain. Note the explicit greedy branch — and the comment that flags what greedy doesn’t give you:
# Tested on torch==2.3.0 + Python 3.11
import torch
import torch.nn.functional as F
def sample_next_token(
logits: torch.Tensor, # (vocab_size,)
temperature: float = 1.0,
top_k: int = 50,
top_p: float = 0.9,
generator: torch.Generator | None = None, # seed control — see caveat below
) -> int:
"""Sample one token via the temperature → top-k → top-p → softmax → multinomial pipeline.
NOTE: temperature=0 takes the deterministic argmax of THIS logit vector. It does NOT
guarantee the same logit vector across hardware, batches, or model versions. For
bit-identical reproducibility, cache the outputs upstream.
"""
# Stage 1: temperature — greedy short-circuit on T = 0
if temperature == 0:
return logits.argmax().item()
logits = logits / temperature
# Stage 2: top-k filter (fixed cap)
if top_k > 0:
kth = torch.topk(logits, top_k)[0][..., -1, None]
logits = logits.masked_fill(logits < kth, float("-inf"))
# Stage 3: top-p (nucleus) filter (adaptive cap)
if top_p < 1.0:
sorted_logits, sorted_idx = torch.sort(logits, descending=True)
cum_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Drop tokens whose cumulative prob exceeds p — keep the first token unconditionally
drop = cum_probs > top_p
drop[..., 1:] = drop[..., :-1].clone()
drop[..., 0] = False
logits = logits.masked_fill(
torch.zeros_like(logits, dtype=torch.bool).scatter_(0, sorted_idx, drop),
float("-inf"),
)
# Stage 4: softmax → multinomial sample
probs = F.softmax(logits, dim=-1)
return torch.multinomial(probs, num_samples=1, generator=generator).item()
# Example: same logit vector, three production presets
vocab_size = 50257
logits = torch.randn(vocab_size)
factual = sample_next_token(logits, temperature=0.2, top_k=10, top_p=0.9)
creative = sample_next_token(logits, temperature=0.9, top_k=50, top_p=0.95)
greedy = sample_next_token(logits, temperature=0) # argmax — same vector, same token, every time
Karpathy’s nanoGPT ships a near-identical sampling function [karpathy-nanogpt]; the production-grade equivalent in Hugging Face transformers is LogitsProcessorList with TemperatureLogitsWarper, TopKLogitsWarper, TopPLogitsWarper composed in that order.
Common Pitfalls & Misconceptions
| Misconception | Why it’s wrong | What to do instead |
|---|---|---|
”temperature=0 makes generation deterministic” | It stops sampling — it does not stop floating-point. Same prompt + same seed + same model on a different GPU SKU can flip the argmax when two logits are within ~1e-6. cuBLAS reproducibility holds only “with the same architecture and the same number of SMs” [cublas]. | If you need reproducibility, cache outputs. Don’t rely on temperature=0 across providers, batch sizes, or API versions. |
| ”Greedy decoding is best for factual tasks” | Greedy produces “bland and strangely repetitive” text on open-ended tasks (Holtzman 2020 [holtzman2020]). Even for factual tasks it picks the locally-likely token, not the globally-correct one — error propagation is worse, not better. | Use temperature=0.2 + top_p=0.9 for factual Q&A. Reserve T=0 for cases where the answer is a single high-confidence token and you genuinely need argmax. |
| ”Top-k is fine, why use top-p?” | Top-k uses a fixed cap regardless of distribution shape. When one token dominates (e.g. after the), top-50 keeps 49 garbage tokens. When the distribution is flat, top-50 may exclude valid candidates. | Top-p adapts to the entropy of the distribution at each step — keep top-p as your default, with top-k as a coarse safety cap (e.g. top_k=50, top_p=0.9). |
| ”Higher temperature = more creative” | Above T≈1.5, the model starts emitting low-probability garbage tokens — typos, wrong-language switches, broken syntax. It is not “more creative”, it is less coherent. | Stay in T=0.7–1.0 for creative tasks. Use T>1.2 only when you can post-filter (e.g. brainstorming with downstream human review). |
”Setting seed=42 makes the API call reproducible” | Provider seeds (seed in OpenAI Chat Completions, similar in Anthropic) only seed the sampler within their infrastructure — they do not pin the GPU SKU, the batch composition, the kernel version, or the model snapshot. Same seed across two API versions can produce different output. | Treat provider seed parameters as best-effort. For real reproducibility, cache the response keyed on prompt + model + version, and persist it. |
| ”The model knows when to stop” | The model emits an EOS token if it was trained to, on inputs that look like training data. Custom finetunes often forget; long-running generation hits max_tokens mid-sentence; chat templates can fail to strip “Human:” stop sequences. | Always set max_tokens explicitly. Configure stop=["Human:", "</answer>", ...] for chat-templated models. Inspect the finish reason in production logs. |
| ”Autoregressive generation will self-correct if it goes wrong” | It won’t. Each token conditions only on previous tokens, including its own previous errors. Hallucination momentum is real — the model continues the pattern it started. | Add explicit error-recovery prompts (“if you’re not sure, say ‘I don’t know’ and stop”). For agent loops, use the three stop conditions from Ch06. |
Takeaway: every generation bug in production traces to one of these seven misconceptions — and the first row alone explains roughly half of every “the model used to work and now it doesn’t” support ticket.
Verify Your Understanding
Before continuing, you should be able to answer these from memory:
- Temperature at three points. What happens mathematically and practically at
T=0,T=1,T=2? Name the failure mode at each extreme. - Top-k arithmetic. Given logits
[3.0, 2.0, 1.0, 0.5]andtop_k=2, which tokens can be selected? Compute the renormalised probabilities (hint: softmax over[3.0, 2.0]). - Why does autoregressive generation produce repetitive text? Name the mechanism (hint: error propagation + likelihood-max bias). Why does nucleus sampling [holtzman2020] help, and why doesn’t it fully solve the problem?
- The
temperature=0reproducibility trap. Your colleague swearstemperature=0is deterministic. They run the same prompt twice on the same A100 and get the same answer. They run it on an H100 and get a different answer. Walk through what changed and name the specific cuBLAS-level mechanism. What does caching the response give you that no settings combination can? - Top-k vs top-p. Give a concrete query where top-p beats top-k, and another where top-k is sufficient. Explain why top-p is the better default.
What’s Next
Generation produces text token-by-token from learned probabilities. The next chapter — Generation → Retrieval — addresses the fact that the model can only generate from what it was trained on. Retrieval grounds the generator in external facts so it can answer questions whose answers aren’t in its weights.
References
- [holtzman2020] Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y. The Curious Case of Neural Text Degeneration. ICLR 2020. arXiv:1904.09751. Source of nucleus (top-p) sampling. Documents that greedy and likelihood-max beam search produce “bland and strangely repetitive” text; nucleus sampling gives “diversity without sacrificing fluency and coherence.” Cited in §§ Building On Previous Knowledge, Decoding Strategies, Common Pitfalls & Misconceptions, Verify Your Understanding.
- [cublas] NVIDIA. cuBLAS Library Documentation. docs.nvidia.com/cuda/cublas. Explicitly states bit-wise reproducibility requires “same architecture and same number of SMs”, that the guarantee “no longer holds when multiple CUDA streams are active”, and that routines using atomics (e.g.
cublasSetAtomicsMode) are “not guaranteed to be bit-wise reproducible.” The hardware-level root cause of temperature=0 non-determinism. Cited in §§ Building On Previous Knowledge, Why “Deterministic” Generation Doesn’t Exist, Common Pitfalls & Misconceptions. - [karpathy-nanogpt] Karpathy, A. nanoGPT —
sample.py. GitHub:karpathy/nanoGPT. Canonical practitioner reference for the same four-stage sampling pipeline implemented in this chapter’s Code Example. Cited in § Code Example. - [openai-seed] OpenAI. Chat Completions API —
seedparameter. platform.openai.com/docs/api-reference/chat/create#chat-create-seed. Best-effort sampler seed; the docs explicitly note that determinism is “not guaranteed” and that thesystem_fingerprintfield changes across model snapshots. Cited in § Common Pitfalls & Misconceptions. - [hf-logitsprocessor] Hugging Face.
transformersGeneration — LogitsProcessor. huggingface.co/docs/transformers/internal/generation_utils. Production-grade equivalent of the manual pipeline in the Code Example:TemperatureLogitsWarper,TopKLogitsWarper,TopPLogitsWarper, composed in order. Cited in § Code Example. - [pa-cost] Production Agents — Part 4: Cost Control & Token Budgets. Operator-grade companion on
max_tokensbudgets and stopping discipline in production. Cross-series bridge from § Stopping Generation.