Attention to Generation - Producing Text Token by Token | Intentional / Deliberate / Engineering

Two-panel hero. Left: the canonical per-token sampling pipeline — logits ∈ ℝ^|V| → ÷ temperature → top-k filter (k=50) → top-p nucleus filter (p=0.9) → softmax → multinomial sample → next token. A footer note quotes Holtzman 2020: greedy + likelihood-max yields 'bland and strangely repetitive' text; nucleus sampling gives 'diversity without sacrificing fluency and coherence.' Right: three same-prompt runs with temperature=0 and seed=42. A100 run #1 picks 'Paris' (logit 4.1003217, runner-up 'Lyon' at 4.1003198). A100 run #2 reproduces 'Paris' bit-identically. H100 run #3 picks 'Lyon' (logit 4.1003241) — reduction order changed, winner flipped. A red flashing panel reads 'temperature=0 stops sampling. It does not stop floating-point.' Four fixes listed: cuBLAS reduction order varies across SM counts + CUDA streams, atomic-add is not bit-reproducible, argmax can flip when two logits are within ~1e-6, API upgrades silently change quantisation. Footer: 'Cache outputs when you need reproducibility. Settings cannot give it to you across hardware.' — Generation = sampling pipeline + a non-determinism caveat tutorials skip

Building On Previous Knowledge

The previous chapter ended with one Takeaway: scaled dot-product attention is four linear-algebra steps, and the critical one is the √d_k divisor. Attention produces a sequence of context-aware hidden states — one vector per position, each one a soft mixture over every other token. This chapter picks up the data path at the final hidden state and turns it into a token.

The transformer’s last layer projects each position’s hidden state through a W_vocab matrix into a (vocab_size,) vector of logits — raw, unnormalised scores, one per token in the vocabulary. Generation is what happens between that logit vector and the next word the user sees. The math is small: temperature scales, top-k clips, top-p clips adaptively, softmax normalises, multinomial samples. The hero diagram at the top of this chapter walks the pipeline end-to-end.

Where most generation tutorials stop: they explain temperature, top-k, top-p, and recommend temperature=0 for “deterministic” output. Then a developer pins temperature=0, ships, and discovers that the same prompt against the same OpenAI model returns different responses across providers, GPU types, or even API versions. The chapter’s named differentiator: temperature=0 is not bit-deterministic across hardware — and the reasons are documented in NVIDIA’s own cuBLAS reference [cublas]. Holtzman et al. 2020 [holtzman2020] explains why greedy decoding produces bland text; the hardware-level explanation for why “deterministic” isn’t is the half public coverage skips.

Takeaway: generation has a clean four-step sampling pipeline and a dirty hardware caveat — production teams that rely on temperature=0 for reproducibility ship a bug; production teams that cache outputs ship reproducibility.

What Goes Wrong Without This:

Generation Failure Patterns

Symptom: Your LLM-powered app gives different answers every time.
Cause:   Temperature > 0 introduces randomness. This is a feature,
       not a bug—but you might not want it for your use case.

Symptom: Model outputs are repetitive and boring.
Cause: You set temperature = 0 (greedy decoding).
Model always picks highest probability = mode collapse.

Symptom: LLM generates coherent first paragraphs, then rambles.
Cause: Autoregressive generation accumulates errors.
Each token conditions on previous (possibly wrong) tokens.

The Generation Pipeline

Text generation is autoregressive: generate one token, append it, generate the next.

Generation Pipeline

Input: "The capital of France is"

Step 1: Tokenize
→ [464, 3139, 286, 4881, 318]

Step 2: Forward pass through transformer
Input embeddings → Attention layers → FFN layers
→ Final hidden states for each position

Step 3: Project to vocabulary (LM head)
Last position's hidden state → (vocab_size,) logits
hidden_state @ W_vocab → [2.3, -1.1, 0.5, ..., 1.8]
↑
50,257 values (one per token)

Step 4: Convert to probabilities
softmax(logits / temperature) → probabilities
[0.001, 0.0002, 0.3, ..., 0.001]
↑
Sum = 1.0, each value ∈ [0,1]

Step 5: Sample next token
Based on probabilities → token "Paris" (ID 6342)

Step 6: Repeat from Step 2 with extended sequence
New input: "The capital of France is Paris"
→ Generate next token...

Logits to Probabilities

The model outputs raw scores (logits), not probabilities:

Logits to Probabilities

Logits: [2.3, -1.1, 0.5, 4.1, -0.3, ...]
      ↑                    ↑
   "the"                "Paris"

Softmax converts to probabilities:
P(token_i) = exp(logit_i) / Σ exp(logit_j)

After softmax: [0.02, 0.001, 0.004, 0.85, 0.002, ...]
↑
"Paris" = 85% probability

The token with highest logit gets highest probability.
But it's not 100%—other tokens have non-zero chance.

Takeaway: generation is autoregressive — each token conditions only on tokens already in the context, and the model decides one token at a time by sampling from a softmax-normalised vocabulary distribution.

Temperature: Controlling Randomness

Temperature scales logits before softmax, controlling distribution “sharpness”:

Temperature Effects

T = 0.0 (or very small):
logits / 0 → ∞ for max, 0 for others
Always pick highest probability token (greedy)
Output: deterministic, conservative, may be repetitive

T = 1.0 (default):
logits unchanged
Sample according to trained distribution
Output: balanced creativity and coherence

T = 2.0 (high):
logits / 2 → flatter distribution
Low probability tokens become more likely
Output: more random, creative, potentially incoherent

Visual:
T=0.1 T=1.0 T=2.0
Token A (logit 4): 99% 70% 45%
Token B (logit 2): 1% 20% 30%
Token C (logit 1): <1% 10% 25%

Practical guidance:

Temperature Guidelines

┌──────────────────┬───────────────┬───────────────────────────────┐
│  Temperature     │  Use Case     │  Behavior                     │
├──────────────────┼───────────────┼───────────────────────────────┤
│  0.0 - 0.3       │  Factual Q&A  │  Conservative, deterministic  │
│  0.5 - 0.7       │  Code gen     │  Balanced, mostly predictable │
│  0.7 - 1.0       │  Creative     │  More varied, still coherent  │
│  1.0 - 1.5       │  Brainstorm   │  High variety, some wild      │
└──────────────────┴───────────────┴───────────────────────────────┘

Takeaway: temperature is a one-knob exploration/exploitation dial — T<1 sharpens the distribution toward the highest-probability tokens, T>1 flattens it, T=0 collapses to argmax. It is not a determinism switch.

Decoding Strategies

Temperature alone doesn’t solve everything. Other strategies modify which tokens are considered.

Greedy Decoding

Always pick highest probability token:

P = [0.02, 0.001, 0.85, 0.004, ...]
↓
Select: token_2 (0.85)

Pros: Deterministic, fast (no sampling)
Cons: Repetitive, no exploration, misses better paths

"The best the best the best the best..." ← mode collapse

Top-K Sampling

Only consider the K most likely tokens:

Original: [0.4, 0.3, 0.15, 0.08, 0.04, 0.02, 0.01, ...]
└── many tiny probabilities

top_k=3: [0.4, 0.3, 0.15, 0, 0, 0, 0, ...]
└─────────────┘
Renormalize these to sum to 1.0

After renormalization: [0.47, 0.35, 0.18, 0, 0, ...]
└── others impossible

Sample from reduced distribution.

Benefit: Prevents very unlikely tokens from being chosen.
Risk: K is fixed, but vocabulary distribution varies.
Sometimes top-5 is enough. Sometimes top-50 is needed.

Top-P (Nucleus) Sampling

Include tokens until cumulative probability reaches P:

Sorted probs: [0.4, 0.3, 0.15, 0.08, 0.04, ...]
Cumulative: [0.4, 0.7, 0.85, 0.93, 0.97, ...]
↑
Top-p=0.9 → include up to 0.93

Adaptive: includes more tokens when distribution is flat,
fewer when one token dominates.

This is often better than top-k because it adapts to context.

Combining Strategies

Real systems often combine:

1. Apply temperature
2. Apply top-k (e.g., k=50)
3. Apply top-p (e.g., p=0.9)
4. Sample from result

Each filter removes tokens that "shouldn't" be generated.
Order matters: temperature affects which tokens pass top-k.

Holtzman, Buys, Du, Forbes, Choi 2020 introduced nucleus sampling because greedy and beam-search decoding produce text that is “bland and strangely repetitive” [holtzman2020]. Nucleus sampling, in their words, gives “diversity without sacrificing fluency and coherence” by truncating “the unreliable tail” of the distribution adaptively — keeping more tokens when the distribution is flat, fewer when one token dominates. That adaptiveness is why top-p has displaced top-k as the production default for open-ended generation.

Takeaway: greedy, top-k, and top-p are three points on the same exploration spectrum — greedy is degenerate (mode-collapse), top-k is rigid, top-p is the adaptive default. Production pipelines typically apply temperature → top-k → top-p in that order.

Practical Generation Settings

Recommended settings for common use cases:

Factual/Deterministic

{
  "temperature": 0,
  "top_p": 1,
  "max_tokens": 256
}

Or with slight randomness:

{
  "temperature": 0.2,
  "top_p": 0.95,
  "max_tokens": 256
}

Code Generation

{
  "temperature": 0.3,
  "top_p": 0.95,
  "max_tokens": 1024
}

Creative Writing

{
  "temperature": 0.9,
  "top_p": 0.95,
  "max_tokens": 2048
}

When in Doubt

{
  "temperature": 0.7,
  "top_p": 0.9,
  "max_tokens": 512
}

Takeaway: production-grade defaults are use-case-specific — factual lookups want T≈0.2, code wants T≈0.3, creative writing wants T≈0.9, and a top_p of 0.9–0.95 is the safe nucleus across all of them.

The Autoregressive Problem

Generation has a fundamental limitation:

Autoregressive Limitation

Each token conditions only on PREVIOUS tokens.
The model can't "look ahead" and fix mistakes.

Step 1: "The answer is"
Step 2: "The answer is definitely"
Step 3: "The answer is definitely 42" ← committed to "definitely"
Step 4: "The answer is definitely 42."

What if "42" was wrong?
Model already said "definitely"—can't take it back.
Error propagation through the sequence.

Consequences:

Error Propagation Consequences

1. Early mistakes compound
 Wrong direction at step 10 affects all subsequent tokens.

2. Hallucination momentum
 Once model starts hallucinating, it continues the pattern.
 "The author of Hamlet was Francis Bacon..." continues confidently.

3. No self-correction without explicit mechanisms
 Model doesn't naturally "notice" it's wrong.
 Chain-of-thought helps but doesn't eliminate the problem.

Takeaway: autoregressive generation has no look-ahead — early tokens lock in commitments the model can’t retract. Mitigations live above the loop (chain-of-thought, self-reflection prompts, retry-with-different-seed), not inside it.

Why “Deterministic” Generation Doesn’t Exist

This is the chapter’s load-bearing claim. Most public coverage stops at “set temperature=0 and the output is deterministic.” That is wrong at the hardware level, and the failure mode is concrete:

You set temperature=0. The argmax of the logit vector is supposed to win every time.
The runner-up logit is 4.1003198. The winner is 4.1003217. The gap is ~2e-6.
The same prompt routed to a different GPU SKU computes those logits via a slightly different reduction order in cuBLAS’s matmul kernels — and 4.1003198 becomes 4.1003241.
Argmax now picks the other token. Same prompt, same seed, same model weights, different word.

The hero diagram at the top of this chapter shows the three runs side by side. NVIDIA’s own cuBLAS documentation [cublas] states the constraint directly: bit-wise reproducibility holds only “when executed on GPUs with the same architecture and the same number of SMs”, and “no longer holds when multiple CUDA streams are active.” Some routines (e.g. cublasSetAtomicsMode) explicitly trade reproducibility for speed via non-associative atomic adds.

Even on identical hardware, batching effects break reproducibility: the same prompt in batch position 0 vs batch position 7 sees different padding tokens, different attention masking edge cases, and slightly different floating-point accumulation paths. Provider APIs amortise across batches dynamically — your temperature=0 request gets batched with whatever else is in the queue.

Even on identical hardware with the same batch position, model upgrades break reproducibility: provider re-quantising from FP16 → BF16, swapping kernels, or rolling out a new infra version moves the logits by small amounts. Two of those logits being within 1e-6 of each other is enough to flip the argmax.

Sources of Non-Determinism

1. FLOATING-POINT PRECISION
 Different GPUs/CPUs compute slightly differently
 exp(12.345) on GPU A ≠ exp(12.345) on GPU B (last bits)
 When tokens have similar probabilities, winner can change

2. BATCHING EFFECTS
 Same prompt in different batch positions → different padding
 Attention patterns slightly affected

3. API VERSION CHANGES
 Provider updates model weights, quantization, infrastructure
 "Same model" may not be same computation

4. PARALLEL COMPUTATION ORDER
 Operations aren't strictly ordered in parallel execution
 (a + b) + c vs a + (b + c) → floating point differs

Practical implications:

Dealing with Non-Determinism

- Don't assume same prompt → same output, ever
- If you need reproducibility, cache outputs
- Test with multiple runs, not just one
- Use seed parameter if available (helps but doesn't guarantee)

Takeaway: temperature=0 stops sampling; it does not stop floating-point. Bit-deterministic generation requires same-architecture / same-SM-count / same-batch-position / same-model-version — four conditions production rarely controls. If your application needs reproducibility, cache outputs.

Stopping Generation

How does the model know when to stop?

Stopping Conditions

1. EOS TOKEN
 Model generates <|endoftext|> or equivalent
 Trained to output this when "done"

2. MAX TOKENS
 Hit the limit you specified (max_tokens=256)
 May cut off mid-sentence

3. STOP SEQUENCES
 Custom strings that terminate generation
 stop=["\n", "Human:", "```"]

4. TIMEOUT
 API or system timeout (less common)

Takeaway: generation needs an explicit exit signal — EOS, max_tokens, stop sequence, or timeout. Most production bugs in this layer trace to missing or wrong stop-sequence configuration (“Human:” not stripped, max_tokens too small, EOS not respected in custom finetune). For the operator-grade deep dive on budgeting these stops across an agent loop, see Production Agents Part 4: Cost Control [pa-cost].

Code Example

A minimal sampling pipeline pinned to torch==2.3.0 implementing the four-stage filter chain. Note the explicit greedy branch — and the comment that flags what greedy doesn’t give you:

# Tested on torch==2.3.0 + Python 3.11
import torch
import torch.nn.functional as F


def sample_next_token(
    logits: torch.Tensor,  # (vocab_size,)
    temperature: float = 1.0,
    top_k: int = 50,
    top_p: float = 0.9,
    generator: torch.Generator | None = None,  # seed control — see caveat below
) -> int:
    """Sample one token via the temperature → top-k → top-p → softmax → multinomial pipeline.

    NOTE: temperature=0 takes the deterministic argmax of THIS logit vector. It does NOT
    guarantee the same logit vector across hardware, batches, or model versions. For
    bit-identical reproducibility, cache the outputs upstream.
    """
    # Stage 1: temperature — greedy short-circuit on T = 0
    if temperature == 0:
        return logits.argmax().item()
    logits = logits / temperature

    # Stage 2: top-k filter (fixed cap)
    if top_k > 0:
        kth = torch.topk(logits, top_k)[0][..., -1, None]
        logits = logits.masked_fill(logits < kth, float("-inf"))

    # Stage 3: top-p (nucleus) filter (adaptive cap)
    if top_p < 1.0:
        sorted_logits, sorted_idx = torch.sort(logits, descending=True)
        cum_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
        # Drop tokens whose cumulative prob exceeds p — keep the first token unconditionally
        drop = cum_probs > top_p
        drop[..., 1:] = drop[..., :-1].clone()
        drop[..., 0] = False
        logits = logits.masked_fill(
            torch.zeros_like(logits, dtype=torch.bool).scatter_(0, sorted_idx, drop),
            float("-inf"),
        )

    # Stage 4: softmax → multinomial sample
    probs = F.softmax(logits, dim=-1)
    return torch.multinomial(probs, num_samples=1, generator=generator).item()


# Example: same logit vector, three production presets
vocab_size = 50257
logits = torch.randn(vocab_size)

factual = sample_next_token(logits, temperature=0.2, top_k=10, top_p=0.9)
creative = sample_next_token(logits, temperature=0.9, top_k=50, top_p=0.95)
greedy = sample_next_token(logits, temperature=0)  # argmax — same vector, same token, every time

Karpathy’s nanoGPT ships a near-identical sampling function [karpathy-nanogpt]; the production-grade equivalent in Hugging Face transformers is LogitsProcessorList with TemperatureLogitsWarper, TopKLogitsWarper, TopPLogitsWarper composed in that order.

Common Pitfalls & Misconceptions

Misconception	Why it’s wrong	What to do instead
”`temperature=0` makes generation deterministic”	It stops sampling — it does not stop floating-point. Same prompt + same seed + same model on a different GPU SKU can flip the argmax when two logits are within `~1e-6`. cuBLAS reproducibility holds only “with the same architecture and the same number of SMs” [cublas].	If you need reproducibility, cache outputs. Don’t rely on `temperature=0` across providers, batch sizes, or API versions.
”Greedy decoding is best for factual tasks”	Greedy produces “bland and strangely repetitive” text on open-ended tasks (Holtzman 2020 [holtzman2020]). Even for factual tasks it picks the locally-likely token, not the globally-correct one — error propagation is worse, not better.	Use `temperature=0.2` + `top_p=0.9` for factual Q&A. Reserve `T=0` for cases where the answer is a single high-confidence token and you genuinely need argmax.
”Top-k is fine, why use top-p?”	Top-k uses a fixed cap regardless of distribution shape. When one token dominates (e.g. after `the`), top-50 keeps 49 garbage tokens. When the distribution is flat, top-50 may exclude valid candidates.	Top-p adapts to the entropy of the distribution at each step — keep top-p as your default, with top-k as a coarse safety cap (e.g. `top_k=50, top_p=0.9`).
”Higher temperature = more creative”	Above `T≈1.5`, the model starts emitting low-probability garbage tokens — typos, wrong-language switches, broken syntax. It is not “more creative”, it is less coherent.	Stay in `T=0.7–1.0` for creative tasks. Use `T>1.2` only when you can post-filter (e.g. brainstorming with downstream human review).
”Setting `seed=42` makes the API call reproducible”	Provider seeds (`seed` in OpenAI Chat Completions, similar in Anthropic) only seed the sampler within their infrastructure — they do not pin the GPU SKU, the batch composition, the kernel version, or the model snapshot. Same `seed` across two API versions can produce different output.	Treat provider `seed` parameters as best-effort. For real reproducibility, cache the response keyed on prompt + model + version, and persist it.
”The model knows when to stop”	The model emits an EOS token if it was trained to, on inputs that look like training data. Custom finetunes often forget; long-running generation hits `max_tokens` mid-sentence; chat templates can fail to strip “Human:” stop sequences.	Always set `max_tokens` explicitly. Configure `stop=["Human:", "</answer>", ...]` for chat-templated models. Inspect the finish reason in production logs.
”Autoregressive generation will self-correct if it goes wrong”	It won’t. Each token conditions only on previous tokens, including its own previous errors. Hallucination momentum is real — the model continues the pattern it started.	Add explicit error-recovery prompts (“if you’re not sure, say ‘I don’t know’ and stop”). For agent loops, use the three stop conditions from Ch06.

Takeaway: every generation bug in production traces to one of these seven misconceptions — and the first row alone explains roughly half of every “the model used to work and now it doesn’t” support ticket.

Verify Your Understanding

Before continuing, you should be able to answer these from memory:

Temperature at three points. What happens mathematically and practically at T=0, T=1, T=2? Name the failure mode at each extreme.
Top-k arithmetic. Given logits [3.0, 2.0, 1.0, 0.5] and top_k=2, which tokens can be selected? Compute the renormalised probabilities (hint: softmax over [3.0, 2.0]).
Why does autoregressive generation produce repetitive text? Name the mechanism (hint: error propagation + likelihood-max bias). Why does nucleus sampling [holtzman2020] help, and why doesn’t it fully solve the problem?
The temperature=0 reproducibility trap. Your colleague swears temperature=0 is deterministic. They run the same prompt twice on the same A100 and get the same answer. They run it on an H100 and get a different answer. Walk through what changed and name the specific cuBLAS-level mechanism. What does caching the response give you that no settings combination can?
Top-k vs top-p. Give a concrete query where top-p beats top-k, and another where top-k is sufficient. Explain why top-p is the better default.

What’s Next

Generation produces text token-by-token from learned probabilities. The next chapter — Generation → Retrieval — addresses the fact that the model can only generate from what it was trained on. Retrieval grounds the generator in external facts so it can answer questions whose answers aren’t in its weights.

References

[holtzman2020] Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y. The Curious Case of Neural Text Degeneration. ICLR 2020. arXiv:1904.09751. Source of nucleus (top-p) sampling. Documents that greedy and likelihood-max beam search produce “bland and strangely repetitive” text; nucleus sampling gives “diversity without sacrificing fluency and coherence.” Cited in §§ Building On Previous Knowledge, Decoding Strategies, Common Pitfalls & Misconceptions, Verify Your Understanding.
[cublas] NVIDIA. cuBLAS Library Documentation. docs.nvidia.com/cuda/cublas. Explicitly states bit-wise reproducibility requires “same architecture and same number of SMs”, that the guarantee “no longer holds when multiple CUDA streams are active”, and that routines using atomics (e.g. cublasSetAtomicsMode) are “not guaranteed to be bit-wise reproducible.” The hardware-level root cause of temperature=0 non-determinism. Cited in §§ Building On Previous Knowledge, Why “Deterministic” Generation Doesn’t Exist, Common Pitfalls & Misconceptions.
[karpathy-nanogpt] Karpathy, A. nanoGPT — sample.py. GitHub: karpathy/nanoGPT. Canonical practitioner reference for the same four-stage sampling pipeline implemented in this chapter’s Code Example. Cited in § Code Example.
[openai-seed] OpenAI. Chat Completions API — seed parameter. platform.openai.com/docs/api-reference/chat/create#chat-create-seed. Best-effort sampler seed; the docs explicitly note that determinism is “not guaranteed” and that the system_fingerprint field changes across model snapshots. Cited in § Common Pitfalls & Misconceptions.
[hf-logitsprocessor] Hugging Face. transformers Generation — LogitsProcessor. huggingface.co/docs/transformers/internal/generation_utils. Production-grade equivalent of the manual pipeline in the Code Example: TemperatureLogitsWarper, TopKLogitsWarper, TopPLogitsWarper, composed in order. Cited in § Code Example.
[pa-cost] Production Agents — Part 4: Cost Control & Token Budgets. Operator-grade companion on max_tokens budgets and stopping discipline in production. Cross-series bridge from § Stopping Generation.