Attention to Generation - Producing Text Token by Token
Deep dive into text generation: the generation pipeline, temperature and sampling, decoding strategies, and why deterministic generation doesn't exist
12 minutes•Intermediate Level•Dec 2024
Building On Previous Knowledge
In the previous progression, you learned how attention lets tokens incorporate information from context. The transformer processes the entire sequence, and each position ends up with a rich representation that “knows about” the whole input.
But there’s still a gap: how does a rich vector become the next word?
This progression bridges that gap. After all the attention and feedforward layers, the model outputs a probability distribution over the entire vocabulary. Generation is the process of sampling from that distribution, one token at a time.
What Goes Wrong Without This:
Generation Failure Patterns
Generation Failure Patterns
Symptom: Your LLM-powered app gives different answers every time.
Cause: Temperature > 0 introduces randomness. This is a feature,
not a bug—but you might not want it for your use case.
Symptom: Model outputs are repetitive and boring.
Cause: You set temperature = 0 (greedy decoding).
Model always picks highest probability = mode collapse.
Symptom: LLM generates coherent first paragraphs, then rambles.
Cause: Autoregressive generation accumulates errors.
Each token conditions on previous (possibly wrong) tokens.
The Generation Pipeline
Text generation is autoregressive: generate one token, append it, generate the next.
Generation Pipeline
Generation Pipeline
GENERATION PIPELINE:
Input: "The capital of France is"
Step 1: Tokenize→ [464, 3139, 286, 4881, 318]
Step 2: Forward pass through transformer
Input embeddings → Attention layers → FFN layers
→ Final hidden states for each position
Step 3: Project to vocabulary (LM head)
Last position's hidden state → (vocab_size,) logits
hidden_state @ W_vocab → [2.3, -1.1, 0.5, ..., 1.8]
↑
50,257 values (one per token)
Step 4: Convert to probabilitiessoftmax(logits / temperature) →probabilities
[0.001, 0.0002, 0.3, ..., 0.001]
↑
Sum = 1.0, each value ∈ [0,1]
Step 5: Sample next token
Based on probabilities→ token "Paris" (ID 6342)
Step 6: Repeat from Step 2 with extended sequence
New input: "The capital of France is Paris"
→ Generate next token...
Logits to Probabilities
The model outputs raw scores (logits), not probabilities:
Logits to Probabilities
Logits to Probabilities
Logits: [2.3, -1.1, 0.5, 4.1, -0.3, ...]
↑↑
"the" "Paris"
Softmaxconverts to probabilities:
P(token_i) = exp(logit_i) / Σ exp(logit_j)
After softmax: [0.02, 0.001, 0.004, 0.85, 0.002, ...]
↑
"Paris" = 85% probability
The token with highest logit gets highest probability.
But it's not 100%—other tokens have non-zero chance.
Temperature: Controlling Randomness
Temperature scales logits before softmax, controlling distribution “sharpness”:
Temperature Effects
Temperature Effects
Temperature Effects:
T = 0.0 (or very small):
logits / 0 → ∞ for max, 0 for others
Always pick highest probability token (greedy)
Output: deterministic, conservative, may be repetitiveT = 1.0 (default):
logits unchanged
Sample according to trained distribution
Output: balanced creativity and coherenceT = 2.0 (high):
logits / 2 → flatter distribution
Low probability tokens become more likely
Output: more random, creative, potentially incoherent
Visual:
T=0.1 T=1.0 T=2.0
Token A (logit 4): 99% 70% 45%
Token B (logit 2): 1% 20% 30%
Token C (logit 1): <1% 10% 25%
Practical guidance:
Temperature Guidelines
Temperature Guidelines
┌──────────────────┬───────────────┬───────────────────────────────┐│Temperature│Use Case│Behavior│├──────────────────┼───────────────┼───────────────────────────────┤│ 0.0 - 0.3 │Factual Q&A│Conservative, deterministic ││ 0.5 - 0.7 │Code gen│Balanced, mostly predictable ││ 0.7 - 1.0 │Creative│ More varied, still coherent ││ 1.0 - 1.5 │Brainstorm│ High variety, some wild │└──────────────────┴───────────────┴───────────────────────────────┘
Decoding Strategies
Temperature alone doesn’t solve everything. Other strategies modify which tokens are considered.
Greedy Decoding
Greedy Decoding
Greedy Decoding
Always pickhighest probability token:
P = [0.02, 0.001, 0.85, 0.004, ...]
↓Select: token_2 (0.85)
Pros: Deterministic, fast (no sampling)
Cons: Repetitive, no exploration, misses better paths
"The best the best the best the best..." ←mode collapse
Top-K Sampling
Top-K Sampling
Top-K Sampling
Only consider the K most likely tokens:
Original: [0.4, 0.3, 0.15, 0.08, 0.04, 0.02, 0.01, ...]
└── many tiny probabilities
top_k=3: [0.4, 0.3, 0.15, 0, 0, 0, 0, ...]
└─────────────┘Renormalize these to sum to 1.0
After renormalization: [0.47, 0.35, 0.18, 0, 0, ...]
└── others impossible
Sample from reduced distribution.
Benefit: Prevents very unlikely tokens from being chosen.
Risk: K is fixed, but vocabulary distribution varies.
Sometimes top-5 is enough. Sometimes top-50 is needed.
Top-P (Nucleus) Sampling
Top-P (Nucleus) Sampling
Top-P (Nucleus) Sampling
Include tokens until cumulative probabilityreaches P:
Sorted probs: [0.4, 0.3, 0.15, 0.08, 0.04, ...]
Cumulative: [0.4, 0.7, 0.85, 0.93, 0.97, ...]
↑
Top-p=0.9 → include up to 0.93
Adaptive: includes more tokens when distribution is flat,
fewer when one token dominates.
This is often better than top-k because it adapts to context.
Combining Strategies
Combining Strategies
Combining Strategies
Real systems often combine:
1. Applytemperature
2. Applytop-k (e.g., k=50)
3. Applytop-p (e.g., p=0.9)
4. Sample from result
Each filterremoves tokens that "shouldn't" be generated.
Order matters: temperature affects which tokens pass top-k.
Each token conditions only on PREVIOUS tokens.
The model can't "look ahead" and fix mistakes.
Step 1: "The answer is"
Step 2: "The answer is definitely"
Step 3: "The answer is definitely 42" ←committed to "definitely"
Step 4: "The answer is definitely 42."
What if "42" was wrong?
Model already said "definitely"—can't take it back.
Error propagation through the sequence.
Consequences:
Error Propagation Consequences
Error Propagation Consequences
1. Early mistakescompound
Wrong direction at step 10 affects all subsequent tokens.
2. Hallucination momentum
Once model starts hallucinating, it continues the pattern.
"The author of Hamlet was Francis Bacon..." continues confidently.
3. No self-correction without explicit mechanisms
Model doesn't naturally "notice" it's wrong.
Chain-of-thought helps but doesn't eliminate the problem.
Why “Deterministic” Generation Doesn’t Exist
Even with temperature = 0, outputs can vary:
Sources of Non-Determinism
Sources of Non-Determinism
Sources of Non-Determinism:
1. FLOATING-POINT PRECISION
Different GPUs/CPUs compute slightly differently
exp(12.345) on GPU A ≠ exp(12.345) on GPU B (last bits)
When tokens have similar probabilities, winner can change
2. BATCHING EFFECTS
Same prompt in different batch positions → different padding
Attention patterns slightly affected
3. API VERSION CHANGES
Provider updates model weights, quantization, infrastructure
"Same model" may not be same computation
4. PARALLEL COMPUTATION ORDER
Operations aren't strictly ordered in parallel execution
(a + b) + c vs a + (b + c) → floating point differs
Practical implications:
Dealing with Non-Determinism
Dealing with Non-Determinism
- Don't assume same prompt → same output, ever
- If you need reproducibility, cache outputs
- Test with multiple runs, not just one
- Use seed parameter if available (helps but doesn't guarantee)
Stopping Generation
How does the model know when to stop?
Stopping Conditions
Stopping Conditions
STOPPING CONDITIONS:
1. EOS TOKEN
Model generates <|endoftext|> or equivalent
Trained to output this when "done"
2. MAX TOKENS
Hit the limit you specified (max_tokens=256)
May cut off mid-sentence
3. STOP SEQUENCES
Custom strings that terminate generation
stop=["\n", "Human:", "```"]
4. TIMEOUT
API or system timeout (less common)
Code Example
Sampling from a probability distribution:
import torchimport torch.nn.functional as Fdef sample_next_token( logits: torch.Tensor, # (vocab_size,) temperature: float = 1.0, top_k: int = 50, top_p: float = 0.9,) -> int: """ Sample next token using temperature, top-k, and top-p. """ # Step 1: Apply temperature if temperature > 0: logits = logits / temperature else: # Greedy: return argmax return logits.argmax().item() # Step 2: Apply top-k if top_k > 0: indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None] logits[indices_to_remove] = float('-inf') # Step 3: Apply top-p (nucleus) if top_p < 1.0: sorted_logits, sorted_indices = torch.sort(logits, descending=True) cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1) # Remove tokens with cumulative probability above threshold sorted_indices_to_remove = cumulative_probs > top_p # Keep at least one token sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone() sorted_indices_to_remove[..., 0] = 0 indices_to_remove = sorted_indices[sorted_indices_to_remove] logits[indices_to_remove] = float('-inf') # Step 4: Convert to probabilities and sample probs = F.softmax(logits, dim=-1) next_token = torch.multinomial(probs, num_samples=1) return next_token.item()# Example usagevocab_size = 50257logits = torch.randn(vocab_size)# Different settings for different use casesfactual_token = sample_next_token(logits, temperature=0.2, top_k=10, top_p=0.9)creative_token = sample_next_token(logits, temperature=0.9, top_k=50, top_p=0.95)greedy_token = sample_next_token(logits, temperature=0) # always argmax
Key Takeaways
Key Takeaways
Key Takeaways
1. Generation is autoregressive: one token at a time, each conditioned on previous
2. Model outputs logits→softmax→ probabilities over vocabulary
3. Temperature controls distribution sharpness:
- T=0: deterministic (greedy)
- T=1: as trained
- T>1: more random
4. Top-k and top-p filter the token distribution:
- Top-k: only consider k most likely
- Top-p: consider tokens until cumulative probability reaches p
5. Autoregressive generation can't look ahead—early errors propagate
6. True determinism doesn't exist due to floating-point and infrastructure variations
Verify Your Understanding
Before proceeding, you should be able to:
Explain what happens when you set temperature to 0 vs 1 vs 2 — What changes mathematically? What changes practically?
Given logits [3.0, 2.0, 1.0, 0.5] and top_k=2, what tokens can be selected? Calculate the renormalized probabilities.
Why does autoregressive generation sometimes produce repetitive text? What’s the mechanism, and how do sampling strategies help?
Your application needs consistent outputs for the same input. What can you do? What can’t you guarantee?
Explain the difference between top-k and top-p. When might top-p be better?