Skip to content

Ai-engineering Series

Attention to Generation - Producing Text Token by Token

Deep dive into text generation: the generation pipeline, temperature and sampling, decoding strategies, and why deterministic generation doesn't exist

Building On Previous Knowledge

In the previous progression, you learned how attention lets tokens incorporate information from context. The transformer processes the entire sequence, and each position ends up with a rich representation that “knows about” the whole input.

But there’s still a gap: how does a rich vector become the next word?

This progression bridges that gap. After all the attention and feedforward layers, the model outputs a probability distribution over the entire vocabulary. Generation is the process of sampling from that distribution, one token at a time.

What Goes Wrong Without This:

Generation Failure Patterns

The Generation Pipeline

Text generation is autoregressive: generate one token, append it, generate the next.

Generation Pipeline

Logits to Probabilities

The model outputs raw scores (logits), not probabilities:

Logits to Probabilities

Temperature: Controlling Randomness

Temperature scales logits before softmax, controlling distribution “sharpness”:

Temperature Effects

Practical guidance:

Temperature Guidelines

Decoding Strategies

Temperature alone doesn’t solve everything. Other strategies modify which tokens are considered.

Greedy Decoding

Greedy Decoding

Top-K Sampling

Top-K Sampling

Top-P (Nucleus) Sampling

Top-P (Nucleus) Sampling

Combining Strategies

Combining Strategies

Practical Generation Settings

Recommended settings for common use cases:

Factual/Deterministic

{
  "temperature": 0,
  "top_p": 1,
  "max_tokens": 256
}

Or with slight randomness:

{
  "temperature": 0.2,
  "top_p": 0.95,
  "max_tokens": 256
}

Code Generation

{
  "temperature": 0.3,
  "top_p": 0.95,
  "max_tokens": 1024
}

Creative Writing

{
  "temperature": 0.9,
  "top_p": 0.95,
  "max_tokens": 2048
}

When in Doubt

{
  "temperature": 0.7,
  "top_p": 0.9,
  "max_tokens": 512
}

The Autoregressive Problem

Generation has a fundamental limitation:

Autoregressive Limitation

Consequences:

Error Propagation Consequences

Why “Deterministic” Generation Doesn’t Exist

Even with temperature = 0, outputs can vary:

Sources of Non-Determinism

Practical implications:

Dealing with Non-Determinism

Stopping Generation

How does the model know when to stop?

Stopping Conditions

Code Example

Sampling from a probability distribution:

import torch
import torch.nn.functional as F

def sample_next_token(
    logits: torch.Tensor,  # (vocab_size,)
    temperature: float = 1.0,
    top_k: int = 50,
    top_p: float = 0.9,
) -> int:
    """
    Sample next token using temperature, top-k, and top-p.
    """
    # Step 1: Apply temperature
    if temperature > 0:
        logits = logits / temperature
    else:
        # Greedy: return argmax
        return logits.argmax().item()

    # Step 2: Apply top-k
    if top_k > 0:
        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
        logits[indices_to_remove] = float('-inf')

    # Step 3: Apply top-p (nucleus)
    if top_p < 1.0:
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

        # Remove tokens with cumulative probability above threshold
        sorted_indices_to_remove = cumulative_probs > top_p
        # Keep at least one token
        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
        sorted_indices_to_remove[..., 0] = 0

        indices_to_remove = sorted_indices[sorted_indices_to_remove]
        logits[indices_to_remove] = float('-inf')

    # Step 4: Convert to probabilities and sample
    probs = F.softmax(logits, dim=-1)
    next_token = torch.multinomial(probs, num_samples=1)

    return next_token.item()

# Example usage
vocab_size = 50257
logits = torch.randn(vocab_size)

# Different settings for different use cases
factual_token = sample_next_token(logits, temperature=0.2, top_k=10, top_p=0.9)
creative_token = sample_next_token(logits, temperature=0.9, top_k=50, top_p=0.95)
greedy_token = sample_next_token(logits, temperature=0)  # always argmax

Key Takeaways

Key Takeaways

Verify Your Understanding

Before proceeding, you should be able to:

Explain what happens when you set temperature to 0 vs 1 vs 2 — What changes mathematically? What changes practically?

Given logits [3.0, 2.0, 1.0, 0.5] and top_k=2, what tokens can be selected? Calculate the renormalized probabilities.

Why does autoregressive generation sometimes produce repetitive text? What’s the mechanism, and how do sampling strategies help?

Your application needs consistent outputs for the same input. What can you do? What can’t you guarantee?

Explain the difference between top-k and top-p. When might top-p be better?


What’s Next

After this, you can:

  • Continue → Generation → Retrieval — grounding generation in external facts
  • Apply → Build LLM applications with appropriate generation settings

Concepts covered in this article