Skip to content

Ai-engineering Series

Text to Tokens - The Foundation

Deep dive into tokenization: why models can't read text directly, subword algorithms like BPE, practical patterns, and the pitfalls that cause production failures

Why Tokenization Matters

Tokenization seems like preprocessing trivia. It’s not.

Every LLM interaction starts here. Tokenization determines what the model can see, how much it costs, and what it can generate. Bugs at this layer cause failures that are extremely difficult to diagnose because they look like model problems.

If you skip understanding tokenization:

  • You’ll wonder why non-English users complain about costs (hint: they’re paying 3-5x more per concept)
  • You’ll struggle with structured output when tokens don’t align with your expected boundaries
  • You’ll miscount context window usage and either truncate critical information or pay for waste

What Goes Wrong Without This:

Tokenization Failure Patterns

Why Models Can’t Read Text

Neural networks are math machines. They do matrix multiplications. They need numbers.

The Tokenization Gap

Text is a sequence of characters. Characters have Unicode values, but using raw Unicode doesn’t work:

Raw Unicode Problems

Why Not Just Split on Words?

Word-Level Tokenization Problems

Subword Tokenization

The solution: break words into pieces. Common pieces become tokens.

Subword Tokenization Benefits

Three dominant algorithms:

AlgorithmUsed ByKey Idea
BPE (Byte Pair Encoding)GPT, LLaMA, ClaudeMerge frequent byte pairs
WordPieceBERTMerge to maximize likelihood
SentencePieceT5, multilingualLanguage-agnostic, treats text as bytes

BPE: How It Works

Start with character-level vocabulary. Repeatedly merge the most frequent adjacent pair.

BPE Training Process

Result: Common words become single tokens. Rare words decompose.

BPE Results

Practical Tokenization Patterns

Whitespace handling:

Whitespace in Tokens

Case sensitivity:

Case Sensitivity

Numbers:

Number Tokenization

Code:

Code Tokenization

Tokenizer Differences Matter:

Tokenizer Incompatibility

The Context Window Problem

Every model has a maximum context length. Measured in tokens, not characters.

Model Context Lengths

Tokens ≠ Characters ≠ Words:

Token Counting Rule of Thumb

Tokenization Pitfalls

Non-English Languages

Non-English Token Inefficiency

Rare Words and Neologisms

Handling New Words

Adversarial Inputs

Unicode Tricks

Token Boundaries Affect Generation

Generation Boundary Issues

Special Tokens

Every tokenizer has reserved tokens for structure:

Special Token Types

Code Example

Minimal tokenization exploration with tiktoken (OpenAI’s tokenizer):

import tiktoken

# Load GPT-4 tokenizer
enc = tiktoken.encoding_for_model("gpt-4")

def explore_tokenization(text: str) -> None:
    """Show how text becomes tokens."""
    tokens = enc.encode(text)
    print(f"Text: {text!r}")
    print(f"Tokens: {tokens}")
    print(f"Count: {len(tokens)}")

    # Decode each token to see the pieces
    pieces = [enc.decode([t]) for t in tokens]
    print(f"Pieces: {pieces}")
    print()

# Common word - single token
explore_tokenization("hello")

# Compound word - multiple tokens
explore_tokenization("unbelievable")

# Code - mixed tokenization
explore_tokenization("def calculate_total():")

# Numbers - often split
explore_tokenization("123456789")

# Non-English - more tokens per concept
explore_tokenization("Hello")      # English
explore_tokenization("Bonjour")    # French
explore_tokenization("こんにちは")  # Japanese

Key Takeaways

Key Takeaways

Verify Your Understanding

Before proceeding, you should be able to:

Explain why “Hello” and ” Hello” are different tokens — What does whitespace handling in tokenizers tell you about how prompts are represented?

Predict relative token counts for English vs. Japanese — Which will have more tokens? By roughly how much? Why does this matter for cost and performance?

Given a tokenizer, explain how BPE builds the vocabulary — Walk through the merge process.

Identify the tokenization trap: You want the model to output JSON with a specific field name “customerID”. The model sometimes outputs “customer_ID” or “customerId”. What might be happening at the token level?


What’s Next

After this, you can:

  • Continue → Tokens → Embeddings — how tokens become vectors
  • Apply → Build semantic search once you understand embeddings

Concepts covered in this article