Deep dive into tokenization: why models can't read text directly, subword algorithms like BPE, practical patterns, and the pitfalls that cause production failures
12 minutes•Intermediate Level•Dec 2024
Why Tokenization Matters
Tokenization seems like preprocessing trivia. It’s not.
Every LLM interaction starts here. Tokenization determines what the model can see, how much it costs, and what it can generate. Bugs at this layer cause failures that are extremely difficult to diagnose because they look like model problems.
If you skip understanding tokenization:
You’ll wonder why non-English users complain about costs (hint: they’re paying 3-5x more per concept)
You’ll struggle with structured output when tokens don’t align with your expected boundaries
You’ll miscount context window usage and either truncate critical information or pay for waste
What Goes Wrong Without This:
Tokenization Failure Patterns
Tokenization Failure Patterns
Symptom: Your LLM can't do basic arithmetic reliably.
Cause: Numbers tokenize inconsistently. "1234" might be one token,
two tokens, or four tokens depending on the tokenizer.
Symptom: Japanese/Arabic/Hindi users report higher costs.
Cause: Tokenizer trained primarily on English. Other languages
require 3-5x more tokens for the same semantic content.
Symptom: Model outputs "Hel" instead of "Hello"
Cause: Token boundaries don't align with word boundaries.
"Hello" and " Hello" are different tokens.
Why Models Can’t Read Text
Neural networks are math machines. They do matrix multiplications. They need numbers.
The Tokenization Gap
The Tokenization Gap
"Hello world" → ??? → [0.23, -0.41, 0.89, ...]
The ??? is tokenization.
Text is a sequence of characters. Characters have Unicode values, but using raw Unicode doesn’t work:
Raw Unicode Problems
Raw Unicode Problems
"cat" → [99, 97, 116] # ASCII/Unicode
"Cat" → [67, 97, 116] # Different numbers!
"CAT" → [67, 65, 84] # Totally different
Models would have to learn that these mean the same thing
from raw numbers alone. Possible, but wasteful.
Why Not Just Split on Words?
Word-Level Tokenization Problems
Word-Level Tokenization Problems
Problem 1: Vocabulary explosion
English has ~170,000 words in common use.
Add proper nouns, technical terms, typos...
"COVID-19" — in your vocabulary?
"ketankhairnar" — definitely not.
Problem 2: Out-of-vocabulary (OOV)
Unknown words →<UNK> token
Model has no idea what <UNK> means.
Problem 3: No subword sharing
"run", "running", "runner" are clearly related.
Word-level treats them as completely separate.
Subword Tokenization
The solution: break words into pieces. Common pieces become tokens.
Most tokenizers includeleading whitespace in tokens:
"Hello world" → ["Hello", " world"]
^ space is part of token
This is why " world" and "world" are different tokens.
Case sensitivity:
Case Sensitivity
Case Sensitivity
Usually case-sensitive:
"Hello" → [token_123]
"hello" → [token_456]
"HELLO" → [token_789] or decomposed: ["HE", "LLO"]
Numbers:
Number Tokenization
Number Tokenization
Numbers often tokenizedigit-by-digit or in chunks:
"123" → ["123"] # if common
"12345" → ["123", "45"] # chunked
"3.14159" → ["3", ".", "14", "159"]
Arithmetic is hard because digits aren't reliably grouped.
Rule of thumb for English:
1 token ≈ 4 characters ≈ 0.75 words
"The quick brown fox jumps over the lazy dog"
Characters: 43
Words: 9
Tokens: ~11
But this varies by language and content type.
Tokenization Pitfalls
Non-English Languages
Non-English Token Inefficiency
Non-English Token Inefficiency
Tokenizers trained primarily on English.
Other languages get worsetoken efficiency.
English: "hello" →1 tokenJapanese: "こんにちは" →3-5 tokens (same meaning!)
Arabic: "مرحبا" →4-6 tokens
Same semantic content, 3-5x more tokens.
This means:
• Higher costs
• Smaller effective context window
• Sometimes worse model performance
Rare Words and Neologisms
Handling New Words
Handling New Words
"COVID-19" (pre-2020 tokenizer): ["CO", "VID", "-", "19"]
"ChatGPT" (early tokenizer): ["Chat", "G", "PT"]
Model must infer meaning from pieces.
Usually works, but costs more tokens.
Adversarial Inputs
Unicode Tricks
Unicode Tricks
Unicode tricks can break tokenizers:
"Hello" vs "Ηello" # second H is Greek Eta
• Look identical
• Different tokens
• Model might behave differentlyWhitespace characters:
Regular space vs non-breaking space vs zero-width space
• Visually same
• Different tokens
Token Boundaries Affect Generation
Generation Boundary Issues
Generation Boundary Issues
If you want the model to output exactly "Hello":
Prompt: "Say Hello"
Output: "Hello" ✓
But if tokenizer makes "Hello" → ["Hel", "lo"]:
Model might generate "Hel" then something else.
Special Tokens
Every tokenizer has reserved tokens for structure:
Special Token Types
Special Token Types
┌────────────────┬────────────────────────────────┐│ Token │ Purpose │├────────────────┼────────────────────────────────┤│<BOS>/<s> │ Beginning of sequence ││<EOS></s> │ End of sequence ││<PAD>│ Padding for batch processing ││<UNK>│ Unknown token (rare in BPE) ││<|im_start|>│ Message boundary (chat models) ││[INST]│ Instruction marker │└────────────────┴────────────────────────────────┘
These are NOT in your text. They're added by formatting.
Chat templates use them to structure conversations.
Code Example
Minimal tokenization exploration with tiktoken (OpenAI’s tokenizer):
import tiktoken# Load GPT-4 tokenizerenc = tiktoken.encoding_for_model("gpt-4")def explore_tokenization(text: str) -> None: """Show how text becomes tokens.""" tokens = enc.encode(text) print(f"Text: {text!r}") print(f"Tokens: {tokens}") print(f"Count: {len(tokens)}") # Decode each token to see the pieces pieces = [enc.decode([t]) for t in tokens] print(f"Pieces: {pieces}") print()# Common word - single tokenexplore_tokenization("hello")# Compound word - multiple tokensexplore_tokenization("unbelievable")# Code - mixed tokenizationexplore_tokenization("def calculate_total():")# Numbers - often splitexplore_tokenization("123456789")# Non-English - more tokens per conceptexplore_tokenization("Hello") # Englishexplore_tokenization("Bonjour") # Frenchexplore_tokenization("こんにちは") # Japanese
Key Takeaways
Key Takeaways
Key Takeaways
1. Tokenization is the foundation of every LLM interaction
- Determines cost, context limits, and model behavior
2. Subword tokenization (BPE) balances vocabulary size with coverage
- Common words → single tokens
- Rare words →decomposed to pieces
3. Tokenizers are model-specific
- Never mix tokenizers and models
- Same text → different token counts across providers
4. Non-English text is more expensive
- 3-5x more tokens for same semantic content
- Important for international applications
5. Token boundaries affect generation
- "Hello" and " Hello" are different tokens
- Matters for structured output
Verify Your Understanding
Before proceeding, you should be able to:
Explain why “Hello” and ” Hello” are different tokens — What does whitespace handling in tokenizers tell you about how prompts are represented?
Predict relative token counts for English vs. Japanese — Which will have more tokens? By roughly how much? Why does this matter for cost and performance?
Given a tokenizer, explain how BPE builds the vocabulary — Walk through the merge process.
Identify the tokenization trap: You want the model to output JSON with a specific field name “customerID”. The model sometimes outputs “customer_ID” or “customerId”. What might be happening at the token level?