I/D/E · ai-engineering

Text to Tokens - The Foundation

Summary

Deep dive into tokenization: why models can't read text directly, subword algorithms like BPE, practical patterns, and the pitfalls that cause production failures

BPE: subword merges first, raw bytes as the last-resort fallback

Tokenisation is where text becomes integers. Where the integers land decides cost, coverage, and context.

Why Tokenization Matters

Tokenization is the first transformation in the LLM data path. It decides what the model can see, how much it costs, and what it can generate. Bugs here look like model bugs — and that is why they’re expensive.

Where most tokenisation tutorials stop: they say “words become numbers” and stop at a whitespace .split() example. They never show what Byte Pair Encoding actually does to unbelievable, why a unicorn emoji costs four tokens, or why a Japanese sentence pays 3–5× an English sentence with the same meaning (Petrov et al. 2023 measured up to 15× for some scripts [petrov2023]). This chapter follows the data path through Sennrich et al. 2016 [sennrich2016] and the OpenAI tiktoken tokeniser [tiktoken] — both verifiable, both load-bearing.

What this matters for in practice:

  • Non-English users pay 3–5× more per concept (Japanese and Arabic sentences need ~5× the tokens of an English equivalent)
  • Structured output drifts when token boundaries don’t align with field names you ask for
  • Context-window math is in tokens, not characters — miscount and you truncate or overpay

Takeaway: tokenisation is the boundary between text and math, and every later bug compounds on top of it — token-level reasoning is the floor of every other reasoning layer.

What Goes Wrong Without This:

Tokenization Failure Patterns
Symptom: Your LLM can't do basic arithmetic reliably.
Cause:   Numbers tokenize inconsistently. "1234" might be one token,
       two tokens, or four tokens depending on the tokenizer.

Symptom: Japanese/Arabic/Hindi users report higher costs.
Cause: Tokenizer trained primarily on English. Other languages
require 3-5x more tokens for the same semantic content.

Symptom: Model outputs "Hel" instead of "Hello"
Cause: Token boundaries don't align with word boundaries.
"Hello" and " Hello" are different tokens.

Why Models Can’t Read Text

Neural networks are math machines. They do matrix multiplications. They need numbers.

The Tokenization Gap
"Hello world"  ???  [0.23, -0.41, 0.89, ...]

The ??? is tokenization.

Text is a sequence of characters. Characters have Unicode values, but using raw Unicode doesn’t work:

Raw Unicode Problems
"cat"  [99, 97, 116]        # ASCII/Unicode
"Cat"  [67, 97, 116]        # Different numbers!
"CAT"  [67, 65, 84]         # Totally different

Models would have to learn that these mean the same thing
from raw numbers alone. Possible, but wasteful.

Why Not Just Split on Words?

Word-Level Tokenization Problems
Problem 1: Vocabulary explosion
English has ~170,000 words in common use.
Add proper nouns, technical terms, typos...
"COVID-19" — in your vocabulary?
"ketankhairnar" — definitely not.

Problem 2: Out-of-vocabulary (OOV)
Unknown words  <UNK> token
Model has no idea what <UNK> means.

Problem 3: No subword sharing
"run", "running", "runner" are clearly related.
Word-level treats them as completely separate.

Takeaway: characters are too granular (sequences get long, semantics get lost); words are too brittle (vocabulary explodes, OOV is fatal). Subword tokenisation is the negotiated settlement.


Subword Tokenization

The solution: break words into pieces. Common pieces become tokens.

Subword Tokenization Benefits
"unhappiness"  ["un", "happi", "ness"]

Benefits:
• Finite vocabulary (typically 32K-100K tokens)
• Rare words decompose into known piecesMorphology captured ("un-" prefix, "-ness" suffix)

Three dominant algorithms:

AlgorithmUsed ByKey Idea
BPE (Byte Pair Encoding)GPT, LLaMA, ClaudeMerge frequent byte pairs
WordPieceBERTMerge to maximize likelihood
SentencePieceT5, multilingualLanguage-agnostic, treats text as bytes

Reference implementations for all three live in the Hugging Face tokenizers library [hf-tokenizers], typically accessed via transformers.AutoTokenizer.

BPE: How It Works

BPE was repurposed from a 1994 data-compression algorithm by Gage [gage1994]. Sennrich, Haddow, Birch 2016 adapted it for neural machine translation [sennrich2016]. The algorithm is small enough to fit on the back of an envelope:

  1. Initialise the symbol vocabulary as the character vocabulary.
  2. Count every adjacent symbol-pair in the training corpus.
  3. Merge the most frequent pair into a single new symbol.
  4. Repeat for num_merges iterations (typically 30K–100K).

Quoting §3.2 of the paper: “We iteratively count all symbol pairs and replace each occurrence of the most frequent pair ('A', 'B') with a new symbol 'AB'. Each merge operation produces a new symbol which represents a character n-gram.” [sennrich2016]

The paper’s worked example (Figure 1) uses the toy dictionary {'low': 5, 'lowest': 2, 'newer': 6, 'wider': 3} with num_merges = 10. The first four learned merges are:

Sennrich 2016 Figure 1 — first four BPE merges
Dictionary: {'low</w>': 5, 'lowest</w>': 2, 'newer</w>': 6, 'wider</w>': 3}
                                       (</w> marks end-of-word)

Merge rule:  pair    merged_symbol
r + </w>      r</w>     (count: 6 + 3 = 9, from newer + wider)
l + o         lo        (count: 5 + 2 = 7, from low + lowest)
lo + w        low       (count: 5 + 2 = 7, using prior merge)
e + r</w>     er</w>    (count: 6 + 3 = 9, using prior merge)

After 4 merges, "lower" (OOV at training time) segments as ["low", "er</w>"]
— both pieces are in the learned vocabulary, no <UNK> needed.

That last line is the load-bearing claim. Quote from §4 of the same paper: “BPE meets our goal of being open-vocabulary, and the learned merge operations can be applied to the test set to obtain a segmentation with no unknown symbols.” [sennrich2016]

Production BPE adds one more trick: byte-level fallback. Modern tokenisers — GPT-2’s BPE, GPT-4’s cl100k_base, GPT-4o’s o200k_base, and LLaMA-2/3’s SentencePiece byte-fallback mode — extend the algorithm with 256 single-byte tokens at the bottom of the vocabulary. When a Unicode character was never seen at training time — a new emoji, a rare CJK glyph, an Egyptian hieroglyph — it decomposes into its raw UTF-8 byte sequence. The model never emits an <UNK>. The hero diagram at the top of this chapter shows '🦄' taking exactly four byte tokens (0xF0 0x9F 0xA6 0x84).

BPE Results
"the"             ["the"]                         # 1 token  (common, in-vocab)
"understanding"   ["understanding"]                # 1 token  (cl100k_base, common)
"defenestration"  ["def", "en", "estr", "ation"]   # 4 tokens (rare, decomposed)
"🦄"               [0xF0, 0x9F, 0xA6, 0x84]         # 4 bytes  (byte fallback)

Takeaway: BPE is a compression algorithm pressed into vocabulary service — merges absorb frequent patterns into single tokens, and the 256-byte fallback layer guarantees zero <UNK> for any UTF-8 input.


Practical Tokenization Patterns

Whitespace handling:

Whitespace in Tokens
Most tokenizers include leading whitespace in tokens:

"Hello world"  ["Hello", " world"]
^ space is part of token

This is why " world" and "world" are different tokens.

Case sensitivity:

Case Sensitivity
Usually case-sensitive:
"Hello"  [token_123]
"hello"  [token_456]
"HELLO"  [token_789] or decomposed: ["HE", "LLO"]

Numbers:

Number Tokenization
Numbers often tokenize digit-by-digit or in chunks:

"123"  ["123"] # if common
"12345"  ["123", "45"] # chunked
"3.14159"  ["3", ".", "14", "159"]

Arithmetic is hard because digits aren't reliably grouped.

Code:

Code Tokenization
Code tokenization varies wildly:

Python: "def foo():"  ["def", " foo", "():", ...]
JSON: "{"key":"  ["{", '"', "key", '"', ":"]

Common patterns (def, function, return)  single tokens
Rare identifiers  decomposed to pieces

Tokenizer Differences Matter:

Tokenizer Incompatibility
OpenAI (GPT-4):     "Hello world"  [9906, 1917]
Anthropic (Claude): "Hello world"  [different IDs]
Meta (LLaMA):       "Hello world"  [different IDs]

The token IDs mean completely different things.
You cannot mix tokenizers and models.

The OpenAI tiktoken library [tiktoken] ships four encodings: cl100k_base (gpt-3.5-turbo, gpt-4, text-embedding-3-*), o200k_base (gpt-4o, gpt-4.1, gpt-5, the o-series reasoning models), p50k_base and r50k_base (legacy Codex / davinci). Always pin the encoding to the model: tiktoken.encoding_for_model("gpt-4o") returns o200k_base; tiktoken.encoding_for_model("gpt-4") returns cl100k_base. The OpenAI tokeniser playground [openai-tokenizer-playground] is the fastest way to verify a count interactively before you commit to a context budget.

Takeaway: tokeniser IDs are model-private dictionaries — same text yields different integers across providers, and there is no gettext-style translation table.


The Context Window Problem

Every model has a maximum context length. Measured in tokens, not characters.

Model Context Lengths

 Model          Context Length (tokens)    

 GPT-4          8K / 32K / 128K            
 Claude 3       200K                       
 LLaMA 3        8K (extendable)            
 Gemini 1.5     1M (preview)               

Tokens ≠ Characters ≠ Words:

Token Counting Rule of Thumb
Rule of thumb for English:
1 token  4 characters  0.75 words

"The quick brown fox jumps over the lazy dog"
Characters: 43
Words: 9
Tokens: ~11

But this varies by language and content type.

The 1 token ≈ 4 characters rule only holds for English. The tiktoken README states each token corresponds to “about 4 bytes” on average across English text [tiktoken]; for non-Latin scripts the ratio inverts because each character occupies 2–4 UTF-8 bytes and few merges apply. Always tokenise a representative sample of your real input before sizing a context budget — never extrapolate from the English heuristic.

Takeaway: tokens, not characters or words, are the unit the model bills and budgets — assume a 3–5× spread by language until you have measured your actual corpus.


Common Pitfalls & Misconceptions

The table below condenses the most common ways tokenisation surprises production code. Each row names a concrete failure mode — not a theoretical one.

SymptomCauseFix
LLM can’t add two five-digit numbers reliablyNumbers tokenise inconsistently — "12345" may be one token, ["123","45"], or ["1","2","3","4","5"] depending on encoding. Digit boundaries vary across cl100k_base vs o200k_base.Force consistent digit handling: pre-format numbers with separators (12,345), or use a structured output schema that constrains generation. For arithmetic, route to a tool — don’t ask the LLM to “do math”.
Japanese/Arabic users complain about API costsTokeniser trained primarily on English; non-Latin scripts fall back to multi-byte sequences. A 5-character Japanese sentence can be 9–15 tokens.Tokenise a representative sample with tiktoken.encoding_for_model(...) in the target language before pricing. Budget context limits in tokens not characters. Consider model selection (some models tokenise CJK more efficiently).
Model outputs "Hel" instead of "Hello""Hello" and " Hello" are different tokens; the model picked the wrong leading-whitespace variant and your stopping condition matched a partial.Use logit-bias to forbid the partial-token variants, or use a structured output schema. Inspect generated tokens (not just decoded text) when debugging stop conditions.
JSON field name drifts between customerID, customer_id, customerIdEach variant is one or more different tokens. The model can switch token sequences mid-generation if the prompt doesn’t anchor a single form.Use constrained-decoding / JSON-mode (response_format) so the schema enforces field names at the token level — not just at the validation step.
<UNK> token appears in production outputYou’re using a word-level or older WordPiece tokeniser without byte-level fallback (e.g. legacy BERT). Modern BPE tokenisers can’t emit <UNK> — the 256-byte fallback layer absorbs any UTF-8 input.Migrate to a byte-level BPE tokeniser (cl100k_base, o200k_base, LLaMA-2/3 SentencePiece byte-fallback) or accept that legacy tokenisers will OOV on emoji and new scripts.
Output looks fine but downstream parser rejects itUnicode lookalikes: a Greek capital eta Η (U+0397) renders identical to Latin H (U+0048) but tokenises differently. Adversarial inputs use this to bypass filters.Normalise input via NFC or NFKC (Unicode Standard Annex #15 [unicode-tr15]) before tokenising for security-sensitive paths. For output validation, byte-compare critical fields, not visual strings.
Token count differs between encode() and count_tokens() SDK helpersProvider SDKs sometimes add chat-template overhead (system prompts, &lt;|im_start|&gt;, etc.) inside count_tokens but not in encode. The “extra” tokens are the chat scaffolding.Always count via the same path the request goes through. For OpenAI: tiktoken.encoding_for_model(...).encode(...) for raw text; use the message-format helpers for chat-completion budgeting.

Takeaway: surprises at the application layer almost always trace to surprises at the token layer — when behaviour is weird, decode the integers before you blame the model.


Edge Cases In Detail

The pitfalls table names what to watch for; this section drills into the four edge cases that bite hardest in production.

Non-English Languages

Non-English Token Inefficiency
Tokenizers trained primarily on English.
Other languages get worse token efficiency.

English: "hello"  1 token
Japanese: "こんにちは"  3-5 tokens (same meaning!)
Arabic: "مرحبا"  4-6 tokens

Same semantic content, 3-5x more tokens.
This means:
• Higher costsSmaller effective context window
• Sometimes worse model performance

Rare Words and Neologisms

Handling New Words
"COVID-19" (pre-2020 tokenizer): ["CO", "VID", "-", "19"]
"ChatGPT" (early tokenizer):     ["Chat", "G", "PT"]

Model must infer meaning from pieces.
Usually works, but costs more tokens.

Adversarial Inputs

Unicode Tricks
Unicode tricks can break tokenizers:

"Hello" vs "Ηello" # second H is Greek Eta
• Look identical
• Different tokens
• Model might behave differently

Whitespace characters:
Regular space vs non-breaking space vs zero-width space
• Visually same
• Different tokens

Token Boundaries Affect Generation

Generation Boundary Issues
If you want the model to output exactly "Hello":

Prompt: "Say Hello"
Output: "Hello" 

But if tokenizer makes "Hello"  ["Hel", "lo"]:
Model might generate "Hel" then something else.

Takeaway: non-English inefficiency, neologism decomposition, Unicode adversarial inputs, and token-boundary generation drift are the four classes that cover most production tokeniser bugs. Special-token misuse (next section) is the fifth.


Special Tokens

Every tokenizer has reserved tokens for structure:

Special Token Types

 Token           Purpose                        

 <BOS>/<s>       Beginning of sequence          
 <EOS></s>       End of sequence                
 <PAD>           Padding for batch processing   
 <UNK>           Unknown token (rare in BPE)    
 <|im_start|>    Message boundary (chat models) 
 [INST]          Instruction marker             


These are NOT in your text. They're added by formatting.
Chat templates use them to structure conversations.

Takeaway: special tokens are model-private control characters — chat templates render them from messages, but they also reserve vocabulary slots that user text must never collide with.


Code Example

Minimal tokenisation exploration with tiktoken (OpenAI’s official Python tokeniser). Run as shown against the pinned version:

# Tested on tiktoken==0.7.0 + Python 3.11
# pip install "tiktoken>=0.7,<0.8"
import tiktoken


def show_tokens(text: str, encoding: tiktoken.Encoding) -> None:
    ids = encoding.encode(text)
    pieces = [encoding.decode([t]) for t in ids]
    print(f"  {text!r:<25}{len(ids)} tokens  {pieces}")


# Two model families, two encodings:
gpt4 = tiktoken.encoding_for_model("gpt-4")        # → cl100k_base
gpt4o = tiktoken.encoding_for_model("gpt-4o")      # → o200k_base

for enc, name in [(gpt4, "cl100k_base"), (gpt4o, "o200k_base")]:
    print(f"\n=== {name} ===")
    show_tokens("hello", enc)              # common word
    show_tokens("unbelievable", enc)       # compound — BPE splits
    show_tokens("def calculate_total():",  # code — mixed
                 enc)
    show_tokens("123456789", enc)          # numbers — varies
    show_tokens("Hello",     enc)          # English baseline
    show_tokens("Bonjour",   enc)          # French
    show_tokens("こんにちは", enc)          # Japanese — byte fallback shows up
    show_tokens("🦄",         enc)          # emoji — 4 byte tokens

The encoding choice changes the token count for the same string — that is the load-bearing observation. cl100k_base and o200k_base are not interchangeable; the older encoding fragments emoji and CJK more aggressively. Run the script and compare the rows.


Verify Your Understanding

Before continuing, you should be able to answer these from memory:

  1. Why are "Hello" and " Hello" different tokens? Explain what BPE includes in a token and why a leading space changes the token sequence. What does this imply for prompt-engineering and stop conditions?
  2. Walk through BPE on {'low': 5, 'lowest': 2, 'newer': 6, 'wider': 3}. Name the first merge the algorithm learns and explain why. After 10 merges, segment the word "lower" — show that no <UNK> is needed.
  3. English vs Japanese token-count ratio. A 5-character Japanese sentence and a 30-character English sentence have similar semantic content. Which costs more tokens? Why? What is the practical cost ratio for non-English users?
  4. Byte-level fallback. A user pastes a brand-new emoji that didn’t exist when the tokeniser was trained. The model never emits <UNK>. How? How many tokens does a 4-byte UTF-8 character cost in the worst case?
  5. The customerID trap. You ask the model for JSON with field "customerID" and it sometimes emits "customer_id" or "customerId". Explain at the token level why this happens, and name two production-grade fixes.

What’s Next

Tokenisation is the gate between text and integers. The next chapter — Tokens → Embeddings — picks up the integers and shows how they become vectors with learned meaning: why one-hot fails, how cosine similarity captures semantics, and the difference between token-level, sentence-level, and contextual embeddings.


References

  • [sennrich2016] Sennrich, R., Haddow, B., Birch, A. Neural Machine Translation of Rare Words with Subword Units. ACL 2016, pages 1715–1725. arXiv:1508.07909. Source for the BPE algorithm in NMT, the {'low','lowest','newer','wider'} worked example, and the open-vocabulary claim. Cited in §§ Why Tokenization Matters, Subword Tokenization — BPE: How It Works.
  • [gage1994] Gage, P. A New Algorithm for Data Compression. The C Users Journal, Feb 1994. Original BPE as a compression algorithm — Sennrich 2016 adapted it for vocabulary learning. Cited in § Subword Tokenization — BPE: How It Works.
  • [tiktoken] OpenAI. tiktoken. GitHub: openai/tiktoken. Canonical Python tokeniser for OpenAI models; ships cl100k_base (gpt-3.5/gpt-4), o200k_base (gpt-4o / gpt-4.1 / gpt-5 / o-series), p50k_base, r50k_base, gpt2. Cited in §§ Why Tokenization Matters, Practical Tokenization Patterns, The Context Window Problem, Code Example.
  • [hf-tokenizers] Hugging Face. tokenizers library and transformers.AutoTokenizer docs. huggingface.co/docs/tokenizers. Reference implementation for BPE, WordPiece, SentencePiece across the open-model ecosystem. Cited in § Subword Tokenization.
  • [openai-tokenizer-playground] OpenAI. Tokeniser Playground. platform.openai.com/tokenizer. Interactive verification tool — paste any string, see the token boundaries and IDs for each encoding. Cited in § The Context Window Problem.
  • [unicode-tr15] Unicode Consortium. Unicode Normalization Forms (UAX #15). unicode.org/reports/tr15. NFC / NFKC normalisation; the standard for defending against lookalike-character attacks. Cited in § Common Pitfalls & Misconceptions.
  • [petrov2023] Petrov, A. et al. Language Model Tokenizers Introduce Unfairness Between Languages. NeurIPS 2023. arXiv:2305.15425. Measured tokeniser cost ratios across 50+ languages; documents up to 15× for some scripts vs English. Cited in § Why Tokenization Matters.
Ai-engineering Ch 1/8
  1. 1 Text to Tokens - The Foundation 12m
  2. 2 Tokens to Embeddings - Vectors That Capture Meaning 12m
  3. 3 Embeddings to Attention - Relating Tokens to Each Other 15m
  4. 4 Attention to Generation - Producing Text Token by Token 12m
  5. 5 Generation to Retrieval - Grounding LLMs in Facts 15m
  6. 6 Retrieval to RAG - The Complete Pipeline 15m
  7. 7 RAG to Agents - From Retrieval to Action 15m
  8. 8 Agents to Evaluation - Measuring What Matters 12m