Tokens to Embeddings - Vectors That Capture Meaning
Deep dive into embeddings: why one-hot encoding fails, how meaning emerges from training, measuring similarity, and the difference between token and sentence embeddings
12 minutes•Intermediate Level•Dec 2024
Building On Previous Knowledge
In the previous progression, you learned how text becomes a sequence of token IDs. This created a problem: token ID 3797 is just an arbitrary number. It has no inherent relationship to token ID 3798, even if one means “cat” and the other means “kitten.”
Neural networks need representations where similar meanings are mathematically close. Token IDs don’t provide this.
This progression solves that problem by introducing embeddings: learned vectors where semantic similarity becomes geometric proximity.
What Goes Wrong Without This:
Embedding Failure Patterns
Embedding Failure Patterns
Symptom: Semantic search returns irrelevant results despite high scores.
Cause: Embedding model trained on general web text. Your domain
uses specialized vocabulary the model doesn't understand.
Symptom: Multilingual search works poorly.
Cause: Embedding model trained primarily on English.
Other languages are poorly aligned in the embedding space.
Symptom: Embedding-based system degrades without code changes.
Cause: You updated the embedding model or the provider did.
Old embeddings no longer align with new query embeddings.
This is called "embedding drift."
Token IDs Are Meaningless
After tokenization, you have a sequence of integers:
Token IDs Have No Meaning
Token IDs Have No Meaning
"The cat sat on the mat"
→ [464, 3797, 3332, 319, 262, 2603]
These numbers are arbitraryvocabulary indices.
Token 3797 = "cat"
Token 3798 = "catch" # no meaningful relationship to 3797
The model needs a representation where "cat" and "kitten"
are mathematically close, while "cat" and "democracy" are far apart.
Why One-Hot Encoding Fails
The naive approach: represent each token as a vector with one “1” and all other positions “0.”
One-Hot Encoding
One-Hot Encoding
Vocabulary size: 50,257 tokens
Token "cat" (ID 3797):
[0, 0, 0, ..., 1, ..., 0, 0]
^ position 3797
50,257 dimensions. 50,256 zeros. One "1".
Token "kitten" (ID 4521):
[0, 0, 0, ..., 1, ..., 0, 0]
^ position 4521
Three fatal problems:
Problem 1: No Semantic Information
Problem 1: No Semantic Information
Similarity between any two different tokens = 0dot_product(cat, kitten) = 0 # orthogonal!
dot_product(cat, democracy) = 0 # also orthogonal!
Every pair of words is equally "unrelated."
The representation contains no meaning.
Problem 2: Curse of Dimensionality
Problem 2: Curse of Dimensionality
Vocabulary = 50,000 → 50,000-dimensional vectors
Each vector is 99.998% zerosMemory: 50,000 x 50,000 x 4 bytes = 10 GB just for embeddings
Compute: mostly multiplying by zero
Problem 3: No Transfer Learning
Problem 3: No Transfer Learning
Learning about "cat" teaches nothing about "kitten"
They're orthogonal—completely independentMust see every word many times to learn anything about it
Embeddings solve all three:
Dense (no wasted dimensions)
Semantic (similar words → similar vectors)
Transfer (related words share structure)
Embeddings: Vectors That Capture Meaning
An embedding is a dense vector of floating-point numbers representing a concept.
What Embeddings Encode
What Embeddings Encode
Token ID 3797 ("cat") → [0.23, -0.41, 0.89, 0.12, ..., -0.33]
+------------ 768 dimensions -----------+
This vector encodes everything the model learned about "cat":
• It's an animal
• It's furry
• It's a pet
• It appears in similar contexts as "dog", "kitten", "pet"
All of this compressed into ~768 numbers.
Lookup is O(1): given token ID, grab that row from the matrix.
How Meaning Emerges
Embeddings aren’t designed. They’re learned from data.
Learning Meaning From Context
Learning Meaning From Context
Training objective: predict next token (or masked token)
"The cat sat on the ___"
Model sees millions of examples where:
• "cat" appears near "dog", "pet", "furry", "meow"
• "cat" appears after "the", "a", "my"
• "cat" appears before "sat", "slept", "ran"
Gradient descentadjusts embeddings so:
• Similar context →similar embeddings
• "cat" and "kitten" vectors become close
• "cat" and "democracy" vectors stay far
The famous Word2Vec result:
Vector Arithmetic Captures Relationships
Vector Arithmetic Captures Relationships
vector("king") - vector("man") + vector("woman") ≈vector("queen")
Embeddings capture relationships:
king:queen :: man:woman
paris:france :: tokyo:japan
This emerges from context, not explicit programming.
Measuring Similarity
Two vectors are similar if they point in similar directions.
Cosine Similarity
Cosine Similarity
Cosine Similarity
Cosine similarity: measureangle between vectors1.0 = identical direction (very similar)
0.0 = orthogonal (unrelated)
-1.0 = opposite direction (antonyms, sometimes)
sim("cat", "kitten") ≈ 0.85 # very similar
sim("cat", "dog") ≈ 0.75 # related but different
sim("cat", "democracy") ≈ 0.12 # unrelated
Dot Product
Dot Product for Normalized Vectors
Dot Product for Normalized Vectors
Most embedding models normalize vectors to unit length.
When ||v|| = 1 for all vectors:
cosine_similarity(a, b) = dot_product(a, b)
This makes similarity computation fast: just matrix multiply.
Euclidean Distance
Euclidean Distance
Euclidean Distance
Euclidean distance: straight-line distance in vector space
distance("cat", "kitten") ≈ 0.3 # close together
distance("cat", "democracy") ≈ 1.8 # far apart
Lower distance = more similar (opposite of cosine)
When to use which:
Metric
Best For
Note
Cosine similarity
Semantic similarity
Direction matters, magnitude doesn’t
Dot product
Ranking, attention
Faster; magnitude affects result
Euclidean distance
Clustering, k-NN
Position in space, not just direction
Rule of thumb: Use cosine similarity for text embeddings. It’s the standard.
Token vs Sentence Embeddings
Two different things, often confused:
Token vs Sentence Embeddings
Token vs Sentence Embeddings
┌────────────────────────────────────────────────────────────────────┐│TOKEN EMBEDDINGS││ ---------------- ││ One vector per token ││ "The cat sat" → 3 vectors, one for each token ││││ These are INSIDE the model, between layers. ││Not directly useful for semantic search. │├────────────────────────────────────────────────────────────────────┤│SENTENCE/TEXT EMBEDDINGS││ ------------------------- ││ One vector per text chunk ││ "The cat sat" → 1 vector representing whole meaning ││││ These are OUTPUT of specialized embedding models. ││ Used for semantic search, clustering, classification. ││││ Examples: OpenAI text-embedding-3, Cohere embed, sentence-BERT │└────────────────────────────────────────────────────────────────────┘
How sentence embeddings are created:
Sentence Embedding Methods
Sentence Embedding Methods
Method 1: Mean pooling (average all token vectors)
[v1, v2, v3, v4] → (v1 + v2 + v3 + v4) / 4
Method 2: CLS token (use special token's output)
[CLS] The cat sat [SEP] → use embedding of [CLS]
Method 3: Trained pooler (learned combination)
Model learns optimal way to combine token embeddings
Modern embedding models use Method 3 with contrastive training.
Embedding Dimensions
Common dimension sizes and their tradeoffs:
Embedding Dimension Tradeoffs
Embedding Dimension Tradeoffs
┌───────────────┬────────────────┬─────────────┬─────────────────────┐│Dimensions│Memory/Speed│Quality│Example Models│├───────────────┼────────────────┼─────────────┼─────────────────────┤│ 384 │Fast, small │ Good │ all-MiniLM-L6-v2 ││ 768 │ Medium │ Better │ BERT, e5-base ││ 1024 │ Slower │ Very good │ e5-large ││ 1536 │ Slow │Excellent│ text-embedding-3 ││ 3072 │ Very slow │Best│ text-embedding-3-L │└───────────────┴────────────────┴─────────────┴─────────────────────┘
Higher dimensions = more information capacity
But: diminishing returns, quadratic cost in attention
Practical guidance:
Choosing Embedding Dimensions
Choosing Embedding Dimensions
Prototype / cost-sensitive: 384 dims (all-MiniLM)
Production / quality-matters: 768-1024 dims (e5-base/large)
Best quality, cost no object: 1536+ dims (OpenAI large)
Most applications: 768 is the sweet spot.
Contextual vs Static Embeddings
Static embeddings (Word2Vec, GloVe): one vector per word, always.
Static Embedding Problem
Static Embedding Problem
"I went to the bank to deposit money"
"I sat on the river bank"
Static: "bank" →same vector in both sentences
Problem: completely different meanings!
Contextual embeddings (BERT, GPT, modern): vector depends on context.
Contextual Embeddings Solve This
Contextual Embeddings Solve This
"I went to the bank to deposit money"
"bank" → [financial institution vector]
"I sat on the river bank"
"bank" → [riverside vector]
Same word, different vectors based on surrounding words.
All modern models use contextual embeddings. Each token’s embedding changes based on what’s around it. This happens through attention (next progression).
Positional Information
Embeddings alone don’t encode position:
Word Order Problem
Word Order Problem
"dog bites man" vs "man bites dog"
Same tokens, same embeddings, completely different meaning!
Models add positional encoding:
Positional Encoding
Positional Encoding
final_embedding = token_embedding + position_embedding
Position embeddings: learned vectors for each position
Position 0: [0.1, -0.2, ...]
Position 1: [0.3, 0.4, ...]
Position 2: [-0.1, 0.5, ...]
Or: sinusoidal functions (original Transformer)
Or: RoPE (rotary position embeddings, modern LLMs)
This lets the model distinguish word order.
Code Example
Using a real embedding model to see embeddings in action:
from sentence_transformers import SentenceTransformerimport numpy as np# Load a popular embedding modelmodel = SentenceTransformer('all-MiniLM-L6-v2')def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: """Compute cosine similarity between two vectors.""" return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))# Get embeddingstexts = [ "The cat sat on the mat", "A kitten was resting on the rug", "Python is a programming language", "I love machine learning",]embeddings = model.encode(texts)print(f"Embedding shape: {embeddings.shape}") # (4, 384)print(f"First embedding (first 10 dims): {embeddings[0][:10]}")# Compare similaritiesprint("\nSimilarity matrix:")for i, text_i in enumerate(texts): for j, text_j in enumerate(texts): sim = cosine_similarity(embeddings[i], embeddings[j]) print(f" [{i}][{j}] {sim:.3f}", end="") print(f" ← {text_i[:30]}...")# Expected output:# Cat sentences similar to each other, different from Python/ML
Key Takeaways
Key Takeaways
Key Takeaways
1. Token IDs are arbitrary integers with no inherent meaning
2. Embeddings are dense vectors (384-3072 dims) encoding semantics
3. Meaning emerges from training on context, not explicit rules
4. Similar meaning → similar vectors (measurable with cosine similarity)
5. Modern embeddings are contextual: same word, different vector based on context
6. Position is added separately (positional encoding)
7. Token embeddings ≠ sentence embeddings
- Token: one vector per word, inside model
- Sentence: one vector per text, output of embedding model
Verify Your Understanding
Before proceeding, you should be able to:
Explain “king - man + woman = queen” without using the words “vector” or “embedding” — A genuine explanation might involve: “Words that appear in similar contexts develop similar internal representations…”
Why do contextual embeddings solve a problem that static embeddings have? — Give a specific example sentence where static embeddings fail.
Given these two sentences:
“The bank was closed for the holiday”
“The river bank was eroded by flooding”
Will a static embedding model give “bank” the same vector in both? Will a contextual model? Why does this matter?
Your embedding model gives similarity = 0.92 for two texts. What does this tell you? List at least two things it does NOT tell you.