Same token 'cat', two representations. One teaches the model nothing; the other teaches it everything.
Building On Previous Knowledge
The previous chapter ended with one Takeaway: tokenisation produces a sequence of integers, and bugs at this layer compound through every layer above. Those integers are arbitrary. Token ID 3797 happens to mean “cat”; token ID 3798 happens to mean “catch”; the numbers carry no semantic relationship to each other.
Neural networks need representations where similar meanings are mathematically close. Cosine, dot product, and Euclidean distance all assume the geometry of the input space carries information. Token IDs don’t. This chapter shows the bridge: embeddings — learned dense vectors where semantic similarity becomes geometric proximity.
Where most embedding tutorials stop: they show the king − man + woman ≈ queen analogy from Mikolov 2013 [mikolov2013], wave at “similar meanings have similar vectors”, and ship. They never explain the difference between static embeddings (word2vec, GloVe — one vector per word, always) and contextual embeddings (BERT [devlin2018], modern LLMs — vector depends on context). They never warn you about embedding drift: the silent production bug where re-embedding your corpus with a new model makes every old similarity score meaningless. This chapter delivers both — the geometric intuition the hero diagram makes concrete, and the production-grade caveats the tutorials skip.
Takeaway: embeddings are the bridge from arbitrary integers to learned geometry — and every retrieval, classification, clustering, or RAG system in this series sits on top of that geometry, so the choice of embedding model is a first-order architectural decision.
What Goes Wrong Without This:
Symptom: Semantic search returns irrelevant results despite high scores. Cause: Embedding model trained on general web text. Your domain uses specialized vocabulary the model doesn't understand. Symptom: Multilingual search works poorly. Cause: Embedding model trained primarily on English. Other languages are poorly aligned in the embedding space. Symptom: Embedding-based system degrades without code changes. Cause: You updated the embedding model or the provider did. Old embeddings no longer align with new query embeddings. This is called "embedding drift."
Token IDs Are Meaningless
After tokenization, you have a sequence of integers:
"The cat sat on the mat" → [464, 3797, 3332, 319, 262, 2603] These numbers are arbitrary vocabulary indices. Token 3797 = "cat" Token 3798 = "catch" # no meaningful relationship to 3797 The model needs a representation where "cat" and "kitten" are mathematically close, while "cat" and "democracy" are far apart.
Takeaway: token IDs are vocabulary indices, not features — they carry no relational structure, so any downstream geometry (cosine, dot product, nearest-neighbour search) sees random noise unless you re-encode them into a meaningful space.
Why One-Hot Encoding Fails
The naive approach: represent each token as a vector with one “1” and all other positions “0.”
Vocabulary size: 50,257 tokens Token "cat" (ID 3797): [0, 0, 0, ..., 1, ..., 0, 0] ^ position 3797 50,257 dimensions. 50,256 zeros. One "1". Token "kitten" (ID 4521): [0, 0, 0, ..., 1, ..., 0, 0] ^ position 4521
Three fatal problems:
Similarity between any two different tokens = 0 dot_product(cat, kitten) = 0 # orthogonal! dot_product(cat, democracy) = 0 # also orthogonal! Every pair of words is equally "unrelated." The representation contains no meaning.
Vocabulary = 50,000 → 50,000-dimensional vectors Each vector is 99.998% zeros Memory: 50,000 x 50,000 x 4 bytes = 10 GB just for embeddings Compute: mostly multiplying by zero
Learning about "cat" teaches nothing about "kitten" They're orthogonal—completely independent Must see every word many times to learn anything about it
Embeddings solve all three:
- Dense (no wasted dimensions)
- Semantic (similar words → similar vectors)
- Transfer (related words share structure)
Takeaway: one-hot encoding is the strawman that motivates embeddings — three independent failures (no semantics, exponential memory, no transfer) that all dissolve when you replace the sparse one-hot with a dense learned vector.
Embeddings: Vectors That Capture Meaning
An embedding is a dense vector of floating-point numbers representing a concept.
Token ID 3797 ("cat") → [0.23, -0.41, 0.89, 0.12, ..., -0.33] +------------ 768 dimensions -----------+ This vector encodes everything the model learned about "cat": • It's an animal • It's furry • It's a pet • It appears in similar contexts as "dog", "kitten", "pet" All of this compressed into ~768 numbers.
The Embedding Matrix
Models store embeddings in a lookup table:
┌─────────────────────────────────────────────────────────────────────┐ │ Embedding Matrix │ │ (vocab_size x embedding_dim) │ │ │ │ Token ID → Embedding Vector │ │ --------- ---------------- │ │ 0 → [0.12, -0.34, 0.56, ..., 0.78] ← token "the" │ │ 1 → [0.23, 0.45, -0.67, ..., 0.89] ← token "a" │ │ 2 → [-0.11, 0.22, 0.33, ..., -0.44] ← token "is" │ │ ... → ... │ │ 50256 → [0.91, -0.82, 0.73, ..., 0.64] ← last token │ │ │ │ GPT-2: 50,257 tokens x 768 dimensions = 38.6M parameters │ └─────────────────────────────────────────────────────────────────────┘
Lookup is O(1): given token ID, grab that row from the matrix.
How Meaning Emerges
Embeddings aren’t designed. They’re learned from data.
Training objective: predict next token (or masked token) "The cat sat on the ___" Model sees millions of examples where: • "cat" appears near "dog", "pet", "furry", "meow" • "cat" appears after "the", "a", "my" • "cat" appears before "sat", "slept", "ran" Gradient descent adjusts embeddings so: • Similar context → similar embeddings • "cat" and "kitten" vectors become close • "cat" and "democracy" vectors stay far
The famous Word2Vec result:
vector("king") - vector("man") + vector("woman") ≈ vector("queen") Embeddings capture relationships: king:queen :: man:woman paris:france :: tokyo:japan This emerges from context, not explicit programming.
Mikolov et al. 2013 [mikolov2013] introduced the two architectures that made this practical at scale — CBOW (Continuous Bag of Words) predicts the centre word from its context, and Skip-gram predicts the surrounding context from the centre word. Both train on raw text without any labelled relationship data. The king-queen analogy emerges as a side effect of the prediction objective — nobody programs it; gradient descent discovers that “royalty” is a direction in vector space.
Takeaway: embeddings are not designed — they are learned from context. The training objective is “predict the next (or masked) token”, and the geometry that emerges encodes relationships the model was never explicitly taught.
Measuring Similarity
Two vectors are similar if they point in similar directions.
Cosine Similarity
Cosine similarity: measure angle between vectors 1.0 = identical direction (very similar) 0.0 = orthogonal (unrelated) -1.0 = opposite direction (antonyms, sometimes) sim("cat", "kitten") ≈ 0.85 # very similar sim("cat", "dog") ≈ 0.75 # related but different sim("cat", "democracy") ≈ 0.12 # unrelated
Dot Product
Most embedding models normalize vectors to unit length. When ||v|| = 1 for all vectors: cosine_similarity(a, b) = dot_product(a, b) This makes similarity computation fast: just matrix multiply.
Euclidean Distance
Euclidean distance: straight-line distance in vector space distance("cat", "kitten") ≈ 0.3 # close together distance("cat", "democracy") ≈ 1.8 # far apart Lower distance = more similar (opposite of cosine)
When to use which:
| Metric | Best For | Note |
|---|---|---|
| Cosine similarity | Semantic similarity | Direction matters, magnitude doesn’t |
| Dot product | Ranking, attention | Faster; magnitude affects result |
| Euclidean distance | Clustering, k-NN | Position in space, not just direction |
Rule of thumb: Use cosine similarity for text embeddings. It’s the standard.
Takeaway: cosine measures direction, dot product is cosine for unit-normalised vectors, Euclidean measures position. For text similarity, normalise + dot product is the production-grade choice — same answer as cosine, faster matmul.
Token vs Sentence Embeddings
Two different things, often confused:
┌────────────────────────────────────────────────────────────────────┐ │ TOKEN EMBEDDINGS │ │ ---------------- │ │ One vector per token │ │ "The cat sat" → 3 vectors, one for each token │ │ │ │ These are INSIDE the model, between layers. │ │ Not directly useful for semantic search. │ ├────────────────────────────────────────────────────────────────────┤ │ SENTENCE/TEXT EMBEDDINGS │ │ ------------------------- │ │ One vector per text chunk │ │ "The cat sat" → 1 vector representing whole meaning │ │ │ │ These are OUTPUT of specialized embedding models. │ │ Used for semantic search, clustering, classification. │ │ │ │ Examples: OpenAI text-embedding-3, Cohere embed, sentence-BERT │ └────────────────────────────────────────────────────────────────────┘
How sentence embeddings are created:
Method 1: Mean pooling (average all token vectors) [v1, v2, v3, v4] → (v1 + v2 + v3 + v4) / 4 Method 2: CLS token (use special token's output) [CLS] The cat sat [SEP] → use embedding of [CLS] Method 3: Trained pooler (learned combination) Model learns optimal way to combine token embeddings Modern embedding models use Method 3 with contrastive training.
Reimers & Gurevych 2019 [reimers2019] introduced Sentence-BERT (SBERT) — a siamese network that fine-tunes BERT for sentence-level similarity. The headline production result is the speedup: finding the most similar pair among 10,000 sentences drops from approximately 65 hours with raw BERT to 5 seconds with SBERT, while preserving accuracy. That’s ~46,000× — and it’s the reason every semantic-search system you’ve ever used runs sentence-level dense retrieval, not token-level [reimers2019].
Common dimension sizes and their tradeoffs:
┌───────────────┬────────────────┬─────────────┬─────────────────────┐ │ Dimensions │ Memory/Speed │ Quality │ Example Models │ ├───────────────┼────────────────┼─────────────┼─────────────────────┤ │ 384 │ Fast, small │ Good │ all-MiniLM-L6-v2 │ │ 768 │ Medium │ Better │ BERT, e5-base │ │ 1024 │ Slower │ Very good │ e5-large │ │ 1536 │ Slow │ Excellent │ text-embedding-3 │ │ 3072 │ Very slow │ Best │ text-embedding-3-L │ └───────────────┴────────────────┴─────────────┴─────────────────────┘ Higher dimensions = more information capacity But: diminishing returns, quadratic cost in attention
Practical guidance:
Prototype / cost-sensitive: 384 dims (all-MiniLM) Production / quality-matters: 768-1024 dims (e5-base/large) Best quality, cost no object: 1536+ dims (OpenAI large) Most applications: 768 is the sweet spot.
Production references at the three tiers: all-MiniLM-L6-v2 [sbert-model] for the 384-dim fast baseline; text-embedding-3-small (1,536-dim) and text-embedding-3-large (3,072-dim) from OpenAI [openai-embeddings] for high-quality dense retrieval; and BGE-M3 [bge-m3] when the corpus is multilingual or you need a hybrid (dense + sparse + multi-vector) signal in a single model. Production-agents [pa-overview] covers the operator-grade patterns for embedding rollouts (model-versioning, atomic re-indexing, blue/green index cutover).
Takeaway: token embeddings live inside the model (one per token position, used by attention); sentence embeddings live outside the model (one per text chunk, used by retrieval). Confusing the two is the most common boundary mistake at the embeddings/RAG layer — every production RAG system uses sentence-level dense embeddings, not token-level.
Contextual vs Static Embeddings
Static embeddings (Word2Vec, GloVe): one vector per word, always.
"I went to the bank to deposit money" "I sat on the river bank" Static: "bank" → same vector in both sentences Problem: completely different meanings!
Contextual embeddings (BERT, GPT, modern): vector depends on context.
"I went to the bank to deposit money" "bank" → [financial institution vector] "I sat on the river bank" "bank" → [riverside vector] Same word, different vectors based on surrounding words.
All modern models use contextual embeddings. Each token’s embedding changes based on what’s around it. The mechanism that makes this work — attention — is the subject of Ch02; BERT [devlin2018] was the model that proved at scale that contextual representations outperform static ones on every downstream task that benefits from disambiguation.
Positional Information
Embeddings alone don’t encode position:
"dog bites man" vs "man bites dog"
Same tokens, same embeddings, completely different meaning!
Models add positional encoding:
final_embedding = token_embedding + position_embedding Position embeddings: learned vectors for each position Position 0: [0.1, -0.2, ...] Position 1: [0.3, 0.4, ...] Position 2: [-0.1, 0.5, ...] Or: sinusoidal functions (original Transformer) Or: RoPE (rotary position embeddings, modern LLMs) This lets the model distinguish word order.
Takeaway: static embeddings (word2vec, GloVe) assign one vector per word forever; contextual embeddings (BERT and every modern LLM) assign a vector per token in context. The disambiguation of polysemy (“bank” as river vs financial institution) is impossible without contextual embeddings — and that’s exactly why no greenfield production search system has shipped on static embeddings since ~2019.
Code Example
A semantic-search comparison with three text pairs pinned to current library versions. The output makes the geometry concrete — cat sentences cluster, ML sentences cluster, the two clusters are far apart:
# Tested on:
# sentence-transformers==3.0.1
# numpy==1.26.4
# Python 3.11
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2") # 384-dim, fast, public baseline
def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
texts = [
"The cat sat on the mat", # 0
"A kitten was resting on the rug", # 1 — paraphrase of 0
"Python is a programming language",# 2
"I love machine learning", # 3 — same domain as 2
]
# Tip: pass normalize_embeddings=True so dot-product == cosine — faster matmul later.
embeddings = model.encode(texts, normalize_embeddings=True)
print(f"shape: {embeddings.shape}") # (4, 384)
print("\ncosine similarity matrix:")
for i, t_i in enumerate(texts):
row = " ".join(f"{cosine_sim(embeddings[i], embeddings[j]):.3f}" for j in range(len(texts)))
print(f" {row} ← {t_i[:34]}")
# Expected output (approximate, exact values depend on model weights):
# 1.000 0.692 0.094 0.071 ← The cat sat on the mat
# 0.692 1.000 0.087 0.064 ← A kitten was resting on the rug
# 0.094 0.087 1.000 0.481 ← Python is a programming language
# 0.071 0.064 0.481 1.000 ← I love machine learning
#
# Cat sentences are 0.69 apart (semantically close).
# Cat–Python sentences are <0.10 apart (semantically far).
# This is the geometry one-hot encoding cannot produce.
Common Pitfalls & Misconceptions
| Misconception | Why it’s wrong | What to do instead |
|---|---|---|
| ”Embeddings are just numbers — any model will do” | Different embedding models live in different vector spaces. An embedding from all-MiniLM-L6-v2 (384-dim) and one from text-embedding-3-large (3,072-dim) are not comparable, can’t be averaged, and don’t fit in the same index. | Treat the embedding model as part of the index schema. Pin the model version. Re-embed the whole corpus when changing models — never partial. |
| ”I’ll upgrade my embedding model in production without re-indexing” | This is embedding drift. Old chunks live in the old vector space; new queries live in the new space; cosine similarity is meaningless across them. Retrieval silently degrades to noise. | Re-embed the entire corpus atomically. Run new + old indexes in parallel during cutover. Score both, switch when new index matches or beats old. |
| ”Static embeddings (word2vec, GloVe) are fine for retrieval” | Static embeddings give “bank” one vector forever — river bank and financial bank collide. Modern retrieval needs polysemy disambiguation, which only contextual embeddings provide. | Use a contextual sentence-embedding model (all-MiniLM-L6-v2, text-embedding-3-*, BGE, etc.). Reserve word2vec/GloVe for the historical record. |
| ”Higher embedding dimension = better quality, always” | Returns diminish above ~768 dims for most English NL tasks. 3,072-dim embeddings cost 4× the memory and 4× the matmul of 768-dim with marginal recall gains. | Start at 384–768 dims. Move higher only when you measure a retrieval-quality bottleneck and the higher-dim model is the cheapest fix. |
| ”Cosine similarity = 0.92 means the texts mean the same thing” | High cosine means the model thinks they’re similar in this space, not that humans agree they mean the same. Two abstracts from different papers can score 0.92 because they share domain vocabulary. | Calibrate on labelled pairs from your domain. Threshold values are corpus-specific — set them empirically, not by intuition. |
| ”Mean pooling of token embeddings = sentence embedding” | Mean pooling works in a pinch, but Reimers & Gurevych 2019 [reimers2019] showed it underperforms a contrastively fine-tuned sentence encoder by a wide margin on STS tasks. | Use a dedicated sentence-embedding model (SBERT, text-embedding-3-*, Cohere embed-v3). Don’t mean-pool raw BERT or GPT token embeddings. |
| ”Multilingual embedding model = good for every language” | Multilingual models are trained on a mix that heavily over-represents English. Low-resource languages get poorly-aligned embeddings — semantic search fails silently. | Evaluate on a held-out test set per language. Consider language-specific models (e.g. Cohere multilingual, BGE-M3) when serving non-English corpora. |
Takeaway: every embedding bug in production traces to one of seven things — and embedding drift (row 2) is the silent one that breaks systems weeks after the change that caused it.
Verify Your Understanding
Before continuing, you should be able to answer these from memory:
- Explain
king − man + woman ≈ queenwithout using the words “vector” or “embedding”. A real explanation involves: “Words that appear in similar contexts develop similar internal representations, and gendered vs non-gendered turns out to be a direction the model learns…” - Why do contextual embeddings solve a problem static embeddings can’t? Give two concrete sentences where the same word should have different representations and explain what static vs contextual produces for each. Name the underlying mechanism.
- The “bank” disambiguation test. “The bank was closed for the holiday” vs “The river bank was eroded by flooding.” For each, name what a static model produces, what a contextual model produces, and one downstream task (search? classification?) where the difference is load-bearing.
- Two texts score
cos = 0.92. List three things this tells you and two things it explicitly does not tell you. Why is the threshold for “same meaning” corpus-specific? - Embedding-drift diagnosis. Your semantic search worked great on Friday and returns garbage on Monday. Nothing changed in your code. Walk through three concrete hypotheses for why this might happen at the embeddings layer, and name the operational defence that catches each one.
What’s Next
Embeddings turn arbitrary token IDs into a geometry of meaning. The next chapter — Embeddings → Attention — picks up the dense vectors and shows the mechanism that lets every token’s representation incorporate information from every other token: scaled dot-product attention, the √d_k divisor that keeps it differentiable, and multi-head attention as the engine of contextualisation.
References
- [mikolov2013] Mikolov, T., Chen, K., Corrado, G., Dean, J. Efficient Estimation of Word Representations in Vector Space. ICLR Workshop 2013. arXiv:1301.3781. Introduced the CBOW and Skip-gram architectures and the famous
king − man + woman ≈ queenanalogy. The paper that proved geometry can encode semantics at scale. Cited in §§ Building On Previous Knowledge, Embeddings: Vectors That Capture Meaning. - [reimers2019] Reimers, N., Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019. arXiv:1908.10084. Siamese / triplet network fine-tuning that drops the most-similar-pair search over 10,000 sentences from ~65 hours to ~5 seconds (~46,000× speedup). The architecture every modern sentence-embedding model traces back to. Cited in §§ Token vs Sentence Embeddings, Common Pitfalls & Misconceptions.
- [devlin2018] Devlin, J. et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805. The model that proved at scale that contextual representations outperform static embeddings on every disambiguation-sensitive downstream task. Cited in §§ Building On Previous Knowledge, Contextual vs Static Embeddings.
- [sbert-model] Reimers, N. all-MiniLM-L6-v2. Sentence-Transformers / Hugging Face: huggingface.co/sentence-transformers/all-MiniLM-L6-v2. 384-dim, MIT-licensed, fast public baseline; the model used in this chapter’s Code Example. Cited in §§ Token vs Sentence Embeddings, Code Example.
- [openai-embeddings] OpenAI. text-embedding-3 family. platform.openai.com/docs/guides/embeddings. 1,536-dim (
text-embedding-3-small) and 3,072-dim (text-embedding-3-large); the production reference for high-quality dense retrieval. Cited in §§ Token vs Sentence Embeddings, Common Pitfalls & Misconceptions. - [bge-m3] Chen, J. et al. BGE-M3. arXiv:2402.03216. Multilingual + multi-granular embedding model (dense + sparse + multi-vector); the production reference for non-English retrieval. Cited in § Common Pitfalls & Misconceptions.
- [pa-overview] Production Agents — Part 0: Overview. The operator-grade companion series; embedding-drift, model-versioning, and re-indexing strategies appear there as production patterns. Cross-series bridge.