Skip to content

Ai-engineering Series

Tokens to Embeddings - Vectors That Capture Meaning

Deep dive into embeddings: why one-hot encoding fails, how meaning emerges from training, measuring similarity, and the difference between token and sentence embeddings

Building On Previous Knowledge

In the previous progression, you learned how text becomes a sequence of token IDs. This created a problem: token ID 3797 is just an arbitrary number. It has no inherent relationship to token ID 3798, even if one means “cat” and the other means “kitten.”

Neural networks need representations where similar meanings are mathematically close. Token IDs don’t provide this.

This progression solves that problem by introducing embeddings: learned vectors where semantic similarity becomes geometric proximity.

What Goes Wrong Without This:

Embedding Failure Patterns

Token IDs Are Meaningless

After tokenization, you have a sequence of integers:

Token IDs Have No Meaning

Why One-Hot Encoding Fails

The naive approach: represent each token as a vector with one “1” and all other positions “0.”

One-Hot Encoding

Three fatal problems:

Problem 1: No Semantic Information
Problem 2: Curse of Dimensionality
Problem 3: No Transfer Learning

Embeddings solve all three:

  • Dense (no wasted dimensions)
  • Semantic (similar words → similar vectors)
  • Transfer (related words share structure)

Embeddings: Vectors That Capture Meaning

An embedding is a dense vector of floating-point numbers representing a concept.

What Embeddings Encode

The Embedding Matrix

Models store embeddings in a lookup table:

Embedding Matrix

Lookup is O(1): given token ID, grab that row from the matrix.

How Meaning Emerges

Embeddings aren’t designed. They’re learned from data.

Learning Meaning From Context

The famous Word2Vec result:

Vector Arithmetic Captures Relationships

Measuring Similarity

Two vectors are similar if they point in similar directions.

Cosine Similarity

Cosine Similarity

Dot Product

Dot Product for Normalized Vectors

Euclidean Distance

Euclidean Distance

When to use which:

MetricBest ForNote
Cosine similaritySemantic similarityDirection matters, magnitude doesn’t
Dot productRanking, attentionFaster; magnitude affects result
Euclidean distanceClustering, k-NNPosition in space, not just direction

Rule of thumb: Use cosine similarity for text embeddings. It’s the standard.


Token vs Sentence Embeddings

Two different things, often confused:

Token vs Sentence Embeddings

How sentence embeddings are created:

Sentence Embedding Methods

Embedding Dimensions

Common dimension sizes and their tradeoffs:

Embedding Dimension Tradeoffs

Practical guidance:

Choosing Embedding Dimensions

Contextual vs Static Embeddings

Static embeddings (Word2Vec, GloVe): one vector per word, always.

Static Embedding Problem

Contextual embeddings (BERT, GPT, modern): vector depends on context.

Contextual Embeddings Solve This

All modern models use contextual embeddings. Each token’s embedding changes based on what’s around it. This happens through attention (next progression).

Positional Information

Embeddings alone don’t encode position:

Word Order Problem

Models add positional encoding:

Positional Encoding

Code Example

Using a real embedding model to see embeddings in action:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a popular embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Compute cosine similarity between two vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Get embeddings
texts = [
    "The cat sat on the mat",
    "A kitten was resting on the rug",
    "Python is a programming language",
    "I love machine learning",
]

embeddings = model.encode(texts)

print(f"Embedding shape: {embeddings.shape}")  # (4, 384)
print(f"First embedding (first 10 dims): {embeddings[0][:10]}")

# Compare similarities
print("\nSimilarity matrix:")
for i, text_i in enumerate(texts):
    for j, text_j in enumerate(texts):
        sim = cosine_similarity(embeddings[i], embeddings[j])
        print(f"  [{i}][{j}] {sim:.3f}", end="")
    print(f"  ← {text_i[:30]}...")

# Expected output:
# Cat sentences similar to each other, different from Python/ML

Key Takeaways

Key Takeaways

Verify Your Understanding

Before proceeding, you should be able to:

Explain “king - man + woman = queen” without using the words “vector” or “embedding” — A genuine explanation might involve: “Words that appear in similar contexts develop similar internal representations…”

Why do contextual embeddings solve a problem that static embeddings have? — Give a specific example sentence where static embeddings fail.

Given these two sentences:

  • “The bank was closed for the holiday”
  • “The river bank was eroded by flooding”

Will a static embedding model give “bank” the same vector in both? Will a contextual model? Why does this matter?

Your embedding model gives similarity = 0.92 for two texts. What does this tell you? List at least two things it does NOT tell you.


What’s Next

After this, you can:

  • Continue → Embeddings → Attention — how tokens “look at” each other
  • Build → Semantic search with what you’ve learned

Concepts covered in this article