Math Intuitions | Concepts

TL;DR

Embeddings are vectors in high-dimensional space where similar meanings cluster together. Understanding dot products, cosine similarity, and matrix multiplication is essential for working with embeddings and attention mechanisms.

Visual Overview

Embedding Space

EMBEDDING SPACE (visualized in 2D, real embeddings are 384-4096 dims)
┌───────────────────────────────────────────────────────────┐
│                                                           │
│         │                                                 │
│     cat *           * dog                                 │
│         │                                                 │
│         │        * puppy                                  │
│         │                                                 │
│   ──────┼──────────────────────────                       │
│         │                                                 │
│         │  * car      * truck                             │
│         │                                                 │
│         │         * vehicle                               │
│                                                           │
│   Semantic similarity = geometric proximity               │
│   "cat" is closer to "dog" than to "car"                  │
│                                                           │
└───────────────────────────────────────────────────────────┘

Key insight: When a model converts text to embeddings, it’s placing words/sentences at coordinates in a space where distance = meaning difference.

What Dimensions Represent

Dimensions

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Each dimension captures some learned feature.           │
│                                                           │
│   Hypothetical (models don't label dimensions):           │
│     Dimension 1: animate vs inanimate                     │
│     Dimension 2: size                                     │
│     Dimension 3: domesticated vs wild                     │
│     ...                                                   │
│     Dimension 768: ???                                    │
│                                                           │
│   In practice: Dimensions aren't interpretable            │
│   individually. The geometry of relationships is what     │
│   matters.                                                │
│                                                           │
└───────────────────────────────────────────────────────────┘

Dot Product

The dot product is the fundamental operation in neural networks. Attention, similarity, and layer computations all use it.

Dot Product

DOT PRODUCT FORMULA
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   a · b = SUM(a_i x b_i)                                  │
│                                                           │
│   Example (3D vectors):                                   │
│     a = [3, 4, 0]                                         │
│     b = [2, 1, 2]                                         │
│                                                           │
│     a · b = (3x2) + (4x1) + (0x2) = 6 + 4 + 0 = 10        │
│                                                           │
└───────────────────────────────────────────────────────────┘

GEOMETRIC MEANING
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ a · b = |a| × |b| × cos(θ)                                │
│                                                           │
│ Where:                                                    │
│ |a| = length of vector a                                  │
│ |b| = length of vector b                                  │
│ θ = angle between vectors                                 │
│                                                           │
└───────────────────────────────────────────────────────────┘

DOT PRODUCT SIGN
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ b                                                         │
│ ↑                                                         │
│ │ a·b > 0 (similar direction)                             │
│ ─────────┼*─────→ a                                      │
│ │                                                         │
│ │ a·b < 0 (opposite direction)                            │
│ ↓                                                         │
│                                                           │
│ Same direction (0°): cos(0) = 1 → positive                │
│ Perpendicular (90°): cos(90) = 0 → zero                   │
│ Opposite (180°): cos(180) = -1 → negative                 │
│                                                           │
└───────────────────────────────────────────────────────────┘

In attention: Query . Key computes relevance. High dot product = this key is relevant to this query.

Cosine Similarity

Cosine similarity is a normalized dot product. It measures direction alignment, ignoring magnitude.

Cosine Similarity

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   cos_sim(a, b) = (a · b) / (|a| x |b|)                   │
│                                                           │
│   Range: [-1, 1]                                          │
│      1.0 = identical direction (parallel)                 │
│      0.0 = orthogonal (unrelated)                         │
│     -1.0 = opposite direction (antonyms, in some spaces)  │
│                                                           │
└───────────────────────────────────────────────────────────┘

WHY NORMALIZE?
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ WITHOUT NORMALIZATION:                                    │
│                                                           │
│ Vector lengths vary:                                      │
│ "king" might have ‖v‖ = 10                                │
│ "queen" might have ‖v‖ = 8                                │
│                                                           │
│ Raw dot product:                                          │
│ king · queen = 75                                         │
│ king · dog = 80 ← Higher! But "dog" isn't                 │
│ more similar                                              │
│                                                           │
│ Problem: Length dominates, not direction.                 │
│                                                           │
│ WITH NORMALIZATION:                                       │
│                                                           │
│ Cosine similarity:                                        │
│ cos(king, queen) = 0.95                                   │
│ cos(king, dog) = 0.30                                     │
│                                                           │
│ Now direction dominates. "queen" is more similar.         │
│                                                           │
└───────────────────────────────────────────────────────────┘

In practice: Most embedding models output normalized vectors (length = 1). When vectors are normalized, dot product = cosine similarity.

Distance Metrics

Euclidean Distance (L2)

Euclidean Distance

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   d(a, b) = sqrt(SUM((a_i - b_i)²))                       │
│                                                           │
│   "Straight line" distance in space.                      │
│                                                           │
│         │                                                 │
│       a *                                                 │
│         │                                                │
│         │  d = 5                                         │
│         │                                                │
│         │   * b                                           │
│   ──────┴───────                                          │
│                                                           │
└───────────────────────────────────────────────────────────┘

Cosine Distance

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   cos_dist(a, b) = 1 - cos_sim(a, b)                      │
│                                                           │
│   Range: [0, 2]                                           │
│     0 = identical direction                               │
│     1 = orthogonal                                        │
│     2 = opposite direction                                │
│                                                           │
└───────────────────────────────────────────────────────────┘

When to Use What

Metric	When	Why
Cosine similarity	Text embeddings	Direction = semantic meaning
Cosine distance	Retrieval ranking	Lower = more similar
Euclidean (L2)	Some image embeddings	Magnitude can carry info
Dot product	Normalized vectors	Fast, equals cosine sim

Default choice: Cosine similarity for text. It’s what embedding models are trained to optimize.

Matrix Multiplication

Neural networks are stacks of matrix multiplications. Understanding this operation clarifies how models transform representations.

Matrix Multiplication

MATRIX x VECTOR = NEW VECTOR
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   [ 2  0 ]   [ 3 ]     [  6 ]                             │
│   [      ] x [   ]  =  [    ]                             │
│   [ 0  3 ]   [ 2 ]     [  6 ]                             │
│                                                           │
│   This matrix scales x by 2, y by 3.                      │
│                                                           │
└───────────────────────────────────────────────────────────┘

TRANSFORMATION VIEW
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ A matrix defines a transformation of space.               │
│ Multiplying transforms points.                            │
│                                                           │
│ Rotation: Points rotate around origin                     │
│ Scaling: Points stretch/compress                          │
│ Projection: Higher-dim → lower-dim                        │
│ Combination: All of the above                             │
│                                                           │
│ Neural network layer = matrix multiply + activation       │
│ Each layer transforms the representation into a new       │
│ space.                                                    │
│                                                           │
└───────────────────────────────────────────────────────────┘

DIMENSION CHANGES
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ Matrix shape: (output_dim, input_dim)                     │
│ Vector shape: (input_dim,)                                │
│ Result shape: (output_dim,)                               │
│                                                           │
│ Example:                                                  │
│ Input embedding: 768 dimensions                           │
│ Weight matrix: (3072, 768)                                │
│ Output: 3072 dimensions ← expanded                        │
│                                                           │
│ Transformer FFN: 768 → 3072 → 768 (expand then            │
│ compress)                                                 │
│                                                           │
└───────────────────────────────────────────────────────────┘

In Attention

Attention is built from these primitives:

Attention Computation

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   1. Project inputs to Q, K, V spaces:                    │
│      Q = X @ W_Q    (768 → 64 per head)                   │
│      K = X @ W_K                                          │
│      V = X @ W_V                                          │
│                                                           │
│   2. Compute attention scores:                            │
│      scores = Q @ K.T   ← Dot products between all        │
│                           Q-K pairs                       │
│                                                           │
│   3. Scale and softmax:                                   │
│      weights = softmax(scores / sqrt(d_k))                │
│                                                           │
│   4. Weighted sum of values:                              │
│      output = weights @ V                                 │
│                                                           │
│   Each operation is dot products or matrix multiplies.    │
│                                                           │
└───────────────────────────────────────────────────────────┘

Why projections? Different W_Q, W_K, W_V let the model learn different “views” of the input. Query projection emphasizes “what am I looking for?” Key projection emphasizes “what do I contain?” Value projection emphasizes “what information should I contribute?”

Dimensionality and Capacity

More dimensions = more capacity to represent distinctions.

Dimensionality Tradeoff

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   384D: Good separation for many tasks                    │
│     • Fast inference                                      │
│     • Small storage                                       │
│     • May lose fine distinctions                          │
│                                                           │
│   768D: Rich separation (BERT-sized)                      │
│     • "bank" (financial) far from "bank" (river)          │
│     • Nuanced relationships preserved                     │
│                                                           │
│   4096D: Maximum expressiveness                           │
│     • Captures subtle distinctions                        │
│     • Expensive to compute and store                      │
│                                                           │
└───────────────────────────────────────────────────────────┘

Common dimensions:

Small/fast: 384 (e5-small, all-MiniLM)
Standard: 768 (BERT, many embedding models)
Large: 1024-4096 (GPT-scale, high-quality embeddings)

When This Matters

Situation	Concept to apply
Choosing an embedding model	Dimensionality tradeoff
Understanding retrieval	Cosine similarity for ranking
Understanding attention	Q.K dot products, softmax, V weighting
Debugging “wrong results returned”	Check distance metric matches model
Understanding layer transformations	Matrix multiply as space transformation
Optimizing inference	Dot products are the computational bottleneck

TL;DR

Visual Overview

What Dimensions Represent

Dot Product

Cosine Similarity

Distance Metrics

Euclidean Distance (L2)

Cosine Distance

When to Use What

Matrix Multiplication

In Attention

Dimensionality and Capacity

When This Matters

Why this concept matters