Embeddings are vectors in high-dimensional space where similar meanings cluster together. Understanding dot products, cosine similarity, and matrix multiplication is essential for working with embeddings and attention mechanisms.
Visual Overview
Embedding Space
Embedding Space
EMBEDDING SPACE (visualized in 2D, real embeddings are 384-4096 dims)
┌───────────────────────────────────────────────────────────┐││││││ cat * * dog ││││││ * puppy │││││──────┼──────────────────────────││││││ * car * truck ││││││ * vehicle ││││Semantic similarity = geometric proximity││ "cat" is closer to "dog" than to "car" │││└───────────────────────────────────────────────────────────┘
Key insight: When a model converts text to embeddings, it’s placing words/sentences at coordinates in a space where distance = meaning difference.
What Dimensions Represent
Dimensions
Dimensions
DIMENSIONS┌───────────────────────────────────────────────────────────┐│││ Each dimension captures some learned feature. ││││ Hypothetical (models don't label dimensions): ││ Dimension 1: animate vs inanimate ││ Dimension 2: size ││ Dimension 3: domesticated vs wild ││ ... ││ Dimension 768: ??? ││││ In practice: Dimensions aren't interpretable ││ individually. The geometry of relationships is what ││ matters. │││└───────────────────────────────────────────────────────────┘
Dot Product
The dot product is the fundamental operation in neural networks. Attention, similarity, and layer computations all use it.
Dot Product
Dot Product
DOT PRODUCT FORMULA┌───────────────────────────────────────────────────────────┐│││ a · b = SUM(a_i x b_i) ││││ Example (3D vectors): ││ a = [3, 4, 0] ││ b = [2, 1, 2] ││││ a · b = (3x2) + (4x1) + (0x2) = 6 + 4 + 0 = 10 │││└───────────────────────────────────────────────────────────┘GEOMETRIC MEANING┌───────────────────────────────────────────────────────────┐│││ a · b = |a| × |b| × cos(θ) ││││ Where: ││ |a| = length of vector a ││ |b| = length of vector b ││ θ = angle between vectors │││└───────────────────────────────────────────────────────────┘DOT PRODUCT SIGN┌───────────────────────────────────────────────────────────┐│││ b ││↑│││ a·b > 0 (similar direction) ││─────────┼*─────→ a ││││││ a·b < 0 (opposite direction) ││↓││││ Same direction (0°): cos(0) = 1 →positive││ Perpendicular (90°): cos(90) = 0 → zero ││ Opposite (180°): cos(180) = -1 →negative│││└───────────────────────────────────────────────────────────┘
In attention: Query . Key computes relevance. High dot product = this key is relevant to this query.
Cosine Similarity
Cosine similarity is a normalized dot product. It measures direction alignment, ignoring magnitude.
Cosine Similarity
Cosine Similarity
COSINE SIMILARITY┌───────────────────────────────────────────────────────────┐│││ cos_sim(a, b) = (a · b) / (|a| x |b|) ││││ Range: [-1, 1] ││ 1.0 = identical direction (parallel) ││ 0.0 = orthogonal (unrelated) ││ -1.0 = opposite direction (antonyms, in some spaces) │││└───────────────────────────────────────────────────────────┘WHY NORMALIZE?
┌───────────────────────────────────────────────────────────┐│││WITHOUT NORMALIZATION: ││││ Vector lengths vary: ││ "king" might have ‖v‖ = 10 ││ "queen" might have ‖v‖ = 8 ││││ Raw dot product: ││ king · queen = 75 ││ king · dog = 80 ← Higher! But "dog" isn't ││ more similar ││││ Problem: Length dominates, not direction. ││││WITH NORMALIZATION: ││││ Cosine similarity: ││ cos(king, queen) = 0.95 ││ cos(king, dog) = 0.30 ││││ Now direction dominates. "queen" is more similar. │││└───────────────────────────────────────────────────────────┘
In practice: Most embedding models output normalized vectors (length = 1). When vectors are normalized, dot product = cosine similarity.
Distance Metrics
Euclidean Distance (L2)
Euclidean Distance
Euclidean Distance
EUCLIDEAN DISTANCE┌───────────────────────────────────────────────────────────┐│││ d(a, b) = sqrt(SUM((a_i - b_i)²)) ││││ "Straight line" distance in space. │││││││ a * ││││││ d = 5 ││││││ * b ││──────┴───────│││└───────────────────────────────────────────────────────────┘
Cosine Distance
Cosine Distance
Cosine Distance
COSINE DISTANCE┌───────────────────────────────────────────────────────────┐│││ cos_dist(a, b) = 1 - cos_sim(a, b) ││││ Range: [0, 2] ││ 0 = identical direction ││ 1 = orthogonal ││ 2 = opposite direction │││└───────────────────────────────────────────────────────────┘
When to Use What
Metric
When
Why
Cosine similarity
Text embeddings
Direction = semantic meaning
Cosine distance
Retrieval ranking
Lower = more similar
Euclidean (L2)
Some image embeddings
Magnitude can carry info
Dot product
Normalized vectors
Fast, equals cosine sim
Default choice: Cosine similarity for text. It’s what embedding models are trained to optimize.
Matrix Multiplication
Neural networks are stacks of matrix multiplications. Understanding this operation clarifies how models transform representations.
Matrix Multiplication
Matrix Multiplication
MATRIX x VECTOR = NEW VECTOR┌───────────────────────────────────────────────────────────┐│││ [ 2 0 ] [ 3 ] [ 6 ] ││ [ ] x [ ] = [ ] ││ [ 0 3 ] [ 2 ] [ 6 ] ││││ This matrix scales x by 2, y by 3. │││└───────────────────────────────────────────────────────────┘TRANSFORMATION VIEW┌───────────────────────────────────────────────────────────┐│││ A matrix defines a transformation of space. ││ Multiplying transforms points. ││││Rotation: Points rotate around origin ││Scaling: Points stretch/compress ││Projection: Higher-dim → lower-dim ││ Combination: All of the above ││││ Neural network layer = matrix multiply + activation ││ Each layer transforms the representation into a new ││ space. │││└───────────────────────────────────────────────────────────┘DIMENSION CHANGES┌───────────────────────────────────────────────────────────┐│││ Matrix shape: (output_dim, input_dim) ││ Vector shape: (input_dim,) ││ Result shape: (output_dim,) ││││ Example: ││ Input embedding: 768 dimensions ││ Weight matrix: (3072, 768) ││ Output: 3072 dimensions ← expanded ││││ Transformer FFN: 768 → 3072 → 768 (expand then ││ compress) │││└───────────────────────────────────────────────────────────┘
In Attention
Attention is built from these primitives:
Attention Computation
Attention Computation
ATTENTION COMPUTATION┌───────────────────────────────────────────────────────────┐│││ 1. Project inputs to Q, K, V spaces: ││Q = X @ W_Q (768 → 64 per head) ││K = X @ W_K││V = X @ W_V││││ 2. Compute attention scores: ││ scores = Q @ K.T ← Dot products between all ││Q-K pairs ││││ 3. Scale and softmax: ││ weights = softmax(scores / sqrt(d_k)) ││││ 4. Weighted sum of values: ││ output = weights @ V││││ Each operation is dot products or matrix multiplies. │││└───────────────────────────────────────────────────────────┘
Why projections? Different W_Q, W_K, W_V let the model learn different “views” of the input. Query projection emphasizes “what am I looking for?” Key projection emphasizes “what do I contain?” Value projection emphasizes “what information should I contribute?”
Dimensionality and Capacity
More dimensions = more capacity to represent distinctions.
Dimensionality Tradeoff
Dimensionality Tradeoff
DIMENSIONALITY TRADEOFF┌───────────────────────────────────────────────────────────┐│││384D: Good separation for many tasks ││ • Fast inference││ • Small storage││ • May lose fine distinctions││││768D: Rich separation (BERT-sized) ││ • "bank" (financial) far from "bank" (river) ││ • Nuanced relationships preserved ││││4096D: Maximum expressiveness││ • Captures subtle distinctions ││ • Expensive to compute and store│││└───────────────────────────────────────────────────────────┘