Skip to content

Math Intuitions

Geometric intuitions for vectors, cosine similarity, dot products, and matrix multiplication in AI

TL;DR

Embeddings are vectors in high-dimensional space where similar meanings cluster together. Understanding dot products, cosine similarity, and matrix multiplication is essential for working with embeddings and attention mechanisms.

Visual Overview

Embedding Space

Key insight: When a model converts text to embeddings, it’s placing words/sentences at coordinates in a space where distance = meaning difference.


What Dimensions Represent

Dimensions

Dot Product

The dot product is the fundamental operation in neural networks. Attention, similarity, and layer computations all use it.

Dot Product

In attention: Query . Key computes relevance. High dot product = this key is relevant to this query.


Cosine Similarity

Cosine similarity is a normalized dot product. It measures direction alignment, ignoring magnitude.

Cosine Similarity

In practice: Most embedding models output normalized vectors (length = 1). When vectors are normalized, dot product = cosine similarity.


Distance Metrics

Euclidean Distance (L2)

Euclidean Distance

Cosine Distance

Cosine Distance

When to Use What

MetricWhenWhy
Cosine similarityText embeddingsDirection = semantic meaning
Cosine distanceRetrieval rankingLower = more similar
Euclidean (L2)Some image embeddingsMagnitude can carry info
Dot productNormalized vectorsFast, equals cosine sim

Default choice: Cosine similarity for text. It’s what embedding models are trained to optimize.


Matrix Multiplication

Neural networks are stacks of matrix multiplications. Understanding this operation clarifies how models transform representations.

Matrix Multiplication

In Attention

Attention is built from these primitives:

Attention Computation

Why projections? Different W_Q, W_K, W_V let the model learn different “views” of the input. Query projection emphasizes “what am I looking for?” Key projection emphasizes “what do I contain?” Value projection emphasizes “what information should I contribute?”


Dimensionality and Capacity

More dimensions = more capacity to represent distinctions.

Dimensionality Tradeoff

Common dimensions:

  • Small/fast: 384 (e5-small, all-MiniLM)
  • Standard: 768 (BERT, many embedding models)
  • Large: 1024-4096 (GPT-scale, high-quality embeddings)

When This Matters

SituationConcept to apply
Choosing an embedding modelDimensionality tradeoff
Understanding retrievalCosine similarity for ranking
Understanding attentionQ.K dot products, softmax, V weighting
Debugging “wrong results returned”Check distance metric matches model
Understanding layer transformationsMatrix multiply as space transformation
Optimizing inferenceDot products are the computational bottleneck
Interview Notes
💼65% of ML interviews
Interview Relevance
65% of ML interviews
🏭Understanding embeddings and attention
Production Impact
Powers systems at Understanding embeddings and attention
Foundation for retrieval systems
Performance
Foundation for retrieval systems query improvement