I/D/E · Generative AI

Math Intuitions

Summary

Geometric intuitions for vectors, cosine similarity, dot products, and matrix multiplication in AI

TL;DR

Embeddings are vectors in high-dimensional space where similar meanings cluster together. Understanding dot products, cosine similarity, and matrix multiplication is essential for working with embeddings and attention mechanisms.

Visual Overview

Embedding Space
EMBEDDING SPACE (visualized in 2D, real embeddings are 384-4096 dims)

                                                           
                                                          
     cat *           * dog                                 
                                                          
                 * puppy                                  
                                                          
                          
                                                          
           * car      * truck                             
                                                          
                  * vehicle                               
                                                           
   Semantic similarity = geometric proximity               
   "cat" is closer to "dog" than to "car"                  
                                                           

Key insight: When a model converts text to embeddings, it’s placing words/sentences at coordinates in a space where distance = meaning difference.


What Dimensions Represent

Dimensions

                                                           
   Each dimension captures some learned feature.           
                                                           
   Hypothetical (models don't label dimensions):           
     Dimension 1: animate vs inanimate                     
     Dimension 2: size                                     
     Dimension 3: domesticated vs wild                     
     ...                                                   
     Dimension 768: ???                                    
                                                           
   In practice: Dimensions aren't interpretable            
   individually. The geometry of relationships is what     
   matters.                                                
                                                           


Dot Product

The dot product is the fundamental operation in neural networks. Attention, similarity, and layer computations all use it.

Dot Product
DOT PRODUCT FORMULA

                                                           
   a · b = SUM(a_i x b_i)                                  
                                                           
   Example (3D vectors):                                   
     a = [3, 4, 0]                                         
     b = [2, 1, 2]                                         
                                                           
     a · b = (3x2) + (4x1) + (0x2) = 6 + 4 + 0 = 10        
                                                           


GEOMETRIC MEANING

 
 a · b = |a| × |b| × cos(θ) 
 
 Where: 
 |a| = length of vector a 
 |b| = length of vector b 
 θ = angle between vectors 
 


DOT PRODUCT SIGN

 
 b 
  
  a·b > 0 (similar direction) 
 * a 
  
  a·b < 0 (opposite direction) 
  
 
 Same direction (0°): cos(0) = 1  positive 
 Perpendicular (90°): cos(90) = 0  zero 
 Opposite (180°): cos(180) = -1  negative 
 

In attention: Query . Key computes relevance. High dot product = this key is relevant to this query.


Cosine Similarity

Cosine similarity is a normalized dot product. It measures direction alignment, ignoring magnitude.

Cosine Similarity

                                                           
   cos_sim(a, b) = (a · b) / (|a| x |b|)                   
                                                           
   Range: [-1, 1]                                          
      1.0 = identical direction (parallel)                 
      0.0 = orthogonal (unrelated)                         
     -1.0 = opposite direction (antonyms, in some spaces)  
                                                           


WHY NORMALIZE?

 
 WITHOUT NORMALIZATION: 
 
 Vector lengths vary: 
 "king" might have ‖v‖ = 10 
 "queen" might have ‖v‖ = 8 
 
 Raw dot product: 
 king · queen = 75 
 king · dog = 80  Higher! But "dog" isn't 
 more similar 
 
 Problem: Length dominates, not direction. 
 
 WITH NORMALIZATION: 
 
 Cosine similarity: 
 cos(king, queen) = 0.95 
 cos(king, dog) = 0.30 
 
 Now direction dominates. "queen" is more similar. 
 

In practice: Most embedding models output normalized vectors (length = 1). When vectors are normalized, dot product = cosine similarity.


Distance Metrics

Euclidean Distance (L2)

Euclidean Distance

                                                           
   d(a, b) = sqrt(SUM((a_i - b_i)²))                       
                                                           
   "Straight line" distance in space.                      
                                                           
                                                          
       a *                                                 
                                                         
           d = 5                                         
                                                         
            * b                                           
                                             
                                                           

Cosine Distance

Cosine Distance

                                                           
   cos_dist(a, b) = 1 - cos_sim(a, b)                      
                                                           
   Range: [0, 2]                                           
     0 = identical direction                               
     1 = orthogonal                                        
     2 = opposite direction                                
                                                           

When to Use What

MetricWhenWhy
Cosine similarityText embeddingsDirection = semantic meaning
Cosine distanceRetrieval rankingLower = more similar
Euclidean (L2)Some image embeddingsMagnitude can carry info
Dot productNormalized vectorsFast, equals cosine sim

Default choice: Cosine similarity for text. It’s what embedding models are trained to optimize.


Matrix Multiplication

Neural networks are stacks of matrix multiplications. Understanding this operation clarifies how models transform representations.

Matrix Multiplication
MATRIX x VECTOR = NEW VECTOR

                                                           
   [ 2  0 ]   [ 3 ]     [  6 ]                             
   [      ] x [   ]  =  [    ]                             
   [ 0  3 ]   [ 2 ]     [  6 ]                             
                                                           
   This matrix scales x by 2, y by 3.                      
                                                           


TRANSFORMATION VIEW

 
 A matrix defines a transformation of space. 
 Multiplying transforms points. 
 
 Rotation: Points rotate around origin 
 Scaling: Points stretch/compress 
 Projection: Higher-dim  lower-dim 
 Combination: All of the above 
 
 Neural network layer = matrix multiply + activation 
 Each layer transforms the representation into a new 
 space. 
 


DIMENSION CHANGES

 
 Matrix shape: (output_dim, input_dim) 
 Vector shape: (input_dim,) 
 Result shape: (output_dim,) 
 
 Example: 
 Input embedding: 768 dimensions 
 Weight matrix: (3072, 768) 
 Output: 3072 dimensions  expanded 
 
 Transformer FFN: 768  3072  768 (expand then 
 compress) 
 


In Attention

Attention is built from these primitives:

Attention Computation

                                                           
   1. Project inputs to Q, K, V spaces:                    
      Q = X @ W_Q    (768  64 per head)                   
      K = X @ W_K                                          
      V = X @ W_V                                          
                                                           
   2. Compute attention scores:                            
      scores = Q @ K.T    Dot products between all        
                           Q-K pairs                       
                                                           
   3. Scale and softmax:                                   
      weights = softmax(scores / sqrt(d_k))                
                                                           
   4. Weighted sum of values:                              
      output = weights @ V                                 
                                                           
   Each operation is dot products or matrix multiplies.    
                                                           

Why projections? Different W_Q, W_K, W_V let the model learn different “views” of the input. Query projection emphasizes “what am I looking for?” Key projection emphasizes “what do I contain?” Value projection emphasizes “what information should I contribute?”


Dimensionality and Capacity

More dimensions = more capacity to represent distinctions.

Dimensionality Tradeoff

                                                           
   384D: Good separation for many tasks                    
Fast inference                                      
Small storage                                       
May lose fine distinctions                          
                                                           
   768D: Rich separation (BERT-sized)                      
     • "bank" (financial) far from "bank" (river)          
     • Nuanced relationships preserved                     
                                                           
   4096D: Maximum expressiveness                           
     • Captures subtle distinctions                        
Expensive to compute and store                      
                                                           

Common dimensions:

  • Small/fast: 384 (e5-small, all-MiniLM)
  • Standard: 768 (BERT, many embedding models)
  • Large: 1024-4096 (GPT-scale, high-quality embeddings)

When This Matters

SituationConcept to apply
Choosing an embedding modelDimensionality tradeoff
Understanding retrievalCosine similarity for ranking
Understanding attentionQ.K dot products, softmax, V weighting
Debugging “wrong results returned”Check distance metric matches model
Understanding layer transformationsMatrix multiply as space transformation
Optimizing inferenceDot products are the computational bottleneck

Production signal

Why this concept matters

Interview 65% of ML interviews
Production Understanding embeddings and attention
Performance Foundation for retrieval systems