Activation Functions

TL;DR

Activation functions add non-linearity to neural networks. ReLU is the classic choice, GELU is standard in transformers, and SwiGLU powers modern LLMs like Llama. Understanding these helps you read model architectures and debug training issues.

Visual Overview

Why Activation Functions

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Without activation (linear only):                       │
│     layer1: y = W1 × x + b1                               │
│     layer2: y = W2 × (W1 × x + b1) + b2                   │
│            = W2 × W1 × x + W2 × b1 + b2                   │
│            = W' × x + b'  ← still just linear!            │
│                                                           │
│   Deep network collapses to single linear transformation. │
│   Activation functions add non-linearity → actual depth.  │
│                                                           │
└───────────────────────────────────────────────────────────┘

ReLU (Rectified Linear Unit)

ReLU

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   ReLU(x) = max(0, x)                                     │
│                                                           │
│        │                                                  │
│      3 │        /                                         │
│        │       /                                          │
│      1 │      /                                           │
│        │─────*─────────                                   │
│     -1 │     │                                            │
│        └─────┴─────────                                   │
│          -2  0   2                                        │
│                                                           │
│   Simple. Fast. Works well in practice.                   │
│   Default choice for hidden layers.                       │
│                                                           │
└───────────────────────────────────────────────────────────┘

Why ReLU Helps Gradient Flow

ReLU vs Sigmoid Gradients

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Sigmoid gradient: max 0.25, usually much smaller        │
│     Deep network: 0.25 × 0.25 × 0.25 × 0.25 = 0.004       │
│     → Gradients vanish, early layers stop learning        │
│                                                           │
│   ReLU gradient:                                          │
│     x > 0: gradient = 1                                   │
│     x < 0: gradient = 0                                   │
│                                                           │
│     Deep network (active neurons): 1 × 1 × 1 × 1 = 1      │
│     → No vanishing! Gradients flow cleanly.               │
│                                                           │
└───────────────────────────────────────────────────────────┘

Gotcha: “Dying ReLU” — if neuron always receives negative input, gradient is always 0, it never learns.

Variants that fix dying ReLU:

Leaky ReLU: max(0.01x, x) — small slope for negatives
PReLU: max(alpha*x, x) — learned alpha
ELU: x if x>0, alpha(e^x-1) — smooth negative region

GELU (Gaussian Error Linear Unit)

GELU

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   GELU(x) = x × Φ(x)                                      │
│                                                           │
│   Where Φ = CDF of standard normal distribution.          │
│                                                           │
│        │                                                  │
│      3 │        /                                         │
│        │       /                                          │
│      1 │      /                                           │
│        │  ───*/                                           │
│     -1 │                                                 │
│        └─────┴─────────                                   │
│          -2  0   2                                        │
│                                                           │
│   Smooth version of ReLU.                                 │
│   Default in transformers (GPT, BERT).                    │
│                                                           │
└───────────────────────────────────────────────────────────┘

Why GELU for transformers:

Smoother gradients than ReLU
Slight regularization effect from probabilistic gating
Small negative values can pass through (unlike ReLU)

SwiGLU and GeGLU (Modern Gated Activations)

Used in: Llama, Mistral, PaLM, and most modern LLMs

SwiGLU and Gated Activations

SWIGLU
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   SwiGLU(x, W, V) = Swish(xW) * (xV)                      │
│                                                           │
│   Where:                                                  │
│     Swish(x) = x × sigmoid(x)                             │
│     * = element-wise multiplication                       │
│                                                           │
│   Two linear projections, one gated by Swish activation.  │
│                                                           │
└───────────────────────────────────────────────────────────┘

WHY GATED ACTIVATIONS?
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ Standard FFN:                                             │
│ output = activation(x @ W1) @ W2                          │
│                                                           │
│ Gated FFN:                                                │
│ output = (activation(x @ W_gate) * (x @ W_up)) @ W_d     │
│ ↑ ↑                                                       │
│ "what to keep" "candidate values"                         │
│                                                           │
│ The gate learns WHICH information to let through.         │
│ More expressive than fixed activation functions.          │
│                                                           │
└───────────────────────────────────────────────────────────┘

Tradeoff: SwiGLU needs 3 matrices instead of 2 (more parameters), but quality improvement is worth it.

Sigmoid

┌───────────────────────────────────────────────────────────┐
│                                                           │
│   σ(x) = 1 / (1 + e^(-x))                                 │
│                                                           │
│   Output range: (0, 1)                                    │
│                                                           │
│        │                                                  │
│      1 │         *───────                                 │
│        │       /                                          │
│    0.5 │──────*                                           │
│        │    /                                             │
│      0 │───*───────────                                   │
│        └─────┴─────────                                   │
│          -4  0   4                                        │
│                                                           │
│   Use for: binary output, gates (LSTM), probabilities     │
│                                                           │
└───────────────────────────────────────────────────────────┘

Gotcha: Vanishing gradients — saturates at extremes, gradients -> 0.

Softmax

Softmax and Temperature Scaling

SOFTMAX
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   softmax(x_i) = e^(x_i) / Σ(e^(x_j))                     │
│                                                           │
│   Converts logits → probability distribution (sums to 1)  │
│                                                           │
│   Example:                                                │
│     logits:  [2.0, 1.0, 0.1]                              │
│                                                           │
│     exp:     [7.39, 2.72, 1.11]  (e^x for each)           │
│     sum:     11.22                                        │
│                                                           │
│     softmax: [0.66, 0.24, 0.10]  (each / sum)             │
│              └────── sums to 1.0 ──────┘                  │
│                                                           │
│   Use for: multi-class classification output layer        │
│                                                           │
└───────────────────────────────────────────────────────────┘

TEMPERATURE SCALING
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ softmax(x_i / T)                                          │
│                                                           │
│ T < 1: sharper (more confident)                           │
│ T > 1: softer (more uniform)                              │
│ T = 1: standard                                           │
│                                                           │
│ ┌─────────────────────────────────────┐                   │
│ │ T=0.5: [0.88, 0.10, 0.02] sharp │                       │
│ │ T=1.0: [0.66, 0.24, 0.10] normal │                      │
│ │ T=2.0: [0.49, 0.31, 0.20] soft │                        │
│ └─────────────────────────────────────┘                   │
│                                                           │
└───────────────────────────────────────────────────────────┘

Where to Use What

Layer	Activation	Reason
Hidden (MLP)	ReLU or GELU	Fast, works well
Hidden (Transformer FFN)	GELU, SwiGLU, GeGLU	Smoother, standard
Binary output	Sigmoid	Maps to [0,1] probability
Multi-class output	Softmax	Distribution over classes
Regression output	None (linear)	Unconstrained range
Attention scores	Softmax	Weights sum to 1
Gates (LSTM/GRU)	Sigmoid	Control flow [0,1]

Debugging Activation Issues

MODEL OUTPUTS ALL SAME VALUE
┌───────────────────────────────────────────────────────────┐
│                                                           │
│   Symptoms:                                               │
│     • All predictions identical regardless of input       │
│     • Loss doesn't decrease                               │
│                                                           │
│   Causes:                                                 │
│     • Dead ReLUs: All neurons stuck at 0                  │
│     • Bad initialization: Weights too small/large         │
│     • All activations saturated (sigmoid/tanh at extreme) │
│                                                           │
│   Debug steps:                                            │
│     1. Check activation statistics (mean, std, % zeros)   │
│     2. Try Leaky ReLU instead of ReLU                     │
│     3. Check weight initialization (Xavier/He)            │
│     4. Reduce learning rate if activations are exploding  │
│                                                           │
└───────────────────────────────────────────────────────────┘

MODEL VERY CONFIDENT BUT WRONG
┌───────────────────────────────────────────────────────────┐
│                                                           │
│ Symptoms:                                                 │
│ • Softmax outputs near 0 or 1                             │
│ • High confidence on wrong predictions                    │
│                                                           │
│ Causes:                                                   │
│ • Overfitting                                             │
│ • Missing regularization                                  │
│ • Logits too large going into softmax                     │
│                                                           │
│ Debug steps:                                              │
│ 1. Add dropout before final layer                         │
│ 2. Use label smoothing                                    │
│ 3. Add temperature scaling at inference                   │
│ 4. Check for data leakage                                 │
│                                                           │
└───────────────────────────────────────────────────────────┘

When This Matters

Situation	What to know
Understanding transformer FFN	GELU or SwiGLU, not ReLU
Reading Llama/Mistral architecture	SwiGLU in FFN blocks
Debugging “model isn’t learning”	Check for dead ReLUs
Model outputs all same	Activation saturation or dead neurons
Understanding attention	Softmax creates probability weights
Temperature in generation	Softmax(logits/T) controls randomness
Choosing for new model	GELU for transformers, ReLU for CNNs