Activation Functions

TL;DR

Activation functions add non-linearity to neural networks. ReLU is the classic choice, GELU is standard in transformers, and SwiGLU powers modern LLMs like Llama. Understanding these helps you read model architectures and debug training issues.

Visual Overview

Why Activation Functions

ReLU (Rectified Linear Unit)

ReLU

Why ReLU Helps Gradient Flow

ReLU vs Sigmoid Gradients

Gotcha: “Dying ReLU” — if neuron always receives negative input, gradient is always 0, it never learns.

Variants that fix dying ReLU:

Leaky ReLU: max(0.01x, x) — small slope for negatives
PReLU: max(alpha*x, x) — learned alpha
ELU: x if x>0, alpha(e^x-1) — smooth negative region

GELU (Gaussian Error Linear Unit)

GELU

Why GELU for transformers:

Smoother gradients than ReLU
Slight regularization effect from probabilistic gating
Small negative values can pass through (unlike ReLU)

SwiGLU and GeGLU (Modern Gated Activations)

Used in: Llama, Mistral, PaLM, and most modern LLMs

SwiGLU and Gated Activations

Tradeoff: SwiGLU needs 3 matrices instead of 2 (more parameters), but quality improvement is worth it.

Sigmoid

Gotcha: Vanishing gradients — saturates at extremes, gradients -> 0.

Softmax

Softmax and Temperature Scaling

Where to Use What

Layer	Activation	Reason
Hidden (MLP)	ReLU or GELU	Fast, works well
Hidden (Transformer FFN)	GELU, SwiGLU, GeGLU	Smoother, standard
Binary output	Sigmoid	Maps to [0,1] probability
Multi-class output	Softmax	Distribution over classes
Regression output	None (linear)	Unconstrained range
Attention scores	Softmax	Weights sum to 1
Gates (LSTM/GRU)	Sigmoid	Control flow [0,1]

Debugging Activation Issues

When This Matters

Situation	What to know
Understanding transformer FFN	GELU or SwiGLU, not ReLU
Reading Llama/Mistral architecture	SwiGLU in FFN blocks
Debugging “model isn’t learning”	Check for dead ReLUs
Model outputs all same	Activation saturation or dead neurons
Understanding attention	Softmax creates probability weights
Temperature in generation	Softmax(logits/T) controls randomness
Choosing for new model	GELU for transformers, ReLU for CNNs