Skip to content

Activation Functions

ReLU, GELU, SwiGLU, softmax, and sigmoid: what they do and when to use them

TL;DR

Activation functions add non-linearity to neural networks. ReLU is the classic choice, GELU is standard in transformers, and SwiGLU powers modern LLMs like Llama. Understanding these helps you read model architectures and debug training issues.

Visual Overview

Why Activation Functions

ReLU (Rectified Linear Unit)

ReLU

Why ReLU Helps Gradient Flow

ReLU vs Sigmoid Gradients

Gotcha: “Dying ReLU” — if neuron always receives negative input, gradient is always 0, it never learns.

Variants that fix dying ReLU:

  • Leaky ReLU: max(0.01x, x) — small slope for negatives
  • PReLU: max(alpha*x, x) — learned alpha
  • ELU: x if x>0, alpha(e^x-1) — smooth negative region

GELU (Gaussian Error Linear Unit)

GELU

Why GELU for transformers:

  • Smoother gradients than ReLU
  • Slight regularization effect from probabilistic gating
  • Small negative values can pass through (unlike ReLU)

SwiGLU and GeGLU (Modern Gated Activations)

Used in: Llama, Mistral, PaLM, and most modern LLMs

SwiGLU and Gated Activations

Tradeoff: SwiGLU needs 3 matrices instead of 2 (more parameters), but quality improvement is worth it.


Sigmoid

Sigmoid

Gotcha: Vanishing gradients — saturates at extremes, gradients -> 0.


Softmax

Softmax and Temperature Scaling

Where to Use What

LayerActivationReason
Hidden (MLP)ReLU or GELUFast, works well
Hidden (Transformer FFN)GELU, SwiGLU, GeGLUSmoother, standard
Binary outputSigmoidMaps to [0,1] probability
Multi-class outputSoftmaxDistribution over classes
Regression outputNone (linear)Unconstrained range
Attention scoresSoftmaxWeights sum to 1
Gates (LSTM/GRU)SigmoidControl flow [0,1]

Debugging Activation Issues

Debugging Activation Issues

When This Matters

SituationWhat to know
Understanding transformer FFNGELU or SwiGLU, not ReLU
Reading Llama/Mistral architectureSwiGLU in FFN blocks
Debugging “model isn’t learning”Check for dead ReLUs
Model outputs all sameActivation saturation or dead neurons
Understanding attentionSoftmax creates probability weights
Temperature in generationSoftmax(logits/T) controls randomness
Choosing for new modelGELU for transformers, ReLU for CNNs
Interview Notes
💼60% of ML architecture interviews
Interview Relevance
60% of ML architecture interviews
🏭Understanding model architecture
Production Impact
Powers systems at Understanding model architecture
GELU/SwiGLU in modern LLMs
Performance
GELU/SwiGLU in modern LLMs query improvement