I/D/E · Generative AI

Activation Functions

Summary

ReLU, GELU, SwiGLU, softmax, and sigmoid: what they do and when to use them

TL;DR

Activation functions add non-linearity to neural networks. ReLU is the classic choice, GELU is standard in transformers, and SwiGLU powers modern LLMs like Llama. Understanding these helps you read model architectures and debug training issues.

Visual Overview

Why Activation Functions

                                                           
   Without activation (linear only):                       
     layer1: y = W1 × x + b1                               
     layer2: y = W2 × (W1 × x + b1) + b2                   
            = W2 × W1 × x + W2 × b1 + b2                   
            = W' × x + b'   still just linear!            
                                                           
   Deep network collapses to single linear transformation. 
   Activation functions add non-linearity  actual depth.  
                                                           


ReLU (Rectified Linear Unit)

ReLU

                                                           
   ReLU(x) = max(0, x)                                     
                                                           
                                                          
      3         /                                         
               /                                          
      1       /                                           
        *                                   
     -1                                                  
                                           
          -2  0   2                                        
                                                           
   Simple. Fast. Works well in practice.                   
   Default choice for hidden layers.                       
                                                           

Why ReLU Helps Gradient Flow

ReLU vs Sigmoid Gradients

                                                           
   Sigmoid gradient: max 0.25, usually much smaller        
     Deep network: 0.25 × 0.25 × 0.25 × 0.25 = 0.004       
      Gradients vanish, early layers stop learning        
                                                           
   ReLU gradient:                                          
     x > 0: gradient = 1                                   
     x < 0: gradient = 0                                   
                                                           
     Deep network (active neurons): 1 × 1 × 1 × 1 = 1      
      No vanishing! Gradients flow cleanly.               
                                                           

Gotcha: “Dying ReLU” — if neuron always receives negative input, gradient is always 0, it never learns.

Variants that fix dying ReLU:

  • Leaky ReLU: max(0.01x, x) — small slope for negatives
  • PReLU: max(alpha*x, x) — learned alpha
  • ELU: x if x>0, alpha(e^x-1) — smooth negative region

GELU (Gaussian Error Linear Unit)

GELU

                                                           
   GELU(x) = x × Φ(x)                                      
                                                           
   Where Φ = CDF of standard normal distribution.          
                                                           
                                                          
      3         /                                         
               /                                          
      1       /                                           
          */                                           
     -1                                                  
                                           
          -2  0   2                                        
                                                           
   Smooth version of ReLU.                                 
   Default in transformers (GPT, BERT).                    
                                                           

Why GELU for transformers:

  • Smoother gradients than ReLU
  • Slight regularization effect from probabilistic gating
  • Small negative values can pass through (unlike ReLU)

SwiGLU and GeGLU (Modern Gated Activations)

Used in: Llama, Mistral, PaLM, and most modern LLMs

SwiGLU and Gated Activations
SWIGLU

                                                           
   SwiGLU(x, W, V) = Swish(xW) * (xV)                      
                                                           
   Where:                                                  
     Swish(x) = x × sigmoid(x)                             
     * = element-wise multiplication                       
                                                           
   Two linear projections, one gated by Swish activation.  
                                                           


WHY GATED ACTIVATIONS?

 
 Standard FFN: 
 output = activation(x @ W1) @ W2 
 
 Gated FFN: 
 output = (activation(x @ W_gate) * (x @ W_up)) @ W_d 
   
 "what to keep" "candidate values" 
 
 The gate learns WHICH information to let through. 
 More expressive than fixed activation functions. 
 

Tradeoff: SwiGLU needs 3 matrices instead of 2 (more parameters), but quality improvement is worth it.


Sigmoid

Sigmoid

                                                           
   σ(x) = 1 / (1 + e^(-x))                                 
                                                           
   Output range: (0, 1)                                    
                                                           
                                                          
      1          *                                 
               /                                          
    0.5 *                                           
            /                                             
      0 *                                   
                                           
          -4  0   4                                        
                                                           
   Use for: binary output, gates (LSTM), probabilities     
                                                           

Gotcha: Vanishing gradients — saturates at extremes, gradients -> 0.


Softmax

Softmax and Temperature Scaling
SOFTMAX

                                                           
   softmax(x_i) = e^(x_i) / Σ(e^(x_j))                     
                                                           
   Converts logits  probability distribution (sums to 1)  
                                                           
   Example:                                                
     logits:  [2.0, 1.0, 0.1]                              
                                                           
     exp:     [7.39, 2.72, 1.11]  (e^x for each)           
     sum:     11.22                                        
                                                           
     softmax: [0.66, 0.24, 0.10]  (each / sum)             
               sums to 1.0                   
                                                           
   Use for: multi-class classification output layer        
                                                           


TEMPERATURE SCALING

 
 softmax(x_i / T) 
 
 T < 1: sharper (more confident) 
 T > 1: softer (more uniform) 
 T = 1: standard 
 
  
  T=0.5: [0.88, 0.10, 0.02] sharp  
  T=1.0: [0.66, 0.24, 0.10] normal  
  T=2.0: [0.49, 0.31, 0.20] soft  
  
 


Where to Use What

LayerActivationReason
Hidden (MLP)ReLU or GELUFast, works well
Hidden (Transformer FFN)GELU, SwiGLU, GeGLUSmoother, standard
Binary outputSigmoidMaps to [0,1] probability
Multi-class outputSoftmaxDistribution over classes
Regression outputNone (linear)Unconstrained range
Attention scoresSoftmaxWeights sum to 1
Gates (LSTM/GRU)SigmoidControl flow [0,1]

Debugging Activation Issues

Debugging Activation Issues
MODEL OUTPUTS ALL SAME VALUE

                                                           
   Symptoms:                                               
     • All predictions identical regardless of input       
     • Loss doesn't decrease                               
                                                           
   Causes:                                                 
Dead ReLUs: All neurons stuck at 0                  
Bad initialization: Weights too small/large         
     • All activations saturated (sigmoid/tanh at extreme) 
                                                           
   Debug steps:                                            
     1. Check activation statistics (mean, std, % zeros)   
     2. Try Leaky ReLU instead of ReLU                     
     3. Check weight initialization (Xavier/He)            
     4. Reduce learning rate if activations are exploding  
                                                           


MODEL VERY CONFIDENT BUT WRONG

 
 Symptoms: 
 • Softmax outputs near 0 or 1 
 • High confidence on wrong predictions 
 
 Causes: 
Overfitting 
Missing regularization 
 • Logits too large going into softmax 
 
 Debug steps: 
 1. Add dropout before final layer 
 2. Use label smoothing 
 3. Add temperature scaling at inference 
 4. Check for data leakage 
 


When This Matters

SituationWhat to know
Understanding transformer FFNGELU or SwiGLU, not ReLU
Reading Llama/Mistral architectureSwiGLU in FFN blocks
Debugging “model isn’t learning”Check for dead ReLUs
Model outputs all sameActivation saturation or dead neurons
Understanding attentionSoftmax creates probability weights
Temperature in generationSoftmax(logits/T) controls randomness
Choosing for new modelGELU for transformers, ReLU for CNNs

Production signal

Why this concept matters

Interview 60% of ML architecture interviews
Production Understanding model architecture
Performance GELU/SwiGLU in modern LLMs