I/D/E · Patterns

Rate Limiting

Summary

Controlling the rate of requests to a service to prevent overload, ensure fair usage, and protect against abuse

TL;DR

Rate limiting controls how many requests a client can make to a service within a time window. It protects backend resources from overload, ensures fair usage across clients, and defends against abuse like brute-force attacks. The two most common algorithms are token bucket (allows bursts) and sliding window (smooth enforcement).

Visual Overview

Rate Limiting Flow

                                                    
  Client Request  Rate Limiter  Backend Service   
                                                   
                                                   
              Check: Within limit?                  
                    /     \                         
                  YES      NO                       
                                                  
              ALLOWED    REJECTED                   
              (process)   (429 + Retry-After)       
                                                    


WHY RATE LIMIT?

  1. Resource Protection                            
      Prevent DB connection exhaustion            
                                                    
  2. Fair Usage                                     
      One client can't starve others              
                                                    
  3. Cost Control                                   
      Limit cloud/API spend                       
                                                    
  4. Security                                       
      Stop brute-force, credential stuffing       

Core Explanation

What is Rate Limiting?

Real-World Analogy: Think of rate limiting like a nightclub bouncer. The club has a capacity of 100 people. The bouncer lets people in one at a time, but if the club is full, new arrivals must wait outside. Some VIPs (premium API users) might get a higher limit or skip the line.

Rate limiting enforces boundaries on how many requests a client can make:

  • Per user: Each authenticated user gets N requests/minute
  • Per IP: Unauthenticated requests limited by IP address
  • Per API key: Different tiers get different limits
  • Global: Protect the entire service from overload

How It Works

Every rate limiter needs to answer two questions:

  1. Identification: Who is making this request? (user ID, IP, API key)
  2. Counting: How many requests have they made recently?

The implementation varies by algorithm, but the flow is consistent:

Request arrives  Identify client  Check counter  Allow or Reject

Two Main Algorithms

Token Bucket vs Sliding Window
TOKEN BUCKET

  Bucket: [●●●●●] Capacity: 5 tokens                
                                                    
  Request 1: [●●●●○] consume 1  ALLOWED            
  Request 2: [●●●○○] consume 1  ALLOWED            
  Request 3: [●●○○○] consume 1  ALLOWED            
  Request 4: [●○○○○] consume 1  ALLOWED            
  Request 5: [○○○○○] consume 1  ALLOWED            
  Request 6: [○○○○○] no tokens  REJECTED           
                                                    
  After 2 seconds (refill 1/sec):                   
  Bucket: [●●○○○]  can handle 2 more requests      
                                                    
   Allows bursts up to capacity                    
   Steady-state rate = refill rate                 


SLIDING WINDOW COUNTER

  Window: 60 seconds, Limit: 10 requests            
                                                    
  Previous window: 8 requests                       
  Current window: 3 requests                        
  Current position: 25% into window                 
                                                    
  Weighted count = 8 × 0.75 + 3 = 9                 
  9 < 10  ALLOWED                                  
                                                    
   Prevents boundary burst problem                 
   Smooth rate enforcement                         

Token Bucket:

  • Tokens accumulate at a steady rate (refill)
  • Each request consumes one token
  • Burst allowed up to bucket capacity
  • Best for: APIs where legitimate traffic is bursty

Sliding Window Counter:

  • Counts requests in a sliding time window
  • Weights previous window to prevent boundary bursts
  • Smooth enforcement, no burst allowance
  • Best for: Strict rate enforcement, billing limits

Real Systems Using Rate Limiting

SystemAlgorithm StyleNotesUse Case
GitHub APIToken bucket styleTiered by authentication; check current docs for limitsDeveloper API access
StripeSliding windowDifferent limits for live vs test modePayment processing
Twitter/X APITiered windowsVaries significantly by endpoint and tierSocial media API
AWS API GatewayToken bucketFully configurable per stageAPI management
CloudflareLeaky bucketRule-based configurationEdge rate limiting

Note: Specific limits change frequently. Always verify current limits in official documentation.

Case Study: API Gateway Rate Limiting

Multi-Tier Rate Limiting
MULTI-TIER API LIMITS (Illustrative Example)

  Tier         Rate Limit     Burst    Pricing  

  Free         Low            Small    Free     
  Developer    Medium         Medium   Basic    
  Pro          High           Large    Premium  
  Enterprise   Custom         Custom   Custom   

  (Exact limits vary by provider—verify current docs) 

REQUEST FLOW WITH TIERS

  Request + API Key                                 
                                                   
  Lookup tier from API key                          
                                                   
  Check rate limit for tier                         
                                                   
             
   Within limit? YES Process                 
                                                 
        NO                                        
                                                 
   Return 429 + Retry-After header                
   + X-RateLimit-Remaining: 0                     
   + X-RateLimit-Reset: 1640000000                
             

When to Use Rate Limiting

✓ Perfect Use Cases

Rate Limiting Use Cases
PUBLIC APIS
Scenario: REST API exposed to external developers
Requirement: Prevent abuse, ensure fair access
Configuration: 1000 requests/hour per API key
Trade-off: Legitimate high-volume users need higher tier

MULTI-TENANT SAAS
Scenario: Shared infrastructure, many customers
Requirement: One customer can't impact others
Configuration: Per-tenant limits based on plan
Trade-off: Complexity in limit management

LOGIN ENDPOINTS
Scenario: Authentication endpoints
Requirement: Prevent credential stuffing attacks
Configuration: 5 attempts/minute per IP
Trade-off: Legitimate users with typos may be blocked

WEBHOOK RECEIVERS
Scenario: Receiving webhooks from external services
Requirement: Prevent flood from misbehaving senders
Configuration: 100 requests/second per source
Trade-off: May drop legitimate spikes

✕ When NOT to Use (or Use Carefully)

When Rate Limiting May Not Fit
INTERNAL SERVICE-TO-SERVICE
Problem: Adds latency to trusted internal calls
Alternative: Circuit breakers, back-pressure
When OK: Still useful for resource protection

REAL-TIME CRITICAL PATHS
Problem: 429s disrupt user experience
Alternative: Graceful degradation, queuing
When OK: As last resort protection

BATCH PROCESSING
Problem: Legitimate jobs need high throughput
Alternative: Separate batch endpoints with higher limits
When OK: With burst capacity for batch jobs

Interview Application

Common Interview Question

Q: “You’re designing an API for a public service. How would you implement rate limiting? What algorithm would you choose?”

Strong Answer:

“I’d implement rate limiting at the API gateway level with a token bucket algorithm. Here’s my approach:

Why Token Bucket:

  1. Allows legitimate bursts: Users often make multiple quick requests (page load, app startup)
  2. Simple state: Just two values per client (tokens, last_refill_time)
  3. Configurable: Capacity controls burst size, refill rate controls steady-state

Implementation:

  • Store state in Redis for distributed rate limiting
  • Key: ratelimit:{user_id} with token count and timestamp
  • Use Redis MULTI/EXEC for atomic check-and-decrement

Configuration:

  • Capacity: 100 tokens (max burst)
  • Refill: 10 tokens/second (100 req/sec steady-state)
  • Different limits per API tier

Response Headers:

  • X-RateLimit-Limit: Maximum requests allowed
  • X-RateLimit-Remaining: Requests left in window
  • X-RateLimit-Reset: Unix timestamp when limit resets
  • Retry-After: Seconds to wait (on 429)

Edge Cases:

  • Clock skew: Use Redis server time, not client time
  • Distributed: Single Redis cluster for consistency
  • Failover: Fail open (allow) if Redis unavailable—better to risk abuse than block all users”

Follow-up: How would you handle distributed rate limiting across multiple regions?

“For multi-region, I’d use local rate limiting with global synchronization:

  1. Each region has local Redis for low-latency checks
  2. Async sync between regions (eventual consistency)
  3. Accept that users might get slightly more than limit globally
  4. Alternative: Single global Redis with latency cost

The trade-off is accuracy vs latency. For most APIs, slightly exceeding limits across regions is acceptable.”

Code Example

Token Bucket Rate Limiter (Python + Redis)

import time
import redis

class TokenBucketRateLimiter:
    """
    Distributed token bucket rate limiter using Redis.

    Allows bursts up to capacity while enforcing average rate.
    """

    def __init__(self, redis_client: redis.Redis, capacity: int, refill_rate: float):
        """
        Args:
            redis_client: Redis connection
            capacity: Maximum tokens (burst size)
            refill_rate: Tokens added per second
        """
        self.redis = redis_client
        self.capacity = capacity
        self.refill_rate = refill_rate

    def is_allowed(self, key: str) -> tuple[bool, dict]:
        """
        Check if request is allowed and consume a token if so.

        Returns:
            (allowed: bool, info: dict with remaining, reset_at)
        """
        now = time.time()
        bucket_key = f"ratelimit:{key}"

        # Lua script for atomic check-and-update
        # This runs entirely on Redis server (no race conditions)
        lua_script = """
        local key = KEYS[1]
        local capacity = tonumber(ARGV[1])
        local refill_rate = tonumber(ARGV[2])
        local now = tonumber(ARGV[3])

        -- Get current state
        local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
        local tokens = tonumber(bucket[1]) or capacity
        local last_refill = tonumber(bucket[2]) or now

        -- Calculate tokens to add since last refill
        local elapsed = now - last_refill
        local tokens_to_add = elapsed * refill_rate
        tokens = math.min(capacity, tokens + tokens_to_add)

        -- Check if we can consume a token
        local allowed = 0
        if tokens >= 1 then
            tokens = tokens - 1
            allowed = 1
        end

        -- Update state
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, 3600)  -- Clean up after 1 hour idle

        return {allowed, tokens, now + (capacity - tokens) / refill_rate}
        """

        result = self.redis.eval(
            lua_script,
            1,  # number of keys
            bucket_key,
            self.capacity,
            self.refill_rate,
            now
        )

        allowed = result[0] == 1
        remaining = int(result[1])
        reset_at = int(result[2])

        return allowed, {
            "remaining": remaining,
            "limit": self.capacity,
            "reset_at": reset_at
        }


# Usage example
if __name__ == "__main__":
    redis_client = redis.Redis(host='localhost', port=6379, db=0)

    # 10 requests/second with burst of 20
    limiter = TokenBucketRateLimiter(
        redis_client=redis_client,
        capacity=20,       # Allow burst of 20 requests
        refill_rate=10     # Refill 10 tokens/second
    )

    user_id = "user_123"

    # Simulate requests
    for i in range(25):
        allowed, info = limiter.is_allowed(user_id)

        if allowed:
            print(f"Request {i+1}: ALLOWED (remaining: {info['remaining']})")
        else:
            print(f"Request {i+1}: REJECTED (retry at: {info['reset_at']})")
            # In real code: return 429 with Retry-After header

Express.js Middleware Example

const rateLimit = require('express-rate-limit');
const RedisStore = require('rate-limit-redis');
const Redis = require('ioredis');

const redis = new Redis({
  host: process.env.REDIS_HOST,
  port: 6379,
});

// Create rate limiter middleware
const apiLimiter = rateLimit({
  store: new RedisStore({
    sendCommand: (...args) => redis.call(...args),
  }),

  // 100 requests per 15 minutes
  windowMs: 15 * 60 * 1000,
  max: 100,

  // Return rate limit info in headers
  standardHeaders: true,
  legacyHeaders: false,

  // Custom key generator (by user ID if authenticated, else IP)
  keyGenerator: (req) => {
    return req.user?.id || req.ip;
  },

  // Custom response when rate limited
  handler: (req, res) => {
    res.status(429).json({
      error: 'Too many requests',
      message: 'Please try again later',
      retryAfter: Math.ceil(req.rateLimit.resetTime / 1000),
    });
  },
});

// Apply to all API routes
app.use('/api/', apiLimiter);

// Stricter limit for auth endpoints
const authLimiter = rateLimit({
  windowMs: 60 * 1000,  // 1 minute
  max: 5,               // 5 attempts per minute
  message: 'Too many login attempts, please try again later',
});

app.use('/api/auth/login', authLimiter);

See It In Action:

Related Concepts:

Quick Self-Check

  • Can explain rate limiting in 60 seconds?
  • Understand difference between token bucket and sliding window?
  • Know what HTTP 429 means and what headers to return?
  • Can implement distributed rate limiting with Redis?
  • Understand why sliding window prevents boundary burst?
  • Know when to use rate limiting vs circuit breaker?

Production signal

Why this concept matters

Interview 80% of API design interviews
Production All production APIs
Performance Resource protection
Scale Fair usage enforcement