Rate Limiting | Concepts

TL;DR

Rate limiting controls how many requests a client can make to a service within a time window. It protects backend resources from overload, ensures fair usage across clients, and defends against abuse like brute-force attacks. The two most common algorithms are token bucket (allows bursts) and sliding window (smooth enforcement).

Visual Overview

Rate Limiting Flow

┌────────────────────────────────────────────────────┐
│                                                    │
│  Client Request → Rate Limiter → Backend Service   │
│                       │                            │
│                       ↓                            │
│              Check: Within limit?                  │
│                    /     \                        │
│                  YES      NO                       │
│                   ↓        ↓                       │
│              ALLOWED    REJECTED                   │
│              (process)   (429 + Retry-After)       │
│                                                    │
└────────────────────────────────────────────────────┘

WHY RATE LIMIT?
┌────────────────────────────────────────────────────┐
│  1. Resource Protection                            │
│     └─ Prevent DB connection exhaustion            │
│                                                    │
│  2. Fair Usage                                     │
│     └─ One client can't starve others              │
│                                                    │
│  3. Cost Control                                   │
│     └─ Limit cloud/API spend                       │
│                                                    │
│  4. Security                                       │
│     └─ Stop brute-force, credential stuffing       │
└────────────────────────────────────────────────────┘

Core Explanation

What is Rate Limiting?

Real-World Analogy: Think of rate limiting like a nightclub bouncer. The club has a capacity of 100 people. The bouncer lets people in one at a time, but if the club is full, new arrivals must wait outside. Some VIPs (premium API users) might get a higher limit or skip the line.

Rate limiting enforces boundaries on how many requests a client can make:

Per user: Each authenticated user gets N requests/minute
Per IP: Unauthenticated requests limited by IP address
Per API key: Different tiers get different limits
Global: Protect the entire service from overload

How It Works

Every rate limiter needs to answer two questions:

Identification: Who is making this request? (user ID, IP, API key)
Counting: How many requests have they made recently?

The implementation varies by algorithm, but the flow is consistent:

Request arrives → Identify client → Check counter → Allow or Reject

Two Main Algorithms

Token Bucket vs Sliding Window

TOKEN BUCKET
┌────────────────────────────────────────────────────┐
│  Bucket: [●●●●●] Capacity: 5 tokens                │
│                                                    │
│  Request 1: [●●●●○] consume 1 → ALLOWED            │
│  Request 2: [●●●○○] consume 1 → ALLOWED            │
│  Request 3: [●●○○○] consume 1 → ALLOWED            │
│  Request 4: [●○○○○] consume 1 → ALLOWED            │
│  Request 5: [○○○○○] consume 1 → ALLOWED            │
│  Request 6: [○○○○○] no tokens → REJECTED           │
│                                                    │
│  After 2 seconds (refill 1/sec):                   │
│  Bucket: [●●○○○] → can handle 2 more requests      │
│                                                    │
│  ✓ Allows bursts up to capacity                    │
│  ✓ Steady-state rate = refill rate                 │
└────────────────────────────────────────────────────┘

SLIDING WINDOW COUNTER
┌────────────────────────────────────────────────────┐
│  Window: 60 seconds, Limit: 10 requests            │
│                                                    │
│  Previous window: 8 requests                       │
│  Current window: 3 requests                        │
│  Current position: 25% into window                 │
│                                                    │
│  Weighted count = 8 × 0.75 + 3 = 9                 │
│  9 < 10 → ALLOWED                                  │
│                                                    │
│  ✓ Prevents boundary burst problem                 │
│  ✓ Smooth rate enforcement                         │
└────────────────────────────────────────────────────┘

Token Bucket:

Tokens accumulate at a steady rate (refill)
Each request consumes one token
Burst allowed up to bucket capacity
Best for: APIs where legitimate traffic is bursty

Sliding Window Counter:

Counts requests in a sliding time window
Weights previous window to prevent boundary bursts
Smooth enforcement, no burst allowance
Best for: Strict rate enforcement, billing limits

Real Systems Using Rate Limiting

System	Algorithm Style	Notes	Use Case
GitHub API	Token bucket style	Tiered by authentication; check current docs for limits	Developer API access
Stripe	Sliding window	Different limits for live vs test mode	Payment processing
Twitter/X API	Tiered windows	Varies significantly by endpoint and tier	Social media API
AWS API Gateway	Token bucket	Fully configurable per stage	API management
Cloudflare	Leaky bucket	Rule-based configuration	Edge rate limiting

Note: Specific limits change frequently. Always verify current limits in official documentation.

Case Study: API Gateway Rate Limiting

Multi-Tier Rate Limiting

MULTI-TIER API LIMITS (Illustrative Example)
┌────────────────────────────────────────────────────┐
│  Tier        │ Rate Limit    │ Burst   │ Pricing   │
│──────────────┼───────────────┼─────────┼────────── │
│  Free        │ Low           │ Small   │ Free      │
│  Developer   │ Medium        │ Medium  │ Basic     │
│  Pro         │ High          │ Large   │ Premium   │
│  Enterprise  │ Custom        │ Custom  │ Custom    │
└────────────────────────────────────────────────────┘
│  (Exact limits vary by provider—verify current docs) │

REQUEST FLOW WITH TIERS
┌────────────────────────────────────────────────────┐
│  Request + API Key                                 │
│         ↓                                          │
│  Lookup tier from API key                          │
│         ↓                                          │
│  Check rate limit for tier                         │
│         ↓                                          │
│  ┌─────────────────────────────────────┐           │
│  │ Within limit? ──YES──► Process      │           │
│  │       │                             │           │
│  │      NO                             │           │
│  │       ↓                             │           │
│  │ Return 429 + Retry-After header     │           │
│  │ + X-RateLimit-Remaining: 0          │           │
│  │ + X-RateLimit-Reset: 1640000000     │           │
│  └─────────────────────────────────────┘           │
└────────────────────────────────────────────────────┘

When to Use Rate Limiting

✓ Perfect Use Cases

Rate Limiting Use Cases

PUBLIC APIS
Scenario: REST API exposed to external developers
Requirement: Prevent abuse, ensure fair access
Configuration: 1000 requests/hour per API key
Trade-off: Legitimate high-volume users need higher tier

MULTI-TENANT SAAS
Scenario: Shared infrastructure, many customers
Requirement: One customer can't impact others
Configuration: Per-tenant limits based on plan
Trade-off: Complexity in limit management

LOGIN ENDPOINTS
Scenario: Authentication endpoints
Requirement: Prevent credential stuffing attacks
Configuration: 5 attempts/minute per IP
Trade-off: Legitimate users with typos may be blocked

WEBHOOK RECEIVERS
Scenario: Receiving webhooks from external services
Requirement: Prevent flood from misbehaving senders
Configuration: 100 requests/second per source
Trade-off: May drop legitimate spikes

✕ When NOT to Use (or Use Carefully)

When Rate Limiting May Not Fit

INTERNAL SERVICE-TO-SERVICE
Problem: Adds latency to trusted internal calls
Alternative: Circuit breakers, back-pressure
When OK: Still useful for resource protection

REAL-TIME CRITICAL PATHS
Problem: 429s disrupt user experience
Alternative: Graceful degradation, queuing
When OK: As last resort protection

BATCH PROCESSING
Problem: Legitimate jobs need high throughput
Alternative: Separate batch endpoints with higher limits
When OK: With burst capacity for batch jobs

Interview Application

Common Interview Question

Q: “You’re designing an API for a public service. How would you implement rate limiting? What algorithm would you choose?”

Strong Answer:

“I’d implement rate limiting at the API gateway level with a token bucket algorithm. Here’s my approach:

Why Token Bucket:

Allows legitimate bursts: Users often make multiple quick requests (page load, app startup)

Simple state: Just two values per client (tokens, last_refill_time)

Configurable: Capacity controls burst size, refill rate controls steady-state

Implementation:

Store state in Redis for distributed rate limiting

Key: ratelimit:{user_id} with token count and timestamp

Use Redis MULTI/EXEC for atomic check-and-decrement

Configuration:

Capacity: 100 tokens (max burst)

Refill: 10 tokens/second (100 req/sec steady-state)

Different limits per API tier

Response Headers:

X-RateLimit-Limit: Maximum requests allowed

X-RateLimit-Remaining: Requests left in window

X-RateLimit-Reset: Unix timestamp when limit resets

Retry-After: Seconds to wait (on 429)

Edge Cases:

Clock skew: Use Redis server time, not client time

Distributed: Single Redis cluster for consistency

Failover: Fail open (allow) if Redis unavailable—better to risk abuse than block all users”

Follow-up: How would you handle distributed rate limiting across multiple regions?

“For multi-region, I’d use local rate limiting with global synchronization:

Each region has local Redis for low-latency checks

Async sync between regions (eventual consistency)

Accept that users might get slightly more than limit globally

Alternative: Single global Redis with latency cost

The trade-off is accuracy vs latency. For most APIs, slightly exceeding limits across regions is acceptable.”

Code Example

Token Bucket Rate Limiter (Python + Redis)

import time
import redis

class TokenBucketRateLimiter:
    """
    Distributed token bucket rate limiter using Redis.

    Allows bursts up to capacity while enforcing average rate.
    """

    def __init__(self, redis_client: redis.Redis, capacity: int, refill_rate: float):
        """
        Args:
            redis_client: Redis connection
            capacity: Maximum tokens (burst size)
            refill_rate: Tokens added per second
        """
        self.redis = redis_client
        self.capacity = capacity
        self.refill_rate = refill_rate

    def is_allowed(self, key: str) -> tuple[bool, dict]:
    # ... omitted: keep concept snippets short

    # Simulate requests
    for i in range(25):
        allowed, info = limiter.is_allowed(user_id)

        if allowed:
            print(f"Request {i+1}: ALLOWED (remaining: {info['remaining']})")
        else:
            print(f"Request {i+1}: REJECTED (retry at: {info['reset_at']})")
            # In real code: return 429 with Retry-After header

Express.js Middleware Example

const rateLimit = require('express-rate-limit');
const RedisStore = require('rate-limit-redis');
const Redis = require('ioredis');

const redis = new Redis({
  host: process.env.REDIS_HOST,
  port: 6379,
});

// Create rate limiter middleware
const apiLimiter = rateLimit({
  store: new RedisStore({
    sendCommand: (...args) => redis.call(...args),
  }),

  // 100 requests per 15 minutes
  windowMs: 15 * 60 * 1000,
  max: 100,

  // Return rate limit info in headers
  standardHeaders: true,
  legacyHeaders: false,
  // ... omitted: keep concept snippets short
app.use('/api/', apiLimiter);

// Stricter limit for auth endpoints
const authLimiter = rateLimit({
  windowMs: 60 * 1000,  // 1 minute
  max: 5,               // 5 attempts per minute
  message: 'Too many login attempts, please try again later',
});

app.use('/api/auth/login', authLimiter);

See It In Action:

Rate Limiting Explainer - Visual walkthrough of token bucket vs sliding window

Related Concepts:

Token Bucket - Burst-tolerant algorithm
Sliding Window - Smooth enforcement algorithm
Load Balancing - Distributing traffic across servers
Circuit Breaker - Failing fast when downstream is unhealthy

Quick Self-Check

Can explain rate limiting in 60 seconds?
Understand difference between token bucket and sliding window?
Know what HTTP 429 means and what headers to return?
Can implement distributed rate limiting with Redis?
Understand why sliding window prevents boundary burst?
Know when to use rate limiting vs circuit breaker?