Rate Limiting in System Design: Complete Guide with Algorithms & Implementation

📚 What You'll Learn

Why rate limiting is essential for protecting APIs
How the 3 main algorithms work (with simple analogies!)
Which algorithm to pick for your use case
How to implement rate limiting with Redis & Python
Production considerations for scale and reliability

What is Rate Limiting?

Rate limiting is a technique to control how many requests a user (or client) can make to your API within a specific time window. It's like a traffic cop for your servers — it makes sure no single user hogs all the resources.

💡 Think of it like this...

Imagine a popular coffee shop that only has 2 baristas. If 100 customers rush in at once, chaos ensues! Rate limiting is like having a queue system — maybe each customer can only order 3 drinks per hour. This keeps things fair and prevents the shop from being overwhelmed.

Why Do We Need It?

Prevent abuse: Stop malicious users or bots from overloading your servers
Ensure fairness: One user shouldn't consume all the resources meant for everyone
Protect revenue: If you have paid tiers, rate limiting enforces usage quotas
Manage costs: Every API call costs money (compute, bandwidth, etc.)
Improve reliability: Prevents cascading failures during traffic spikes

Common Use Cases

Rate limiting is used across many different scenarios:

🔌

API Rate Limiting

Control volume of client requests, ensure fair access (Twitter, GitHub, Google Maps)

🌐

Web Server Protection

Defend against DoS attacks, prevent server overload during traffic spikes

🗄️

Database Rate Limiting

Limit queries per user to prevent resource exhaustion

🔐

Login Rate Limiting

Stop brute-force attacks by limiting login attempts

Types of Rate Limiting

There are several ways to identify and limit traffic. Each has its advantages and limitations:

1. IP-Based Rate Limiting

Limits the number of requests from a single IP address within a specified time period.

✅ Advantages	⚠️ Limitations
Simple to implement at network level	VPNs/proxies can spoof IPs
Effective against single-source attacks	Shared IPs may block legitimate users

2. Server-Based Rate Limiting

Controls the number of requests to a specific server within a time period.

3. Geography-Based Rate Limiting

Restricts traffic based on the geographic location of the IP address. Useful for blocking malicious regions.

4. User/API Key-Based Rate Limiting

The most granular approach — limits requests per authenticated user or API key. Essential for SaaS and paid API tiers.

💡 Best Practice

Combine multiple types! Use IP-based as a first line of defense, then apply user-based limits for authenticated traffic.

Requirements & Constraints

What We Need to Build

Per-user limits: Limit each API key to X requests per minute/hour
Per-IP limits: As a fallback for unauthenticated requests
Low latency: The rate check should add minimal overhead (target: <2ms)
High availability: Rate limiting shouldn't become a single point of failure
Real-time policy updates: Change limits without restarting services

⚠️ Key Constraints to Consider

High throughput: Edge servers may handle 100K+ requests/second
Latency sensitivity: Every millisecond counts in the request path
Memory limits: Can't store unlimited counters per node
Distributed nature: Multiple servers need a consistent view

High-Level Design

At a high level, when a request comes in, your API Gateway asks the Rate Limiter: "Should I allow this request?" The Rate Limiter checks a shared datastore (usually Redis) and returns a yes/no decision.

Key Components

🚪

API Gateway

⚡

Rate Limiter

💾

Redis Cluster

⚙️

Config Service

📊

Metrics

Rate Limiting Algorithms

There are three main algorithms for rate limiting. Each has its own trade-offs.

Token Bucket

Imagine a bucket that holds tokens. Every second, new tokens are added. Each request costs one token. If empty — request denied!

✓ Pros: Allows bursts, simple to implement

✗ Cons: Slightly imprecise at boundaries

Sliding Window

Instead of a fixed window, imagine a window that "slides" with time. At 0:30, you look at requests from 23:30-0:30.

✓ Pros: More accurate, no boundary issues

✗ Cons: Uses more memory

Leaky Bucket

Picture a bucket with a small hole at the bottom. Water (requests) flows in and leaks out at a constant rate.

✓ Pros: Smooths traffic, protects downstream

✗ Cons: Doesn't allow bursts

Which Algorithm Should You Use?

Scenario	Best Algorithm	Why
General API rate limiting	Token Bucket	Allows healthy bursts
Strict requests-per-minute	Sliding Window	Most accurate
Protecting fragile service	Leaky Bucket	Smooths traffic
Mix of bursty/steady clients	Token Bucket	Burst-friendly

Implementation with Redis

Redis is the go-to choice for rate limiting because it's blazing fast, supports atomic operations, and has built-in key expiration.

Data Model

For each user, we store: tokens (remaining) and ts (last refill timestamp).

Key format: rl:user:{user_id}

Lua Script (Atomic Token Bucket)

📄 token_bucket.lua Lua

-- Token Bucket Rate Limiter (Redis + Lua)
local key = KEYS[1]
local rate = tonumber(ARGV[1])           -- tokens added per second
local capacity = tonumber(ARGV[2])       -- max tokens in bucket
local now = tonumber(ARGV[3])            -- current timestamp
local requested = tonumber(ARGV[4])      -- tokens needed

local data = redis.call('HMGET', key, 'tokens', 'ts')
local tokens = tonumber(data[1]) or capacity
local last_ts = tonumber(data[2]) or now

local elapsed = math.max(0, now - last_ts)
local refill = math.min(capacity, tokens + elapsed * rate)

if refill < requested then
  return 0  -- REJECT
else
  local new_tokens = refill - requested
  redis.call('HMSET', key, 'tokens', new_tokens, 'ts', now)
  redis.call('EXPIRE', key, 3600)
  return 1  -- ALLOW
end

Python Implementation

Now let's implement rate limiting in Python with multiple approaches.

1. Token Bucket (Python + Redis)

📄 token_bucket.py Python

import time
import redis

class TokenBucketRateLimiter:
    def __init__(self, redis_client, rate, capacity):
        self.redis = redis_client
        self.rate = rate
        self.capacity = capacity
        self.lua_script = self.redis.register_script("""
            local key = KEYS[1]
            local rate = tonumber(ARGV[1])
            local capacity = tonumber(ARGV[2])
            local now = tonumber(ARGV[3])
            local requested = tonumber(ARGV[4])
            
            local data = redis.call('HMGET', key, 'tokens', 'ts')
            local tokens = tonumber(data[1]) or capacity
            local last_ts = tonumber(data[2]) or now
            
            local elapsed = math.max(0, now - last_ts)
            local refill = math.min(capacity, tokens + elapsed * rate)
            
            if refill < requested then
                return {0, refill}
            else
                redis.call('HMSET', key, 'tokens', refill - requested, 'ts', now)
                redis.call('EXPIRE', key, 3600)
                return {1, refill - requested}
            end
        """)
    
    def allow_request(self, user_id, tokens_requested=1):
        key = f"rl:user:{user_id}"
        result = self.lua_script(keys=[key], args=[self.rate, self.capacity, time.time(), tokens_requested])
        return bool(result[0]), float(result[1])

# Usage
r = redis.Redis(host='localhost', port=6379)
limiter = TokenBucketRateLimiter(r, rate=10, capacity=100)
allowed, remaining = limiter.allow_request("user_123")
print(f"Allowed: {allowed}, Remaining: {remaining}")

2. Flask Decorator Example

📄 flask_rate_limit.py Python

from functools import wraps
from flask import Flask, request, jsonify
import redis, time

app = Flask(__name__)
redis_client = redis.Redis(host='localhost', port=6379)

def rate_limit(max_requests, window_seconds):
    def decorator(f):
        @wraps(f)
        def wrapped(*args, **kwargs):
            key = f"rl:{f.__name__}:{request.remote_addr}"
            pipe = redis_client.pipeline()
            now = time.time()
            pipe.zremrangebyscore(key, 0, now - window_seconds)
            pipe.zcard(key)
            pipe.zadd(key, {str(now): now})
            pipe.expire(key, window_seconds)
            results = pipe.execute()
            
            if results[1] >= max_requests:
                return jsonify({"error": "Rate limit exceeded"}), 429
            return f(*args, **kwargs)
        return wrapped
    return decorator

@app.route('/api/data')
@rate_limit(max_requests=10, window_seconds=60)
def get_data():
    return jsonify({"data": "success"})

Scaling & Performance

Edge Caching

Maintain local counters at each edge node. These sync with Redis periodically. Trade-off: slight overshoot during sync delays.

Sharding

Use consistent hashing to distribute keys across multiple Redis nodes.

Batching

Batch counter updates (10-50 requests at once) to reduce Redis write QPS.

Availability & Fault Tolerance

Strategy	When Redis is Down	Trade-off
Fail-Open	Allow all requests	Risk of abuse
Fail-Closed	Reject all requests	Service degradation
Graceful Degradation	Relax limits (10x)	Balanced approach

Key Trade-offs

⚖️ Design Trade-offs

Accuracy vs Latency: Local caching = fast but imprecise
Simplicity vs Precision: Fixed window = simple; Sliding = more memory
Consistency vs Availability: Strict counting needs sync

Monitoring & Metrics

Request rate by key: Who's hitting your API and how much?
Rejection rate: How often are limits being hit?
Redis latency: Is the rate limiter adding too much overhead?
Near-limit alerts: Warn when a customer is at 80% of their quota

Rate Limiting in Different Layers

Layer	Where It Happens	Best For
🌐 API Gateway	Before services (Kong, AWS)	Global limits, DDoS protection
⚙️ Application	Inside app code	Fine-grained business logic
🔧 Service	Within microservices	Protecting downstream
🗄️ Database	Before DB queries	Preventing query floods

⚠️ Never rely solely on client-side rate limiting! Always enforce on the server side.

Real-World Examples

Company	Rate Limiting Approach
Twitter/X API	15-900 requests per 15-min window; OAuth token-based
GitHub API	5,000/hour authenticated; 60/hour unauthenticated
Stripe API	100 reads/sec; 25 writes/sec with token bucket
AWS API Gateway	Token bucket with configurable rate & burst
Cloudflare	Multi-layer edge limiting; geography-based rules

📚 Cloud Provider Rate Limiters

AWS: API Gateway Throttling, WAF Rate Rules
Azure: API Management Rate Limiting & Quota Policies
GCP: API Gateway Quota Policies, Cloud Armor

Video Resources

Conclusion

Key Takeaway

For most use cases, a Redis-based token bucket using Lua scripts provides the best balance of accuracy, latency, and simplicity. For extreme scale, combine edge caching with centralized storage.

Rate limiting is a critical piece of any production API. It protects your services, ensures fair usage, and gives you control over your system's resources.