🚦 Rate Limiting — System Design

Learn how to design a production-ready rate limiting system that protects your APIs from abuse while keeping things smooth for real users. We'll break down the key algorithms and show you how to implement them with Redis.

20 min read Intermediate Redis, Python, API Gateway

📚 What You'll Learn

  • Why rate limiting is essential for protecting APIs
  • How the 3 main algorithms work (with simple analogies!)
  • Which algorithm to pick for your use case
  • How to implement rate limiting with Redis & Python
  • Production considerations for scale and reliability

What is Rate Limiting?

Rate limiting is a technique to control how many requests a user (or client) can make to your API within a specific time window. It's like a traffic cop for your servers — it makes sure no single user hogs all the resources.

💡 Think of it like this...

Imagine a popular coffee shop that only has 2 baristas. If 100 customers rush in at once, chaos ensues! Rate limiting is like having a queue system — maybe each customer can only order 3 drinks per hour. This keeps things fair and prevents the shop from being overwhelmed.

Why Do We Need It?

  • Prevent abuse: Stop malicious users or bots from overloading your servers
  • Ensure fairness: One user shouldn't consume all the resources meant for everyone
  • Protect revenue: If you have paid tiers, rate limiting enforces usage quotas
  • Manage costs: Every API call costs money (compute, bandwidth, etc.)
  • Improve reliability: Prevents cascading failures during traffic spikes

Common Use Cases

Rate limiting is used across many different scenarios:

🔌

API Rate Limiting

Control volume of client requests, ensure fair access (Twitter, GitHub, Google Maps)

🌐

Web Server Protection

Defend against DoS attacks, prevent server overload during traffic spikes

🗄️

Database Rate Limiting

Limit queries per user to prevent resource exhaustion

🔐

Login Rate Limiting

Stop brute-force attacks by limiting login attempts

Types of Rate Limiting

There are several ways to identify and limit traffic. Each has its advantages and limitations:

1. IP-Based Rate Limiting

Limits the number of requests from a single IP address within a specified time period.

✅ Advantages ⚠️ Limitations
Simple to implement at network level VPNs/proxies can spoof IPs
Effective against single-source attacks Shared IPs may block legitimate users

2. Server-Based Rate Limiting

Controls the number of requests to a specific server within a time period.

3. Geography-Based Rate Limiting

Restricts traffic based on the geographic location of the IP address. Useful for blocking malicious regions.

4. User/API Key-Based Rate Limiting

The most granular approach — limits requests per authenticated user or API key. Essential for SaaS and paid API tiers.

💡 Best Practice

Combine multiple types! Use IP-based as a first line of defense, then apply user-based limits for authenticated traffic.

Requirements & Constraints

What We Need to Build

  • Per-user limits: Limit each API key to X requests per minute/hour
  • Per-IP limits: As a fallback for unauthenticated requests
  • Low latency: The rate check should add minimal overhead (target: <2ms)
  • High availability: Rate limiting shouldn't become a single point of failure
  • Real-time policy updates: Change limits without restarting services

⚠️ Key Constraints to Consider

  • High throughput: Edge servers may handle 100K+ requests/second
  • Latency sensitivity: Every millisecond counts in the request path
  • Memory limits: Can't store unlimited counters per node
  • Distributed nature: Multiple servers need a consistent view

High-Level Design

At a high level, when a request comes in, your API Gateway asks the Rate Limiter: "Should I allow this request?" The Rate Limiter checks a shared datastore (usually Redis) and returns a yes/no decision.

Key Components

🚪

API Gateway

Rate Limiter

💾

Redis Cluster

⚙️

Config Service

📊

Metrics

Rate Limiting Algorithms

There are three main algorithms for rate limiting. Each has its own trade-offs.

Token Bucket

Imagine a bucket that holds tokens. Every second, new tokens are added. Each request costs one token. If empty — request denied!

✓ Pros: Allows bursts, simple to implement
✗ Cons: Slightly imprecise at boundaries

Sliding Window

Instead of a fixed window, imagine a window that "slides" with time. At 0:30, you look at requests from 23:30-0:30.

✓ Pros: More accurate, no boundary issues
✗ Cons: Uses more memory

Leaky Bucket

Picture a bucket with a small hole at the bottom. Water (requests) flows in and leaks out at a constant rate.

✓ Pros: Smooths traffic, protects downstream
✗ Cons: Doesn't allow bursts

Which Algorithm Should You Use?

Scenario Best Algorithm Why
General API rate limiting Token Bucket Allows healthy bursts
Strict requests-per-minute Sliding Window Most accurate
Protecting fragile service Leaky Bucket Smooths traffic
Mix of bursty/steady clients Token Bucket Burst-friendly

Implementation with Redis

Redis is the go-to choice for rate limiting because it's blazing fast, supports atomic operations, and has built-in key expiration.

Data Model

For each user, we store: tokens (remaining) and ts (last refill timestamp).

Key format: rl:user:{user_id}

Lua Script (Atomic Token Bucket)

📄 token_bucket.lua Lua
-- Token Bucket Rate Limiter (Redis + Lua)
local key = KEYS[1]
local rate = tonumber(ARGV[1])           -- tokens added per second
local capacity = tonumber(ARGV[2])       -- max tokens in bucket
local now = tonumber(ARGV[3])            -- current timestamp
local requested = tonumber(ARGV[4])      -- tokens needed

local data = redis.call('HMGET', key, 'tokens', 'ts')
local tokens = tonumber(data[1]) or capacity
local last_ts = tonumber(data[2]) or now

local elapsed = math.max(0, now - last_ts)
local refill = math.min(capacity, tokens + elapsed * rate)

if refill < requested then
  return 0  -- REJECT
else
  local new_tokens = refill - requested
  redis.call('HMSET', key, 'tokens', new_tokens, 'ts', now)
  redis.call('EXPIRE', key, 3600)
  return 1  -- ALLOW
end

Python Implementation

Now let's implement rate limiting in Python with multiple approaches.

1. Token Bucket (Python + Redis)

📄 token_bucket.py Python
import time
import redis

class TokenBucketRateLimiter:
    def __init__(self, redis_client, rate, capacity):
        self.redis = redis_client
        self.rate = rate
        self.capacity = capacity
        self.lua_script = self.redis.register_script("""
            local key = KEYS[1]
            local rate = tonumber(ARGV[1])
            local capacity = tonumber(ARGV[2])
            local now = tonumber(ARGV[3])
            local requested = tonumber(ARGV[4])
            
            local data = redis.call('HMGET', key, 'tokens', 'ts')
            local tokens = tonumber(data[1]) or capacity
            local last_ts = tonumber(data[2]) or now
            
            local elapsed = math.max(0, now - last_ts)
            local refill = math.min(capacity, tokens + elapsed * rate)
            
            if refill < requested then
                return {0, refill}
            else
                redis.call('HMSET', key, 'tokens', refill - requested, 'ts', now)
                redis.call('EXPIRE', key, 3600)
                return {1, refill - requested}
            end
        """)
    
    def allow_request(self, user_id, tokens_requested=1):
        key = f"rl:user:{user_id}"
        result = self.lua_script(keys=[key], args=[self.rate, self.capacity, time.time(), tokens_requested])
        return bool(result[0]), float(result[1])

# Usage
r = redis.Redis(host='localhost', port=6379)
limiter = TokenBucketRateLimiter(r, rate=10, capacity=100)
allowed, remaining = limiter.allow_request("user_123")
print(f"Allowed: {allowed}, Remaining: {remaining}")

2. Flask Decorator Example

📄 flask_rate_limit.py Python
from functools import wraps
from flask import Flask, request, jsonify
import redis, time

app = Flask(__name__)
redis_client = redis.Redis(host='localhost', port=6379)

def rate_limit(max_requests, window_seconds):
    def decorator(f):
        @wraps(f)
        def wrapped(*args, **kwargs):
            key = f"rl:{f.__name__}:{request.remote_addr}"
            pipe = redis_client.pipeline()
            now = time.time()
            pipe.zremrangebyscore(key, 0, now - window_seconds)
            pipe.zcard(key)
            pipe.zadd(key, {str(now): now})
            pipe.expire(key, window_seconds)
            results = pipe.execute()
            
            if results[1] >= max_requests:
                return jsonify({"error": "Rate limit exceeded"}), 429
            return f(*args, **kwargs)
        return wrapped
    return decorator

@app.route('/api/data')
@rate_limit(max_requests=10, window_seconds=60)
def get_data():
    return jsonify({"data": "success"})

Scaling & Performance

Edge Caching

Maintain local counters at each edge node. These sync with Redis periodically. Trade-off: slight overshoot during sync delays.

Sharding

Use consistent hashing to distribute keys across multiple Redis nodes.

Batching

Batch counter updates (10-50 requests at once) to reduce Redis write QPS.

Availability & Fault Tolerance

Strategy When Redis is Down Trade-off
Fail-Open Allow all requests Risk of abuse
Fail-Closed Reject all requests Service degradation
Graceful Degradation Relax limits (10x) Balanced approach

Key Trade-offs

⚖️ Design Trade-offs

  • Accuracy vs Latency: Local caching = fast but imprecise
  • Simplicity vs Precision: Fixed window = simple; Sliding = more memory
  • Consistency vs Availability: Strict counting needs sync

Monitoring & Metrics

  • Request rate by key: Who's hitting your API and how much?
  • Rejection rate: How often are limits being hit?
  • Redis latency: Is the rate limiter adding too much overhead?
  • Near-limit alerts: Warn when a customer is at 80% of their quota

Rate Limiting in Different Layers

Layer Where It Happens Best For
🌐 API Gateway Before services (Kong, AWS) Global limits, DDoS protection
⚙️ Application Inside app code Fine-grained business logic
🔧 Service Within microservices Protecting downstream
🗄️ Database Before DB queries Preventing query floods

⚠️ Never rely solely on client-side rate limiting! Always enforce on the server side.

Real-World Examples

Company Rate Limiting Approach
Twitter/X API 15-900 requests per 15-min window; OAuth token-based
GitHub API 5,000/hour authenticated; 60/hour unauthenticated
Stripe API 100 reads/sec; 25 writes/sec with token bucket
AWS API Gateway Token bucket with configurable rate & burst
Cloudflare Multi-layer edge limiting; geography-based rules

📚 Cloud Provider Rate Limiters

  • AWS: API Gateway Throttling, WAF Rate Rules
  • Azure: API Management Rate Limiting & Quota Policies
  • GCP: API Gateway Quota Policies, Cloud Armor

Video Resources

Conclusion

Key Takeaway

For most use cases, a Redis-based token bucket using Lua scripts provides the best balance of accuracy, latency, and simplicity. For extreme scale, combine edge caching with centralized storage.

Rate limiting is a critical piece of any production API. It protects your services, ensures fair usage, and gives you control over your system's resources.