📚 What You'll Learn
- Why rate limiting is essential for protecting APIs
- How the 3 main algorithms work (with simple analogies!)
- Which algorithm to pick for your use case
- How to implement rate limiting with Redis & Python
- Production considerations for scale and reliability
What is Rate Limiting?
Rate limiting is a technique to control how many requests a user (or client) can make to your API within a specific time window. It's like a traffic cop for your servers — it makes sure no single user hogs all the resources.
💡 Think of it like this...
Imagine a popular coffee shop that only has 2 baristas. If 100 customers rush in at once, chaos ensues! Rate limiting is like having a queue system — maybe each customer can only order 3 drinks per hour. This keeps things fair and prevents the shop from being overwhelmed.
Why Do We Need It?
- Prevent abuse: Stop malicious users or bots from overloading your servers
- Ensure fairness: One user shouldn't consume all the resources meant for everyone
- Protect revenue: If you have paid tiers, rate limiting enforces usage quotas
- Manage costs: Every API call costs money (compute, bandwidth, etc.)
- Improve reliability: Prevents cascading failures during traffic spikes
Common Use Cases
Rate limiting is used across many different scenarios:
API Rate Limiting
Control volume of client requests, ensure fair access (Twitter, GitHub, Google Maps)
Web Server Protection
Defend against DoS attacks, prevent server overload during traffic spikes
Database Rate Limiting
Limit queries per user to prevent resource exhaustion
Login Rate Limiting
Stop brute-force attacks by limiting login attempts
Types of Rate Limiting
There are several ways to identify and limit traffic. Each has its advantages and limitations:
1. IP-Based Rate Limiting
Limits the number of requests from a single IP address within a specified time period.
| ✅ Advantages | ⚠️ Limitations |
|---|---|
| Simple to implement at network level | VPNs/proxies can spoof IPs |
| Effective against single-source attacks | Shared IPs may block legitimate users |
2. Server-Based Rate Limiting
Controls the number of requests to a specific server within a time period.
3. Geography-Based Rate Limiting
Restricts traffic based on the geographic location of the IP address. Useful for blocking malicious regions.
4. User/API Key-Based Rate Limiting
The most granular approach — limits requests per authenticated user or API key. Essential for SaaS and paid API tiers.
💡 Best Practice
Combine multiple types! Use IP-based as a first line of defense, then apply user-based limits for authenticated traffic.
Requirements & Constraints
What We Need to Build
- Per-user limits: Limit each API key to X requests per minute/hour
- Per-IP limits: As a fallback for unauthenticated requests
- Low latency: The rate check should add minimal overhead (target: <2ms)
- High availability: Rate limiting shouldn't become a single point of failure
- Real-time policy updates: Change limits without restarting services
⚠️ Key Constraints to Consider
- High throughput: Edge servers may handle 100K+ requests/second
- Latency sensitivity: Every millisecond counts in the request path
- Memory limits: Can't store unlimited counters per node
- Distributed nature: Multiple servers need a consistent view
High-Level Design
At a high level, when a request comes in, your API Gateway asks the Rate Limiter: "Should I allow this request?" The Rate Limiter checks a shared datastore (usually Redis) and returns a yes/no decision.
Key Components
API Gateway
Rate Limiter
Redis Cluster
Config Service
Metrics
Rate Limiting Algorithms
There are three main algorithms for rate limiting. Each has its own trade-offs.
Token Bucket
Imagine a bucket that holds tokens. Every second, new tokens are added. Each request costs one token. If empty — request denied!
Sliding Window
Instead of a fixed window, imagine a window that "slides" with time. At 0:30, you look at requests from 23:30-0:30.
Leaky Bucket
Picture a bucket with a small hole at the bottom. Water (requests) flows in and leaks out at a constant rate.
Which Algorithm Should You Use?
| Scenario | Best Algorithm | Why |
|---|---|---|
| General API rate limiting | Token Bucket | Allows healthy bursts |
| Strict requests-per-minute | Sliding Window | Most accurate |
| Protecting fragile service | Leaky Bucket | Smooths traffic |
| Mix of bursty/steady clients | Token Bucket | Burst-friendly |
Implementation with Redis
Redis is the go-to choice for rate limiting because it's blazing fast, supports atomic operations, and has built-in key expiration.
Data Model
For each user, we store: tokens (remaining) and ts (last
refill timestamp).
Key format: rl:user:{user_id}
Lua Script (Atomic Token Bucket)
-- Token Bucket Rate Limiter (Redis + Lua)
local key = KEYS[1]
local rate = tonumber(ARGV[1]) -- tokens added per second
local capacity = tonumber(ARGV[2]) -- max tokens in bucket
local now = tonumber(ARGV[3]) -- current timestamp
local requested = tonumber(ARGV[4]) -- tokens needed
local data = redis.call('HMGET', key, 'tokens', 'ts')
local tokens = tonumber(data[1]) or capacity
local last_ts = tonumber(data[2]) or now
local elapsed = math.max(0, now - last_ts)
local refill = math.min(capacity, tokens + elapsed * rate)
if refill < requested then
return 0 -- REJECT
else
local new_tokens = refill - requested
redis.call('HMSET', key, 'tokens', new_tokens, 'ts', now)
redis.call('EXPIRE', key, 3600)
return 1 -- ALLOW
end
Python Implementation
Now let's implement rate limiting in Python with multiple approaches.
1. Token Bucket (Python + Redis)
import time
import redis
class TokenBucketRateLimiter:
def __init__(self, redis_client, rate, capacity):
self.redis = redis_client
self.rate = rate
self.capacity = capacity
self.lua_script = self.redis.register_script("""
local key = KEYS[1]
local rate = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local data = redis.call('HMGET', key, 'tokens', 'ts')
local tokens = tonumber(data[1]) or capacity
local last_ts = tonumber(data[2]) or now
local elapsed = math.max(0, now - last_ts)
local refill = math.min(capacity, tokens + elapsed * rate)
if refill < requested then
return {0, refill}
else
redis.call('HMSET', key, 'tokens', refill - requested, 'ts', now)
redis.call('EXPIRE', key, 3600)
return {1, refill - requested}
end
""")
def allow_request(self, user_id, tokens_requested=1):
key = f"rl:user:{user_id}"
result = self.lua_script(keys=[key], args=[self.rate, self.capacity, time.time(), tokens_requested])
return bool(result[0]), float(result[1])
# Usage
r = redis.Redis(host='localhost', port=6379)
limiter = TokenBucketRateLimiter(r, rate=10, capacity=100)
allowed, remaining = limiter.allow_request("user_123")
print(f"Allowed: {allowed}, Remaining: {remaining}")
2. Flask Decorator Example
from functools import wraps
from flask import Flask, request, jsonify
import redis, time
app = Flask(__name__)
redis_client = redis.Redis(host='localhost', port=6379)
def rate_limit(max_requests, window_seconds):
def decorator(f):
@wraps(f)
def wrapped(*args, **kwargs):
key = f"rl:{f.__name__}:{request.remote_addr}"
pipe = redis_client.pipeline()
now = time.time()
pipe.zremrangebyscore(key, 0, now - window_seconds)
pipe.zcard(key)
pipe.zadd(key, {str(now): now})
pipe.expire(key, window_seconds)
results = pipe.execute()
if results[1] >= max_requests:
return jsonify({"error": "Rate limit exceeded"}), 429
return f(*args, **kwargs)
return wrapped
return decorator
@app.route('/api/data')
@rate_limit(max_requests=10, window_seconds=60)
def get_data():
return jsonify({"data": "success"})
Scaling & Performance
Edge Caching
Maintain local counters at each edge node. These sync with Redis periodically. Trade-off: slight overshoot during sync delays.
Sharding
Use consistent hashing to distribute keys across multiple Redis nodes.
Batching
Batch counter updates (10-50 requests at once) to reduce Redis write QPS.
Availability & Fault Tolerance
| Strategy | When Redis is Down | Trade-off |
|---|---|---|
| Fail-Open | Allow all requests | Risk of abuse |
| Fail-Closed | Reject all requests | Service degradation |
| Graceful Degradation | Relax limits (10x) | Balanced approach |
Key Trade-offs
⚖️ Design Trade-offs
- Accuracy vs Latency: Local caching = fast but imprecise
- Simplicity vs Precision: Fixed window = simple; Sliding = more memory
- Consistency vs Availability: Strict counting needs sync
Monitoring & Metrics
- Request rate by key: Who's hitting your API and how much?
- Rejection rate: How often are limits being hit?
- Redis latency: Is the rate limiter adding too much overhead?
- Near-limit alerts: Warn when a customer is at 80% of their quota
Rate Limiting in Different Layers
| Layer | Where It Happens | Best For |
|---|---|---|
| 🌐 API Gateway | Before services (Kong, AWS) | Global limits, DDoS protection |
| ⚙️ Application | Inside app code | Fine-grained business logic |
| 🔧 Service | Within microservices | Protecting downstream |
| 🗄️ Database | Before DB queries | Preventing query floods |
⚠️ Never rely solely on client-side rate limiting! Always enforce on the server side.
Real-World Examples
| Company | Rate Limiting Approach |
|---|---|
| Twitter/X API | 15-900 requests per 15-min window; OAuth token-based |
| GitHub API | 5,000/hour authenticated; 60/hour unauthenticated |
| Stripe API | 100 reads/sec; 25 writes/sec with token bucket |
| AWS API Gateway | Token bucket with configurable rate & burst |
| Cloudflare | Multi-layer edge limiting; geography-based rules |
📚 Cloud Provider Rate Limiters
- AWS: API Gateway Throttling, WAF Rate Rules
- Azure: API Management Rate Limiting & Quota Policies
- GCP: API Gateway Quota Policies, Cloud Armor
Video Resources
Conclusion
Key Takeaway
For most use cases, a Redis-based token bucket using Lua scripts provides the best balance of accuracy, latency, and simplicity. For extreme scale, combine edge caching with centralized storage.
Rate limiting is a critical piece of any production API. It protects your services, ensures fair usage, and gives you control over your system's resources.