Load Balancing System Design Guide | Algorithms, L4 vs L7, Implementation

What is Load Balancing?

A load balancer acts as a "traffic cop" sitting in front of your backend servers.

In modern high-traffic websites, serving millions of reliable requests typically requires adding more servers (horizontal scaling). A load balancer distributes incoming network traffic across a group of backend servers, ensuring no single server bears too much load. By spreading the work evenly, load balancing improves responsiveness and increases availability of applications.

🛡️ Key Benefits

✓ Availability: If a server dies, traffic is routed to healthy ones.
✓ Scalability: Add/remove servers on demand without downtime.
✓ Security: Can defend against DDoS and hide backend IPs.
✓ Performance: Routes to the fastest or closest resources.

📍 Placement

Load balancers sit at multiple layers:

Between User and Web Servers (NGINX, AWS ALB)
Between Web Servers and App Servers
Between App Servers and Reporting Database

L4 vs L7 Load Balancing

Load balancers are often categorized by the OSI layer at which they operate. The choice depends on performance needs vs. routing intelligence.

Layer 4 (Transport Layer)	Layer 7 (Application Layer)
How it works: Routes based on IP range and Port (TCP/UDP). It doesn't look inside the packet content.	How it works: Routes based on content (HTTP Headers, URL, Cookies, Message Data).
Pros: Extremely fast, handles millions of connections, secure (no decryption needed at LB).	Pros: Smarter routing (e.g., `/video` to video servers), Sticky Sessions, Authentication, TLS Termination.
Cons: "Dumb" routing. Can't route based on request type or content.	Cons: Slower (needs to buffer/inspect packets), requires decryption (CPU intensive).
Examples: AWS Network Load Balancer, HAProxy (TCP mode).	Examples: NGINX, HAProxy (HTTP mode), AWS Application Load Balancer.

Load Balancing Algorithms

How does the load balancer choose which server to send a request to? Here are the standard strategies.

1. Round Robin & Weighted Round Robin

Round Robin: Requests are distributed sequentially. Server A -> Server B -> Server C -> Server A.

Weighted: If Server B is 2x more powerful than A, it gets 2x the traffic (Weights: A=1, B=2, C=1).

Use Case: Simple stateless services where all servers have roughly equivalent capacity.

2. Least Connections

Routes traffic to the server with the fewest active connections. This is useful when processing times vary significantly (e.g., chat servers or long-polling).

3. Consistent Hashing

Crucial for distributed caching systems. Traditional hashing (`key % n_servers`) fails when servers are added/removed because nearly ALL keys get remapped, causing massive cache misses.

Consistent Hashing maps servers and keys to a "Ring". A key is routed to the next server clockwise on the ring. Adding/Removing a server only affects neighbors, minimizing reorganization.

Health Checks & High Availability

A load balancer must know if a backend is dead. This is done via Health Checks.

Active Checks: The LB pings the server (e.g., `GET /health`) every 5 seconds. If it fails 3 times, it's marked "Unhealthy".
Passive Checks: The LB observes real traffic. If multiple requests time out, the server is temporarily evaded.

HA Patterns for Load Balancers

What if the Load Balancer itself dies? You need redundancy.

Active-Passive: Two LBs run. One processes traffic, the other sits idle (heartbeating). If the main dies, the passive takes over the VIP (Virtual IP).

Python Implementation: Weighted Round Robin

A tailored example of a thread-safe Weighted Round Robin implementation, similar to what NGINX or LVS might use internally.

import itertools
import threading

class WeightedRoundRobin:
    def __init__(self, servers):
        """
        servers: dict like {'server_a': 5, 'server_b': 1, 'server_c': 1}
        Strength relates to weight.
        """
        self.servers = servers
        self.lock = threading.Lock()
        
        # Expand the list based on weight
        # Server A (5) will appear 5 times, B (1) once.
        self._pool = []
        for server, weight in servers.items():
            self._pool.extend([server] * weight)
            
        # Create an infinite iterator
        self._cycle = itertools.cycle(self._pool)

    def get_server(self):
        """Thread-safe method to get next server"""
        with self.lock:
            return next(self._cycle)

# Usage Simulation
lb = WeightedRoundRobin({
    'prod-api-01 (High CPU)': 3,
    'prod-api-02 (Low CPU)': 1, 
    'prod-api-03 (Low CPU)': 1
})

print("Traffic Distribution:")
for i in range(10):
    print(f"Request {i+1} -> {lb.get_server()}")

Why this matters: In real systems preventing the "Thundering Herd" problem involves sophisticated load balancing. This simple class demonstrates how stronger servers can legally "steal" more work.

Summary

Use L4 for raw performance, L7 for smart routing.
Round Robin is great for simple stats, Least Connection for long-lived tasks.
Consistent Hashing is a must for caching layers to prevent cache avalanches.
Always implement Active Health Checks to failover gracefully.

Load Balancing & Horizontal Scaling