Real-Time Messaging

WhatsApp System Design

Architecting a billion-user chat platform: WebSocket servers, Cassandra Wide-Column storage, User Presence heartbeats, and Group Chat fanout.

High-Level Architecture

The system is split into Real-Time (Ephemeral) and Storage (Persistent) layers.

Client
Load Balancer
WebSocket
Gateway
Message
Queue
DB Workers

Component Breakdown

  • WebSocket Gateway: Holds persistent TCP connections for online users. Uses Redis to map `User_ID -> Gateway_Server_IP`.
  • Chat Service: Orchestrates message routing. If User B is offline, sends to Push Notification Service.
  • Presence Service: Check if user is Online/Offline via Heartbeats.
  • Group Service: Manages Group metadata (Members, Admin).

Database Choice: Cassandra vs MySQL

Chat systems generate massive write throughput (TB of data per day). Traditional RDBMS (MySQL) struggle with this scale.

Criteria RDBMS (MySQL) Key-Value (Redis) Wide-Column (Cassandra/HBase)
Data Model Relational (Joins) Simple Key-Value Time-Series / Wide-Column
Scaling Vertical (Hard to Shard) Memory Limited Horizontal (Linear Scale)
Query Pattern Complex Queries Single Item "Get recent messages for User A"
Verdict Bad for History Good for Presence Best for Chat History
Why Cassandra? It supports extremely high write throughput and stores messages effectively ordered by time (`Partition Key: ChatID`, `Clustering Key: Timestamp`).

User Presence (Heartbeats)

How do we know if a friend is "Online"?

Heartbeat Mechanism

Detailed Flow:

  1. Client sends "I'm alive" signal every 5s.
  2. Load Balancer forwards to Presence Service.
  3. Service updates Redis with TTL=10s.
  4. If no heartbeat for 10s, key expires -> User Offline.
Fanout Optimization

Do not broadcast "Online" status to everyone. Only send updates to friends specifically viewing the chat list (Active Viewport).

WebSocket Implementation (Python/FastAPI)

from fastapi import FastAPI, WebSocket, WebSocketDisconnect

app = FastAPI()

class ConnectionManager:
    def __init__(self):
        # Map user_id to websocket connection
        self.active_connections: dict[str, WebSocket] = {}

    async def connect(self, websocket: WebSocket, user_id: str):
        await websocket.accept()
        self.active_connections[user_id] = websocket

    def disconnect(self, user_id: str):
        if user_id in self.active_connections:
            del self.active_connections[user_id]

    async def send_personal_message(self, message: str, user_id: str):
        if user_id in self.active_connections:
            # Send real-time if online
            await self.active_connections[user_id].send_text(message)
        else:
            # Fallback to Push Notification / DB
            print(f"User {user_id} is offline. Queuing message.")

manager = ConnectionManager()

@app.websocket("/ws/{user_id}")
async def websocket_endpoint(websocket: WebSocket, user_id: str):
    await manager.connect(websocket, user_id)
    try:
        while True:
            data = await websocket.receive_text()
            # Process incoming message
    except WebSocketDisconnect:
        manager.disconnect(user_id)

Summary

  • Use WebSockets for bi-directional low-latency communication.
  • Choose Cassandra/HBase for storing chat history due to write-heavy nature.
  • Implement User Presence with Redis-based heartbeats.
  • Handle Group Chats by iterating through members and managing fanout carefully.
  • Ensure Reliability with Acknowledgements (ACKs) and Retry queues.