Scalable Architecture

Notification System Design

Designing a multi-channel notification service engine: Priority queues, rate limiting, deduplication, and reliable delivery at scale.

High-Level Architecture

A decoupled architecture using Message Queues is essential to separate the API (ingestion) from the Workers (delivery).

Service A
Notification
Service API
Kafka / RabbitMQ
Workers
3rd Party
(SES/Twilio/FCM)

Component Breakdown

1. Notification Service (API)

Entry point. Validates inputs, checks user preferences (Opt-in/out), performs rate limiting, and publishes to the message queue.

2. Message Queue

Buffers bursts of traffic. Provides reliable persistence. Examples: RabbitMQ (for complex routing) or Kafka (for high throughput).

3. Workers

Consume messages. Render templates. Call external providers (SendGrid, Twilio, FCM). Handle retry logic.

4. Database & Cache

Postgres: Stores notification logs, templates, user settings.
Redis: Stores deduplication keys and rate limit counters.

Prioritization Strategy

Not all notifications are equal. OTPs must arrive within seconds; marketing emails can wait hours.

Priority Level Examples SLA Latency Queue Implementation
CRITICAL OTP, Password Reset, Fraud Alert < 5 seconds Dedicated High Priority Queue requiring low latency consumers.
NORMAL Order Confirmation, Friend Request < 1 minute Standard Queue.
LOW Weekly Newsletter, Promotional Offers Hours Bulk Queue (processed only when idle or at night).

Deduplication Logic

Prevents spamming the user if a service erroneously retries the same event multiple times.

How it works (Redis Cache-Aside)
  1. Generate a unique deduplication key: hash(user_id + event_type + content_id).
  2. Check if key exists in Redis.
  3. If exists → Discard (Duplicate).
  4. If checks pass → Store key in Redis with TTL (e.g., 5 mins) and proceed.
# Python Example: Start Deduplication
import hashlib

def should_send_notification(user_id, event_type, content_id):
    # Create unique key
    raw_key = f"{user_id}:{event_type}:{content_id}"
    dedup_key = hashlib.sha256(raw_key.encode()).hexdigest()
    
    # Check Redis (SETNX returns 1 if set, 0 if already exists)
    is_new = redis_client.set(dedup_key, "1", ex=300, nx=True)
    
    return bool(is_new)

Reliability & Retry Mechanism

External providers (Twilio, SendGrid) can fail. We need a robust retry strategy.

Architecture for Retries

  • Retry Queue: If a worker fails to send, push the message to a "Retry Queue" with a delay.
  • Exponential Backoff: Wait 1s, 2s, 4s, 8s... to avoid thundering herd on recovery.
  • Dead Letter Queue (DLQ): After Max Retries (e.g., 5), move to DLQ for manual inspection.
Rate Limiting: Use Token Bucket algorithm. Example: Max 5 SMS per user per hour in `Redis` to prevent abuse/cost spikes.

Database Schema

-- User Preferences (Opt-in/out)
CREATE TABLE notification_settings (
  user_id UUID PRIMARY KEY,
  email_enabled BOOLEAN DEFAULT TRUE,
  sms_enabled BOOLEAN DEFAULT FALSE,
  push_enabled BOOLEAN DEFAULT TRUE,
  dnd_start_time TIME, -- Do Not Disturb
  dnd_end_time TIME
);

-- Notification Logs (For audit & history)
CREATE TABLE notification_logs (
  id UUID PRIMARY KEY,
  user_id UUID,
  type VARCHAR(20), -- 'email', 'sms'
  status VARCHAR(20), -- 'sent', 'failed', 'queued'
  content TEXT,
  provider_response JSONB, -- Store external ID
  created_at TIMESTAMP DEFAULT NOW()
);

Summary

  • Use Message Queues to decouple intake from delivery.
  • Implement Priority Queues to ensure OTPs aren't blocked by newsletters.
  • Use Deduplication logic with Redis to prevent spam.
  • Handle Retries with exponential backoff and Dead Letter Queues.
  • Respect User Preferences and Rate Limits at the API level.