Notification System Design | Architecture, Rate Limiting, Prioritization & Deduplication

High-Level Architecture

A decoupled architecture using Message Queues is essential to separate the API (ingestion) from the Workers (delivery).

Service A

→

Notification
Service API

→

Kafka / RabbitMQ

→

Workers

→

3rd Party
(SES/Twilio/FCM)

Component Breakdown

1. Notification Service (API)

Entry point. Validates inputs, checks user preferences (Opt-in/out), performs rate limiting, and publishes to the message queue.

2. Message Queue

Buffers bursts of traffic. Provides reliable persistence. Examples: RabbitMQ (for complex routing) or Kafka (for high throughput).

3. Workers

Consume messages. Render templates. Call external providers (SendGrid, Twilio, FCM). Handle retry logic.

4. Database & Cache

Postgres: Stores notification logs, templates, user settings.
Redis: Stores deduplication keys and rate limit counters.

Prioritization Strategy

Not all notifications are equal. OTPs must arrive within seconds; marketing emails can wait hours.

Priority Level	Examples	SLA Latency	Queue Implementation
CRITICAL	OTP, Password Reset, Fraud Alert	< 5 seconds	Dedicated High Priority Queue requiring low latency consumers.
NORMAL	Order Confirmation, Friend Request	< 1 minute	Standard Queue.
LOW	Weekly Newsletter, Promotional Offers	Hours	Bulk Queue (processed only when idle or at night).

Deduplication Logic

Prevents spamming the user if a service erroneously retries the same event multiple times.

How it works (Redis Cache-Aside)

Generate a unique deduplication key: hash(user_id + event_type + content_id).
Check if key exists in Redis.
If exists → Discard (Duplicate).
If checks pass → Store key in Redis with TTL (e.g., 5 mins) and proceed.

# Python Example: Start Deduplication
import hashlib

def should_send_notification(user_id, event_type, content_id):
    # Create unique key
    raw_key = f"{user_id}:{event_type}:{content_id}"
    dedup_key = hashlib.sha256(raw_key.encode()).hexdigest()
    
    # Check Redis (SETNX returns 1 if set, 0 if already exists)
    is_new = redis_client.set(dedup_key, "1", ex=300, nx=True)
    
    return bool(is_new)

Reliability & Retry Mechanism

External providers (Twilio, SendGrid) can fail. We need a robust retry strategy.

Architecture for Retries

Retry Queue: If a worker fails to send, push the message to a "Retry Queue" with a delay.
Exponential Backoff: Wait 1s, 2s, 4s, 8s... to avoid thundering herd on recovery.
Dead Letter Queue (DLQ): After Max Retries (e.g., 5), move to DLQ for manual inspection.

Rate Limiting: Use Token Bucket algorithm. Example: Max 5 SMS per user per hour in `Redis` to prevent abuse/cost spikes.

Database Schema

-- User Preferences (Opt-in/out)
CREATE TABLE notification_settings (
  user_id UUID PRIMARY KEY,
  email_enabled BOOLEAN DEFAULT TRUE,
  sms_enabled BOOLEAN DEFAULT FALSE,
  push_enabled BOOLEAN DEFAULT TRUE,
  dnd_start_time TIME, -- Do Not Disturb
  dnd_end_time TIME
);

-- Notification Logs (For audit & history)
CREATE TABLE notification_logs (
  id UUID PRIMARY KEY,
  user_id UUID,
  type VARCHAR(20), -- 'email', 'sms'
  status VARCHAR(20), -- 'sent', 'failed', 'queued'
  content TEXT,
  provider_response JSONB, -- Store external ID
  created_at TIMESTAMP DEFAULT NOW()
);

Summary

Use Message Queues to decouple intake from delivery.
Implement Priority Queues to ensure OTPs aren't blocked by newsletters.
Use Deduplication logic with Redis to prevent spam.
Handle Retries with exponential backoff and Dead Letter Queues.
Respect User Preferences and Rate Limits at the API level.