High-Level Architecture
A decoupled architecture using Message Queues is essential to separate the API (ingestion) from the Workers (delivery).
Service API
(SES/Twilio/FCM)
Component Breakdown
1. Notification Service (API)
Entry point. Validates inputs, checks user preferences (Opt-in/out), performs rate limiting, and publishes to the message queue.
2. Message Queue
Buffers bursts of traffic. Provides reliable persistence. Examples: RabbitMQ (for complex routing) or Kafka (for high throughput).
3. Workers
Consume messages. Render templates. Call external providers (SendGrid, Twilio, FCM). Handle retry logic.
4. Database & Cache
Postgres: Stores notification logs,
templates, user settings.
Redis: Stores deduplication keys and
rate limit counters.
Prioritization Strategy
Not all notifications are equal. OTPs must arrive within seconds; marketing emails can wait hours.
| Priority Level | Examples | SLA Latency | Queue Implementation |
|---|---|---|---|
| CRITICAL | OTP, Password Reset, Fraud Alert | < 5 seconds | Dedicated High Priority Queue requiring low latency consumers. |
| NORMAL | Order Confirmation, Friend Request | < 1 minute | Standard Queue. |
| LOW | Weekly Newsletter, Promotional Offers | Hours | Bulk Queue (processed only when idle or at night). |
Deduplication Logic
Prevents spamming the user if a service erroneously retries the same event multiple times.
How it works (Redis Cache-Aside)
- Generate a unique deduplication key:
hash(user_id + event_type + content_id). - Check if key exists in Redis.
- If exists → Discard (Duplicate).
- If checks pass → Store key in Redis with TTL (e.g., 5 mins) and proceed.
# Python Example: Start Deduplication
import hashlib
def should_send_notification(user_id, event_type, content_id):
# Create unique key
raw_key = f"{user_id}:{event_type}:{content_id}"
dedup_key = hashlib.sha256(raw_key.encode()).hexdigest()
# Check Redis (SETNX returns 1 if set, 0 if already exists)
is_new = redis_client.set(dedup_key, "1", ex=300, nx=True)
return bool(is_new)
Reliability & Retry Mechanism
External providers (Twilio, SendGrid) can fail. We need a robust retry strategy.
Architecture for Retries
- Retry Queue: If a worker fails to send, push the message to a "Retry Queue" with a delay.
- Exponential Backoff: Wait 1s, 2s, 4s, 8s... to avoid thundering herd on recovery.
- Dead Letter Queue (DLQ): After Max Retries (e.g., 5), move to DLQ for manual inspection.
Database Schema
-- User Preferences (Opt-in/out)
CREATE TABLE notification_settings (
user_id UUID PRIMARY KEY,
email_enabled BOOLEAN DEFAULT TRUE,
sms_enabled BOOLEAN DEFAULT FALSE,
push_enabled BOOLEAN DEFAULT TRUE,
dnd_start_time TIME, -- Do Not Disturb
dnd_end_time TIME
);
-- Notification Logs (For audit & history)
CREATE TABLE notification_logs (
id UUID PRIMARY KEY,
user_id UUID,
type VARCHAR(20), -- 'email', 'sms'
status VARCHAR(20), -- 'sent', 'failed', 'queued'
content TEXT,
provider_response JSONB, -- Store external ID
created_at TIMESTAMP DEFAULT NOW()
);
Summary
- Use Message Queues to decouple intake from delivery.
- Implement Priority Queues to ensure OTPs aren't blocked by newsletters.
- Use Deduplication logic with Redis to prevent spam.
- Handle Retries with exponential backoff and Dead Letter Queues.
- Respect User Preferences and Rate Limits at the API level.