Backup & Disaster Recovery | RPO, RTO & Failover Patterns

RPO vs RTO: The Golden Metrics

Every DR plan starts with defining acceptable loss limits.

Last Backup 💥 DISASTER System Restored

Operating Normal

RPO (Data Loss)

RTO (Downtime)

Recovered

RPO (Recovery Point Objective)

Max acceptable Data Loss. If RPO=1h, you lose up to 1h of data.

RTO (Recovery Time Objective)

Max acceptable Downtime. If RTO=4h, service must be back in 4h.

Backup Strategies

Type	Description	Storage Cost	Restore Speed
Full Backup	Copy entire dataset.	High	Fastest (Start & Finish)
Incremental	Copy only changes since last backup (full or incremental).	Lowest	Slowest (Need Full + All Incremental chains)
Differential	Copy changes since last Full backup.	Medium	Moderate (Full + Latest Diff)

Disaster Recovery Patterns

Multi-region architecture strategies generally driven by cost vs RTO.

🔥 Pilot Light

Cost: Low | RTO: Mins/Hours

Database is replicated (active-passive), but App Servers are OFF. We spin them up (via Auto Scaling) only during disaster.

🌤️ Warm Standby

Cost: Medium | RTO: Seconds/Mins

Scaled-down version of App Servers runs in DR region. Can scale up quickly to handle full traffic load.

⚡ Multi-Site Active/Active

Cost: High | RTO: Near Zero

Both regions handle traffic simultaneously. DNS Weighted Routing shifts traffic if one region fails. Requires complex bi-directional DB replication.

Point-in-Time Recovery (PITR)

Snapshots are not enough. If someone deletes a table at 2:00 PM, and your snapshot was 1:00 PM, you lose 1 hour. with WAL (Write Ahead Log) Archiving, you can replay transactions to 1:59 PM.

Postgres Implementation

# postgresql.conf
archive_mode = on
archive_command = 'test ! -f /mnt/arch/%f && cp %f /mnt/arch/%f'
# Better: Use WAL-G or Barman for S3 uploads
# archive_command = 'wal-g wal-push %p'

Restore Process

Stop DB Service.
Restore Base Backup (Snapshot).
Configure `restore_command` to fetch WAL files from S3.
Set `recovery_target_time = '2026-01-15 13:59:00'`.
Start DB. Postgres replays WALs until it hits the target time.

Summary

RPO/RTO: Define these first. They dictate the budget (e.g., Active-Active is expensive).
Test It: A backup is worthless if the restore process fails. Run DR drills annually.
3-2-1 Rule: 3 copies of data, 2 different media types, 1 offsite (Cloud/DR Region).
WAL Archiving: Essential for limiting RPO to seconds.

Backup & Disaster Recovery (DR)