Resiliency Engineering

Backup & Disaster Recovery (DR)

Prepare for the worst: Mastering RPO vs RTO, Multi-Site Failover strategies, and Point-in-Time Recovery.

RPO vs RTO: The Golden Metrics

Every DR plan starts with defining acceptable loss limits.

Last Backup 💥 DISASTER System Restored
Operating Normal
RPO (Data Loss)
RTO (Downtime)
Recovered
RPO (Recovery Point Objective)
Max acceptable Data Loss. If RPO=1h, you lose up to 1h of data.
RTO (Recovery Time Objective)
Max acceptable Downtime. If RTO=4h, service must be back in 4h.

Backup Strategies

Type Description Storage Cost Restore Speed
Full Backup Copy entire dataset. High Fastest (Start & Finish)
Incremental Copy only changes since last backup (full or incremental). Lowest Slowest (Need Full + All Incremental chains)
Differential Copy changes since last Full backup. Medium Moderate (Full + Latest Diff)

Disaster Recovery Patterns

Multi-region architecture strategies generally driven by cost vs RTO.

🔥 Pilot Light

Cost: Low | RTO: Mins/Hours

Database is replicated (active-passive), but App Servers are OFF. We spin them up (via Auto Scaling) only during disaster.

🌤️ Warm Standby

Cost: Medium | RTO: Seconds/Mins

Scaled-down version of App Servers runs in DR region. Can scale up quickly to handle full traffic load.

⚡ Multi-Site Active/Active

Cost: High | RTO: Near Zero

Both regions handle traffic simultaneously. DNS Weighted Routing shifts traffic if one region fails. Requires complex bi-directional DB replication.

Point-in-Time Recovery (PITR)

Snapshots are not enough. If someone deletes a table at 2:00 PM, and your snapshot was 1:00 PM, you lose 1 hour. with WAL (Write Ahead Log) Archiving, you can replay transactions to 1:59 PM.

Postgres Implementation
# postgresql.conf
archive_mode = on
archive_command = 'test ! -f /mnt/arch/%f && cp %f /mnt/arch/%f'
# Better: Use WAL-G or Barman for S3 uploads
# archive_command = 'wal-g wal-push %p'
Restore Process
  1. Stop DB Service.
  2. Restore Base Backup (Snapshot).
  3. Configure `restore_command` to fetch WAL files from S3.
  4. Set `recovery_target_time = '2026-01-15 13:59:00'`.
  5. Start DB. Postgres replays WALs until it hits the target time.

Summary

  • RPO/RTO: Define these first. They dictate the budget (e.g., Active-Active is expensive).
  • Test It: A backup is worthless if the restore process fails. Run DR drills annually.
  • 3-2-1 Rule: 3 copies of data, 2 different media types, 1 offsite (Cloud/DR Region).
  • WAL Archiving: Essential for limiting RPO to seconds.