RPO vs RTO: The Golden Metrics
Every DR plan starts with defining acceptable loss limits.
RPO (Recovery Point Objective)
Max acceptable Data Loss. If RPO=1h, you lose up to 1h of data.RTO (Recovery Time Objective)
Max acceptable Downtime. If RTO=4h, service must be back in 4h.Backup Strategies
| Type | Description | Storage Cost | Restore Speed |
|---|---|---|---|
| Full Backup | Copy entire dataset. | High | Fastest (Start & Finish) |
| Incremental | Copy only changes since last backup (full or incremental). | Lowest | Slowest (Need Full + All Incremental chains) |
| Differential | Copy changes since last Full backup. | Medium | Moderate (Full + Latest Diff) |
Disaster Recovery Patterns
Multi-region architecture strategies generally driven by cost vs RTO.
🔥 Pilot Light
Cost: Low | RTO: Mins/Hours
Database is replicated (active-passive), but App Servers are OFF. We spin them up (via Auto Scaling) only during disaster.
🌤️ Warm Standby
Cost: Medium | RTO: Seconds/Mins
Scaled-down version of App Servers runs in DR region. Can scale up quickly to handle full traffic load.
⚡ Multi-Site Active/Active
Cost: High | RTO: Near Zero
Both regions handle traffic simultaneously. DNS Weighted Routing shifts traffic if one region fails. Requires complex bi-directional DB replication.
Point-in-Time Recovery (PITR)
Snapshots are not enough. If someone deletes a table at 2:00 PM, and your snapshot was 1:00 PM, you lose 1 hour. with WAL (Write Ahead Log) Archiving, you can replay transactions to 1:59 PM.
Postgres Implementation
# postgresql.conf
archive_mode = on
archive_command = 'test ! -f /mnt/arch/%f && cp %f /mnt/arch/%f'
# Better: Use WAL-G or Barman for S3 uploads
# archive_command = 'wal-g wal-push %p'
Restore Process
- Stop DB Service.
- Restore Base Backup (Snapshot).
- Configure `restore_command` to fetch WAL files from S3.
- Set `recovery_target_time = '2026-01-15 13:59:00'`.
- Start DB. Postgres replays WALs until it hits the target time.
Summary
- RPO/RTO: Define these first. They dictate the budget (e.g., Active-Active is expensive).
- Test It: A backup is worthless if the restore process fails. Run DR drills annually.
- 3-2-1 Rule: 3 copies of data, 2 different media types, 1 offsite (Cloud/DR Region).
- WAL Archiving: Essential for limiting RPO to seconds.