60 lines
1.4 KiB
Markdown
60 lines
1.4 KiB
Markdown
# Backup Strategy
|
|
|
|
## Database Backups
|
|
|
|
### Automated Backups
|
|
- **Frequency**: Daily at 3 AM UTC
|
|
- **Retention**: 7 days daily, 4 weeks weekly, 12 months monthly
|
|
- **Storage**: Encrypted S3 bucket in separate region
|
|
- **Type**: Full backup + WAL archiving for point-in-time recovery
|
|
|
|
### Point-in-Time Recovery
|
|
- **RPO**: < 15 minutes
|
|
- **RTO**: < 1 hour
|
|
- **Method**: WAL archive restoration to specific timestamp
|
|
|
|
### Backup Verification
|
|
- Monthly restore test to staging environment
|
|
- Automated integrity checks on backup files
|
|
- Alert on backup failure within 5 minutes
|
|
|
|
## Redis Backups
|
|
|
|
### Configuration
|
|
- **RDB snapshots**: Every 6 hours
|
|
- **AOF persistence**: Enabled for point-in-time recovery
|
|
- **Storage**: Backed up to S3 daily
|
|
|
|
### Recovery
|
|
- Restore from latest RDB snapshot
|
|
- Replay AOF for recent changes
|
|
- Test data integrity after restore
|
|
|
|
## Backup Monitoring
|
|
|
|
### Alerts
|
|
- Backup failure → Immediate PagerDuty alert
|
|
- Backup size anomaly → Slack notification
|
|
- Restore test failure → Jira ticket creation
|
|
|
|
### Metrics
|
|
- Backup duration
|
|
- Backup size
|
|
- Restore time
|
|
- Data loss window (RPO)
|
|
|
|
## Emergency Procedures
|
|
|
|
### Complete Data Loss
|
|
1. Activate disaster recovery plan
|
|
2. Restore from latest backup
|
|
3. Replay WAL/AOF for recent changes
|
|
4. Verify data integrity
|
|
5. Resume operations
|
|
|
|
### Partial Data Corruption
|
|
1. Identify affected data
|
|
2. Restore specific tables from backup
|
|
3. Verify data consistency
|
|
4. Resume operations
|