Files
ShieldAI/infra/ROLLBACK.md

612 lines
20 KiB
Markdown

# ShieldAI Rollback Runbook
> **Last updated:** 2026-05-12
> **Owner:** Senior Engineer
> **Parent:** [FRE-4574](/FRE/issues/FRE-4574) ShieldAI Production Infrastructure & CI/CD Pipeline
> **Reviewed by:** Code Reviewer (FRE-4808) on 2026-05-12
---
## Table of Contents
1. [Overview](#1-overview)
2. [Rollback Strategies](#2-rollback-strategies)
3. [ECS Service Rollback (AWS)](#3-ecs-service-rollback-aws)
4. [Docker Compose Rollback (Local / Staging)](#4-docker-compose-rollback-local--staging)
5. [Database Migration Rollback](#5-database-migration-rollback)
6. [Automated Rollback Triggers](#6-automated-rollback-triggers)
7. [Blue-Green Deployment Rollback](#7-blue-green-deployment-rollback)
8. [Rollback Decision Tree](#8-rollback-decision-tree)
9. [Post-Rollback Verification](#9-post-rollback-verification)
10. [Testing Checklist](#10-testing-checklist)
11. [Runbook: Emergency Rollback](#11-runbook-emergency-rollback)
---
## 1. Overview
ShieldAI runs four services (api, darkwatch, spamshield, voiceprint) on AWS ECS Fargate behind an Application Load Balancer. Each service has independent deployment, health checks, and rollback capability.
**Rollback types:**
| Type | Trigger | Scope | Automation |
|------|---------|-------|------------|
| **ECS Service Rollback** | Health check failure, manual | Single or all services | ✅ CI/CD + manual script |
| **Docker Compose Rollback** | Manual (local/staging) | All services | ✅ Scripted |
| **Database Migration Rollback** | Manual | Schema changes | ⚠️ Semi-manual |
| **Blue-Green Rollback** | Manual or automated | Full environment | ✅ CI/CD |
| **RDS Point-in-Time Restore** | Manual (disaster) | Full database | ⚠️ Semi-manual |
---
## 2. Rollback Strategies
### 2.1 ECS Service-Level Rollback
Each ECS service maintains a history of task definitions. Rolling back reverts to the **previous successfully deployed task definition**.
**Prerequisites:**
- AWS CLI configured with credentials for the target environment
- IAM permissions: `ecs:UpdateService`, `ecs:DescribeServices`, `ecs:WaitServicesStable`
### 2.2 Blue-Green Rollback
The CI/CD pipeline deploys new images to existing ECS services. If health checks fail after deployment, the `rollback` job in the deploy workflow automatically reverts all four services to their previous task definition revision.
**Pipeline flow:**
```
build-and-push → deploy-ecs → health-check → [PASS: done | FAIL: rollback]
```
### 2.3 Database Migration Rollback
ShieldAI uses Drizzle ORM for database migrations. Each migration is versioned and stored in `src/db/migrations/`. Rollback requires running the previous migration set.
---
## 3. ECS Service Rollback (AWS)
### 3.1 Automated (CI/CD Pipeline)
The deploy workflow (`.github/workflows/deploy.yml`) includes a `rollback` job that triggers on health check failure:
```yaml
rollback:
if: failure() && needs.health-check.result == 'failure'
# Rolls back all 4 services to previous task definition
```
**When it runs:**
- Post-deploy health check fails (HTTP 200 not received from `/health`)
- Runs after `deploy-ecs` and `health-check` jobs
- Rolls back all four services: api, darkwatch, spamshield, voiceprint
**How to verify:**
1. Navigate to the GitHub Actions run for the failed deployment
2. Check the `Rollback on Failure` job logs
3. Confirm each service shows "Rolled back" status
### 3.2 Manual Rollback Script
```bash
# Single service
./infra/scripts/rollback.sh production api
# All services
./infra/scripts/rollback.sh production all
# Staging environment
./infra/scripts/rollback.sh staging all
```
**Script behavior:**
1. Iterates over target services (or all if `all` specified)
2. Calls `aws ecs update-service --rollback` for each service
3. Waits for service to stabilize via `aws ecs wait services-stable`
4. Reports success/failure per service
5. Exits with non-zero code if any service fails to stabilize
**Expected output:**
```
Rolling back services in cluster: shieldai-production
Rolling back api...
Waiting for api to stabilize...
api rolled back successfully
Rolling back darkwatch...
Waiting for darkwatch to stabilize...
darkwatch rolled back successfully
...
Rollback complete for api darkwatch spamshield voiceprint
```
### 3.3 Manual CLI Rollback (Fallback)
If the script is unavailable, rollback individual services:
```bash
CLUSTER="shieldai-production"
SERVICE="api"
# Rollback to previous task definition
aws ecs update-service \
--cluster "$CLUSTER" \
--service "${CLUSTER}-${SERVICE}" \
--rollback \
--no-cli-auto-prompt
# Wait for stabilization
aws ecs wait services-stable \
--cluster "$CLUSTER" \
--services "${CLUSTER}-${SERVICE}"
# Verify health
curl -s -o /dev/null -w "%{http_code}" \
"https://shieldai-production-alb.us-east-1.elb.amazonaws.com/health"
```
---
## 4. Docker Compose Rollback (Local / Staging)
### 4.1 Production Compose Rollback
The `docker-compose.prod.yml` deploys all services with tagged images. To rollback:
```bash
# 1. Identify the previous working tag
# Check GitHub releases or git tags for the last known good version
PREVIOUS_TAG="v1.2.3"
# 2. Stop current services
docker compose -f docker-compose.prod.yml down
# 3. Pull previous images
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-api:${PREVIOUS_TAG}
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-darkwatch:${PREVIOUS_TAG}
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-spamshield:${PREVIOUS_TAG}
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-voiceprint:${PREVIOUS_TAG}
# 4. Override tag in compose
DOCKER_TAG=${PREVIOUS_TAG} docker compose -f docker-compose.prod.yml up -d
# 5. Verify health
for svc in api darkwatch spamshield voiceprint; do
PORT=$(case $svc in
api) echo 3000;; darkwatch) echo 3001;;
spamshield) echo 3002;; voiceprint) echo 3003;;
esac)
curl -sf "http://localhost:${PORT}/health" && echo "$svc: OK" || echo "$svc: FAIL"
done
```
### 4.2 Local Dev Rollback
```bash
# Stop and remove containers
docker compose down
# Rebuild from previous commit
git checkout <previous-commit>
docker compose up -d --build
```
---
## 5. Database Migration Rollback
### 5.1 Drizzle Migration Rollback
ShieldAI uses Drizzle ORM with Turso dialect. Migrations are stored in `src/db/migrations/`.
```bash
# 1. Get database credentials from AWS Secrets Manager
DB_SECRET=$(aws secretsmanager get-secret-value \
--secret-id "shieldai-${ENVIRONMENT}-db-password" \
--query 'SecretString' --output json)
DB_HOST=$(echo "$DB_SECRET" | jq -r '.host')
DB_PORT=$(echo "$DB_SECRET" | jq -r '.port')
DB_USER=$(echo "$DB_SECRET" | jq -r '.username')
DB_PASS=$(echo "$DB_SECRET" | jq -r '.password')
DATABASE_URL="postgresql://${DB_USER}:${DB_PASS}@${DB_HOST}:${DB_PORT}/shieldai"
# 2. List migrations to identify the one to revert
npx drizzle-kit introspect --config=drizzle.config.ts
# 3. Resolve the problematic migration (marks it as not applied)
npx drizzle-kit migrate:resolve --migration "<migration_name>" --status applied
# 4. Re-run previous migration state
npx drizzle-kit migrate --config=drizzle.config.ts
```
### 5.2 RDS Point-in-Time Recovery (Disaster)
When the database itself needs recovery (e.g., data corruption, bad migration):
```bash
# 1. Find available recovery window (automated backups: every 24h, 7-14 day retention)
aws rds describe-db-instances \
--db-instance-identifier "shieldai-production-db" \
--query 'DBInstances[0].LatestRestorableTime'
# 2. Create restored instance (does not affect primary)
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier "shieldai-production-db" \
--db-instance-identifier "shieldai-production-db-restored" \
--restore-time "2026-05-09T08:00:00Z"
# 3. Verify restored instance
aws rds wait db-instance-available \
--db-instance-identifier "shieldai-production-db-restored"
# 4. Update ECS services to point to restored instance
# Update DATABASE_URL secret in Secrets Manager
aws secretsmanager put-secret-value \
--secret-id "shieldai-production-db-password" \
--secret-string "$(echo "$DB_SECRET" | jq --arg host "$(aws rds describe-db-instances --db-instance-identifier shieldai-production-db-restored --query 'DBInstances[0].Endpoint.Address' --output text)" '.host = $host')"
# 5. Trigger ECS service redeployment to pick up new DB endpoint
./infra/scripts/rollback.sh production all
```
### 5.3 RDS Snapshot Restore
```bash
# 1. List available snapshots
aws rds describe-db-snapshots \
--db-instance-identifier "shieldai-production-db"
# 2. Restore from specific snapshot
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier "shieldai-production-db-restored" \
--db-snapshot-identifier "rds:shieldai-production-db-2026-05-08-03-00" \
--db-instance-class "db.t3.medium" \
--vpc-security-group-ids "$(terraform -chdir=infra/output -raw vpc_security_group_id)"
# 3. Follow steps 3-5 from Point-in-Time Recovery above
```
---
## 6. Automated Rollback Triggers
### 6.1 CI/CD Health Check Failure
**Trigger:** Post-deploy health check returns non-200 from `/health`
**Pipeline job:** `rollback` in `.github/workflows/deploy.yml`
**Condition:** `if: failure() && needs.health-check.result == 'failure'`
**Action:** Rolls back all four ECS services to previous task definition
**Timeout:** Health check retries for 5 minutes before triggering rollback
### 6.2 ECS Container Health Check
Each container has an in-container health check defined in the ECS task definition:
```json
"healthCheck": {
"command": ["CMD-SHELL", "wget -q --spider http://localhost:{port}/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
```
**Failure consequence:** Container is marked unhealthy after 3 consecutive failures (90 seconds). ALB marks target as unhealthy after 3 failed health checks (90 seconds). Service enters draining state.
### 6.3 ALB Target Group Health Check
The ALB performs HTTP health checks against `/health` on each target:
| Parameter | Value |
|-----------|-------|
| Interval | 30s |
| Timeout | 5s |
| Healthy threshold | 3 |
| Unhealthy threshold | 3 |
| Expected code | 200 |
### 6.4 CloudWatch Alarms
The following alarms are configured in `infra/modules/cloudwatch/main.tf`:
| Alarm | Threshold | Action |
|-------|-----------|--------|
| ECS CPU >80% | 80% for 2 periods (10min) | SNS notification |
| ECS Memory >85% | 85% for 2 periods (10min) | SNS notification |
| ALB 5xx >10/min | 10 for 3 periods (3min) | SNS notification |
| RDS CPU >75% | 75% for 2 periods (10min) | SNS notification |
| RDS Free Storage <500MB | 500MB for 2 periods (10min) | SNS notification |
**Alarm escalation path:**
1. CloudWatch alarm fires
2. SNS notification sent to on-call engineer
3. Engineer evaluates: if service is degraded, trigger manual rollback
4. If root cause is deployment-related, run `./infra/scripts/rollback.sh production all`
---
## 7. Blue-Green Deployment Rollback
### 7.1 Architecture
ShieldAI uses ECS services with rolling deployments. Each deployment creates a new task definition revision. The ALB routes traffic to healthy targets only.
**Rollback mechanism:** ECS `--rollback` flag reverts the service to the previous task definition revision. This is equivalent to a blue-green swap since:
1. Old task definition (blue) remains registered
2. New task definition (green) is deployed
3. On rollback, ECS reverts to blue task definition
4. ALB automatically routes to healthy (blue) targets
### 7.2 Blue-Green Rollback Procedure
```bash
# 1. Check current deployment state
aws ecs list-services --cluster shieldai-production
aws ecs describe-services --cluster shieldai-production \
--services shieldai-production-api \
--query 'services[0].deployments'
# 2. Identify previous deployment
# The deployment with status "PRIMARY" is current.
# Look for "ACTIVE" deployment with older task definition.
# 3. Execute rollback (script handles all services)
./infra/scripts/rollback.sh production all
# 4. Verify rollback
aws ecs describe-services --cluster shieldai-production \
--services shieldai-production-api \
--query 'services[0].deployments[?status==`PRIMARY`].taskDefinition'
```
### 7.3 Docker Compose Blue-Green (Local)
For local/staging environments using Docker Compose, implement blue-green via service version pinning:
```bash
# Current deployment uses DOCKER_TAG env var
# Rollback by setting DOCKER_TAG to previous version
# Save current tag
CURRENT_TAG=$(grep DOCKER_TAG .env.prod 2>/dev/null | cut -d= -f2 || echo "latest")
# Rollback to previous
export DOCKER_TAG="v1.2.3"
docker compose -f docker-compose.prod.yml up -d
# Verify all services
docker compose -f docker-compose.prod.yml ps
```
---
## 8. Rollback Decision Tree
```
Is the service responding?
├── YES → Is the response correct?
│ ├── YES → Monitor, no action needed
│ └── NO → Is it a data issue?
│ ├── YES → Database Migration Rollback (§5)
│ └── NO → ECS Service Rollback (§3)
└── NO → Is it a single service or all?
├── Single → ECS Service Rollback (§3, specific service)
└── All → Full Environment Rollback
├── Is DB corrupted?
│ ├── YES → RDS Point-in-Time Recovery (§5.2)
│ └── NO → ECS Full Rollback + DB Migration Rollback
```
**SLA targets:**
- Single service rollback: **< 5 minutes**
- Full environment rollback: **< 15 minutes**
- Database recovery: **< 30 minutes** (Point-in-Time)
---
## 9. Post-Rollback Verification
After any rollback, verify the following:
### 9.1 Service Health
```bash
# Check all services are healthy
for svc in api darkwatch spamshield voiceprint; do
PORT=$(case $svc in
api) echo 3000;; darkwatch) echo 3001;;
spamshield) echo 3002;; voiceprint) echo 3003;;
esac)
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
"https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health")
echo "$svc: HTTP $HTTP_CODE"
done
```
### 9.2 ECS Service Status
```bash
# Verify all services are stable
for svc in api darkwatch spamshield voiceprint; do
RUNNING=$(aws ecs describe-services \
--cluster "shieldai-${ENVIRONMENT}" \
--services "shieldai-${ENVIRONMENT}-${svc}" \
--query 'services[0].runningCount' --output text)
DESIRED=$(aws ecs describe-services \
--cluster "shieldai-${ENVIRONMENT}" \
--services "shieldai-${ENVIRONMENT}-${svc}" \
--query 'services[0].desiredCount' --output text)
echo "$svc: $RUNNING/$DESIRED running"
done
```
### 9.3 Database Connectivity
```bash
# Verify database connection
aws ecs execute-command \
--cluster "shieldai-${ENVIRONMENT}" \
--service "shieldai-${ENVIRONMENT}-api" \
--command "npx drizzle-kit status" \
--interactive --cluster "shieldai-${ENVIRONMENT}"
```
### 9.4 CloudWatch Verification
1. Navigate to CloudWatch dashboard: `shieldai-${ENVIRONMENT}-dashboard`
2. Verify CPU/Memory utilization is within normal range
3. Verify ALB 5xx errors have returned to baseline
4. Verify no new alarms are in ALARM state
---
## 10. Testing Checklist
### 10.1 ECS Rollback Test
- [ ] Deploy a known-bad image (e.g., image with `/health` returning 500)
- [ ] Verify CI/CD health check fails within 5 minutes
- [ ] Verify `rollback` job triggers automatically
- [ ] Verify all four services revert to previous task definition
- [ ] Verify health check passes post-rollback
- [ ] Verify CloudWatch metrics show recovery
### 10.2 Manual Script Test
- [ ] Run `./infra/scripts/rollback.sh staging api` on staging
- [ ] Verify single service rolls back correctly
- [ ] Run `./infra/scripts/rollback.sh staging all` on staging
- [ ] Verify all services roll back correctly
- [ ] Verify script exits with code 0 on success
- [ ] Verify script exits with code 1 on failure
### 10.3 Docker Compose Rollback Test
- [ ] Deploy v2.0.0 of all services via docker-compose.prod.yml
- [ ] Rollback to v1.0.0 using DOCKER_TAG override
- [ ] Verify all services restart with previous images
- [ ] Verify health endpoints respond correctly
### 10.4 Database Migration Rollback Test
- [ ] Apply a test migration on staging
- [ ] Run migration rollback procedure
- [ ] Verify schema matches pre-migration state
- [ ] Verify application connects and functions correctly
### 10.5 RDS Point-in-Time Recovery Test
- [ ] Create a test RDS instance
- [ ] Insert test data
- [ ] Restore to point before data insertion
- [ ] Verify restored instance has correct data state
- [ ] Clean up test instance
### 10.6 End-to-End Rollback Drills
| Drill | Frequency | Participants |
|-------|-----------|--------------|
| ECS service rollback | Monthly | Senior Engineer |
| Full environment rollback | Quarterly | Full engineering team |
| Database recovery | Quarterly | Senior Engineer + Founding Engineer |
| Blue-green rollback | Quarterly | Full engineering team |
---
## 11. Runbook: Emergency Rollback
### 11.1 Symptoms
- ALB 5xx error rate > 10/minute for 3+ minutes
- CloudWatch alarm: `shieldai-production-alb-5xx` in ALARM state
- Customer-reported service degradation
### 11.2 Immediate Actions (0-5 minutes)
```bash
# 1. Confirm environment and scope
ENVIRONMENT="production"
# 2. Check service status
aws ecs describe-services \
--cluster "shieldai-${ENVIRONMENT}" \
--services shieldai-${ENVIRONMENT}-api,shieldai-${ENVIRONMENT}-darkwatch,shieldai-${ENVIRONMENT}-spamshield,shieldai-${ENVIRONMENT}-voiceprint \
--query 'services[*].{Name:serviceName,Running:runningCount,Desired:desiredCount,Status:status}'
# 3. Check ALB health
curl -s -o /dev/null -w "%{http_code}" \
"https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health"
# 4. Execute rollback
./infra/scripts/rollback.sh ${ENVIRONMENT} all
```
### 11.3 Verification (5-10 minutes)
```bash
# 1. Wait for services to stabilize
aws ecs wait services-stable \
--cluster "shieldai-${ENVIRONMENT}" \
--services shieldai-${ENVIRONMENT}-api,shieldai-${ENVIRONMENT}-darkwatch,shieldai-${ENVIRONMENT}-spamshield,shieldai-${ENVIRONMENT}-voiceprint
# 2. Verify health endpoint
curl -sf "https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health" \
&& echo "Health: OK" || echo "Health: FAIL"
# 3. Check CloudWatch for recovery
# Navigate to CloudWatch dashboard and verify metrics
```
### 11.4 Communication Template
```
## Rollback Notification
**Environment:** production
**Time:** $(date -u '+%Y-%m-%d %H:%M UTC')
**Trigger:** [ALB 5xx alarm / manual / CI/CD health check]
**Action:** Rolled back all services to previous deployment
**Status:** [In Progress / Verified / Resolved]
**Next steps:** [Post-mortem / monitoring / investigation]
```
### 11.5 Post-Incident
1. Create incident ticket with timeline
2. Document root cause
3. Update runbook if procedure changed
4. Schedule post-mortem within 48 hours
5. Create follow-up issues for preventive measures
---
## Appendix A: Quick Reference
| Resource | Command |
|----------|---------|
| Rollback script | `./infra/scripts/rollback.sh <env> <service\|all>` |
| ECS service status | `aws ecs describe-services --cluster shieldai-<env> --services shieldai-<env>-<svc>` |
| ALB health check | `curl -s -o /dev/null -w "%{http_code}" https://shieldai-<env>-alb.us-east-1.elb.amazonaws.com/health` |
| RDS snapshots | `aws rds describe-db-snapshots --db-instance-identifier shieldai-<env>-db` |
| CloudWatch dashboard | `https://us-east-1.console.aws.amazon.com/cloudwatch/home#dashboards/dashboard/shieldai-<env>-dashboard` |
| ECS task logs | `aws logs filter-log-events --log-group-name /ecs/shieldai-<env>-<svc>` |
## Appendix B: Environment Variables
| Variable | Description | Required |
|----------|-------------|----------|
| `AWS_ACCESS_KEY_ID` | IAM user with ECS, RDS permissions | Yes |
| `AWS_SECRET_ACCESS_KEY` | IAM secret key | Yes |
| `AWS_DEFAULT_REGION` | AWS region (default: us-east-1) | Yes |
| `GITHUB_REPOSITORY_OWNER` | GitHub org/user for container registry | Docker Compose only |
| `DOCKER_TAG` | Container image tag to deploy | Docker Compose only |
| `POSTGRES_PASSWORD` | Database password | Docker Compose only |