Files

Michael Freno 01ffe79bbe Update ROLLBACK.md with review completion (FRE-4808)

2026-05-12 01:11:59 -04:00

20 KiB

Raw Blame History

ShieldAI Rollback Runbook

Last updated: 2026-05-12
Owner: Senior Engineer
Parent: FRE-4574 ShieldAI Production Infrastructure & CI/CD Pipeline
Reviewed by: Code Reviewer (FRE-4808) on 2026-05-12

Overview
Rollback Strategies
ECS Service Rollback (AWS)
Docker Compose Rollback (Local / Staging)
Database Migration Rollback
Automated Rollback Triggers
Blue-Green Deployment Rollback
Rollback Decision Tree
Post-Rollback Verification
Testing Checklist
Runbook: Emergency Rollback

1. Overview

ShieldAI runs four services (api, darkwatch, spamshield, voiceprint) on AWS ECS Fargate behind an Application Load Balancer. Each service has independent deployment, health checks, and rollback capability.

Rollback types:

Type	Trigger	Scope	Automation
ECS Service Rollback	Health check failure, manual	Single or all services	✅ CI/CD + manual script
Docker Compose Rollback	Manual (local/staging)	All services	✅ Scripted
Database Migration Rollback	Manual	Schema changes	⚠️ Semi-manual
Blue-Green Rollback	Manual or automated	Full environment	✅ CI/CD
RDS Point-in-Time Restore	Manual (disaster)	Full database	⚠️ Semi-manual

2. Rollback Strategies

2.1 ECS Service-Level Rollback

Each ECS service maintains a history of task definitions. Rolling back reverts to the previous successfully deployed task definition.

Prerequisites:

AWS CLI configured with credentials for the target environment
IAM permissions: ecs:UpdateService, ecs:DescribeServices, ecs:WaitServicesStable

2.2 Blue-Green Rollback

The CI/CD pipeline deploys new images to existing ECS services. If health checks fail after deployment, the rollback job in the deploy workflow automatically reverts all four services to their previous task definition revision.

Pipeline flow:

build-and-push → deploy-ecs → health-check → [PASS: done | FAIL: rollback]

2.3 Database Migration Rollback

ShieldAI uses Drizzle ORM for database migrations. Each migration is versioned and stored in src/db/migrations/. Rollback requires running the previous migration set.

3. ECS Service Rollback (AWS)

3.1 Automated (CI/CD Pipeline)

The deploy workflow (.github/workflows/deploy.yml) includes a rollback job that triggers on health check failure:

rollback:
  if: failure() && needs.health-check.result == 'failure'
  # Rolls back all 4 services to previous task definition

When it runs:

Post-deploy health check fails (HTTP 200 not received from /health)
Runs after deploy-ecs and health-check jobs
Rolls back all four services: api, darkwatch, spamshield, voiceprint

How to verify:

Navigate to the GitHub Actions run for the failed deployment
Check the Rollback on Failure job logs
Confirm each service shows "Rolled back" status

3.2 Manual Rollback Script

# Single service
./infra/scripts/rollback.sh production api

# All services
./infra/scripts/rollback.sh production all

# Staging environment
./infra/scripts/rollback.sh staging all

Script behavior:

Iterates over target services (or all if all specified)
Calls aws ecs update-service --rollback for each service
Waits for service to stabilize via aws ecs wait services-stable
Reports success/failure per service
Exits with non-zero code if any service fails to stabilize

Expected output:

Rolling back services in cluster: shieldai-production
Rolling back api...
Waiting for api to stabilize...
api rolled back successfully
Rolling back darkwatch...
Waiting for darkwatch to stabilize...
darkwatch rolled back successfully
...
Rollback complete for api darkwatch spamshield voiceprint

3.3 Manual CLI Rollback (Fallback)

If the script is unavailable, rollback individual services:

CLUSTER="shieldai-production"
SERVICE="api"

# Rollback to previous task definition
aws ecs update-service \
  --cluster "$CLUSTER" \
  --service "${CLUSTER}-${SERVICE}" \
  --rollback \
  --no-cli-auto-prompt

# Wait for stabilization
aws ecs wait services-stable \
  --cluster "$CLUSTER" \
  --services "${CLUSTER}-${SERVICE}"

# Verify health
curl -s -o /dev/null -w "%{http_code}" \
  "https://shieldai-production-alb.us-east-1.elb.amazonaws.com/health"

4. Docker Compose Rollback (Local / Staging)

4.1 Production Compose Rollback

The docker-compose.prod.yml deploys all services with tagged images. To rollback:

# 1. Identify the previous working tag
# Check GitHub releases or git tags for the last known good version
PREVIOUS_TAG="v1.2.3"

# 2. Stop current services
docker compose -f docker-compose.prod.yml down

# 3. Pull previous images
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-api:${PREVIOUS_TAG}
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-darkwatch:${PREVIOUS_TAG}
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-spamshield:${PREVIOUS_TAG}
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-voiceprint:${PREVIOUS_TAG}

# 4. Override tag in compose
DOCKER_TAG=${PREVIOUS_TAG} docker compose -f docker-compose.prod.yml up -d

# 5. Verify health
for svc in api darkwatch spamshield voiceprint; do
  PORT=$(case $svc in
    api) echo 3000;; darkwatch) echo 3001;;
    spamshield) echo 3002;; voiceprint) echo 3003;;
  esac)
  curl -sf "http://localhost:${PORT}/health" && echo "$svc: OK" || echo "$svc: FAIL"
done

4.2 Local Dev Rollback

# Stop and remove containers
docker compose down

# Rebuild from previous commit
git checkout <previous-commit>
docker compose up -d --build

5. Database Migration Rollback

5.1 Drizzle Migration Rollback

ShieldAI uses Drizzle ORM with Turso dialect. Migrations are stored in src/db/migrations/.

# 1. Get database credentials from AWS Secrets Manager
DB_SECRET=$(aws secretsmanager get-secret-value \
  --secret-id "shieldai-${ENVIRONMENT}-db-password" \
  --query 'SecretString' --output json)

DB_HOST=$(echo "$DB_SECRET" | jq -r '.host')
DB_PORT=$(echo "$DB_SECRET" | jq -r '.port')
DB_USER=$(echo "$DB_SECRET" | jq -r '.username')
DB_PASS=$(echo "$DB_SECRET" | jq -r '.password')

DATABASE_URL="postgresql://${DB_USER}:${DB_PASS}@${DB_HOST}:${DB_PORT}/shieldai"

# 2. List migrations to identify the one to revert
npx drizzle-kit introspect --config=drizzle.config.ts

# 3. Resolve the problematic migration (marks it as not applied)
npx drizzle-kit migrate:resolve --migration "<migration_name>" --status applied

# 4. Re-run previous migration state
npx drizzle-kit migrate --config=drizzle.config.ts

5.2 RDS Point-in-Time Recovery (Disaster)

When the database itself needs recovery (e.g., data corruption, bad migration):

# 1. Find available recovery window (automated backups: every 24h, 7-14 day retention)
aws rds describe-db-instances \
  --db-instance-identifier "shieldai-production-db" \
  --query 'DBInstances[0].LatestRestorableTime'

# 2. Create restored instance (does not affect primary)
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier "shieldai-production-db" \
  --db-instance-identifier "shieldai-production-db-restored" \
  --restore-time "2026-05-09T08:00:00Z"

# 3. Verify restored instance
aws rds wait db-instance-available \
  --db-instance-identifier "shieldai-production-db-restored"

# 4. Update ECS services to point to restored instance
# Update DATABASE_URL secret in Secrets Manager
aws secretsmanager put-secret-value \
  --secret-id "shieldai-production-db-password" \
  --secret-string "$(echo "$DB_SECRET" | jq --arg host "$(aws rds describe-db-instances --db-instance-identifier shieldai-production-db-restored --query 'DBInstances[0].Endpoint.Address' --output text)" '.host = $host')"

# 5. Trigger ECS service redeployment to pick up new DB endpoint
./infra/scripts/rollback.sh production all

5.3 RDS Snapshot Restore

# 1. List available snapshots
aws rds describe-db-snapshots \
  --db-instance-identifier "shieldai-production-db"

# 2. Restore from specific snapshot
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier "shieldai-production-db-restored" \
  --db-snapshot-identifier "rds:shieldai-production-db-2026-05-08-03-00" \
  --db-instance-class "db.t3.medium" \
  --vpc-security-group-ids "$(terraform -chdir=infra/output -raw vpc_security_group_id)"

# 3. Follow steps 3-5 from Point-in-Time Recovery above

6. Automated Rollback Triggers

6.1 CI/CD Health Check Failure

Trigger: Post-deploy health check returns non-200 from /health

Pipeline job: rollback in .github/workflows/deploy.yml

Condition: if: failure() && needs.health-check.result == 'failure'

Action: Rolls back all four ECS services to previous task definition

Timeout: Health check retries for 5 minutes before triggering rollback

6.2 ECS Container Health Check

Each container has an in-container health check defined in the ECS task definition:

"healthCheck": {
  "command": ["CMD-SHELL", "wget -q --spider http://localhost:{port}/health || exit 1"],
  "interval": 30,
  "timeout": 5,
  "retries": 3,
  "startPeriod": 60
}

Failure consequence: Container is marked unhealthy after 3 consecutive failures (90 seconds). ALB marks target as unhealthy after 3 failed health checks (90 seconds). Service enters draining state.

6.3 ALB Target Group Health Check

The ALB performs HTTP health checks against /health on each target:

Parameter	Value
Interval	30s
Timeout	5s
Healthy threshold	3
Unhealthy threshold	3
Expected code	200

6.4 CloudWatch Alarms

The following alarms are configured in infra/modules/cloudwatch/main.tf:

Alarm	Threshold	Action
ECS CPU >80%	80% for 2 periods (10min)	SNS notification
ECS Memory >85%	85% for 2 periods (10min)	SNS notification
ALB 5xx >10/min	10 for 3 periods (3min)	SNS notification
RDS CPU >75%	75% for 2 periods (10min)	SNS notification
RDS Free Storage <500MB	500MB for 2 periods (10min)	SNS notification

Alarm escalation path:

CloudWatch alarm fires
SNS notification sent to on-call engineer
Engineer evaluates: if service is degraded, trigger manual rollback
If root cause is deployment-related, run ./infra/scripts/rollback.sh production all

7. Blue-Green Deployment Rollback

7.1 Architecture

ShieldAI uses ECS services with rolling deployments. Each deployment creates a new task definition revision. The ALB routes traffic to healthy targets only.

Rollback mechanism: ECS --rollback flag reverts the service to the previous task definition revision. This is equivalent to a blue-green swap since:

Old task definition (blue) remains registered
New task definition (green) is deployed
On rollback, ECS reverts to blue task definition
ALB automatically routes to healthy (blue) targets

7.2 Blue-Green Rollback Procedure

# 1. Check current deployment state
aws ecs list-services --cluster shieldai-production
aws ecs describe-services --cluster shieldai-production \
  --services shieldai-production-api \
  --query 'services[0].deployments'

# 2. Identify previous deployment
# The deployment with status "PRIMARY" is current.
# Look for "ACTIVE" deployment with older task definition.

# 3. Execute rollback (script handles all services)
./infra/scripts/rollback.sh production all

# 4. Verify rollback
aws ecs describe-services --cluster shieldai-production \
  --services shieldai-production-api \
  --query 'services[0].deployments[?status==`PRIMARY`].taskDefinition'

7.3 Docker Compose Blue-Green (Local)

For local/staging environments using Docker Compose, implement blue-green via service version pinning:

# Current deployment uses DOCKER_TAG env var
# Rollback by setting DOCKER_TAG to previous version

# Save current tag
CURRENT_TAG=$(grep DOCKER_TAG .env.prod 2>/dev/null | cut -d= -f2 || echo "latest")

# Rollback to previous
export DOCKER_TAG="v1.2.3"
docker compose -f docker-compose.prod.yml up -d

# Verify all services
docker compose -f docker-compose.prod.yml ps

8. Rollback Decision Tree

Is the service responding?
├── YES → Is the response correct?
│   ├── YES → Monitor, no action needed
│   └── NO → Is it a data issue?
│       ├── YES → Database Migration Rollback (§5)
│       └── NO → ECS Service Rollback (§3)
└── NO → Is it a single service or all?
    ├── Single → ECS Service Rollback (§3, specific service)
    └── All → Full Environment Rollback
        ├── Is DB corrupted?
        │   ├── YES → RDS Point-in-Time Recovery (§5.2)
        │   └── NO → ECS Full Rollback + DB Migration Rollback

SLA targets:

Single service rollback: < 5 minutes
Full environment rollback: < 15 minutes
Database recovery: < 30 minutes (Point-in-Time)

9. Post-Rollback Verification

After any rollback, verify the following:

9.1 Service Health

# Check all services are healthy
for svc in api darkwatch spamshield voiceprint; do
  PORT=$(case $svc in
    api) echo 3000;; darkwatch) echo 3001;;
    spamshield) echo 3002;; voiceprint) echo 3003;;
  esac)
  HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
    "https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health")
  echo "$svc: HTTP $HTTP_CODE"
done

9.2 ECS Service Status

# Verify all services are stable
for svc in api darkwatch spamshield voiceprint; do
  RUNNING=$(aws ecs describe-services \
    --cluster "shieldai-${ENVIRONMENT}" \
    --services "shieldai-${ENVIRONMENT}-${svc}" \
    --query 'services[0].runningCount' --output text)
  DESIRED=$(aws ecs describe-services \
    --cluster "shieldai-${ENVIRONMENT}" \
    --services "shieldai-${ENVIRONMENT}-${svc}" \
    --query 'services[0].desiredCount' --output text)
  echo "$svc: $RUNNING/$DESIRED running"
done

9.3 Database Connectivity

# Verify database connection
aws ecs execute-command \
  --cluster "shieldai-${ENVIRONMENT}" \
  --service "shieldai-${ENVIRONMENT}-api" \
  --command "npx drizzle-kit status" \
  --interactive --cluster "shieldai-${ENVIRONMENT}"

9.4 CloudWatch Verification

Navigate to CloudWatch dashboard: shieldai-${ENVIRONMENT}-dashboard
Verify CPU/Memory utilization is within normal range
Verify ALB 5xx errors have returned to baseline
Verify no new alarms are in ALARM state

10. Testing Checklist

10.1 ECS Rollback Test

Deploy a known-bad image (e.g., image with /health returning 500)
Verify CI/CD health check fails within 5 minutes
Verify rollback job triggers automatically
Verify all four services revert to previous task definition
Verify health check passes post-rollback
Verify CloudWatch metrics show recovery

10.2 Manual Script Test

Run ./infra/scripts/rollback.sh staging api on staging
Verify single service rolls back correctly
Run ./infra/scripts/rollback.sh staging all on staging
Verify all services roll back correctly
Verify script exits with code 0 on success
Verify script exits with code 1 on failure

10.3 Docker Compose Rollback Test

Deploy v2.0.0 of all services via docker-compose.prod.yml
Rollback to v1.0.0 using DOCKER_TAG override
Verify all services restart with previous images
Verify health endpoints respond correctly

10.4 Database Migration Rollback Test

Apply a test migration on staging
Run migration rollback procedure
Verify schema matches pre-migration state
Verify application connects and functions correctly

10.5 RDS Point-in-Time Recovery Test

Create a test RDS instance
Insert test data
Restore to point before data insertion
Verify restored instance has correct data state
Clean up test instance

10.6 End-to-End Rollback Drills

Drill	Frequency	Participants
ECS service rollback	Monthly	Senior Engineer
Full environment rollback	Quarterly	Full engineering team
Database recovery	Quarterly	Senior Engineer + Founding Engineer
Blue-green rollback	Quarterly	Full engineering team

11. Runbook: Emergency Rollback

11.1 Symptoms

ALB 5xx error rate > 10/minute for 3+ minutes
CloudWatch alarm: shieldai-production-alb-5xx in ALARM state
Customer-reported service degradation

11.2 Immediate Actions (0-5 minutes)

# 1. Confirm environment and scope
ENVIRONMENT="production"

# 2. Check service status
aws ecs describe-services \
  --cluster "shieldai-${ENVIRONMENT}" \
  --services shieldai-${ENVIRONMENT}-api,shieldai-${ENVIRONMENT}-darkwatch,shieldai-${ENVIRONMENT}-spamshield,shieldai-${ENVIRONMENT}-voiceprint \
  --query 'services[*].{Name:serviceName,Running:runningCount,Desired:desiredCount,Status:status}'

# 3. Check ALB health
curl -s -o /dev/null -w "%{http_code}" \
  "https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health"

# 4. Execute rollback
./infra/scripts/rollback.sh ${ENVIRONMENT} all

11.3 Verification (5-10 minutes)

# 1. Wait for services to stabilize
aws ecs wait services-stable \
  --cluster "shieldai-${ENVIRONMENT}" \
  --services shieldai-${ENVIRONMENT}-api,shieldai-${ENVIRONMENT}-darkwatch,shieldai-${ENVIRONMENT}-spamshield,shieldai-${ENVIRONMENT}-voiceprint

# 2. Verify health endpoint
curl -sf "https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health" \
  && echo "Health: OK" || echo "Health: FAIL"

# 3. Check CloudWatch for recovery
# Navigate to CloudWatch dashboard and verify metrics

11.4 Communication Template

## Rollback Notification

**Environment:** production
**Time:** $(date -u '+%Y-%m-%d %H:%M UTC')
**Trigger:** [ALB 5xx alarm / manual / CI/CD health check]
**Action:** Rolled back all services to previous deployment
**Status:** [In Progress / Verified / Resolved]
**Next steps:** [Post-mortem / monitoring / investigation]

11.5 Post-Incident

Create incident ticket with timeline
Document root cause
Update runbook if procedure changed
Schedule post-mortem within 48 hours
Create follow-up issues for preventive measures

Appendix A: Quick Reference

Resource	Command
Rollback script	`./infra/scripts/rollback.sh <env> <service\|all>`
ECS service status	`aws ecs describe-services --cluster shieldai-<env> --services shieldai-<env>-<svc>`
ALB health check	`curl -s -o /dev/null -w "%{http_code}" https://shieldai-<env>-alb.us-east-1.elb.amazonaws.com/health`
RDS snapshots	`aws rds describe-db-snapshots --db-instance-identifier shieldai-<env>-db`
CloudWatch dashboard	`https://us-east-1.console.aws.amazon.com/cloudwatch/home#dashboards/dashboard/shieldai-<env>-dashboard`
ECS task logs	`aws logs filter-log-events --log-group-name /ecs/shieldai-<env>-<svc>`

Appendix B: Environment Variables

Variable	Description	Required
`AWS_ACCESS_KEY_ID`	IAM user with ECS, RDS permissions	Yes
`AWS_SECRET_ACCESS_KEY`	IAM secret key	Yes
`AWS_DEFAULT_REGION`	AWS region (default: us-east-1)	Yes
`GITHUB_REPOSITORY_OWNER`	GitHub org/user for container registry	Docker Compose only
`DOCKER_TAG`	Container image tag to deploy	Docker Compose only
`POSTGRES_PASSWORD`	Database password	Docker Compose only

20 KiB Raw Blame History