Files
ShieldAI/infra/ROLLBACK.md

20 KiB

ShieldAI Rollback Runbook

Last updated: 2026-05-12
Owner: Senior Engineer
Parent: FRE-4574 ShieldAI Production Infrastructure & CI/CD Pipeline
Reviewed by: Code Reviewer (FRE-4808) on 2026-05-12


Table of Contents

  1. Overview
  2. Rollback Strategies
  3. ECS Service Rollback (AWS)
  4. Docker Compose Rollback (Local / Staging)
  5. Database Migration Rollback
  6. Automated Rollback Triggers
  7. Blue-Green Deployment Rollback
  8. Rollback Decision Tree
  9. Post-Rollback Verification
  10. Testing Checklist
  11. Runbook: Emergency Rollback

1. Overview

ShieldAI runs four services (api, darkwatch, spamshield, voiceprint) on AWS ECS Fargate behind an Application Load Balancer. Each service has independent deployment, health checks, and rollback capability.

Rollback types:

Type Trigger Scope Automation
ECS Service Rollback Health check failure, manual Single or all services CI/CD + manual script
Docker Compose Rollback Manual (local/staging) All services Scripted
Database Migration Rollback Manual Schema changes ⚠️ Semi-manual
Blue-Green Rollback Manual or automated Full environment CI/CD
RDS Point-in-Time Restore Manual (disaster) Full database ⚠️ Semi-manual

2. Rollback Strategies

2.1 ECS Service-Level Rollback

Each ECS service maintains a history of task definitions. Rolling back reverts to the previous successfully deployed task definition.

Prerequisites:

  • AWS CLI configured with credentials for the target environment
  • IAM permissions: ecs:UpdateService, ecs:DescribeServices, ecs:WaitServicesStable

2.2 Blue-Green Rollback

The CI/CD pipeline deploys new images to existing ECS services. If health checks fail after deployment, the rollback job in the deploy workflow automatically reverts all four services to their previous task definition revision.

Pipeline flow:

build-and-push → deploy-ecs → health-check → [PASS: done | FAIL: rollback]

2.3 Database Migration Rollback

ShieldAI uses Drizzle ORM for database migrations. Each migration is versioned and stored in src/db/migrations/. Rollback requires running the previous migration set.


3. ECS Service Rollback (AWS)

3.1 Automated (CI/CD Pipeline)

The deploy workflow (.github/workflows/deploy.yml) includes a rollback job that triggers on health check failure:

rollback:
  if: failure() && needs.health-check.result == 'failure'
  # Rolls back all 4 services to previous task definition

When it runs:

  • Post-deploy health check fails (HTTP 200 not received from /health)
  • Runs after deploy-ecs and health-check jobs
  • Rolls back all four services: api, darkwatch, spamshield, voiceprint

How to verify:

  1. Navigate to the GitHub Actions run for the failed deployment
  2. Check the Rollback on Failure job logs
  3. Confirm each service shows "Rolled back" status

3.2 Manual Rollback Script

# Single service
./infra/scripts/rollback.sh production api

# All services
./infra/scripts/rollback.sh production all

# Staging environment
./infra/scripts/rollback.sh staging all

Script behavior:

  1. Iterates over target services (or all if all specified)
  2. Calls aws ecs update-service --rollback for each service
  3. Waits for service to stabilize via aws ecs wait services-stable
  4. Reports success/failure per service
  5. Exits with non-zero code if any service fails to stabilize

Expected output:

Rolling back services in cluster: shieldai-production
Rolling back api...
Waiting for api to stabilize...
api rolled back successfully
Rolling back darkwatch...
Waiting for darkwatch to stabilize...
darkwatch rolled back successfully
...
Rollback complete for api darkwatch spamshield voiceprint

3.3 Manual CLI Rollback (Fallback)

If the script is unavailable, rollback individual services:

CLUSTER="shieldai-production"
SERVICE="api"

# Rollback to previous task definition
aws ecs update-service \
  --cluster "$CLUSTER" \
  --service "${CLUSTER}-${SERVICE}" \
  --rollback \
  --no-cli-auto-prompt

# Wait for stabilization
aws ecs wait services-stable \
  --cluster "$CLUSTER" \
  --services "${CLUSTER}-${SERVICE}"

# Verify health
curl -s -o /dev/null -w "%{http_code}" \
  "https://shieldai-production-alb.us-east-1.elb.amazonaws.com/health"

4. Docker Compose Rollback (Local / Staging)

4.1 Production Compose Rollback

The docker-compose.prod.yml deploys all services with tagged images. To rollback:

# 1. Identify the previous working tag
# Check GitHub releases or git tags for the last known good version
PREVIOUS_TAG="v1.2.3"

# 2. Stop current services
docker compose -f docker-compose.prod.yml down

# 3. Pull previous images
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-api:${PREVIOUS_TAG}
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-darkwatch:${PREVIOUS_TAG}
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-spamshield:${PREVIOUS_TAG}
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-voiceprint:${PREVIOUS_TAG}

# 4. Override tag in compose
DOCKER_TAG=${PREVIOUS_TAG} docker compose -f docker-compose.prod.yml up -d

# 5. Verify health
for svc in api darkwatch spamshield voiceprint; do
  PORT=$(case $svc in
    api) echo 3000;; darkwatch) echo 3001;;
    spamshield) echo 3002;; voiceprint) echo 3003;;
  esac)
  curl -sf "http://localhost:${PORT}/health" && echo "$svc: OK" || echo "$svc: FAIL"
done

4.2 Local Dev Rollback

# Stop and remove containers
docker compose down

# Rebuild from previous commit
git checkout <previous-commit>
docker compose up -d --build

5. Database Migration Rollback

5.1 Drizzle Migration Rollback

ShieldAI uses Drizzle ORM with Turso dialect. Migrations are stored in src/db/migrations/.

# 1. Get database credentials from AWS Secrets Manager
DB_SECRET=$(aws secretsmanager get-secret-value \
  --secret-id "shieldai-${ENVIRONMENT}-db-password" \
  --query 'SecretString' --output json)

DB_HOST=$(echo "$DB_SECRET" | jq -r '.host')
DB_PORT=$(echo "$DB_SECRET" | jq -r '.port')
DB_USER=$(echo "$DB_SECRET" | jq -r '.username')
DB_PASS=$(echo "$DB_SECRET" | jq -r '.password')

DATABASE_URL="postgresql://${DB_USER}:${DB_PASS}@${DB_HOST}:${DB_PORT}/shieldai"

# 2. List migrations to identify the one to revert
npx drizzle-kit introspect --config=drizzle.config.ts

# 3. Resolve the problematic migration (marks it as not applied)
npx drizzle-kit migrate:resolve --migration "<migration_name>" --status applied

# 4. Re-run previous migration state
npx drizzle-kit migrate --config=drizzle.config.ts

5.2 RDS Point-in-Time Recovery (Disaster)

When the database itself needs recovery (e.g., data corruption, bad migration):

# 1. Find available recovery window (automated backups: every 24h, 7-14 day retention)
aws rds describe-db-instances \
  --db-instance-identifier "shieldai-production-db" \
  --query 'DBInstances[0].LatestRestorableTime'

# 2. Create restored instance (does not affect primary)
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier "shieldai-production-db" \
  --db-instance-identifier "shieldai-production-db-restored" \
  --restore-time "2026-05-09T08:00:00Z"

# 3. Verify restored instance
aws rds wait db-instance-available \
  --db-instance-identifier "shieldai-production-db-restored"

# 4. Update ECS services to point to restored instance
# Update DATABASE_URL secret in Secrets Manager
aws secretsmanager put-secret-value \
  --secret-id "shieldai-production-db-password" \
  --secret-string "$(echo "$DB_SECRET" | jq --arg host "$(aws rds describe-db-instances --db-instance-identifier shieldai-production-db-restored --query 'DBInstances[0].Endpoint.Address' --output text)" '.host = $host')"

# 5. Trigger ECS service redeployment to pick up new DB endpoint
./infra/scripts/rollback.sh production all

5.3 RDS Snapshot Restore

# 1. List available snapshots
aws rds describe-db-snapshots \
  --db-instance-identifier "shieldai-production-db"

# 2. Restore from specific snapshot
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier "shieldai-production-db-restored" \
  --db-snapshot-identifier "rds:shieldai-production-db-2026-05-08-03-00" \
  --db-instance-class "db.t3.medium" \
  --vpc-security-group-ids "$(terraform -chdir=infra/output -raw vpc_security_group_id)"

# 3. Follow steps 3-5 from Point-in-Time Recovery above

6. Automated Rollback Triggers

6.1 CI/CD Health Check Failure

Trigger: Post-deploy health check returns non-200 from /health

Pipeline job: rollback in .github/workflows/deploy.yml

Condition: if: failure() && needs.health-check.result == 'failure'

Action: Rolls back all four ECS services to previous task definition

Timeout: Health check retries for 5 minutes before triggering rollback

6.2 ECS Container Health Check

Each container has an in-container health check defined in the ECS task definition:

"healthCheck": {
  "command": ["CMD-SHELL", "wget -q --spider http://localhost:{port}/health || exit 1"],
  "interval": 30,
  "timeout": 5,
  "retries": 3,
  "startPeriod": 60
}

Failure consequence: Container is marked unhealthy after 3 consecutive failures (90 seconds). ALB marks target as unhealthy after 3 failed health checks (90 seconds). Service enters draining state.

6.3 ALB Target Group Health Check

The ALB performs HTTP health checks against /health on each target:

Parameter Value
Interval 30s
Timeout 5s
Healthy threshold 3
Unhealthy threshold 3
Expected code 200

6.4 CloudWatch Alarms

The following alarms are configured in infra/modules/cloudwatch/main.tf:

Alarm Threshold Action
ECS CPU >80% 80% for 2 periods (10min) SNS notification
ECS Memory >85% 85% for 2 periods (10min) SNS notification
ALB 5xx >10/min 10 for 3 periods (3min) SNS notification
RDS CPU >75% 75% for 2 periods (10min) SNS notification
RDS Free Storage <500MB 500MB for 2 periods (10min) SNS notification

Alarm escalation path:

  1. CloudWatch alarm fires
  2. SNS notification sent to on-call engineer
  3. Engineer evaluates: if service is degraded, trigger manual rollback
  4. If root cause is deployment-related, run ./infra/scripts/rollback.sh production all

7. Blue-Green Deployment Rollback

7.1 Architecture

ShieldAI uses ECS services with rolling deployments. Each deployment creates a new task definition revision. The ALB routes traffic to healthy targets only.

Rollback mechanism: ECS --rollback flag reverts the service to the previous task definition revision. This is equivalent to a blue-green swap since:

  1. Old task definition (blue) remains registered
  2. New task definition (green) is deployed
  3. On rollback, ECS reverts to blue task definition
  4. ALB automatically routes to healthy (blue) targets

7.2 Blue-Green Rollback Procedure

# 1. Check current deployment state
aws ecs list-services --cluster shieldai-production
aws ecs describe-services --cluster shieldai-production \
  --services shieldai-production-api \
  --query 'services[0].deployments'

# 2. Identify previous deployment
# The deployment with status "PRIMARY" is current.
# Look for "ACTIVE" deployment with older task definition.

# 3. Execute rollback (script handles all services)
./infra/scripts/rollback.sh production all

# 4. Verify rollback
aws ecs describe-services --cluster shieldai-production \
  --services shieldai-production-api \
  --query 'services[0].deployments[?status==`PRIMARY`].taskDefinition'

7.3 Docker Compose Blue-Green (Local)

For local/staging environments using Docker Compose, implement blue-green via service version pinning:

# Current deployment uses DOCKER_TAG env var
# Rollback by setting DOCKER_TAG to previous version

# Save current tag
CURRENT_TAG=$(grep DOCKER_TAG .env.prod 2>/dev/null | cut -d= -f2 || echo "latest")

# Rollback to previous
export DOCKER_TAG="v1.2.3"
docker compose -f docker-compose.prod.yml up -d

# Verify all services
docker compose -f docker-compose.prod.yml ps

8. Rollback Decision Tree

Is the service responding?
├── YES → Is the response correct?
│   ├── YES → Monitor, no action needed
│   └── NO → Is it a data issue?
│       ├── YES → Database Migration Rollback (§5)
│       └── NO → ECS Service Rollback (§3)
└── NO → Is it a single service or all?
    ├── Single → ECS Service Rollback (§3, specific service)
    └── All → Full Environment Rollback
        ├── Is DB corrupted?
        │   ├── YES → RDS Point-in-Time Recovery (§5.2)
        │   └── NO → ECS Full Rollback + DB Migration Rollback

SLA targets:

  • Single service rollback: < 5 minutes
  • Full environment rollback: < 15 minutes
  • Database recovery: < 30 minutes (Point-in-Time)

9. Post-Rollback Verification

After any rollback, verify the following:

9.1 Service Health

# Check all services are healthy
for svc in api darkwatch spamshield voiceprint; do
  PORT=$(case $svc in
    api) echo 3000;; darkwatch) echo 3001;;
    spamshield) echo 3002;; voiceprint) echo 3003;;
  esac)
  HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
    "https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health")
  echo "$svc: HTTP $HTTP_CODE"
done

9.2 ECS Service Status

# Verify all services are stable
for svc in api darkwatch spamshield voiceprint; do
  RUNNING=$(aws ecs describe-services \
    --cluster "shieldai-${ENVIRONMENT}" \
    --services "shieldai-${ENVIRONMENT}-${svc}" \
    --query 'services[0].runningCount' --output text)
  DESIRED=$(aws ecs describe-services \
    --cluster "shieldai-${ENVIRONMENT}" \
    --services "shieldai-${ENVIRONMENT}-${svc}" \
    --query 'services[0].desiredCount' --output text)
  echo "$svc: $RUNNING/$DESIRED running"
done

9.3 Database Connectivity

# Verify database connection
aws ecs execute-command \
  --cluster "shieldai-${ENVIRONMENT}" \
  --service "shieldai-${ENVIRONMENT}-api" \
  --command "npx drizzle-kit status" \
  --interactive --cluster "shieldai-${ENVIRONMENT}"

9.4 CloudWatch Verification

  1. Navigate to CloudWatch dashboard: shieldai-${ENVIRONMENT}-dashboard
  2. Verify CPU/Memory utilization is within normal range
  3. Verify ALB 5xx errors have returned to baseline
  4. Verify no new alarms are in ALARM state

10. Testing Checklist

10.1 ECS Rollback Test

  • Deploy a known-bad image (e.g., image with /health returning 500)
  • Verify CI/CD health check fails within 5 minutes
  • Verify rollback job triggers automatically
  • Verify all four services revert to previous task definition
  • Verify health check passes post-rollback
  • Verify CloudWatch metrics show recovery

10.2 Manual Script Test

  • Run ./infra/scripts/rollback.sh staging api on staging
  • Verify single service rolls back correctly
  • Run ./infra/scripts/rollback.sh staging all on staging
  • Verify all services roll back correctly
  • Verify script exits with code 0 on success
  • Verify script exits with code 1 on failure

10.3 Docker Compose Rollback Test

  • Deploy v2.0.0 of all services via docker-compose.prod.yml
  • Rollback to v1.0.0 using DOCKER_TAG override
  • Verify all services restart with previous images
  • Verify health endpoints respond correctly

10.4 Database Migration Rollback Test

  • Apply a test migration on staging
  • Run migration rollback procedure
  • Verify schema matches pre-migration state
  • Verify application connects and functions correctly

10.5 RDS Point-in-Time Recovery Test

  • Create a test RDS instance
  • Insert test data
  • Restore to point before data insertion
  • Verify restored instance has correct data state
  • Clean up test instance

10.6 End-to-End Rollback Drills

Drill Frequency Participants
ECS service rollback Monthly Senior Engineer
Full environment rollback Quarterly Full engineering team
Database recovery Quarterly Senior Engineer + Founding Engineer
Blue-green rollback Quarterly Full engineering team

11. Runbook: Emergency Rollback

11.1 Symptoms

  • ALB 5xx error rate > 10/minute for 3+ minutes
  • CloudWatch alarm: shieldai-production-alb-5xx in ALARM state
  • Customer-reported service degradation

11.2 Immediate Actions (0-5 minutes)

# 1. Confirm environment and scope
ENVIRONMENT="production"

# 2. Check service status
aws ecs describe-services \
  --cluster "shieldai-${ENVIRONMENT}" \
  --services shieldai-${ENVIRONMENT}-api,shieldai-${ENVIRONMENT}-darkwatch,shieldai-${ENVIRONMENT}-spamshield,shieldai-${ENVIRONMENT}-voiceprint \
  --query 'services[*].{Name:serviceName,Running:runningCount,Desired:desiredCount,Status:status}'

# 3. Check ALB health
curl -s -o /dev/null -w "%{http_code}" \
  "https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health"

# 4. Execute rollback
./infra/scripts/rollback.sh ${ENVIRONMENT} all

11.3 Verification (5-10 minutes)

# 1. Wait for services to stabilize
aws ecs wait services-stable \
  --cluster "shieldai-${ENVIRONMENT}" \
  --services shieldai-${ENVIRONMENT}-api,shieldai-${ENVIRONMENT}-darkwatch,shieldai-${ENVIRONMENT}-spamshield,shieldai-${ENVIRONMENT}-voiceprint

# 2. Verify health endpoint
curl -sf "https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health" \
  && echo "Health: OK" || echo "Health: FAIL"

# 3. Check CloudWatch for recovery
# Navigate to CloudWatch dashboard and verify metrics

11.4 Communication Template

## Rollback Notification

**Environment:** production
**Time:** $(date -u '+%Y-%m-%d %H:%M UTC')
**Trigger:** [ALB 5xx alarm / manual / CI/CD health check]
**Action:** Rolled back all services to previous deployment
**Status:** [In Progress / Verified / Resolved]
**Next steps:** [Post-mortem / monitoring / investigation]

11.5 Post-Incident

  1. Create incident ticket with timeline
  2. Document root cause
  3. Update runbook if procedure changed
  4. Schedule post-mortem within 48 hours
  5. Create follow-up issues for preventive measures

Appendix A: Quick Reference

Resource Command
Rollback script ./infra/scripts/rollback.sh <env> <service|all>
ECS service status aws ecs describe-services --cluster shieldai-<env> --services shieldai-<env>-<svc>
ALB health check curl -s -o /dev/null -w "%{http_code}" https://shieldai-<env>-alb.us-east-1.elb.amazonaws.com/health
RDS snapshots aws rds describe-db-snapshots --db-instance-identifier shieldai-<env>-db
CloudWatch dashboard https://us-east-1.console.aws.amazon.com/cloudwatch/home#dashboards/dashboard/shieldai-<env>-dashboard
ECS task logs aws logs filter-log-events --log-group-name /ecs/shieldai-<env>-<svc>

Appendix B: Environment Variables

Variable Description Required
AWS_ACCESS_KEY_ID IAM user with ECS, RDS permissions Yes
AWS_SECRET_ACCESS_KEY IAM secret key Yes
AWS_DEFAULT_REGION AWS region (default: us-east-1) Yes
GITHUB_REPOSITORY_OWNER GitHub org/user for container registry Docker Compose only
DOCKER_TAG Container image tag to deploy Docker Compose only
POSTGRES_PASSWORD Database password Docker Compose only