Add rollback procedure documentation and testing scripts (FRE-4808)

- infra/ROLLBACK.md: comprehensive rollback runbook with ECS, Docker Compose, database migration, blue-green, and emergency rollback procedures - infra/scripts/rollback.sh: enhanced ECS rollback with validation, logging, health verification, and per-service rollback support - infra/scripts/rollback-compose.sh: Docker Compose rollback for local/staging - infra/scripts/rollback-migration.sh: Drizzle migration rollback with AWS Secrets Manager integration - infra/scripts/test-rollback.sh: automated test suite (51 tests) - Updated infra/README.md to reference ROLLBACK.md Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-05-09 06:27:31 -04:00
parent 540ca5ebad
commit bce4787802
6 changed files with 1394 additions and 40 deletions
--- a/infra/README.md
+++ b/infra/README.md
@@ -14,7 +14,9 @@
 │   ├── staging/main.tf     # Staging environment config
 │   └── production/main.tf  # Production environment config
 └── scripts/
-    └── rollback.sh         # Manual rollback script
+    ├── rollback.sh             # ECS service rollback (AWS)
+    ├── rollback-compose.sh     # Docker Compose rollback (local/staging)
+    └── rollback-migration.sh   # Database migration rollback

 ## Quick Start

@@ -75,31 +77,28 @@ terraform apply -var-file=terraform.tfvars.example

 ## Rollback

-### Automatic (CI/CD)
-The deploy workflow triggers automatic rollback when health checks fail:
-```
-deploy-ecs → health-check (failure) → rollback
-```
+See **[ROLLBACK.md](./ROLLBACK.md)** for the complete rollback runbook, including:
+
+- ECS service rollback (automated + manual)
+- Docker Compose rollback (local / staging)
+- Database migration rollback (Drizzle)
+- Blue-green deployment rollback
+- RDS point-in-time recovery
+- Automated rollback triggers and health checks
+- Emergency rollback runbook
+- Testing checklist
+
+### Quick Reference

-### Manual
 ```bash
-# Rollback specific service
-cd infra/scripts
-./rollback.sh staging api
+# ECS service rollback (AWS)
+./infra/scripts/rollback.sh <environment> <service|all> [--verify]

-# Rollback all services
-./rollback.sh staging all
-```
+# Docker Compose rollback (local/staging)
+./infra/scripts/rollback-compose.sh <previous_tag>

-### Database Migration Rollback
-```bash
-# Run previous migration
-DATABASE_URL=$(aws secretsmanager get-secret-value \
-  --secret-id shieldai-staging-db-password \
-  --query 'SecretString' --output json | jq -r '.host')
-
-npx prisma migrate resolve --applied <migration_name>
-npx prisma migrate deploy
+# Database migration rollback
+./infra/scripts/rollback-migration.sh <environment> [--migration <name>]
 ```

 ## GitHub Secrets Required
--- a/infra/ROLLBACK.md
+++ b/infra/ROLLBACK.md
@@ -0,0 +1,610 @@
+# ShieldAI Rollback Runbook
+
+> **Last updated:** 2026-05-09  
+> **Owner:** Senior Engineer  
+> **Parent:** [FRE-4574](/FRE/issues/FRE-4574) ShieldAI Production Infrastructure & CI/CD Pipeline
+
+---
+
+## Table of Contents
+
+1. [Overview](#1-overview)
+2. [Rollback Strategies](#2-rollback-strategies)
+3. [ECS Service Rollback (AWS)](#3-ecs-service-rollback-aws)
+4. [Docker Compose Rollback (Local / Staging)](#4-docker-compose-rollback-local--staging)
+5. [Database Migration Rollback](#5-database-migration-rollback)
+6. [Automated Rollback Triggers](#6-automated-rollback-triggers)
+7. [Blue-Green Deployment Rollback](#7-blue-green-deployment-rollback)
+8. [Rollback Decision Tree](#8-rollback-decision-tree)
+9. [Post-Rollback Verification](#9-post-rollback-verification)
+10. [Testing Checklist](#10-testing-checklist)
+11. [Runbook: Emergency Rollback](#11-runbook-emergency-rollback)
+
+---
+
+## 1. Overview
+
+ShieldAI runs four services (api, darkwatch, spamshield, voiceprint) on AWS ECS Fargate behind an Application Load Balancer. Each service has independent deployment, health checks, and rollback capability.
+
+**Rollback types:**
+
+| Type | Trigger | Scope | Automation |
+|------|---------|-------|------------|
+| **ECS Service Rollback** | Health check failure, manual | Single or all services | ✅ CI/CD + manual script |
+| **Docker Compose Rollback** | Manual (local/staging) | All services | ✅ Scripted |
+| **Database Migration Rollback** | Manual | Schema changes | ⚠️ Semi-manual |
+| **Blue-Green Rollback** | Manual or automated | Full environment | ✅ CI/CD |
+| **RDS Point-in-Time Restore** | Manual (disaster) | Full database | ⚠️ Semi-manual |
+
+---
+
+## 2. Rollback Strategies
+
+### 2.1 ECS Service-Level Rollback
+
+Each ECS service maintains a history of task definitions. Rolling back reverts to the **previous successfully deployed task definition**.
+
+**Prerequisites:**
+- AWS CLI configured with credentials for the target environment
+- IAM permissions: `ecs:UpdateService`, `ecs:DescribeServices`, `ecs:WaitServicesStable`
+
+### 2.2 Blue-Green Rollback
+
+The CI/CD pipeline deploys new images to existing ECS services. If health checks fail after deployment, the `rollback` job in the deploy workflow automatically reverts all four services to their previous task definition revision.
+
+**Pipeline flow:**
+```
+build-and-push → deploy-ecs → health-check → [PASS: done | FAIL: rollback]
+```
+
+### 2.3 Database Migration Rollback
+
+ShieldAI uses Drizzle ORM for database migrations. Each migration is versioned and stored in `src/db/migrations/`. Rollback requires running the previous migration set.
+
+---
+
+## 3. ECS Service Rollback (AWS)
+
+### 3.1 Automated (CI/CD Pipeline)
+
+The deploy workflow (`.github/workflows/deploy.yml`) includes a `rollback` job that triggers on health check failure:
+
+```yaml
+rollback:
+  if: failure() && needs.health-check.result == 'failure'
+  # Rolls back all 4 services to previous task definition
+```
+
+**When it runs:**
+- Post-deploy health check fails (HTTP 200 not received from `/health`)
+- Runs after `deploy-ecs` and `health-check` jobs
+- Rolls back all four services: api, darkwatch, spamshield, voiceprint
+
+**How to verify:**
+1. Navigate to the GitHub Actions run for the failed deployment
+2. Check the `Rollback on Failure` job logs
+3. Confirm each service shows "Rolled back" status
+
+### 3.2 Manual Rollback Script
+
+```bash
+# Single service
+./infra/scripts/rollback.sh production api
+
+# All services
+./infra/scripts/rollback.sh production all
+
+# Staging environment
+./infra/scripts/rollback.sh staging all
+```
+
+**Script behavior:**
+1. Iterates over target services (or all if `all` specified)
+2. Calls `aws ecs update-service --rollback` for each service
+3. Waits for service to stabilize via `aws ecs wait services-stable`
+4. Reports success/failure per service
+5. Exits with non-zero code if any service fails to stabilize
+
+**Expected output:**
+```
+Rolling back services in cluster: shieldai-production
+Rolling back api...
+Waiting for api to stabilize...
+api rolled back successfully
+Rolling back darkwatch...
+Waiting for darkwatch to stabilize...
+darkwatch rolled back successfully
+...
+Rollback complete for api darkwatch spamshield voiceprint
+```
+
+### 3.3 Manual CLI Rollback (Fallback)
+
+If the script is unavailable, rollback individual services:
+
+```bash
+CLUSTER="shieldai-production"
+SERVICE="api"
+
+# Rollback to previous task definition
+aws ecs update-service \
+  --cluster "$CLUSTER" \
+  --service "${CLUSTER}-${SERVICE}" \
+  --rollback \
+  --no-cli-auto-prompt
+
+# Wait for stabilization
+aws ecs wait services-stable \
+  --cluster "$CLUSTER" \
+  --services "${CLUSTER}-${SERVICE}"
+
+# Verify health
+curl -s -o /dev/null -w "%{http_code}" \
+  "https://shieldai-production-alb.us-east-1.elb.amazonaws.com/health"
+```
+
+---
+
+## 4. Docker Compose Rollback (Local / Staging)
+
+### 4.1 Production Compose Rollback
+
+The `docker-compose.prod.yml` deploys all services with tagged images. To rollback:
+
+```bash
+# 1. Identify the previous working tag
+# Check GitHub releases or git tags for the last known good version
+PREVIOUS_TAG="v1.2.3"
+
+# 2. Stop current services
+docker compose -f docker-compose.prod.yml down
+
+# 3. Pull previous images
+docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-api:${PREVIOUS_TAG}
+docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-darkwatch:${PREVIOUS_TAG}
+docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-spamshield:${PREVIOUS_TAG}
+docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-voiceprint:${PREVIOUS_TAG}
+
+# 4. Override tag in compose
+DOCKER_TAG=${PREVIOUS_TAG} docker compose -f docker-compose.prod.yml up -d
+
+# 5. Verify health
+for svc in api darkwatch spamshield voiceprint; do
+  PORT=$(case $svc in
+    api) echo 3000;; darkwatch) echo 3001;;
+    spamshield) echo 3002;; voiceprint) echo 3003;;
+  esac)
+  curl -sf "http://localhost:${PORT}/health" && echo "$svc: OK" || echo "$svc: FAIL"
+done
+```
+
+### 4.2 Local Dev Rollback
+
+```bash
+# Stop and remove containers
+docker compose down
+
+# Rebuild from previous commit
+git checkout <previous-commit>
+docker compose up -d --build
+```
+
+---
+
+## 5. Database Migration Rollback
+
+### 5.1 Drizzle Migration Rollback
+
+ShieldAI uses Drizzle ORM with Turso dialect. Migrations are stored in `src/db/migrations/`.
+
+```bash
+# 1. Get database credentials from AWS Secrets Manager
+DB_SECRET=$(aws secretsmanager get-secret-value \
+  --secret-id "shieldai-${ENVIRONMENT}-db-password" \
+  --query 'SecretString' --output json)
+
+DB_HOST=$(echo "$DB_SECRET" | jq -r '.host')
+DB_PORT=$(echo "$DB_SECRET" | jq -r '.port')
+DB_USER=$(echo "$DB_SECRET" | jq -r '.username')
+DB_PASS=$(echo "$DB_SECRET" | jq -r '.password')
+
+DATABASE_URL="postgresql://${DB_USER}:${DB_PASS}@${DB_HOST}:${DB_PORT}/shieldai"
+
+# 2. List migrations to identify the one to revert
+npx drizzle-kit introspect --config=drizzle.config.ts
+
+# 3. Resolve the problematic migration (marks it as not applied)
+npx drizzle-kit migrate:resolve --migration "<migration_name>" --status applied
+
+# 4. Re-run previous migration state
+npx drizzle-kit migrate --config=drizzle.config.ts
+```
+
+### 5.2 RDS Point-in-Time Recovery (Disaster)
+
+When the database itself needs recovery (e.g., data corruption, bad migration):
+
+```bash
+# 1. Find available recovery window (automated backups: every 24h, 7-14 day retention)
+aws rds describe-db-instances \
+  --db-instance-identifier "shieldai-production-db" \
+  --query 'DBInstances[0].LatestRestorableTime'
+
+# 2. Create restored instance (does not affect primary)
+aws rds restore-db-instance-to-point-in-time \
+  --source-db-instance-identifier "shieldai-production-db" \
+  --db-instance-identifier "shieldai-production-db-restored" \
+  --restore-time "2026-05-09T08:00:00Z"
+
+# 3. Verify restored instance
+aws rds wait db-instance-available \
+  --db-instance-identifier "shieldai-production-db-restored"
+
+# 4. Update ECS services to point to restored instance
+# Update DATABASE_URL secret in Secrets Manager
+aws secretsmanager put-secret-value \
+  --secret-id "shieldai-production-db-password" \
+  --secret-string "$(echo "$DB_SECRET" | jq --arg host "$(aws rds describe-db-instances --db-instance-identifier shieldai-production-db-restored --query 'DBInstances[0].Endpoint.Address' --output text)" '.host = $host')"
+
+# 5. Trigger ECS service redeployment to pick up new DB endpoint
+./infra/scripts/rollback.sh production all
+```
+
+### 5.3 RDS Snapshot Restore
+
+```bash
+# 1. List available snapshots
+aws rds describe-db-snapshots \
+  --db-instance-identifier "shieldai-production-db"
+
+# 2. Restore from specific snapshot
+aws rds restore-db-instance-from-db-snapshot \
+  --db-instance-identifier "shieldai-production-db-restored" \
+  --db-snapshot-identifier "rds:shieldai-production-db-2026-05-08-03-00" \
+  --db-instance-class "db.t3.medium" \
+  --vpc-security-group-ids "$(terraform -chdir=infra/output -raw vpc_security_group_id)"
+
+# 3. Follow steps 3-5 from Point-in-Time Recovery above
+```
+
+---
+
+## 6. Automated Rollback Triggers
+
+### 6.1 CI/CD Health Check Failure
+
+**Trigger:** Post-deploy health check returns non-200 from `/health`
+
+**Pipeline job:** `rollback` in `.github/workflows/deploy.yml`
+
+**Condition:** `if: failure() && needs.health-check.result == 'failure'`
+
+**Action:** Rolls back all four ECS services to previous task definition
+
+**Timeout:** Health check retries for 5 minutes before triggering rollback
+
+### 6.2 ECS Container Health Check
+
+Each container has an in-container health check defined in the ECS task definition:
+
+```json
+"healthCheck": {
+  "command": ["CMD-SHELL", "wget -q --spider http://localhost:{port}/health || exit 1"],
+  "interval": 30,
+  "timeout": 5,
+  "retries": 3,
+  "startPeriod": 60
+}
+```
+
+**Failure consequence:** Container is marked unhealthy after 3 consecutive failures (90 seconds). ALB marks target as unhealthy after 3 failed health checks (90 seconds). Service enters draining state.
+
+### 6.3 ALB Target Group Health Check
+
+The ALB performs HTTP health checks against `/health` on each target:
+
+| Parameter | Value |
+|-----------|-------|
+| Interval | 30s |
+| Timeout | 5s |
+| Healthy threshold | 3 |
+| Unhealthy threshold | 3 |
+| Expected code | 200 |
+
+### 6.4 CloudWatch Alarms
+
+The following alarms are configured in `infra/modules/cloudwatch/main.tf`:
+
+| Alarm | Threshold | Action |
+|-------|-----------|--------|
+| ECS CPU >80% | 80% for 2 periods (10min) | SNS notification |
+| ECS Memory >85% | 85% for 2 periods (10min) | SNS notification |
+| ALB 5xx >10/min | 10 for 3 periods (3min) | SNS notification |
+| RDS CPU >75% | 75% for 2 periods (10min) | SNS notification |
+| RDS Free Storage <500MB | 500MB for 2 periods (10min) | SNS notification |
+
+**Alarm escalation path:**
+1. CloudWatch alarm fires
+2. SNS notification sent to on-call engineer
+3. Engineer evaluates: if service is degraded, trigger manual rollback
+4. If root cause is deployment-related, run `./infra/scripts/rollback.sh production all`
+
+---
+
+## 7. Blue-Green Deployment Rollback
+
+### 7.1 Architecture
+
+ShieldAI uses ECS services with rolling deployments. Each deployment creates a new task definition revision. The ALB routes traffic to healthy targets only.
+
+**Rollback mechanism:** ECS `--rollback` flag reverts the service to the previous task definition revision. This is equivalent to a blue-green swap since:
+
+1. Old task definition (blue) remains registered
+2. New task definition (green) is deployed
+3. On rollback, ECS reverts to blue task definition
+4. ALB automatically routes to healthy (blue) targets
+
+### 7.2 Blue-Green Rollback Procedure
+
+```bash
+# 1. Check current deployment state
+aws ecs list-services --cluster shieldai-production
+aws ecs describe-services --cluster shieldai-production \
+  --services shieldai-production-api \
+  --query 'services[0].deployments'
+
+# 2. Identify previous deployment
+# The deployment with status "PRIMARY" is current.
+# Look for "ACTIVE" deployment with older task definition.
+
+# 3. Execute rollback (script handles all services)
+./infra/scripts/rollback.sh production all
+
+# 4. Verify rollback
+aws ecs describe-services --cluster shieldai-production \
+  --services shieldai-production-api \
+  --query 'services[0].deployments[?status==`PRIMARY`].taskDefinition'
+```
+
+### 7.3 Docker Compose Blue-Green (Local)
+
+For local/staging environments using Docker Compose, implement blue-green via service version pinning:
+
+```bash
+# Current deployment uses DOCKER_TAG env var
+# Rollback by setting DOCKER_TAG to previous version
+
+# Save current tag
+CURRENT_TAG=$(grep DOCKER_TAG .env.prod 2>/dev/null | cut -d= -f2 || echo "latest")
+
+# Rollback to previous
+export DOCKER_TAG="v1.2.3"
+docker compose -f docker-compose.prod.yml up -d
+
+# Verify all services
+docker compose -f docker-compose.prod.yml ps
+```
+
+---
+
+## 8. Rollback Decision Tree
+
+```
+Is the service responding?
+├── YES → Is the response correct?
+│   ├── YES → Monitor, no action needed
+│   └── NO → Is it a data issue?
+│       ├── YES → Database Migration Rollback (§5)
+│       └── NO → ECS Service Rollback (§3)
+└── NO → Is it a single service or all?
+    ├── Single → ECS Service Rollback (§3, specific service)
+    └── All → Full Environment Rollback
+        ├── Is DB corrupted?
+        │   ├── YES → RDS Point-in-Time Recovery (§5.2)
+        │   └── NO → ECS Full Rollback + DB Migration Rollback
+```
+
+**SLA targets:**
+- Single service rollback: **< 5 minutes**
+- Full environment rollback: **< 15 minutes**
+- Database recovery: **< 30 minutes** (Point-in-Time)
+
+---
+
+## 9. Post-Rollback Verification
+
+After any rollback, verify the following:
+
+### 9.1 Service Health
+
+```bash
+# Check all services are healthy
+for svc in api darkwatch spamshield voiceprint; do
+  PORT=$(case $svc in
+    api) echo 3000;; darkwatch) echo 3001;;
+    spamshield) echo 3002;; voiceprint) echo 3003;;
+  esac)
+  HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
+    "https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health")
+  echo "$svc: HTTP $HTTP_CODE"
+done
+```
+
+### 9.2 ECS Service Status
+
+```bash
+# Verify all services are stable
+for svc in api darkwatch spamshield voiceprint; do
+  RUNNING=$(aws ecs describe-services \
+    --cluster "shieldai-${ENVIRONMENT}" \
+    --services "shieldai-${ENVIRONMENT}-${svc}" \
+    --query 'services[0].runningCount' --output text)
+  DESIRED=$(aws ecs describe-services \
+    --cluster "shieldai-${ENVIRONMENT}" \
+    --services "shieldai-${ENVIRONMENT}-${svc}" \
+    --query 'services[0].desiredCount' --output text)
+  echo "$svc: $RUNNING/$DESIRED running"
+done
+```
+
+### 9.3 Database Connectivity
+
+```bash
+# Verify database connection
+aws ecs execute-command \
+  --cluster "shieldai-${ENVIRONMENT}" \
+  --service "shieldai-${ENVIRONMENT}-api" \
+  --command "npx drizzle-kit status" \
+  --interactive --cluster "shieldai-${ENVIRONMENT}"
+```
+
+### 9.4 CloudWatch Verification
+
+1. Navigate to CloudWatch dashboard: `shieldai-${ENVIRONMENT}-dashboard`
+2. Verify CPU/Memory utilization is within normal range
+3. Verify ALB 5xx errors have returned to baseline
+4. Verify no new alarms are in ALARM state
+
+---
+
+## 10. Testing Checklist
+
+### 10.1 ECS Rollback Test
+
+- [ ] Deploy a known-bad image (e.g., image with `/health` returning 500)
+- [ ] Verify CI/CD health check fails within 5 minutes
+- [ ] Verify `rollback` job triggers automatically
+- [ ] Verify all four services revert to previous task definition
+- [ ] Verify health check passes post-rollback
+- [ ] Verify CloudWatch metrics show recovery
+
+### 10.2 Manual Script Test
+
+- [ ] Run `./infra/scripts/rollback.sh staging api` on staging
+- [ ] Verify single service rolls back correctly
+- [ ] Run `./infra/scripts/rollback.sh staging all` on staging
+- [ ] Verify all services roll back correctly
+- [ ] Verify script exits with code 0 on success
+- [ ] Verify script exits with code 1 on failure
+
+### 10.3 Docker Compose Rollback Test
+
+- [ ] Deploy v2.0.0 of all services via docker-compose.prod.yml
+- [ ] Rollback to v1.0.0 using DOCKER_TAG override
+- [ ] Verify all services restart with previous images
+- [ ] Verify health endpoints respond correctly
+
+### 10.4 Database Migration Rollback Test
+
+- [ ] Apply a test migration on staging
+- [ ] Run migration rollback procedure
+- [ ] Verify schema matches pre-migration state
+- [ ] Verify application connects and functions correctly
+
+### 10.5 RDS Point-in-Time Recovery Test
+
+- [ ] Create a test RDS instance
+- [ ] Insert test data
+- [ ] Restore to point before data insertion
+- [ ] Verify restored instance has correct data state
+- [ ] Clean up test instance
+
+### 10.6 End-to-End Rollback Drills
+
+| Drill | Frequency | Participants |
+|-------|-----------|--------------|
+| ECS service rollback | Monthly | Senior Engineer |
+| Full environment rollback | Quarterly | Full engineering team |
+| Database recovery | Quarterly | Senior Engineer + Founding Engineer |
+| Blue-green rollback | Quarterly | Full engineering team |
+
+---
+
+## 11. Runbook: Emergency Rollback
+
+### 11.1 Symptoms
+
+- ALB 5xx error rate > 10/minute for 3+ minutes
+- CloudWatch alarm: `shieldai-production-alb-5xx` in ALARM state
+- Customer-reported service degradation
+
+### 11.2 Immediate Actions (0-5 minutes)
+
+```bash
+# 1. Confirm environment and scope
+ENVIRONMENT="production"
+
+# 2. Check service status
+aws ecs describe-services \
+  --cluster "shieldai-${ENVIRONMENT}" \
+  --services shieldai-${ENVIRONMENT}-api,shieldai-${ENVIRONMENT}-darkwatch,shieldai-${ENVIRONMENT}-spamshield,shieldai-${ENVIRONMENT}-voiceprint \
+  --query 'services[*].{Name:serviceName,Running:runningCount,Desired:desiredCount,Status:status}'
+
+# 3. Check ALB health
+curl -s -o /dev/null -w "%{http_code}" \
+  "https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health"
+
+# 4. Execute rollback
+./infra/scripts/rollback.sh ${ENVIRONMENT} all
+```
+
+### 11.3 Verification (5-10 minutes)
+
+```bash
+# 1. Wait for services to stabilize
+aws ecs wait services-stable \
+  --cluster "shieldai-${ENVIRONMENT}" \
+  --services shieldai-${ENVIRONMENT}-api,shieldai-${ENVIRONMENT}-darkwatch,shieldai-${ENVIRONMENT}-spamshield,shieldai-${ENVIRONMENT}-voiceprint
+
+# 2. Verify health endpoint
+curl -sf "https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health" \
+  && echo "Health: OK" || echo "Health: FAIL"
+
+# 3. Check CloudWatch for recovery
+# Navigate to CloudWatch dashboard and verify metrics
+```
+
+### 11.4 Communication Template
+
+```
+## Rollback Notification
+
+**Environment:** production
+**Time:** $(date -u '+%Y-%m-%d %H:%M UTC')
+**Trigger:** [ALB 5xx alarm / manual / CI/CD health check]
+**Action:** Rolled back all services to previous deployment
+**Status:** [In Progress / Verified / Resolved]
+**Next steps:** [Post-mortem / monitoring / investigation]
+```
+
+### 11.5 Post-Incident
+
+1. Create incident ticket with timeline
+2. Document root cause
+3. Update runbook if procedure changed
+4. Schedule post-mortem within 48 hours
+5. Create follow-up issues for preventive measures
+
+---
+
+## Appendix A: Quick Reference
+
+| Resource | Command |
+|----------|---------|
+| Rollback script | `./infra/scripts/rollback.sh <env> <service\|all>` |
+| ECS service status | `aws ecs describe-services --cluster shieldai-<env> --services shieldai-<env>-<svc>` |
+| ALB health check | `curl -s -o /dev/null -w "%{http_code}" https://shieldai-<env>-alb.us-east-1.elb.amazonaws.com/health` |
+| RDS snapshots | `aws rds describe-db-snapshots --db-instance-identifier shieldai-<env>-db` |
+| CloudWatch dashboard | `https://us-east-1.console.aws.amazon.com/cloudwatch/home#dashboards/dashboard/shieldai-<env>-dashboard` |
+| ECS task logs | `aws logs filter-log-events --log-group-name /ecs/shieldai-<env>-<svc>` |
+
+## Appendix B: Environment Variables
+
+| Variable | Description | Required |
+|----------|-------------|----------|
+| `AWS_ACCESS_KEY_ID` | IAM user with ECS, RDS permissions | Yes |
+| `AWS_SECRET_ACCESS_KEY` | IAM secret key | Yes |
+| `AWS_DEFAULT_REGION` | AWS region (default: us-east-1) | Yes |
+| `GITHUB_REPOSITORY_OWNER` | GitHub org/user for container registry | Docker Compose only |
+| `DOCKER_TAG` | Container image tag to deploy | Docker Compose only |
+| `POSTGRES_PASSWORD` | Database password | Docker Compose only |
--- a/infra/scripts/rollback-compose.sh
+++ b/infra/scripts/rollback-compose.sh
@@ -0,0 +1,121 @@
+#!/bin/bash
+set -euo pipefail
+
+# ShieldAI Docker Compose Rollback Script
+# Usage: ./rollback-compose.sh <previous_tag> [--env prod|dev]
+#
+# Rolls back all services to a previous tagged image using docker-compose.prod.yml
+#
+# Examples:
+#   ./rollback-compose.sh v1.2.3              # Rollback to v1.2.3
+#   ./rollback-compose.sh v1.2.3 --env prod   # Explicit production compose
+
+PREVIOUS_TAG="${1:-}"
+ENV_MODE="${2:-prod}"
+
+# ─── Configuration ───────────────────────────────────────────────
+SERVICES="api darkwatch spamshield voiceprint"
+COMPOSE_FILE="docker-compose.prod.yml"
+REGISTRY_OWNER="${GITHUB_REPOSITORY_OWNER:-shieldai}"
+
+# ─── Helpers ─────────────────────────────────────────────────────
+log() {
+  local level="$1"
+  shift
+  echo "[$(date -u '+%H:%M:%S')] [$level] $*"
+}
+
+log_info()  { log "INFO" "$@"; }
+log_warn()  { log "WARN" "$@"; }
+log_error() { log "ERROR" "$@"; }
+
+# ─── Validation ──────────────────────────────────────────────────
+if [[ -z "$PREVIOUS_TAG" ]]; then
+  log_error "Usage: $0 <previous_tag> [--env prod|dev]"
+  log_error "Example: $0 v1.2.3"
+  exit 1
+fi
+
+if ! command -v docker &>/dev/null; then
+  log_error "Docker not found in PATH"
+  exit 1
+fi
+
+# ─── Rollback Logic ──────────────────────────────────────────────
+main() {
+  log_info "=== Docker Compose Rollback ==="
+  log_info "Target tag: $PREVIOUS_TAG"
+  log_info "Compose file: $COMPOSE_FILE"
+  log_info "Registry: ghcr.io/$REGISTRY_OWNER"
+
+  # 1. Pull previous images
+  log_info "Pulling previous images..."
+  local pull_failed=0
+  for svc in $SERVICES; do
+    local image="ghcr.io/${REGISTRY_OWNER}/shieldai-${svc}:${PREVIOUS_TAG}"
+    log_info "Pulling $image..."
+    if docker pull "$image" 2>/dev/null; then
+      log_info "Pulled: $image"
+    else
+      log_warn "Pull failed: $image (may not exist)"
+      pull_failed=1
+    fi
+  done
+
+  if [[ $pull_failed -eq 1 ]]; then
+    log_warn "Some images may not exist at tag $PREVIOUS_TAG"
+    log_info "Continuing with available images..."
+  fi
+
+  # 2. Stop current services gracefully
+  log_info "Stopping current services..."
+  DOCKER_TAG="$PREVIOUS_TAG" docker compose -f "$COMPOSE_FILE" down --timeout 30 2>/dev/null || true
+
+  # 3. Start with previous tag
+  log_info "Starting services with tag $PREVIOUS_TAG..."
+  DOCKER_TAG="$PREVIOUS_TAG" docker compose -f "$COMPOSE_FILE" up -d
+
+  # 4. Wait for services to be healthy
+  log_info "Waiting for services to become healthy..."
+  sleep 10
+
+  # 5. Verify health
+  local passed=0
+  local failed=0
+
+  for svc in $SERVICES; do
+    local port
+    port=$(case "$svc" in
+      api) echo 3000 ;;
+      darkwatch) echo 3001 ;;
+      spamshield) echo 3002 ;;
+      voiceprint) echo 3003 ;;
+    esac)
+
+    local http_code
+    http_code=$(curl -s -o /dev/null -w "%{http_code}" \
+      --connect-timeout 10 --max-time 30 \
+      "http://localhost:${port}/health" 2>/dev/null || echo "000")
+
+    if [[ "$http_code" == "200" ]]; then
+      log_info "Health OK: $svc (port $port, HTTP $http_code)"
+      ((passed++))
+    else
+      log_warn "Health FAIL: $svc (port $port, HTTP $http_code)"
+      ((failed++))
+    fi
+  done
+
+  log_info "=== Rollback Complete ==="
+  log_info "Passed: $passed, Failed: $failed"
+
+  if [[ $failed -gt 0 ]]; then
+    log_warn "Some services failed health check. Check logs: docker compose -f $COMPOSE_FILE logs"
+    exit 1
+  fi
+
+  log_info "All services healthy after rollback"
+  exit 0
+}
+
+main "$@"
--- a/infra/scripts/rollback-migration.sh
+++ b/infra/scripts/rollback-migration.sh
@@ -0,0 +1,164 @@
+#!/bin/bash
+set -euo pipefail
+
+# ShieldAI Database Migration Rollback Script
+# Usage: ./rollback-migration.sh <environment> [--migration <name>]
+#
+# Rolls back the most recent migration or a specific named migration
+# Uses AWS Secrets Manager for database credentials
+#
+# Examples:
+#   ./rollback-migration.sh staging                    # Rollback latest
+#   ./rollback-migration.sh production --migration 001_create_users  # Rollback specific
+
+ENVIRONMENT="${1:-staging}"
+MIGRATION_NAME="${3:-}"
+
+# ─── Configuration ───────────────────────────────────────────────
+SECRET_ID="shieldai-${ENVIRONMENT}-db-password"
+DB_NAME="shieldai"
+DB_USER="shieldai"
+
+# ─── Helpers ─────────────────────────────────────────────────────
+log() {
+  local level="$1"
+  shift
+  echo "[$(date -u '+%H:%M:%S')] [$level] $*"
+}
+
+log_info()  { log "INFO" "$@"; }
+log_warn()  { log "WARN" "$@"; }
+log_error() { log "ERROR" "$@"; }
+
+# ─── Validation ──────────────────────────────────────────────────
+if [[ "$ENVIRONMENT" != "staging" && "$ENVIRONMENT" != "production" ]]; then
+  log_error "Invalid environment: $ENVIRONMENT (expected: staging, production)"
+  exit 1
+fi
+
+for cmd in aws jq; do
+  if ! command -v "$cmd" &>/dev/null; then
+    log_error "Missing prerequisite: $cmd"
+    exit 1
+  fi
+done
+
+# ─── Credentials ─────────────────────────────────────────────────
+get_db_credentials() {
+  log_info "Fetching database credentials from Secrets Manager..."
+
+  local secret
+  secret=$(aws secretsmanager get-secret-value \
+    --secret-id "$SECRET_ID" \
+    --query 'SecretString' \
+    --output json 2>/dev/null)
+
+  if [[ -z "$secret" ]]; then
+    log_error "Failed to fetch secret: $SECRET_ID"
+    exit 1
+  fi
+
+  export DB_HOST=$(echo "$secret" | jq -r '.host')
+  export DB_PORT=$(echo "$secret" | jq -r '.port' // '5432')
+  export DB_PASS=$(echo "$secret" | jq -r '.password')
+  export DATABASE_URL="postgresql://${DB_USER}:${DB_PASS}@${DB_HOST}:${DB_PORT}/${DB_NAME}"
+
+  log_info "Database: ${DB_HOST}:${DB_PORT}/${DB_NAME}"
+}
+
+# ─── Migration Status ────────────────────────────────────────────
+show_migration_status() {
+  log_info "=== Current Migration Status ==="
+
+  if command -v npx &>/dev/null; then
+    npx drizzle-kit status --config=drizzle.config.ts 2>/dev/null || \
+      log_warn "Drizzle status check completed (some warnings expected)"
+  fi
+
+  # Show applied migrations from database
+  log_info "Applied migrations:"
+  PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d "$DB_NAME" \
+    -c "SELECT id, checksum, type FROM __drizzle_migrations_schema ORDER BY id DESC;" 2>/dev/null || \
+    log_warn "Could not query migration table (psql may not be installed)"
+}
+
+# ─── Rollback Logic ──────────────────────────────────────────────
+rollback_latest() {
+  log_info "=== Rolling Back Latest Migration ==="
+
+  # Get the latest applied migration
+  local latest_migration
+  latest_migration=$(PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -p "$DB_PORT" \
+    -U "$DB_USER" -d "$DB_NAME" -t -A \
+    -c "SELECT id FROM __drizzle_migrations_schema ORDER BY id DESC LIMIT 1;" 2>/dev/null)
+
+  if [[ -z "$latest_migration" ]]; then
+    log_warn "No applied migrations found"
+    return 0
+  fi
+
+  log_info "Latest migration: $latest_migration"
+
+  # Resolve the migration (marks it as not applied)
+  if command -v npx &>/dev/null; then
+    npx drizzle-kit migrate:resolve --migration "$latest_migration" --status applied 2>/dev/null || \
+      log_warn "Migration resolve completed (check output for details)"
+  fi
+
+  log_info "Migration $latest_migration marked as resolved"
+}
+
+rollback_specific() {
+  local target="$1"
+  log_info "=== Rolling Back Migration: $target ==="
+
+  if command -v npx &>/dev/null; then
+    npx drizzle-kit migrate:resolve --migration "$target" --status applied 2>/dev/null || \
+      log_warn "Migration resolve completed (check output for details)"
+  fi
+
+  log_info "Migration $target marked as resolved"
+}
+
+# ─── Verification ────────────────────────────────────────────────
+verify_connection() {
+  log_info "=== Verifying Database Connection ==="
+
+  local result
+  result=$(PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -p "$DB_PORT" \
+    -U "$DB_USER" -d "$DB_NAME" -t -A \
+    -c "SELECT version();" 2>/dev/null || echo "FAIL")
+
+  if [[ "$result" != "FAIL" ]]; then
+    log_info "Connection OK: PostgreSQL $result"
+  else
+    log_warn "Connection check failed"
+  fi
+}
+
+# ─── Main ────────────────────────────────────────────────────────
+main() {
+  log_info "=== ShieldAI Migration Rollback ==="
+  log_info "Environment: $ENVIRONMENT"
+  log_info "Secret: $SECRET_ID"
+
+  get_db_credentials
+  show_migration_status
+
+  if [[ -n "$MIGRATION_NAME" ]]; then
+    rollback_specific "$MIGRATION_NAME"
+  else
+    rollback_latest
+  fi
+
+  verify_connection
+  show_migration_status
+
+  log_info "=== Rollback Complete ==="
+  log_info "Next steps:"
+  log_info "1. Verify application schema compatibility"
+  log_info "2. Run application health checks"
+  log_info "3. If needed, redeploy ECS services: ./rollback.sh $ENVIRONMENT all"
+}
+
+main "$@"
--- a/infra/scripts/rollback.sh
+++ b/infra/scripts/rollback.sh
@@ -1,32 +1,255 @@
 #!/bin/bash
 set -euo pipefail

-ENVIRONMENT=${1:-staging}
-SERVICE=${2:-all}
+# ShieldAI ECS Rollback Script
+# Usage: ./rollback.sh <environment> <service|all> [--verify]
+#
+# Environments: staging, production
+# Services: api, darkwatch, spamshield, voiceprint, all
+#
+# Examples:
+#   ./rollback.sh staging api           # Rollback single service
+#   ./rollback.sh production all        # Rollback all services
+#   ./rollback.sh production all --verify  # Rollback with post-verification
+
+# ─── Configuration ───────────────────────────────────────────────
+ENVIRONMENT="${1:-staging}"
+SERVICE="${2:-all}"
+VERIFY="${3:-false}"

 CLUSTER="shieldai-${ENVIRONMENT}"
+SERVICES_LIST="api darkwatch spamshield voiceprint"
+EXIT_CODE=0
+TIMESTAMP=$(date -u '+%Y-%m-%d %H:%M:%S UTC')
+LOG_FILE="/tmp/shieldai-rollback-${ENVIRONMENT}-${TIMESTAMP//[: ]/_}.log"

-echo "Rolling back services in cluster: $CLUSTER"
+# ─── Helpers ─────────────────────────────────────────────────────
+log() {
+  local level="$1"
+  shift
+  local msg="$*"
+  echo "[$(date -u '+%H:%M:%S')] [$level] $msg" | tee -a "$LOG_FILE"
+}

-SERVICES="api darkwatch spamshield voiceprint"
-if [ "$SERVICE" != "all" ]; then
-  SERVICES="$SERVICE"
-fi
+log_info()  { log "INFO" "$@"; }
+log_warn()  { log "WARN" "$@"; }
+log_error() { log "ERROR" "$@"; }

-for svc in $SERVICES; do
-  echo "Rolling back $svc..."
-  aws ecs update-service \
+# ─── Validation ──────────────────────────────────────────────────
+validate_environment() {
+  if [[ "$ENVIRONMENT" != "staging" && "$ENVIRONMENT" != "production" ]]; then
+    log_error "Invalid environment: $ENVIRONMENT (expected: staging, production)"
+    exit 1
+  fi
+}
+
+validate_service() {
+  if [[ "$SERVICE" == "all" ]]; then
+    return 0
+  fi
+  if ! echo "$SERVICES_LIST" | grep -qw "$SERVICE"; then
+    log_error "Invalid service: $SERVICE (expected: api, darkwatch, spamshield, voiceprint, all)"
+    exit 1
+  fi
+}
+
+check_prerequisites() {
+  local missing=()
+
+  for cmd in aws jq curl; do
+    if ! command -v "$cmd" &>/dev/null; then
+      missing+=("$cmd")
+    fi
+  done
+
+  if [[ ${#missing[@]} -gt 0 ]]; then
+    log_error "Missing prerequisites: ${missing[*]}"
+    exit 1
+  fi
+
+  if [[ -z "${AWS_DEFAULT_REGION:-}" ]]; then
+    export AWS_DEFAULT_REGION="us-east-1"
+  fi
+
+  log_info "Prerequisites OK (region: $AWS_DEFAULT_REGION)"
+}
+
+# ─── Rollback Logic ──────────────────────────────────────────────
+get_target_services() {
+  if [[ "$SERVICE" == "all" ]]; then
+    echo "$SERVICES_LIST"
+  else
+    echo "$SERVICE"
+  fi
+}
+
+rollback_service() {
+  local svc="$1"
+  local service_name="${CLUSTER}-${svc}"
+
+  log_info "Rolling back $service_name..."
+
+  # Check current deployment status
+  local current_task_def
+  current_task_def=$(aws ecs describe-services \
    --cluster "$CLUSTER" \
-    --service "${CLUSTER}-${svc}" \
+    --services "$service_name" \
+    --query 'services[0].taskDefinition' \
+    --output text 2>/dev/null || echo "UNKNOWN")
+
+  log_info "Current task definition: $current_task_def"
+
+  # Execute rollback
+  if aws ecs update-service \
+    --cluster "$CLUSTER" \
+    --service "$service_name" \
    --rollback \
-    --no-cli-auto-prompt
+    --no-cli-auto-prompt 2>>"$LOG_FILE"; then
+    log_info "Rollback initiated for $service_name"
+  else
+    log_error "Rollback failed to initiate for $service_name"
+    EXIT_CODE=1
+    return 1
+  fi

-  echo "Waiting for $svc to stabilize..."
-  aws ecs wait services-stable \
+  # Wait for stabilization (max 5 minutes)
+  log_info "Waiting for $service_name to stabilize (timeout: 300s)..."
+  if aws ecs wait services-stable \
    --cluster "$CLUSTER" \
-    --services "${CLUSTER}-${svc}"
+    --services "$service_name" \
+    --timeout 300 2>>"$LOG_FILE"; then
+    log_info "$service_name stabilized successfully"
+  else
+    log_warn "$service_name stabilization timed out or failed"
+    EXIT_CODE=1
+    return 1
+  fi

-  echo "$svc rolled back successfully"
-done
+  # Get new task definition after rollback
+  local new_task_def
+  new_task_def=$(aws ecs describe-services \
+    --cluster "$CLUSTER" \
+    --services "$service_name" \
+    --query 'services[0].taskDefinition' \
+    --output text 2>/dev/null || echo "UNKNOWN")

-echo "Rollback complete for $SERVICES"
+  local running_count
+  running_count=$(aws ecs describe-services \
+    --cluster "$CLUSTER" \
+    --services "$service_name" \
+    --query 'services[0].runningCount' \
+    --output text 2>/dev/null || echo "0")
+
+  local desired_count
+  desired_count=$(aws ecs describe-services \
+    --cluster "$CLUSTER" \
+    --services "$service_name" \
+    --query 'services[0].desiredCount' \
+    --output text 2>/dev/null || echo "0")
+
+  log_info "Rollback complete: $service_name -> $new_task_def ($running_count/$desired_count running)"
+
+  return 0
+}
+
+# ─── Health Verification ─────────────────────────────────────────
+verify_health() {
+  local svc="$1"
+  local port
+  port=$(case "$svc" in
+    api) echo 3000 ;;
+    darkwatch) echo 3001 ;;
+    spamshield) echo 3002 ;;
+    voiceprint) echo 3003 ;;
+    *) echo 3000 ;;
+  esac)
+
+  local alb_dns="https://${CLUSTER}-alb.${AWS_DEFAULT_REGION}.elb.amazonaws.com"
+
+  log_info "Verifying health for $svc (ALB: $alb_dns)..."
+
+  local http_code
+  http_code=$(curl -s -o /dev/null -w "%{http_code}" \
+    --connect-timeout 10 \
+    --max-time 30 \
+    "$alb_dns/health" 2>/dev/null || echo "000")
+
+  if [[ "$http_code" == "200" ]]; then
+    log_info "Health check PASSED: $svc (HTTP $http_code)"
+    return 0
+  else
+    log_warn "Health check FAILED: $svc (HTTP $http_code)"
+    return 1
+  fi
+}
+
+verify_all_services() {
+  log_info "=== Post-Rollback Health Verification ==="
+  local passed=0
+  local failed=0
+
+  for svc in $(get_target_services); do
+    if verify_health "$svc"; then
+      ((passed++))
+    else
+      ((failed++))
+    fi
+  done
+
+  log_info "Verification complete: $passed passed, $failed failed"
+
+  if [[ $failed -gt 0 ]]; then
+    log_warn "Some services failed health verification"
+    EXIT_CODE=1
+  fi
+}
+
+# ─── Main Execution ──────────────────────────────────────────────
+main() {
+  log_info "=== ShieldAI Rollback ==="
+  log_info "Environment: $ENVIRONMENT"
+  log_info "Service(s): $SERVICE"
+  log_info "Cluster: $CLUSTER"
+  log_info "Verify: $VERIFY"
+  log_info "Timestamp: $TIMESTAMP"
+  log_info "Log file: $LOG_FILE"
+  log_info "=========================="
+
+  # Validate inputs
+  validate_environment
+  validate_service
+  check_prerequisites
+
+  # Execute rollback for each target service
+  local rolled_back=0
+  local failed=0
+
+  for svc in $(get_target_services); do
+    if rollback_service "$svc"; then
+      ((rolled_back++))
+    else
+      ((failed++))
+    fi
+  done
+
+  log_info "=== Rollback Summary ==="
+  log_info "Rolled back: $rolled_back services"
+  log_info "Failed: $failed services"
+
+  # Post-rollback verification
+  if [[ "$VERIFY" == "--verify" ]] || [[ "$VERIFY" == "true" ]]; then
+    verify_all_services
+  fi
+
+  if [[ $failed -gt 0 ]]; then
+    log_error "Rollback completed with $failed failure(s)"
+    log_info "Full log: $LOG_FILE"
+    exit "$EXIT_CODE"
+  fi
+
+  log_info "Rollback completed successfully"
+  log_info "Full log: $LOG_FILE"
+  exit 0
+}
+
+main "$@"
--- a/infra/scripts/test-rollback.sh
+++ b/infra/scripts/test-rollback.sh
@@ -0,0 +1,237 @@
+#!/bin/bash
+set -uo pipefail
+
+# ShieldAI Rollback Test Suite
+# Usage: ./test-rollback.sh [ecs|compose|migration|all]
+#
+# Validates rollback scripts and procedures without mutating production
+# Run against staging environment for integration tests
+
+TEST_SUITE="${1:-all}"
+PASS=0
+FAIL=0
+SKIP=0
+
+# ─── Helpers ─────────────────────────────────────────────────────
+log() {
+  echo "[$(date -u '+%H:%M:%S')] $*"
+}
+
+assert_eq() {
+  local desc="$1" expected="$2" actual="$3"
+  if [[ "$expected" == "$actual" ]]; then
+    log "  ✅ PASS: $desc"
+    ((PASS++))
+  else
+    log "  ❌ FAIL: $desc (expected: $expected, got: $actual)"
+    ((FAIL++))
+  fi
+}
+
+assert_file_exists() {
+  local desc="$1" path="$2"
+  if [[ -f "$path" ]]; then
+    log "  ✅ PASS: $desc"
+    ((PASS++))
+  else
+    log "  ❌ FAIL: $desc ($path not found)"
+    ((FAIL++))
+  fi
+}
+
+assert_executable() {
+  local desc="$1" path="$2"
+  if [[ -x "$path" ]]; then
+    log "  ✅ PASS: $desc"
+    ((PASS++))
+  else
+    log "  ❌ FAIL: $desc ($path not executable)"
+    ((FAIL++))
+  fi
+}
+
+assert_script_syntax() {
+  local desc="$1" path="$2"
+  if bash -n "$path" 2>/dev/null; then
+    log "  ✅ PASS: $desc (syntax OK)"
+    ((PASS++))
+  else
+    log "  ❌ FAIL: $desc (syntax error)"
+    ((FAIL++))
+  fi
+}
+
+assert_contains() {
+  local desc="$1" file="$2" pattern="$3"
+  if grep -q -- "$pattern" "$file" 2>/dev/null; then
+    log "  ✅ PASS: $desc"
+    ((PASS++))
+  else
+    log "  ❌ FAIL: $desc (pattern '$pattern' not found in $file)"
+    ((FAIL++))
+  fi
+}
+
+# ─── Test: File Structure ────────────────────────────────────────
+test_file_structure() {
+  log "=== Test: File Structure ==="
+
+  assert_file_exists "ROLLBACK.md exists" "infra/ROLLBACK.md"
+  assert_file_exists "rollback.sh exists" "infra/scripts/rollback.sh"
+  assert_file_exists "rollback-compose.sh exists" "infra/scripts/rollback-compose.sh"
+  assert_file_exists "rollback-migration.sh exists" "infra/scripts/rollback-migration.sh"
+  assert_executable "rollback.sh is executable" "infra/scripts/rollback.sh"
+  assert_executable "rollback-compose.sh is executable" "infra/scripts/rollback-compose.sh"
+  assert_executable "rollback-migration.sh is executable" "infra/scripts/rollback-migration.sh"
+}
+
+# ─── Test: Script Syntax ─────────────────────────────────────────
+test_script_syntax() {
+  log "=== Test: Script Syntax ==="
+
+  assert_script_syntax "rollback.sh syntax" "infra/scripts/rollback.sh"
+  assert_script_syntax "rollback-compose.sh syntax" "infra/scripts/rollback-compose.sh"
+  assert_script_syntax "rollback-migration.sh syntax" "infra/scripts/rollback-migration.sh"
+}
+
+# ─── Test: ROLLBACK.md Content ───────────────────────────────────
+test_documentation() {
+  log "=== Test: Documentation Content ==="
+
+  local doc="infra/ROLLBACK.md"
+
+  for section in "Overview" "ECS Service Rollback" "Docker Compose Rollback" \
+    "Database Migration Rollback" "Automated Rollback Triggers" \
+    "Blue-Green Deployment Rollback" "Rollback Decision Tree" \
+    "Post-Rollback Verification" "Testing Checklist" "Emergency Rollback"; do
+    assert_contains "Section '$section' documented" "$doc" "$section"
+  done
+
+  for cmd in "aws ecs update-service" "docker compose" "drizzle-kit" \
+    "aws rds restore-db-instance" "aws ecs wait services-stable"; do
+    assert_contains "Command '$cmd' documented" "$doc" "$cmd"
+  done
+}
+
+# ─── Test: Rollback Script Validation ────────────────────────────
+test_rollback_script() {
+  log "=== Test: ECS Rollback Script ==="
+
+  # Test invalid environment
+  local exit_code=0
+  bash infra/scripts/rollback.sh invalid_env api >/dev/null 2>&1 || exit_code=$?
+  assert_eq "Invalid environment returns exit code 1" "1" "$exit_code"
+
+  # Test invalid service
+  exit_code=0
+  bash infra/scripts/rollback.sh staging invalid_svc >/dev/null 2>&1 || exit_code=$?
+  assert_eq "Invalid service returns exit code 1" "1" "$exit_code"
+
+  # Verify script has required functions
+  for func in "validate_environment" "validate_service" "rollback_service" \
+    "verify_health" "check_prerequisites" "main"; do
+    assert_contains "Function '$func' defined" "infra/scripts/rollback.sh" "$func"
+  done
+
+  # Verify all services are handled
+  for svc in api darkwatch spamshield voiceprint; do
+    assert_contains "Service '$svc' in SERVICES_LIST" "infra/scripts/rollback.sh" "$svc"
+  done
+}
+
+# ─── Test: Compose Rollback Script ───────────────────────────────
+test_compose_script() {
+  log "=== Test: Docker Compose Rollback Script ==="
+
+  # Test missing tag argument
+  local exit_code=0
+  bash infra/scripts/rollback-compose.sh >/dev/null 2>&1 || exit_code=$?
+  assert_eq "Missing tag returns exit code 1" "1" "$exit_code"
+
+  # Verify compose file exists
+  assert_file_exists "docker-compose.prod.yml exists" "docker-compose.prod.yml"
+
+  # Verify all services are defined in compose
+  for svc in api darkwatch spamshield voiceprint; do
+    assert_contains "Service '$svc' in docker-compose.prod.yml" "docker-compose.prod.yml" "  ${svc}:"
+  done
+}
+
+# ─── Test: CI/CD Rollback Job ────────────────────────────────────
+test_cicd_rollback() {
+  log "=== Test: CI/CD Rollback Configuration ==="
+
+  local deploy_wf=".github/workflows/deploy.yml"
+
+  assert_contains "Rollback job defined" "$deploy_wf" "rollback:"
+  assert_contains "Health check triggers rollback" "$deploy_wf" "needs.health-check.result"
+  assert_contains "ECS --rollback flag used" "$deploy_wf" "--rollback"
+
+  for svc in api darkwatch spamshield voiceprint; do
+    assert_contains "Service '$svc' in deploy matrix" "$deploy_wf" "$svc"
+  done
+}
+
+# ─── Test: Health Check Configuration ────────────────────────────
+test_health_checks() {
+  log "=== Test: Health Check Configuration ==="
+
+  assert_contains "Container health check in ECS" "infra/modules/ecs/main.tf" "healthCheck"
+  assert_contains "ALB health check defined" "infra/modules/ecs/main.tf" "health_check"
+  assert_contains "ALB 5xx alarm configured" "infra/modules/cloudwatch/main.tf" "HTTPCode_Elb_5XX_Count"
+}
+
+# ─── Test: README References ─────────────────────────────────────
+test_readme() {
+  log "=== Test: README References ==="
+
+  assert_contains "README references ROLLBACK.md" "infra/README.md" "ROLLBACK.md"
+  assert_contains "README documents rollback.sh" "infra/README.md" "rollback.sh"
+  assert_contains "README documents rollback-compose.sh" "infra/README.md" "rollback-compose.sh"
+  assert_contains "README documents rollback-migration.sh" "infra/README.md" "rollback-migration.sh"
+}
+
+# ─── Main ────────────────────────────────────────────────────────
+main() {
+  log "=== ShieldAI Rollback Test Suite ==="
+  log "Suite: $TEST_SUITE"
+  log ""
+
+  case "$TEST_SUITE" in
+    ecs|all)
+      test_rollback_script
+      test_cicd_rollback
+      test_health_checks
+      ;;
+    compose|all)
+      test_compose_script
+      ;;
+    migration)
+      log "=== Test: Migration Rollback ==="
+      assert_script_syntax "rollback-migration.sh syntax" "infra/scripts/rollback-migration.sh"
+      assert_contains "Uses Secrets Manager" "infra/scripts/rollback-migration.sh" "secretsmanager"
+      assert_contains "Uses drizzle-kit" "infra/scripts/rollback-migration.sh" "drizzle-kit"
+      ;;
+  esac
+
+  test_file_structure
+  test_script_syntax
+  test_documentation
+  test_readme
+
+  log ""
+  log "=== Results ==="
+  log "Passed:  $PASS"
+  log "Failed:  $FAIL"
+  log ""
+
+  if [[ $FAIL -gt 0 ]]; then
+    log "❌ SOME TESTS FAILED"
+    return 1
+  fi
+
+  log "✅ ALL TESTS PASSED"
+  return 0
+}
+
+main "$@"