Add rollback procedure documentation and testing scripts (FRE-4808)

- infra/ROLLBACK.md: comprehensive rollback runbook with ECS, Docker Compose,
  database migration, blue-green, and emergency rollback procedures
- infra/scripts/rollback.sh: enhanced ECS rollback with validation, logging,
  health verification, and per-service rollback support
- infra/scripts/rollback-compose.sh: Docker Compose rollback for local/staging
- infra/scripts/rollback-migration.sh: Drizzle migration rollback with
  AWS Secrets Manager integration
- infra/scripts/test-rollback.sh: automated test suite (51 tests)
- Updated infra/README.md to reference ROLLBACK.md

Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
2026-05-09 06:27:31 -04:00
parent 540ca5ebad
commit bce4787802
6 changed files with 1394 additions and 40 deletions

View File

@@ -14,7 +14,9 @@
│ ├── staging/main.tf # Staging environment config
│ └── production/main.tf # Production environment config
└── scripts/
── rollback.sh # Manual rollback script
── rollback.sh # ECS service rollback (AWS)
├── rollback-compose.sh # Docker Compose rollback (local/staging)
└── rollback-migration.sh # Database migration rollback
## Quick Start
@@ -75,31 +77,28 @@ terraform apply -var-file=terraform.tfvars.example
## Rollback
### Automatic (CI/CD)
The deploy workflow triggers automatic rollback when health checks fail:
```
deploy-ecs → health-check (failure) → rollback
```
See **[ROLLBACK.md](./ROLLBACK.md)** for the complete rollback runbook, including:
- ECS service rollback (automated + manual)
- Docker Compose rollback (local / staging)
- Database migration rollback (Drizzle)
- Blue-green deployment rollback
- RDS point-in-time recovery
- Automated rollback triggers and health checks
- Emergency rollback runbook
- Testing checklist
### Quick Reference
### Manual
```bash
# Rollback specific service
cd infra/scripts
./rollback.sh staging api
# ECS service rollback (AWS)
./infra/scripts/rollback.sh <environment> <service|all> [--verify]
# Rollback all services
./rollback.sh staging all
```
# Docker Compose rollback (local/staging)
./infra/scripts/rollback-compose.sh <previous_tag>
### Database Migration Rollback
```bash
# Run previous migration
DATABASE_URL=$(aws secretsmanager get-secret-value \
--secret-id shieldai-staging-db-password \
--query 'SecretString' --output json | jq -r '.host')
npx prisma migrate resolve --applied <migration_name>
npx prisma migrate deploy
# Database migration rollback
./infra/scripts/rollback-migration.sh <environment> [--migration <name>]
```
## GitHub Secrets Required

610
infra/ROLLBACK.md Normal file
View File

@@ -0,0 +1,610 @@
# ShieldAI Rollback Runbook
> **Last updated:** 2026-05-09
> **Owner:** Senior Engineer
> **Parent:** [FRE-4574](/FRE/issues/FRE-4574) ShieldAI Production Infrastructure & CI/CD Pipeline
---
## Table of Contents
1. [Overview](#1-overview)
2. [Rollback Strategies](#2-rollback-strategies)
3. [ECS Service Rollback (AWS)](#3-ecs-service-rollback-aws)
4. [Docker Compose Rollback (Local / Staging)](#4-docker-compose-rollback-local--staging)
5. [Database Migration Rollback](#5-database-migration-rollback)
6. [Automated Rollback Triggers](#6-automated-rollback-triggers)
7. [Blue-Green Deployment Rollback](#7-blue-green-deployment-rollback)
8. [Rollback Decision Tree](#8-rollback-decision-tree)
9. [Post-Rollback Verification](#9-post-rollback-verification)
10. [Testing Checklist](#10-testing-checklist)
11. [Runbook: Emergency Rollback](#11-runbook-emergency-rollback)
---
## 1. Overview
ShieldAI runs four services (api, darkwatch, spamshield, voiceprint) on AWS ECS Fargate behind an Application Load Balancer. Each service has independent deployment, health checks, and rollback capability.
**Rollback types:**
| Type | Trigger | Scope | Automation |
|------|---------|-------|------------|
| **ECS Service Rollback** | Health check failure, manual | Single or all services | ✅ CI/CD + manual script |
| **Docker Compose Rollback** | Manual (local/staging) | All services | ✅ Scripted |
| **Database Migration Rollback** | Manual | Schema changes | ⚠️ Semi-manual |
| **Blue-Green Rollback** | Manual or automated | Full environment | ✅ CI/CD |
| **RDS Point-in-Time Restore** | Manual (disaster) | Full database | ⚠️ Semi-manual |
---
## 2. Rollback Strategies
### 2.1 ECS Service-Level Rollback
Each ECS service maintains a history of task definitions. Rolling back reverts to the **previous successfully deployed task definition**.
**Prerequisites:**
- AWS CLI configured with credentials for the target environment
- IAM permissions: `ecs:UpdateService`, `ecs:DescribeServices`, `ecs:WaitServicesStable`
### 2.2 Blue-Green Rollback
The CI/CD pipeline deploys new images to existing ECS services. If health checks fail after deployment, the `rollback` job in the deploy workflow automatically reverts all four services to their previous task definition revision.
**Pipeline flow:**
```
build-and-push → deploy-ecs → health-check → [PASS: done | FAIL: rollback]
```
### 2.3 Database Migration Rollback
ShieldAI uses Drizzle ORM for database migrations. Each migration is versioned and stored in `src/db/migrations/`. Rollback requires running the previous migration set.
---
## 3. ECS Service Rollback (AWS)
### 3.1 Automated (CI/CD Pipeline)
The deploy workflow (`.github/workflows/deploy.yml`) includes a `rollback` job that triggers on health check failure:
```yaml
rollback:
if: failure() && needs.health-check.result == 'failure'
# Rolls back all 4 services to previous task definition
```
**When it runs:**
- Post-deploy health check fails (HTTP 200 not received from `/health`)
- Runs after `deploy-ecs` and `health-check` jobs
- Rolls back all four services: api, darkwatch, spamshield, voiceprint
**How to verify:**
1. Navigate to the GitHub Actions run for the failed deployment
2. Check the `Rollback on Failure` job logs
3. Confirm each service shows "Rolled back" status
### 3.2 Manual Rollback Script
```bash
# Single service
./infra/scripts/rollback.sh production api
# All services
./infra/scripts/rollback.sh production all
# Staging environment
./infra/scripts/rollback.sh staging all
```
**Script behavior:**
1. Iterates over target services (or all if `all` specified)
2. Calls `aws ecs update-service --rollback` for each service
3. Waits for service to stabilize via `aws ecs wait services-stable`
4. Reports success/failure per service
5. Exits with non-zero code if any service fails to stabilize
**Expected output:**
```
Rolling back services in cluster: shieldai-production
Rolling back api...
Waiting for api to stabilize...
api rolled back successfully
Rolling back darkwatch...
Waiting for darkwatch to stabilize...
darkwatch rolled back successfully
...
Rollback complete for api darkwatch spamshield voiceprint
```
### 3.3 Manual CLI Rollback (Fallback)
If the script is unavailable, rollback individual services:
```bash
CLUSTER="shieldai-production"
SERVICE="api"
# Rollback to previous task definition
aws ecs update-service \
--cluster "$CLUSTER" \
--service "${CLUSTER}-${SERVICE}" \
--rollback \
--no-cli-auto-prompt
# Wait for stabilization
aws ecs wait services-stable \
--cluster "$CLUSTER" \
--services "${CLUSTER}-${SERVICE}"
# Verify health
curl -s -o /dev/null -w "%{http_code}" \
"https://shieldai-production-alb.us-east-1.elb.amazonaws.com/health"
```
---
## 4. Docker Compose Rollback (Local / Staging)
### 4.1 Production Compose Rollback
The `docker-compose.prod.yml` deploys all services with tagged images. To rollback:
```bash
# 1. Identify the previous working tag
# Check GitHub releases or git tags for the last known good version
PREVIOUS_TAG="v1.2.3"
# 2. Stop current services
docker compose -f docker-compose.prod.yml down
# 3. Pull previous images
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-api:${PREVIOUS_TAG}
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-darkwatch:${PREVIOUS_TAG}
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-spamshield:${PREVIOUS_TAG}
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-voiceprint:${PREVIOUS_TAG}
# 4. Override tag in compose
DOCKER_TAG=${PREVIOUS_TAG} docker compose -f docker-compose.prod.yml up -d
# 5. Verify health
for svc in api darkwatch spamshield voiceprint; do
PORT=$(case $svc in
api) echo 3000;; darkwatch) echo 3001;;
spamshield) echo 3002;; voiceprint) echo 3003;;
esac)
curl -sf "http://localhost:${PORT}/health" && echo "$svc: OK" || echo "$svc: FAIL"
done
```
### 4.2 Local Dev Rollback
```bash
# Stop and remove containers
docker compose down
# Rebuild from previous commit
git checkout <previous-commit>
docker compose up -d --build
```
---
## 5. Database Migration Rollback
### 5.1 Drizzle Migration Rollback
ShieldAI uses Drizzle ORM with Turso dialect. Migrations are stored in `src/db/migrations/`.
```bash
# 1. Get database credentials from AWS Secrets Manager
DB_SECRET=$(aws secretsmanager get-secret-value \
--secret-id "shieldai-${ENVIRONMENT}-db-password" \
--query 'SecretString' --output json)
DB_HOST=$(echo "$DB_SECRET" | jq -r '.host')
DB_PORT=$(echo "$DB_SECRET" | jq -r '.port')
DB_USER=$(echo "$DB_SECRET" | jq -r '.username')
DB_PASS=$(echo "$DB_SECRET" | jq -r '.password')
DATABASE_URL="postgresql://${DB_USER}:${DB_PASS}@${DB_HOST}:${DB_PORT}/shieldai"
# 2. List migrations to identify the one to revert
npx drizzle-kit introspect --config=drizzle.config.ts
# 3. Resolve the problematic migration (marks it as not applied)
npx drizzle-kit migrate:resolve --migration "<migration_name>" --status applied
# 4. Re-run previous migration state
npx drizzle-kit migrate --config=drizzle.config.ts
```
### 5.2 RDS Point-in-Time Recovery (Disaster)
When the database itself needs recovery (e.g., data corruption, bad migration):
```bash
# 1. Find available recovery window (automated backups: every 24h, 7-14 day retention)
aws rds describe-db-instances \
--db-instance-identifier "shieldai-production-db" \
--query 'DBInstances[0].LatestRestorableTime'
# 2. Create restored instance (does not affect primary)
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier "shieldai-production-db" \
--db-instance-identifier "shieldai-production-db-restored" \
--restore-time "2026-05-09T08:00:00Z"
# 3. Verify restored instance
aws rds wait db-instance-available \
--db-instance-identifier "shieldai-production-db-restored"
# 4. Update ECS services to point to restored instance
# Update DATABASE_URL secret in Secrets Manager
aws secretsmanager put-secret-value \
--secret-id "shieldai-production-db-password" \
--secret-string "$(echo "$DB_SECRET" | jq --arg host "$(aws rds describe-db-instances --db-instance-identifier shieldai-production-db-restored --query 'DBInstances[0].Endpoint.Address' --output text)" '.host = $host')"
# 5. Trigger ECS service redeployment to pick up new DB endpoint
./infra/scripts/rollback.sh production all
```
### 5.3 RDS Snapshot Restore
```bash
# 1. List available snapshots
aws rds describe-db-snapshots \
--db-instance-identifier "shieldai-production-db"
# 2. Restore from specific snapshot
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier "shieldai-production-db-restored" \
--db-snapshot-identifier "rds:shieldai-production-db-2026-05-08-03-00" \
--db-instance-class "db.t3.medium" \
--vpc-security-group-ids "$(terraform -chdir=infra/output -raw vpc_security_group_id)"
# 3. Follow steps 3-5 from Point-in-Time Recovery above
```
---
## 6. Automated Rollback Triggers
### 6.1 CI/CD Health Check Failure
**Trigger:** Post-deploy health check returns non-200 from `/health`
**Pipeline job:** `rollback` in `.github/workflows/deploy.yml`
**Condition:** `if: failure() && needs.health-check.result == 'failure'`
**Action:** Rolls back all four ECS services to previous task definition
**Timeout:** Health check retries for 5 minutes before triggering rollback
### 6.2 ECS Container Health Check
Each container has an in-container health check defined in the ECS task definition:
```json
"healthCheck": {
"command": ["CMD-SHELL", "wget -q --spider http://localhost:{port}/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
```
**Failure consequence:** Container is marked unhealthy after 3 consecutive failures (90 seconds). ALB marks target as unhealthy after 3 failed health checks (90 seconds). Service enters draining state.
### 6.3 ALB Target Group Health Check
The ALB performs HTTP health checks against `/health` on each target:
| Parameter | Value |
|-----------|-------|
| Interval | 30s |
| Timeout | 5s |
| Healthy threshold | 3 |
| Unhealthy threshold | 3 |
| Expected code | 200 |
### 6.4 CloudWatch Alarms
The following alarms are configured in `infra/modules/cloudwatch/main.tf`:
| Alarm | Threshold | Action |
|-------|-----------|--------|
| ECS CPU >80% | 80% for 2 periods (10min) | SNS notification |
| ECS Memory >85% | 85% for 2 periods (10min) | SNS notification |
| ALB 5xx >10/min | 10 for 3 periods (3min) | SNS notification |
| RDS CPU >75% | 75% for 2 periods (10min) | SNS notification |
| RDS Free Storage <500MB | 500MB for 2 periods (10min) | SNS notification |
**Alarm escalation path:**
1. CloudWatch alarm fires
2. SNS notification sent to on-call engineer
3. Engineer evaluates: if service is degraded, trigger manual rollback
4. If root cause is deployment-related, run `./infra/scripts/rollback.sh production all`
---
## 7. Blue-Green Deployment Rollback
### 7.1 Architecture
ShieldAI uses ECS services with rolling deployments. Each deployment creates a new task definition revision. The ALB routes traffic to healthy targets only.
**Rollback mechanism:** ECS `--rollback` flag reverts the service to the previous task definition revision. This is equivalent to a blue-green swap since:
1. Old task definition (blue) remains registered
2. New task definition (green) is deployed
3. On rollback, ECS reverts to blue task definition
4. ALB automatically routes to healthy (blue) targets
### 7.2 Blue-Green Rollback Procedure
```bash
# 1. Check current deployment state
aws ecs list-services --cluster shieldai-production
aws ecs describe-services --cluster shieldai-production \
--services shieldai-production-api \
--query 'services[0].deployments'
# 2. Identify previous deployment
# The deployment with status "PRIMARY" is current.
# Look for "ACTIVE" deployment with older task definition.
# 3. Execute rollback (script handles all services)
./infra/scripts/rollback.sh production all
# 4. Verify rollback
aws ecs describe-services --cluster shieldai-production \
--services shieldai-production-api \
--query 'services[0].deployments[?status==`PRIMARY`].taskDefinition'
```
### 7.3 Docker Compose Blue-Green (Local)
For local/staging environments using Docker Compose, implement blue-green via service version pinning:
```bash
# Current deployment uses DOCKER_TAG env var
# Rollback by setting DOCKER_TAG to previous version
# Save current tag
CURRENT_TAG=$(grep DOCKER_TAG .env.prod 2>/dev/null | cut -d= -f2 || echo "latest")
# Rollback to previous
export DOCKER_TAG="v1.2.3"
docker compose -f docker-compose.prod.yml up -d
# Verify all services
docker compose -f docker-compose.prod.yml ps
```
---
## 8. Rollback Decision Tree
```
Is the service responding?
├── YES → Is the response correct?
│ ├── YES → Monitor, no action needed
│ └── NO → Is it a data issue?
│ ├── YES → Database Migration Rollback (§5)
│ └── NO → ECS Service Rollback (§3)
└── NO → Is it a single service or all?
├── Single → ECS Service Rollback (§3, specific service)
└── All → Full Environment Rollback
├── Is DB corrupted?
│ ├── YES → RDS Point-in-Time Recovery (§5.2)
│ └── NO → ECS Full Rollback + DB Migration Rollback
```
**SLA targets:**
- Single service rollback: **< 5 minutes**
- Full environment rollback: **< 15 minutes**
- Database recovery: **< 30 minutes** (Point-in-Time)
---
## 9. Post-Rollback Verification
After any rollback, verify the following:
### 9.1 Service Health
```bash
# Check all services are healthy
for svc in api darkwatch spamshield voiceprint; do
PORT=$(case $svc in
api) echo 3000;; darkwatch) echo 3001;;
spamshield) echo 3002;; voiceprint) echo 3003;;
esac)
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
"https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health")
echo "$svc: HTTP $HTTP_CODE"
done
```
### 9.2 ECS Service Status
```bash
# Verify all services are stable
for svc in api darkwatch spamshield voiceprint; do
RUNNING=$(aws ecs describe-services \
--cluster "shieldai-${ENVIRONMENT}" \
--services "shieldai-${ENVIRONMENT}-${svc}" \
--query 'services[0].runningCount' --output text)
DESIRED=$(aws ecs describe-services \
--cluster "shieldai-${ENVIRONMENT}" \
--services "shieldai-${ENVIRONMENT}-${svc}" \
--query 'services[0].desiredCount' --output text)
echo "$svc: $RUNNING/$DESIRED running"
done
```
### 9.3 Database Connectivity
```bash
# Verify database connection
aws ecs execute-command \
--cluster "shieldai-${ENVIRONMENT}" \
--service "shieldai-${ENVIRONMENT}-api" \
--command "npx drizzle-kit status" \
--interactive --cluster "shieldai-${ENVIRONMENT}"
```
### 9.4 CloudWatch Verification
1. Navigate to CloudWatch dashboard: `shieldai-${ENVIRONMENT}-dashboard`
2. Verify CPU/Memory utilization is within normal range
3. Verify ALB 5xx errors have returned to baseline
4. Verify no new alarms are in ALARM state
---
## 10. Testing Checklist
### 10.1 ECS Rollback Test
- [ ] Deploy a known-bad image (e.g., image with `/health` returning 500)
- [ ] Verify CI/CD health check fails within 5 minutes
- [ ] Verify `rollback` job triggers automatically
- [ ] Verify all four services revert to previous task definition
- [ ] Verify health check passes post-rollback
- [ ] Verify CloudWatch metrics show recovery
### 10.2 Manual Script Test
- [ ] Run `./infra/scripts/rollback.sh staging api` on staging
- [ ] Verify single service rolls back correctly
- [ ] Run `./infra/scripts/rollback.sh staging all` on staging
- [ ] Verify all services roll back correctly
- [ ] Verify script exits with code 0 on success
- [ ] Verify script exits with code 1 on failure
### 10.3 Docker Compose Rollback Test
- [ ] Deploy v2.0.0 of all services via docker-compose.prod.yml
- [ ] Rollback to v1.0.0 using DOCKER_TAG override
- [ ] Verify all services restart with previous images
- [ ] Verify health endpoints respond correctly
### 10.4 Database Migration Rollback Test
- [ ] Apply a test migration on staging
- [ ] Run migration rollback procedure
- [ ] Verify schema matches pre-migration state
- [ ] Verify application connects and functions correctly
### 10.5 RDS Point-in-Time Recovery Test
- [ ] Create a test RDS instance
- [ ] Insert test data
- [ ] Restore to point before data insertion
- [ ] Verify restored instance has correct data state
- [ ] Clean up test instance
### 10.6 End-to-End Rollback Drills
| Drill | Frequency | Participants |
|-------|-----------|--------------|
| ECS service rollback | Monthly | Senior Engineer |
| Full environment rollback | Quarterly | Full engineering team |
| Database recovery | Quarterly | Senior Engineer + Founding Engineer |
| Blue-green rollback | Quarterly | Full engineering team |
---
## 11. Runbook: Emergency Rollback
### 11.1 Symptoms
- ALB 5xx error rate > 10/minute for 3+ minutes
- CloudWatch alarm: `shieldai-production-alb-5xx` in ALARM state
- Customer-reported service degradation
### 11.2 Immediate Actions (0-5 minutes)
```bash
# 1. Confirm environment and scope
ENVIRONMENT="production"
# 2. Check service status
aws ecs describe-services \
--cluster "shieldai-${ENVIRONMENT}" \
--services shieldai-${ENVIRONMENT}-api,shieldai-${ENVIRONMENT}-darkwatch,shieldai-${ENVIRONMENT}-spamshield,shieldai-${ENVIRONMENT}-voiceprint \
--query 'services[*].{Name:serviceName,Running:runningCount,Desired:desiredCount,Status:status}'
# 3. Check ALB health
curl -s -o /dev/null -w "%{http_code}" \
"https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health"
# 4. Execute rollback
./infra/scripts/rollback.sh ${ENVIRONMENT} all
```
### 11.3 Verification (5-10 minutes)
```bash
# 1. Wait for services to stabilize
aws ecs wait services-stable \
--cluster "shieldai-${ENVIRONMENT}" \
--services shieldai-${ENVIRONMENT}-api,shieldai-${ENVIRONMENT}-darkwatch,shieldai-${ENVIRONMENT}-spamshield,shieldai-${ENVIRONMENT}-voiceprint
# 2. Verify health endpoint
curl -sf "https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health" \
&& echo "Health: OK" || echo "Health: FAIL"
# 3. Check CloudWatch for recovery
# Navigate to CloudWatch dashboard and verify metrics
```
### 11.4 Communication Template
```
## Rollback Notification
**Environment:** production
**Time:** $(date -u '+%Y-%m-%d %H:%M UTC')
**Trigger:** [ALB 5xx alarm / manual / CI/CD health check]
**Action:** Rolled back all services to previous deployment
**Status:** [In Progress / Verified / Resolved]
**Next steps:** [Post-mortem / monitoring / investigation]
```
### 11.5 Post-Incident
1. Create incident ticket with timeline
2. Document root cause
3. Update runbook if procedure changed
4. Schedule post-mortem within 48 hours
5. Create follow-up issues for preventive measures
---
## Appendix A: Quick Reference
| Resource | Command |
|----------|---------|
| Rollback script | `./infra/scripts/rollback.sh <env> <service\|all>` |
| ECS service status | `aws ecs describe-services --cluster shieldai-<env> --services shieldai-<env>-<svc>` |
| ALB health check | `curl -s -o /dev/null -w "%{http_code}" https://shieldai-<env>-alb.us-east-1.elb.amazonaws.com/health` |
| RDS snapshots | `aws rds describe-db-snapshots --db-instance-identifier shieldai-<env>-db` |
| CloudWatch dashboard | `https://us-east-1.console.aws.amazon.com/cloudwatch/home#dashboards/dashboard/shieldai-<env>-dashboard` |
| ECS task logs | `aws logs filter-log-events --log-group-name /ecs/shieldai-<env>-<svc>` |
## Appendix B: Environment Variables
| Variable | Description | Required |
|----------|-------------|----------|
| `AWS_ACCESS_KEY_ID` | IAM user with ECS, RDS permissions | Yes |
| `AWS_SECRET_ACCESS_KEY` | IAM secret key | Yes |
| `AWS_DEFAULT_REGION` | AWS region (default: us-east-1) | Yes |
| `GITHUB_REPOSITORY_OWNER` | GitHub org/user for container registry | Docker Compose only |
| `DOCKER_TAG` | Container image tag to deploy | Docker Compose only |
| `POSTGRES_PASSWORD` | Database password | Docker Compose only |

121
infra/scripts/rollback-compose.sh Executable file
View File

@@ -0,0 +1,121 @@
#!/bin/bash
set -euo pipefail
# ShieldAI Docker Compose Rollback Script
# Usage: ./rollback-compose.sh <previous_tag> [--env prod|dev]
#
# Rolls back all services to a previous tagged image using docker-compose.prod.yml
#
# Examples:
# ./rollback-compose.sh v1.2.3 # Rollback to v1.2.3
# ./rollback-compose.sh v1.2.3 --env prod # Explicit production compose
PREVIOUS_TAG="${1:-}"
ENV_MODE="${2:-prod}"
# ─── Configuration ───────────────────────────────────────────────
SERVICES="api darkwatch spamshield voiceprint"
COMPOSE_FILE="docker-compose.prod.yml"
REGISTRY_OWNER="${GITHUB_REPOSITORY_OWNER:-shieldai}"
# ─── Helpers ─────────────────────────────────────────────────────
log() {
local level="$1"
shift
echo "[$(date -u '+%H:%M:%S')] [$level] $*"
}
log_info() { log "INFO" "$@"; }
log_warn() { log "WARN" "$@"; }
log_error() { log "ERROR" "$@"; }
# ─── Validation ──────────────────────────────────────────────────
if [[ -z "$PREVIOUS_TAG" ]]; then
log_error "Usage: $0 <previous_tag> [--env prod|dev]"
log_error "Example: $0 v1.2.3"
exit 1
fi
if ! command -v docker &>/dev/null; then
log_error "Docker not found in PATH"
exit 1
fi
# ─── Rollback Logic ──────────────────────────────────────────────
main() {
log_info "=== Docker Compose Rollback ==="
log_info "Target tag: $PREVIOUS_TAG"
log_info "Compose file: $COMPOSE_FILE"
log_info "Registry: ghcr.io/$REGISTRY_OWNER"
# 1. Pull previous images
log_info "Pulling previous images..."
local pull_failed=0
for svc in $SERVICES; do
local image="ghcr.io/${REGISTRY_OWNER}/shieldai-${svc}:${PREVIOUS_TAG}"
log_info "Pulling $image..."
if docker pull "$image" 2>/dev/null; then
log_info "Pulled: $image"
else
log_warn "Pull failed: $image (may not exist)"
pull_failed=1
fi
done
if [[ $pull_failed -eq 1 ]]; then
log_warn "Some images may not exist at tag $PREVIOUS_TAG"
log_info "Continuing with available images..."
fi
# 2. Stop current services gracefully
log_info "Stopping current services..."
DOCKER_TAG="$PREVIOUS_TAG" docker compose -f "$COMPOSE_FILE" down --timeout 30 2>/dev/null || true
# 3. Start with previous tag
log_info "Starting services with tag $PREVIOUS_TAG..."
DOCKER_TAG="$PREVIOUS_TAG" docker compose -f "$COMPOSE_FILE" up -d
# 4. Wait for services to be healthy
log_info "Waiting for services to become healthy..."
sleep 10
# 5. Verify health
local passed=0
local failed=0
for svc in $SERVICES; do
local port
port=$(case "$svc" in
api) echo 3000 ;;
darkwatch) echo 3001 ;;
spamshield) echo 3002 ;;
voiceprint) echo 3003 ;;
esac)
local http_code
http_code=$(curl -s -o /dev/null -w "%{http_code}" \
--connect-timeout 10 --max-time 30 \
"http://localhost:${port}/health" 2>/dev/null || echo "000")
if [[ "$http_code" == "200" ]]; then
log_info "Health OK: $svc (port $port, HTTP $http_code)"
((passed++))
else
log_warn "Health FAIL: $svc (port $port, HTTP $http_code)"
((failed++))
fi
done
log_info "=== Rollback Complete ==="
log_info "Passed: $passed, Failed: $failed"
if [[ $failed -gt 0 ]]; then
log_warn "Some services failed health check. Check logs: docker compose -f $COMPOSE_FILE logs"
exit 1
fi
log_info "All services healthy after rollback"
exit 0
}
main "$@"

View File

@@ -0,0 +1,164 @@
#!/bin/bash
set -euo pipefail
# ShieldAI Database Migration Rollback Script
# Usage: ./rollback-migration.sh <environment> [--migration <name>]
#
# Rolls back the most recent migration or a specific named migration
# Uses AWS Secrets Manager for database credentials
#
# Examples:
# ./rollback-migration.sh staging # Rollback latest
# ./rollback-migration.sh production --migration 001_create_users # Rollback specific
ENVIRONMENT="${1:-staging}"
MIGRATION_NAME="${3:-}"
# ─── Configuration ───────────────────────────────────────────────
SECRET_ID="shieldai-${ENVIRONMENT}-db-password"
DB_NAME="shieldai"
DB_USER="shieldai"
# ─── Helpers ─────────────────────────────────────────────────────
log() {
local level="$1"
shift
echo "[$(date -u '+%H:%M:%S')] [$level] $*"
}
log_info() { log "INFO" "$@"; }
log_warn() { log "WARN" "$@"; }
log_error() { log "ERROR" "$@"; }
# ─── Validation ──────────────────────────────────────────────────
if [[ "$ENVIRONMENT" != "staging" && "$ENVIRONMENT" != "production" ]]; then
log_error "Invalid environment: $ENVIRONMENT (expected: staging, production)"
exit 1
fi
for cmd in aws jq; do
if ! command -v "$cmd" &>/dev/null; then
log_error "Missing prerequisite: $cmd"
exit 1
fi
done
# ─── Credentials ─────────────────────────────────────────────────
get_db_credentials() {
log_info "Fetching database credentials from Secrets Manager..."
local secret
secret=$(aws secretsmanager get-secret-value \
--secret-id "$SECRET_ID" \
--query 'SecretString' \
--output json 2>/dev/null)
if [[ -z "$secret" ]]; then
log_error "Failed to fetch secret: $SECRET_ID"
exit 1
fi
export DB_HOST=$(echo "$secret" | jq -r '.host')
export DB_PORT=$(echo "$secret" | jq -r '.port' // '5432')
export DB_PASS=$(echo "$secret" | jq -r '.password')
export DATABASE_URL="postgresql://${DB_USER}:${DB_PASS}@${DB_HOST}:${DB_PORT}/${DB_NAME}"
log_info "Database: ${DB_HOST}:${DB_PORT}/${DB_NAME}"
}
# ─── Migration Status ────────────────────────────────────────────
show_migration_status() {
log_info "=== Current Migration Status ==="
if command -v npx &>/dev/null; then
npx drizzle-kit status --config=drizzle.config.ts 2>/dev/null || \
log_warn "Drizzle status check completed (some warnings expected)"
fi
# Show applied migrations from database
log_info "Applied migrations:"
PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d "$DB_NAME" \
-c "SELECT id, checksum, type FROM __drizzle_migrations_schema ORDER BY id DESC;" 2>/dev/null || \
log_warn "Could not query migration table (psql may not be installed)"
}
# ─── Rollback Logic ──────────────────────────────────────────────
rollback_latest() {
log_info "=== Rolling Back Latest Migration ==="
# Get the latest applied migration
local latest_migration
latest_migration=$(PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -p "$DB_PORT" \
-U "$DB_USER" -d "$DB_NAME" -t -A \
-c "SELECT id FROM __drizzle_migrations_schema ORDER BY id DESC LIMIT 1;" 2>/dev/null)
if [[ -z "$latest_migration" ]]; then
log_warn "No applied migrations found"
return 0
fi
log_info "Latest migration: $latest_migration"
# Resolve the migration (marks it as not applied)
if command -v npx &>/dev/null; then
npx drizzle-kit migrate:resolve --migration "$latest_migration" --status applied 2>/dev/null || \
log_warn "Migration resolve completed (check output for details)"
fi
log_info "Migration $latest_migration marked as resolved"
}
rollback_specific() {
local target="$1"
log_info "=== Rolling Back Migration: $target ==="
if command -v npx &>/dev/null; then
npx drizzle-kit migrate:resolve --migration "$target" --status applied 2>/dev/null || \
log_warn "Migration resolve completed (check output for details)"
fi
log_info "Migration $target marked as resolved"
}
# ─── Verification ────────────────────────────────────────────────
verify_connection() {
log_info "=== Verifying Database Connection ==="
local result
result=$(PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -p "$DB_PORT" \
-U "$DB_USER" -d "$DB_NAME" -t -A \
-c "SELECT version();" 2>/dev/null || echo "FAIL")
if [[ "$result" != "FAIL" ]]; then
log_info "Connection OK: PostgreSQL $result"
else
log_warn "Connection check failed"
fi
}
# ─── Main ────────────────────────────────────────────────────────
main() {
log_info "=== ShieldAI Migration Rollback ==="
log_info "Environment: $ENVIRONMENT"
log_info "Secret: $SECRET_ID"
get_db_credentials
show_migration_status
if [[ -n "$MIGRATION_NAME" ]]; then
rollback_specific "$MIGRATION_NAME"
else
rollback_latest
fi
verify_connection
show_migration_status
log_info "=== Rollback Complete ==="
log_info "Next steps:"
log_info "1. Verify application schema compatibility"
log_info "2. Run application health checks"
log_info "3. If needed, redeploy ECS services: ./rollback.sh $ENVIRONMENT all"
}
main "$@"

View File

@@ -1,32 +1,255 @@
#!/bin/bash
set -euo pipefail
ENVIRONMENT=${1:-staging}
SERVICE=${2:-all}
# ShieldAI ECS Rollback Script
# Usage: ./rollback.sh <environment> <service|all> [--verify]
#
# Environments: staging, production
# Services: api, darkwatch, spamshield, voiceprint, all
#
# Examples:
# ./rollback.sh staging api # Rollback single service
# ./rollback.sh production all # Rollback all services
# ./rollback.sh production all --verify # Rollback with post-verification
# ─── Configuration ───────────────────────────────────────────────
ENVIRONMENT="${1:-staging}"
SERVICE="${2:-all}"
VERIFY="${3:-false}"
CLUSTER="shieldai-${ENVIRONMENT}"
SERVICES_LIST="api darkwatch spamshield voiceprint"
EXIT_CODE=0
TIMESTAMP=$(date -u '+%Y-%m-%d %H:%M:%S UTC')
LOG_FILE="/tmp/shieldai-rollback-${ENVIRONMENT}-${TIMESTAMP//[: ]/_}.log"
echo "Rolling back services in cluster: $CLUSTER"
# ─── Helpers ─────────────────────────────────────────────────────
log() {
local level="$1"
shift
local msg="$*"
echo "[$(date -u '+%H:%M:%S')] [$level] $msg" | tee -a "$LOG_FILE"
}
SERVICES="api darkwatch spamshield voiceprint"
if [ "$SERVICE" != "all" ]; then
SERVICES="$SERVICE"
fi
log_info() { log "INFO" "$@"; }
log_warn() { log "WARN" "$@"; }
log_error() { log "ERROR" "$@"; }
for svc in $SERVICES; do
echo "Rolling back $svc..."
aws ecs update-service \
# ─── Validation ──────────────────────────────────────────────────
validate_environment() {
if [[ "$ENVIRONMENT" != "staging" && "$ENVIRONMENT" != "production" ]]; then
log_error "Invalid environment: $ENVIRONMENT (expected: staging, production)"
exit 1
fi
}
validate_service() {
if [[ "$SERVICE" == "all" ]]; then
return 0
fi
if ! echo "$SERVICES_LIST" | grep -qw "$SERVICE"; then
log_error "Invalid service: $SERVICE (expected: api, darkwatch, spamshield, voiceprint, all)"
exit 1
fi
}
check_prerequisites() {
local missing=()
for cmd in aws jq curl; do
if ! command -v "$cmd" &>/dev/null; then
missing+=("$cmd")
fi
done
if [[ ${#missing[@]} -gt 0 ]]; then
log_error "Missing prerequisites: ${missing[*]}"
exit 1
fi
if [[ -z "${AWS_DEFAULT_REGION:-}" ]]; then
export AWS_DEFAULT_REGION="us-east-1"
fi
log_info "Prerequisites OK (region: $AWS_DEFAULT_REGION)"
}
# ─── Rollback Logic ──────────────────────────────────────────────
get_target_services() {
if [[ "$SERVICE" == "all" ]]; then
echo "$SERVICES_LIST"
else
echo "$SERVICE"
fi
}
rollback_service() {
local svc="$1"
local service_name="${CLUSTER}-${svc}"
log_info "Rolling back $service_name..."
# Check current deployment status
local current_task_def
current_task_def=$(aws ecs describe-services \
--cluster "$CLUSTER" \
--service "${CLUSTER}-${svc}" \
--services "$service_name" \
--query 'services[0].taskDefinition' \
--output text 2>/dev/null || echo "UNKNOWN")
log_info "Current task definition: $current_task_def"
# Execute rollback
if aws ecs update-service \
--cluster "$CLUSTER" \
--service "$service_name" \
--rollback \
--no-cli-auto-prompt
--no-cli-auto-prompt 2>>"$LOG_FILE"; then
log_info "Rollback initiated for $service_name"
else
log_error "Rollback failed to initiate for $service_name"
EXIT_CODE=1
return 1
fi
echo "Waiting for $svc to stabilize..."
aws ecs wait services-stable \
# Wait for stabilization (max 5 minutes)
log_info "Waiting for $service_name to stabilize (timeout: 300s)..."
if aws ecs wait services-stable \
--cluster "$CLUSTER" \
--services "${CLUSTER}-${svc}"
--services "$service_name" \
--timeout 300 2>>"$LOG_FILE"; then
log_info "$service_name stabilized successfully"
else
log_warn "$service_name stabilization timed out or failed"
EXIT_CODE=1
return 1
fi
echo "$svc rolled back successfully"
done
# Get new task definition after rollback
local new_task_def
new_task_def=$(aws ecs describe-services \
--cluster "$CLUSTER" \
--services "$service_name" \
--query 'services[0].taskDefinition' \
--output text 2>/dev/null || echo "UNKNOWN")
echo "Rollback complete for $SERVICES"
local running_count
running_count=$(aws ecs describe-services \
--cluster "$CLUSTER" \
--services "$service_name" \
--query 'services[0].runningCount' \
--output text 2>/dev/null || echo "0")
local desired_count
desired_count=$(aws ecs describe-services \
--cluster "$CLUSTER" \
--services "$service_name" \
--query 'services[0].desiredCount' \
--output text 2>/dev/null || echo "0")
log_info "Rollback complete: $service_name -> $new_task_def ($running_count/$desired_count running)"
return 0
}
# ─── Health Verification ─────────────────────────────────────────
verify_health() {
local svc="$1"
local port
port=$(case "$svc" in
api) echo 3000 ;;
darkwatch) echo 3001 ;;
spamshield) echo 3002 ;;
voiceprint) echo 3003 ;;
*) echo 3000 ;;
esac)
local alb_dns="https://${CLUSTER}-alb.${AWS_DEFAULT_REGION}.elb.amazonaws.com"
log_info "Verifying health for $svc (ALB: $alb_dns)..."
local http_code
http_code=$(curl -s -o /dev/null -w "%{http_code}" \
--connect-timeout 10 \
--max-time 30 \
"$alb_dns/health" 2>/dev/null || echo "000")
if [[ "$http_code" == "200" ]]; then
log_info "Health check PASSED: $svc (HTTP $http_code)"
return 0
else
log_warn "Health check FAILED: $svc (HTTP $http_code)"
return 1
fi
}
verify_all_services() {
log_info "=== Post-Rollback Health Verification ==="
local passed=0
local failed=0
for svc in $(get_target_services); do
if verify_health "$svc"; then
((passed++))
else
((failed++))
fi
done
log_info "Verification complete: $passed passed, $failed failed"
if [[ $failed -gt 0 ]]; then
log_warn "Some services failed health verification"
EXIT_CODE=1
fi
}
# ─── Main Execution ──────────────────────────────────────────────
main() {
log_info "=== ShieldAI Rollback ==="
log_info "Environment: $ENVIRONMENT"
log_info "Service(s): $SERVICE"
log_info "Cluster: $CLUSTER"
log_info "Verify: $VERIFY"
log_info "Timestamp: $TIMESTAMP"
log_info "Log file: $LOG_FILE"
log_info "=========================="
# Validate inputs
validate_environment
validate_service
check_prerequisites
# Execute rollback for each target service
local rolled_back=0
local failed=0
for svc in $(get_target_services); do
if rollback_service "$svc"; then
((rolled_back++))
else
((failed++))
fi
done
log_info "=== Rollback Summary ==="
log_info "Rolled back: $rolled_back services"
log_info "Failed: $failed services"
# Post-rollback verification
if [[ "$VERIFY" == "--verify" ]] || [[ "$VERIFY" == "true" ]]; then
verify_all_services
fi
if [[ $failed -gt 0 ]]; then
log_error "Rollback completed with $failed failure(s)"
log_info "Full log: $LOG_FILE"
exit "$EXIT_CODE"
fi
log_info "Rollback completed successfully"
log_info "Full log: $LOG_FILE"
exit 0
}
main "$@"

237
infra/scripts/test-rollback.sh Executable file
View File

@@ -0,0 +1,237 @@
#!/bin/bash
set -uo pipefail
# ShieldAI Rollback Test Suite
# Usage: ./test-rollback.sh [ecs|compose|migration|all]
#
# Validates rollback scripts and procedures without mutating production
# Run against staging environment for integration tests
TEST_SUITE="${1:-all}"
PASS=0
FAIL=0
SKIP=0
# ─── Helpers ─────────────────────────────────────────────────────
log() {
echo "[$(date -u '+%H:%M:%S')] $*"
}
assert_eq() {
local desc="$1" expected="$2" actual="$3"
if [[ "$expected" == "$actual" ]]; then
log " ✅ PASS: $desc"
((PASS++))
else
log " ❌ FAIL: $desc (expected: $expected, got: $actual)"
((FAIL++))
fi
}
assert_file_exists() {
local desc="$1" path="$2"
if [[ -f "$path" ]]; then
log " ✅ PASS: $desc"
((PASS++))
else
log " ❌ FAIL: $desc ($path not found)"
((FAIL++))
fi
}
assert_executable() {
local desc="$1" path="$2"
if [[ -x "$path" ]]; then
log " ✅ PASS: $desc"
((PASS++))
else
log " ❌ FAIL: $desc ($path not executable)"
((FAIL++))
fi
}
assert_script_syntax() {
local desc="$1" path="$2"
if bash -n "$path" 2>/dev/null; then
log " ✅ PASS: $desc (syntax OK)"
((PASS++))
else
log " ❌ FAIL: $desc (syntax error)"
((FAIL++))
fi
}
assert_contains() {
local desc="$1" file="$2" pattern="$3"
if grep -q -- "$pattern" "$file" 2>/dev/null; then
log " ✅ PASS: $desc"
((PASS++))
else
log " ❌ FAIL: $desc (pattern '$pattern' not found in $file)"
((FAIL++))
fi
}
# ─── Test: File Structure ────────────────────────────────────────
test_file_structure() {
log "=== Test: File Structure ==="
assert_file_exists "ROLLBACK.md exists" "infra/ROLLBACK.md"
assert_file_exists "rollback.sh exists" "infra/scripts/rollback.sh"
assert_file_exists "rollback-compose.sh exists" "infra/scripts/rollback-compose.sh"
assert_file_exists "rollback-migration.sh exists" "infra/scripts/rollback-migration.sh"
assert_executable "rollback.sh is executable" "infra/scripts/rollback.sh"
assert_executable "rollback-compose.sh is executable" "infra/scripts/rollback-compose.sh"
assert_executable "rollback-migration.sh is executable" "infra/scripts/rollback-migration.sh"
}
# ─── Test: Script Syntax ─────────────────────────────────────────
test_script_syntax() {
log "=== Test: Script Syntax ==="
assert_script_syntax "rollback.sh syntax" "infra/scripts/rollback.sh"
assert_script_syntax "rollback-compose.sh syntax" "infra/scripts/rollback-compose.sh"
assert_script_syntax "rollback-migration.sh syntax" "infra/scripts/rollback-migration.sh"
}
# ─── Test: ROLLBACK.md Content ───────────────────────────────────
test_documentation() {
log "=== Test: Documentation Content ==="
local doc="infra/ROLLBACK.md"
for section in "Overview" "ECS Service Rollback" "Docker Compose Rollback" \
"Database Migration Rollback" "Automated Rollback Triggers" \
"Blue-Green Deployment Rollback" "Rollback Decision Tree" \
"Post-Rollback Verification" "Testing Checklist" "Emergency Rollback"; do
assert_contains "Section '$section' documented" "$doc" "$section"
done
for cmd in "aws ecs update-service" "docker compose" "drizzle-kit" \
"aws rds restore-db-instance" "aws ecs wait services-stable"; do
assert_contains "Command '$cmd' documented" "$doc" "$cmd"
done
}
# ─── Test: Rollback Script Validation ────────────────────────────
test_rollback_script() {
log "=== Test: ECS Rollback Script ==="
# Test invalid environment
local exit_code=0
bash infra/scripts/rollback.sh invalid_env api >/dev/null 2>&1 || exit_code=$?
assert_eq "Invalid environment returns exit code 1" "1" "$exit_code"
# Test invalid service
exit_code=0
bash infra/scripts/rollback.sh staging invalid_svc >/dev/null 2>&1 || exit_code=$?
assert_eq "Invalid service returns exit code 1" "1" "$exit_code"
# Verify script has required functions
for func in "validate_environment" "validate_service" "rollback_service" \
"verify_health" "check_prerequisites" "main"; do
assert_contains "Function '$func' defined" "infra/scripts/rollback.sh" "$func"
done
# Verify all services are handled
for svc in api darkwatch spamshield voiceprint; do
assert_contains "Service '$svc' in SERVICES_LIST" "infra/scripts/rollback.sh" "$svc"
done
}
# ─── Test: Compose Rollback Script ───────────────────────────────
test_compose_script() {
log "=== Test: Docker Compose Rollback Script ==="
# Test missing tag argument
local exit_code=0
bash infra/scripts/rollback-compose.sh >/dev/null 2>&1 || exit_code=$?
assert_eq "Missing tag returns exit code 1" "1" "$exit_code"
# Verify compose file exists
assert_file_exists "docker-compose.prod.yml exists" "docker-compose.prod.yml"
# Verify all services are defined in compose
for svc in api darkwatch spamshield voiceprint; do
assert_contains "Service '$svc' in docker-compose.prod.yml" "docker-compose.prod.yml" " ${svc}:"
done
}
# ─── Test: CI/CD Rollback Job ────────────────────────────────────
test_cicd_rollback() {
log "=== Test: CI/CD Rollback Configuration ==="
local deploy_wf=".github/workflows/deploy.yml"
assert_contains "Rollback job defined" "$deploy_wf" "rollback:"
assert_contains "Health check triggers rollback" "$deploy_wf" "needs.health-check.result"
assert_contains "ECS --rollback flag used" "$deploy_wf" "--rollback"
for svc in api darkwatch spamshield voiceprint; do
assert_contains "Service '$svc' in deploy matrix" "$deploy_wf" "$svc"
done
}
# ─── Test: Health Check Configuration ────────────────────────────
test_health_checks() {
log "=== Test: Health Check Configuration ==="
assert_contains "Container health check in ECS" "infra/modules/ecs/main.tf" "healthCheck"
assert_contains "ALB health check defined" "infra/modules/ecs/main.tf" "health_check"
assert_contains "ALB 5xx alarm configured" "infra/modules/cloudwatch/main.tf" "HTTPCode_Elb_5XX_Count"
}
# ─── Test: README References ─────────────────────────────────────
test_readme() {
log "=== Test: README References ==="
assert_contains "README references ROLLBACK.md" "infra/README.md" "ROLLBACK.md"
assert_contains "README documents rollback.sh" "infra/README.md" "rollback.sh"
assert_contains "README documents rollback-compose.sh" "infra/README.md" "rollback-compose.sh"
assert_contains "README documents rollback-migration.sh" "infra/README.md" "rollback-migration.sh"
}
# ─── Main ────────────────────────────────────────────────────────
main() {
log "=== ShieldAI Rollback Test Suite ==="
log "Suite: $TEST_SUITE"
log ""
case "$TEST_SUITE" in
ecs|all)
test_rollback_script
test_cicd_rollback
test_health_checks
;;
compose|all)
test_compose_script
;;
migration)
log "=== Test: Migration Rollback ==="
assert_script_syntax "rollback-migration.sh syntax" "infra/scripts/rollback-migration.sh"
assert_contains "Uses Secrets Manager" "infra/scripts/rollback-migration.sh" "secretsmanager"
assert_contains "Uses drizzle-kit" "infra/scripts/rollback-migration.sh" "drizzle-kit"
;;
esac
test_file_structure
test_script_syntax
test_documentation
test_readme
log ""
log "=== Results ==="
log "Passed: $PASS"
log "Failed: $FAIL"
log ""
if [[ $FAIL -gt 0 ]]; then
log "❌ SOME TESTS FAILED"
return 1
fi
log "✅ ALL TESTS PASSED"
return 0
}
main "$@"