612 lines
20 KiB
Markdown
612 lines
20 KiB
Markdown
# ShieldAI Rollback Runbook
|
|
|
|
> **Last updated:** 2026-05-12
|
|
> **Owner:** Senior Engineer
|
|
> **Parent:** [FRE-4574](/FRE/issues/FRE-4574) ShieldAI Production Infrastructure & CI/CD Pipeline
|
|
> **Reviewed by:** Code Reviewer (FRE-4808) on 2026-05-12
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Overview](#1-overview)
|
|
2. [Rollback Strategies](#2-rollback-strategies)
|
|
3. [ECS Service Rollback (AWS)](#3-ecs-service-rollback-aws)
|
|
4. [Docker Compose Rollback (Local / Staging)](#4-docker-compose-rollback-local--staging)
|
|
5. [Database Migration Rollback](#5-database-migration-rollback)
|
|
6. [Automated Rollback Triggers](#6-automated-rollback-triggers)
|
|
7. [Blue-Green Deployment Rollback](#7-blue-green-deployment-rollback)
|
|
8. [Rollback Decision Tree](#8-rollback-decision-tree)
|
|
9. [Post-Rollback Verification](#9-post-rollback-verification)
|
|
10. [Testing Checklist](#10-testing-checklist)
|
|
11. [Runbook: Emergency Rollback](#11-runbook-emergency-rollback)
|
|
|
|
---
|
|
|
|
## 1. Overview
|
|
|
|
ShieldAI runs four services (api, darkwatch, spamshield, voiceprint) on AWS ECS Fargate behind an Application Load Balancer. Each service has independent deployment, health checks, and rollback capability.
|
|
|
|
**Rollback types:**
|
|
|
|
| Type | Trigger | Scope | Automation |
|
|
|------|---------|-------|------------|
|
|
| **ECS Service Rollback** | Health check failure, manual | Single or all services | ✅ CI/CD + manual script |
|
|
| **Docker Compose Rollback** | Manual (local/staging) | All services | ✅ Scripted |
|
|
| **Database Migration Rollback** | Manual | Schema changes | ⚠️ Semi-manual |
|
|
| **Blue-Green Rollback** | Manual or automated | Full environment | ✅ CI/CD |
|
|
| **RDS Point-in-Time Restore** | Manual (disaster) | Full database | ⚠️ Semi-manual |
|
|
|
|
---
|
|
|
|
## 2. Rollback Strategies
|
|
|
|
### 2.1 ECS Service-Level Rollback
|
|
|
|
Each ECS service maintains a history of task definitions. Rolling back reverts to the **previous successfully deployed task definition**.
|
|
|
|
**Prerequisites:**
|
|
- AWS CLI configured with credentials for the target environment
|
|
- IAM permissions: `ecs:UpdateService`, `ecs:DescribeServices`, `ecs:WaitServicesStable`
|
|
|
|
### 2.2 Blue-Green Rollback
|
|
|
|
The CI/CD pipeline deploys new images to existing ECS services. If health checks fail after deployment, the `rollback` job in the deploy workflow automatically reverts all four services to their previous task definition revision.
|
|
|
|
**Pipeline flow:**
|
|
```
|
|
build-and-push → deploy-ecs → health-check → [PASS: done | FAIL: rollback]
|
|
```
|
|
|
|
### 2.3 Database Migration Rollback
|
|
|
|
ShieldAI uses Drizzle ORM for database migrations. Each migration is versioned and stored in `src/db/migrations/`. Rollback requires running the previous migration set.
|
|
|
|
---
|
|
|
|
## 3. ECS Service Rollback (AWS)
|
|
|
|
### 3.1 Automated (CI/CD Pipeline)
|
|
|
|
The deploy workflow (`.github/workflows/deploy.yml`) includes a `rollback` job that triggers on health check failure:
|
|
|
|
```yaml
|
|
rollback:
|
|
if: failure() && needs.health-check.result == 'failure'
|
|
# Rolls back all 4 services to previous task definition
|
|
```
|
|
|
|
**When it runs:**
|
|
- Post-deploy health check fails (HTTP 200 not received from `/health`)
|
|
- Runs after `deploy-ecs` and `health-check` jobs
|
|
- Rolls back all four services: api, darkwatch, spamshield, voiceprint
|
|
|
|
**How to verify:**
|
|
1. Navigate to the GitHub Actions run for the failed deployment
|
|
2. Check the `Rollback on Failure` job logs
|
|
3. Confirm each service shows "Rolled back" status
|
|
|
|
### 3.2 Manual Rollback Script
|
|
|
|
```bash
|
|
# Single service
|
|
./infra/scripts/rollback.sh production api
|
|
|
|
# All services
|
|
./infra/scripts/rollback.sh production all
|
|
|
|
# Staging environment
|
|
./infra/scripts/rollback.sh staging all
|
|
```
|
|
|
|
**Script behavior:**
|
|
1. Iterates over target services (or all if `all` specified)
|
|
2. Calls `aws ecs update-service --rollback` for each service
|
|
3. Waits for service to stabilize via `aws ecs wait services-stable`
|
|
4. Reports success/failure per service
|
|
5. Exits with non-zero code if any service fails to stabilize
|
|
|
|
**Expected output:**
|
|
```
|
|
Rolling back services in cluster: shieldai-production
|
|
Rolling back api...
|
|
Waiting for api to stabilize...
|
|
api rolled back successfully
|
|
Rolling back darkwatch...
|
|
Waiting for darkwatch to stabilize...
|
|
darkwatch rolled back successfully
|
|
...
|
|
Rollback complete for api darkwatch spamshield voiceprint
|
|
```
|
|
|
|
### 3.3 Manual CLI Rollback (Fallback)
|
|
|
|
If the script is unavailable, rollback individual services:
|
|
|
|
```bash
|
|
CLUSTER="shieldai-production"
|
|
SERVICE="api"
|
|
|
|
# Rollback to previous task definition
|
|
aws ecs update-service \
|
|
--cluster "$CLUSTER" \
|
|
--service "${CLUSTER}-${SERVICE}" \
|
|
--rollback \
|
|
--no-cli-auto-prompt
|
|
|
|
# Wait for stabilization
|
|
aws ecs wait services-stable \
|
|
--cluster "$CLUSTER" \
|
|
--services "${CLUSTER}-${SERVICE}"
|
|
|
|
# Verify health
|
|
curl -s -o /dev/null -w "%{http_code}" \
|
|
"https://shieldai-production-alb.us-east-1.elb.amazonaws.com/health"
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Docker Compose Rollback (Local / Staging)
|
|
|
|
### 4.1 Production Compose Rollback
|
|
|
|
The `docker-compose.prod.yml` deploys all services with tagged images. To rollback:
|
|
|
|
```bash
|
|
# 1. Identify the previous working tag
|
|
# Check GitHub releases or git tags for the last known good version
|
|
PREVIOUS_TAG="v1.2.3"
|
|
|
|
# 2. Stop current services
|
|
docker compose -f docker-compose.prod.yml down
|
|
|
|
# 3. Pull previous images
|
|
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-api:${PREVIOUS_TAG}
|
|
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-darkwatch:${PREVIOUS_TAG}
|
|
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-spamshield:${PREVIOUS_TAG}
|
|
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-voiceprint:${PREVIOUS_TAG}
|
|
|
|
# 4. Override tag in compose
|
|
DOCKER_TAG=${PREVIOUS_TAG} docker compose -f docker-compose.prod.yml up -d
|
|
|
|
# 5. Verify health
|
|
for svc in api darkwatch spamshield voiceprint; do
|
|
PORT=$(case $svc in
|
|
api) echo 3000;; darkwatch) echo 3001;;
|
|
spamshield) echo 3002;; voiceprint) echo 3003;;
|
|
esac)
|
|
curl -sf "http://localhost:${PORT}/health" && echo "$svc: OK" || echo "$svc: FAIL"
|
|
done
|
|
```
|
|
|
|
### 4.2 Local Dev Rollback
|
|
|
|
```bash
|
|
# Stop and remove containers
|
|
docker compose down
|
|
|
|
# Rebuild from previous commit
|
|
git checkout <previous-commit>
|
|
docker compose up -d --build
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Database Migration Rollback
|
|
|
|
### 5.1 Drizzle Migration Rollback
|
|
|
|
ShieldAI uses Drizzle ORM with Turso dialect. Migrations are stored in `src/db/migrations/`.
|
|
|
|
```bash
|
|
# 1. Get database credentials from AWS Secrets Manager
|
|
DB_SECRET=$(aws secretsmanager get-secret-value \
|
|
--secret-id "shieldai-${ENVIRONMENT}-db-password" \
|
|
--query 'SecretString' --output json)
|
|
|
|
DB_HOST=$(echo "$DB_SECRET" | jq -r '.host')
|
|
DB_PORT=$(echo "$DB_SECRET" | jq -r '.port')
|
|
DB_USER=$(echo "$DB_SECRET" | jq -r '.username')
|
|
DB_PASS=$(echo "$DB_SECRET" | jq -r '.password')
|
|
|
|
DATABASE_URL="postgresql://${DB_USER}:${DB_PASS}@${DB_HOST}:${DB_PORT}/shieldai"
|
|
|
|
# 2. List migrations to identify the one to revert
|
|
npx drizzle-kit introspect --config=drizzle.config.ts
|
|
|
|
# 3. Resolve the problematic migration (marks it as not applied)
|
|
npx drizzle-kit migrate:resolve --migration "<migration_name>" --status applied
|
|
|
|
# 4. Re-run previous migration state
|
|
npx drizzle-kit migrate --config=drizzle.config.ts
|
|
```
|
|
|
|
### 5.2 RDS Point-in-Time Recovery (Disaster)
|
|
|
|
When the database itself needs recovery (e.g., data corruption, bad migration):
|
|
|
|
```bash
|
|
# 1. Find available recovery window (automated backups: every 24h, 7-14 day retention)
|
|
aws rds describe-db-instances \
|
|
--db-instance-identifier "shieldai-production-db" \
|
|
--query 'DBInstances[0].LatestRestorableTime'
|
|
|
|
# 2. Create restored instance (does not affect primary)
|
|
aws rds restore-db-instance-to-point-in-time \
|
|
--source-db-instance-identifier "shieldai-production-db" \
|
|
--db-instance-identifier "shieldai-production-db-restored" \
|
|
--restore-time "2026-05-09T08:00:00Z"
|
|
|
|
# 3. Verify restored instance
|
|
aws rds wait db-instance-available \
|
|
--db-instance-identifier "shieldai-production-db-restored"
|
|
|
|
# 4. Update ECS services to point to restored instance
|
|
# Update DATABASE_URL secret in Secrets Manager
|
|
aws secretsmanager put-secret-value \
|
|
--secret-id "shieldai-production-db-password" \
|
|
--secret-string "$(echo "$DB_SECRET" | jq --arg host "$(aws rds describe-db-instances --db-instance-identifier shieldai-production-db-restored --query 'DBInstances[0].Endpoint.Address' --output text)" '.host = $host')"
|
|
|
|
# 5. Trigger ECS service redeployment to pick up new DB endpoint
|
|
./infra/scripts/rollback.sh production all
|
|
```
|
|
|
|
### 5.3 RDS Snapshot Restore
|
|
|
|
```bash
|
|
# 1. List available snapshots
|
|
aws rds describe-db-snapshots \
|
|
--db-instance-identifier "shieldai-production-db"
|
|
|
|
# 2. Restore from specific snapshot
|
|
aws rds restore-db-instance-from-db-snapshot \
|
|
--db-instance-identifier "shieldai-production-db-restored" \
|
|
--db-snapshot-identifier "rds:shieldai-production-db-2026-05-08-03-00" \
|
|
--db-instance-class "db.t3.medium" \
|
|
--vpc-security-group-ids "$(terraform -chdir=infra/output -raw vpc_security_group_id)"
|
|
|
|
# 3. Follow steps 3-5 from Point-in-Time Recovery above
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Automated Rollback Triggers
|
|
|
|
### 6.1 CI/CD Health Check Failure
|
|
|
|
**Trigger:** Post-deploy health check returns non-200 from `/health`
|
|
|
|
**Pipeline job:** `rollback` in `.github/workflows/deploy.yml`
|
|
|
|
**Condition:** `if: failure() && needs.health-check.result == 'failure'`
|
|
|
|
**Action:** Rolls back all four ECS services to previous task definition
|
|
|
|
**Timeout:** Health check retries for 5 minutes before triggering rollback
|
|
|
|
### 6.2 ECS Container Health Check
|
|
|
|
Each container has an in-container health check defined in the ECS task definition:
|
|
|
|
```json
|
|
"healthCheck": {
|
|
"command": ["CMD-SHELL", "wget -q --spider http://localhost:{port}/health || exit 1"],
|
|
"interval": 30,
|
|
"timeout": 5,
|
|
"retries": 3,
|
|
"startPeriod": 60
|
|
}
|
|
```
|
|
|
|
**Failure consequence:** Container is marked unhealthy after 3 consecutive failures (90 seconds). ALB marks target as unhealthy after 3 failed health checks (90 seconds). Service enters draining state.
|
|
|
|
### 6.3 ALB Target Group Health Check
|
|
|
|
The ALB performs HTTP health checks against `/health` on each target:
|
|
|
|
| Parameter | Value |
|
|
|-----------|-------|
|
|
| Interval | 30s |
|
|
| Timeout | 5s |
|
|
| Healthy threshold | 3 |
|
|
| Unhealthy threshold | 3 |
|
|
| Expected code | 200 |
|
|
|
|
### 6.4 CloudWatch Alarms
|
|
|
|
The following alarms are configured in `infra/modules/cloudwatch/main.tf`:
|
|
|
|
| Alarm | Threshold | Action |
|
|
|-------|-----------|--------|
|
|
| ECS CPU >80% | 80% for 2 periods (10min) | SNS notification |
|
|
| ECS Memory >85% | 85% for 2 periods (10min) | SNS notification |
|
|
| ALB 5xx >10/min | 10 for 3 periods (3min) | SNS notification |
|
|
| RDS CPU >75% | 75% for 2 periods (10min) | SNS notification |
|
|
| RDS Free Storage <500MB | 500MB for 2 periods (10min) | SNS notification |
|
|
|
|
**Alarm escalation path:**
|
|
1. CloudWatch alarm fires
|
|
2. SNS notification sent to on-call engineer
|
|
3. Engineer evaluates: if service is degraded, trigger manual rollback
|
|
4. If root cause is deployment-related, run `./infra/scripts/rollback.sh production all`
|
|
|
|
---
|
|
|
|
## 7. Blue-Green Deployment Rollback
|
|
|
|
### 7.1 Architecture
|
|
|
|
ShieldAI uses ECS services with rolling deployments. Each deployment creates a new task definition revision. The ALB routes traffic to healthy targets only.
|
|
|
|
**Rollback mechanism:** ECS `--rollback` flag reverts the service to the previous task definition revision. This is equivalent to a blue-green swap since:
|
|
|
|
1. Old task definition (blue) remains registered
|
|
2. New task definition (green) is deployed
|
|
3. On rollback, ECS reverts to blue task definition
|
|
4. ALB automatically routes to healthy (blue) targets
|
|
|
|
### 7.2 Blue-Green Rollback Procedure
|
|
|
|
```bash
|
|
# 1. Check current deployment state
|
|
aws ecs list-services --cluster shieldai-production
|
|
aws ecs describe-services --cluster shieldai-production \
|
|
--services shieldai-production-api \
|
|
--query 'services[0].deployments'
|
|
|
|
# 2. Identify previous deployment
|
|
# The deployment with status "PRIMARY" is current.
|
|
# Look for "ACTIVE" deployment with older task definition.
|
|
|
|
# 3. Execute rollback (script handles all services)
|
|
./infra/scripts/rollback.sh production all
|
|
|
|
# 4. Verify rollback
|
|
aws ecs describe-services --cluster shieldai-production \
|
|
--services shieldai-production-api \
|
|
--query 'services[0].deployments[?status==`PRIMARY`].taskDefinition'
|
|
```
|
|
|
|
### 7.3 Docker Compose Blue-Green (Local)
|
|
|
|
For local/staging environments using Docker Compose, implement blue-green via service version pinning:
|
|
|
|
```bash
|
|
# Current deployment uses DOCKER_TAG env var
|
|
# Rollback by setting DOCKER_TAG to previous version
|
|
|
|
# Save current tag
|
|
CURRENT_TAG=$(grep DOCKER_TAG .env.prod 2>/dev/null | cut -d= -f2 || echo "latest")
|
|
|
|
# Rollback to previous
|
|
export DOCKER_TAG="v1.2.3"
|
|
docker compose -f docker-compose.prod.yml up -d
|
|
|
|
# Verify all services
|
|
docker compose -f docker-compose.prod.yml ps
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Rollback Decision Tree
|
|
|
|
```
|
|
Is the service responding?
|
|
├── YES → Is the response correct?
|
|
│ ├── YES → Monitor, no action needed
|
|
│ └── NO → Is it a data issue?
|
|
│ ├── YES → Database Migration Rollback (§5)
|
|
│ └── NO → ECS Service Rollback (§3)
|
|
└── NO → Is it a single service or all?
|
|
├── Single → ECS Service Rollback (§3, specific service)
|
|
└── All → Full Environment Rollback
|
|
├── Is DB corrupted?
|
|
│ ├── YES → RDS Point-in-Time Recovery (§5.2)
|
|
│ └── NO → ECS Full Rollback + DB Migration Rollback
|
|
```
|
|
|
|
**SLA targets:**
|
|
- Single service rollback: **< 5 minutes**
|
|
- Full environment rollback: **< 15 minutes**
|
|
- Database recovery: **< 30 minutes** (Point-in-Time)
|
|
|
|
---
|
|
|
|
## 9. Post-Rollback Verification
|
|
|
|
After any rollback, verify the following:
|
|
|
|
### 9.1 Service Health
|
|
|
|
```bash
|
|
# Check all services are healthy
|
|
for svc in api darkwatch spamshield voiceprint; do
|
|
PORT=$(case $svc in
|
|
api) echo 3000;; darkwatch) echo 3001;;
|
|
spamshield) echo 3002;; voiceprint) echo 3003;;
|
|
esac)
|
|
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
|
|
"https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health")
|
|
echo "$svc: HTTP $HTTP_CODE"
|
|
done
|
|
```
|
|
|
|
### 9.2 ECS Service Status
|
|
|
|
```bash
|
|
# Verify all services are stable
|
|
for svc in api darkwatch spamshield voiceprint; do
|
|
RUNNING=$(aws ecs describe-services \
|
|
--cluster "shieldai-${ENVIRONMENT}" \
|
|
--services "shieldai-${ENVIRONMENT}-${svc}" \
|
|
--query 'services[0].runningCount' --output text)
|
|
DESIRED=$(aws ecs describe-services \
|
|
--cluster "shieldai-${ENVIRONMENT}" \
|
|
--services "shieldai-${ENVIRONMENT}-${svc}" \
|
|
--query 'services[0].desiredCount' --output text)
|
|
echo "$svc: $RUNNING/$DESIRED running"
|
|
done
|
|
```
|
|
|
|
### 9.3 Database Connectivity
|
|
|
|
```bash
|
|
# Verify database connection
|
|
aws ecs execute-command \
|
|
--cluster "shieldai-${ENVIRONMENT}" \
|
|
--service "shieldai-${ENVIRONMENT}-api" \
|
|
--command "npx drizzle-kit status" \
|
|
--interactive --cluster "shieldai-${ENVIRONMENT}"
|
|
```
|
|
|
|
### 9.4 CloudWatch Verification
|
|
|
|
1. Navigate to CloudWatch dashboard: `shieldai-${ENVIRONMENT}-dashboard`
|
|
2. Verify CPU/Memory utilization is within normal range
|
|
3. Verify ALB 5xx errors have returned to baseline
|
|
4. Verify no new alarms are in ALARM state
|
|
|
|
---
|
|
|
|
## 10. Testing Checklist
|
|
|
|
### 10.1 ECS Rollback Test
|
|
|
|
- [ ] Deploy a known-bad image (e.g., image with `/health` returning 500)
|
|
- [ ] Verify CI/CD health check fails within 5 minutes
|
|
- [ ] Verify `rollback` job triggers automatically
|
|
- [ ] Verify all four services revert to previous task definition
|
|
- [ ] Verify health check passes post-rollback
|
|
- [ ] Verify CloudWatch metrics show recovery
|
|
|
|
### 10.2 Manual Script Test
|
|
|
|
- [ ] Run `./infra/scripts/rollback.sh staging api` on staging
|
|
- [ ] Verify single service rolls back correctly
|
|
- [ ] Run `./infra/scripts/rollback.sh staging all` on staging
|
|
- [ ] Verify all services roll back correctly
|
|
- [ ] Verify script exits with code 0 on success
|
|
- [ ] Verify script exits with code 1 on failure
|
|
|
|
### 10.3 Docker Compose Rollback Test
|
|
|
|
- [ ] Deploy v2.0.0 of all services via docker-compose.prod.yml
|
|
- [ ] Rollback to v1.0.0 using DOCKER_TAG override
|
|
- [ ] Verify all services restart with previous images
|
|
- [ ] Verify health endpoints respond correctly
|
|
|
|
### 10.4 Database Migration Rollback Test
|
|
|
|
- [ ] Apply a test migration on staging
|
|
- [ ] Run migration rollback procedure
|
|
- [ ] Verify schema matches pre-migration state
|
|
- [ ] Verify application connects and functions correctly
|
|
|
|
### 10.5 RDS Point-in-Time Recovery Test
|
|
|
|
- [ ] Create a test RDS instance
|
|
- [ ] Insert test data
|
|
- [ ] Restore to point before data insertion
|
|
- [ ] Verify restored instance has correct data state
|
|
- [ ] Clean up test instance
|
|
|
|
### 10.6 End-to-End Rollback Drills
|
|
|
|
| Drill | Frequency | Participants |
|
|
|-------|-----------|--------------|
|
|
| ECS service rollback | Monthly | Senior Engineer |
|
|
| Full environment rollback | Quarterly | Full engineering team |
|
|
| Database recovery | Quarterly | Senior Engineer + Founding Engineer |
|
|
| Blue-green rollback | Quarterly | Full engineering team |
|
|
|
|
---
|
|
|
|
## 11. Runbook: Emergency Rollback
|
|
|
|
### 11.1 Symptoms
|
|
|
|
- ALB 5xx error rate > 10/minute for 3+ minutes
|
|
- CloudWatch alarm: `shieldai-production-alb-5xx` in ALARM state
|
|
- Customer-reported service degradation
|
|
|
|
### 11.2 Immediate Actions (0-5 minutes)
|
|
|
|
```bash
|
|
# 1. Confirm environment and scope
|
|
ENVIRONMENT="production"
|
|
|
|
# 2. Check service status
|
|
aws ecs describe-services \
|
|
--cluster "shieldai-${ENVIRONMENT}" \
|
|
--services shieldai-${ENVIRONMENT}-api,shieldai-${ENVIRONMENT}-darkwatch,shieldai-${ENVIRONMENT}-spamshield,shieldai-${ENVIRONMENT}-voiceprint \
|
|
--query 'services[*].{Name:serviceName,Running:runningCount,Desired:desiredCount,Status:status}'
|
|
|
|
# 3. Check ALB health
|
|
curl -s -o /dev/null -w "%{http_code}" \
|
|
"https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health"
|
|
|
|
# 4. Execute rollback
|
|
./infra/scripts/rollback.sh ${ENVIRONMENT} all
|
|
```
|
|
|
|
### 11.3 Verification (5-10 minutes)
|
|
|
|
```bash
|
|
# 1. Wait for services to stabilize
|
|
aws ecs wait services-stable \
|
|
--cluster "shieldai-${ENVIRONMENT}" \
|
|
--services shieldai-${ENVIRONMENT}-api,shieldai-${ENVIRONMENT}-darkwatch,shieldai-${ENVIRONMENT}-spamshield,shieldai-${ENVIRONMENT}-voiceprint
|
|
|
|
# 2. Verify health endpoint
|
|
curl -sf "https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health" \
|
|
&& echo "Health: OK" || echo "Health: FAIL"
|
|
|
|
# 3. Check CloudWatch for recovery
|
|
# Navigate to CloudWatch dashboard and verify metrics
|
|
```
|
|
|
|
### 11.4 Communication Template
|
|
|
|
```
|
|
## Rollback Notification
|
|
|
|
**Environment:** production
|
|
**Time:** $(date -u '+%Y-%m-%d %H:%M UTC')
|
|
**Trigger:** [ALB 5xx alarm / manual / CI/CD health check]
|
|
**Action:** Rolled back all services to previous deployment
|
|
**Status:** [In Progress / Verified / Resolved]
|
|
**Next steps:** [Post-mortem / monitoring / investigation]
|
|
```
|
|
|
|
### 11.5 Post-Incident
|
|
|
|
1. Create incident ticket with timeline
|
|
2. Document root cause
|
|
3. Update runbook if procedure changed
|
|
4. Schedule post-mortem within 48 hours
|
|
5. Create follow-up issues for preventive measures
|
|
|
|
---
|
|
|
|
## Appendix A: Quick Reference
|
|
|
|
| Resource | Command |
|
|
|----------|---------|
|
|
| Rollback script | `./infra/scripts/rollback.sh <env> <service\|all>` |
|
|
| ECS service status | `aws ecs describe-services --cluster shieldai-<env> --services shieldai-<env>-<svc>` |
|
|
| ALB health check | `curl -s -o /dev/null -w "%{http_code}" https://shieldai-<env>-alb.us-east-1.elb.amazonaws.com/health` |
|
|
| RDS snapshots | `aws rds describe-db-snapshots --db-instance-identifier shieldai-<env>-db` |
|
|
| CloudWatch dashboard | `https://us-east-1.console.aws.amazon.com/cloudwatch/home#dashboards/dashboard/shieldai-<env>-dashboard` |
|
|
| ECS task logs | `aws logs filter-log-events --log-group-name /ecs/shieldai-<env>-<svc>` |
|
|
|
|
## Appendix B: Environment Variables
|
|
|
|
| Variable | Description | Required |
|
|
|----------|-------------|----------|
|
|
| `AWS_ACCESS_KEY_ID` | IAM user with ECS, RDS permissions | Yes |
|
|
| `AWS_SECRET_ACCESS_KEY` | IAM secret key | Yes |
|
|
| `AWS_DEFAULT_REGION` | AWS region (default: us-east-1) | Yes |
|
|
| `GITHUB_REPOSITORY_OWNER` | GitHub org/user for container registry | Docker Compose only |
|
|
| `DOCKER_TAG` | Container image tag to deploy | Docker Compose only |
|
|
| `POSTGRES_PASSWORD` | Database password | Docker Compose only |
|