feat: establish unified project foundation with root config cleanup

- Archive legacy packages/, services/, server/ directories
- Update pnpm workspace to web + browser-ext
- Simplify root package.json scripts to delegate to web/
- Update turbo.json for new workspace structure
- Remove obsolete root config files (vite, tsconfig, etc.)
- Add .nvmrc, .editorconfig for consistent dev environment
- Update CI workflow to remove references to deleted packages
- Add missing dependencies (@tailwindcss/vite, tailwindcss) to web
- Add test and lint scripts to web package
- Verify pnpm install, build, and dev work correctly
This commit is contained in:
2026-05-25 12:31:43 -04:00
parent 59fcc31483
commit f627033665
500 changed files with 622 additions and 99592 deletions

9
infra/.gitignore vendored
View File

@@ -1,9 +0,0 @@
.terraform/
*.tfstate
*.tfstate.backup
*.tfvars
.terraform.lock.hcl
override.tf
override.tf.json
*_override.tf
*_override.tf.json

View File

@@ -1,113 +0,0 @@
/infra/
├── main.tf # Root module: VPC, ECS, RDS, ElastiCache, S3, Secrets, CloudWatch
├── variables.tf # Input variables with validation
├── outputs.tf # Output values (endpoints, ARNs, URLs)
├── modules/
│ ├── vpc/main.tf # VPC, subnets, IGW, NAT GW, security groups
│ ├── ecs/main.tf # ECS cluster, task definitions, services, ALB, auto-scaling
│ ├── rds/main.tf # RDS PostgreSQL with automated backups
│ ├── elasticache/main.tf # ElastiCache Redis with replication
│ ├── s3/main.tf # S3 buckets: state, artifacts, logs
│ ├── secrets/main.tf # AWS Secrets Manager
│ └── cloudwatch/main.tf # Dashboards, alarms, notifications
├── environments/
│ ├── staging/main.tf # Staging environment config
│ └── production/main.tf # Production environment config
└── scripts/
├── rollback.sh # ECS service rollback (AWS)
├── rollback-compose.sh # Docker Compose rollback (local/staging)
└── rollback-migration.sh # Database migration rollback
## Quick Start
### Prerequisites
- Terraform >= 1.5.0
- AWS CLI configured with appropriate credentials
- AWS account with ECS, RDS, ElastiCache permissions
### Initialize
```bash
cd infra/environments/staging
terraform init
terraform plan -var-file=terraform.tfvars.example
terraform apply -var-file=terraform.tfvars.example
```
### Deploy via CI/CD
- Push to `main` → deploys to staging
- Create a release → deploys to production
- Health check failure → automatic rollback
## Architecture
### Networking
- VPC with public/private subnets across multiple AZs
- NAT Gateway for outbound traffic from private subnets
- Security groups: ECS → RDS (5432), ECS → ElastiCache (6379)
### Compute
- ECS Fargate for serverless container orchestration
- Application Load Balancer with health checks
- Auto-scaling: CPU-based scaling (70% target)
- Production: 3 replicas per service, min 2, max 10
### Data
- RDS PostgreSQL 16.2 with Multi-AZ (production)
- Automated daily backups, 7-14 day retention
- ElastiCache Redis 7.0 with replication
- S3 with versioning and lifecycle policies
### Secrets
- AWS Secrets Manager for all credentials
- ECS task execution role with SecretsManagerReadOnly
- DB credentials auto-rotated via RDS integration
### Monitoring
- CloudWatch dashboards: CPU, memory, ALB metrics
- Alarms: CPU >80%, memory >85%, 5xx >10/min, RDS storage <500MB
- Container Insights enabled for ECS
- Logs: 30-day retention (production), 7-day (staging)
### Backup Strategy
- RDS: automated snapshots every 24h, 7-14 day retention
- RDS: Multi-AZ for automatic failover (production)
- ElastiCache: daily snapshots, 1-7 day retention
- S3: versioning enabled, non-current versions expire after 30 days
- Terraform state: S3 with versioning + DynamoDB locking
## Rollback
See **[ROLLBACK.md](./ROLLBACK.md)** for the complete rollback runbook, including:
- ECS service rollback (automated + manual)
- Docker Compose rollback (local / staging)
- Database migration rollback (Drizzle)
- Blue-green deployment rollback
- RDS point-in-time recovery
- Automated rollback triggers and health checks
- Emergency rollback runbook
- Testing checklist
### Quick Reference
```bash
# ECS service rollback (AWS)
./infra/scripts/rollback.sh <environment> <service|all> [--verify]
# Docker Compose rollback (local/staging)
./infra/scripts/rollback-compose.sh <previous_tag>
# Database migration rollback
./infra/scripts/rollback-migration.sh <environment> [--migration <name>]
```
## GitHub Secrets Required
| Secret | Description |
|--------|-------------|
| AWS_ACCESS_KEY_ID | IAM user with ECS, RDS, ElastiCache permissions |
| AWS_SECRET_ACCESS_KEY | IAM secret key |
| HIBP_API_KEY | Have I Been Pwned API key |
| RESEND_API_KEY | Resend email API key |
| SENTRY_DSN | Sentry error tracking DSN |
| DATADOG_API_KEY | Datadog monitoring API key |
| GITHUB_TOKEN | Auto-provided, needs write:packages scope |

View File

@@ -1,611 +0,0 @@
# ShieldAI Rollback Runbook
> **Last updated:** 2026-05-12
> **Owner:** Senior Engineer
> **Parent:** [FRE-4574](/FRE/issues/FRE-4574) ShieldAI Production Infrastructure & CI/CD Pipeline
> **Reviewed by:** Code Reviewer (FRE-4808) on 2026-05-12
---
## Table of Contents
1. [Overview](#1-overview)
2. [Rollback Strategies](#2-rollback-strategies)
3. [ECS Service Rollback (AWS)](#3-ecs-service-rollback-aws)
4. [Docker Compose Rollback (Local / Staging)](#4-docker-compose-rollback-local--staging)
5. [Database Migration Rollback](#5-database-migration-rollback)
6. [Automated Rollback Triggers](#6-automated-rollback-triggers)
7. [Blue-Green Deployment Rollback](#7-blue-green-deployment-rollback)
8. [Rollback Decision Tree](#8-rollback-decision-tree)
9. [Post-Rollback Verification](#9-post-rollback-verification)
10. [Testing Checklist](#10-testing-checklist)
11. [Runbook: Emergency Rollback](#11-runbook-emergency-rollback)
---
## 1. Overview
ShieldAI runs four services (api, darkwatch, spamshield, voiceprint) on AWS ECS Fargate behind an Application Load Balancer. Each service has independent deployment, health checks, and rollback capability.
**Rollback types:**
| Type | Trigger | Scope | Automation |
|------|---------|-------|------------|
| **ECS Service Rollback** | Health check failure, manual | Single or all services | ✅ CI/CD + manual script |
| **Docker Compose Rollback** | Manual (local/staging) | All services | ✅ Scripted |
| **Database Migration Rollback** | Manual | Schema changes | ⚠️ Semi-manual |
| **Blue-Green Rollback** | Manual or automated | Full environment | ✅ CI/CD |
| **RDS Point-in-Time Restore** | Manual (disaster) | Full database | ⚠️ Semi-manual |
---
## 2. Rollback Strategies
### 2.1 ECS Service-Level Rollback
Each ECS service maintains a history of task definitions. Rolling back reverts to the **previous successfully deployed task definition**.
**Prerequisites:**
- AWS CLI configured with credentials for the target environment
- IAM permissions: `ecs:UpdateService`, `ecs:DescribeServices`, `ecs:WaitServicesStable`
### 2.2 Blue-Green Rollback
The CI/CD pipeline deploys new images to existing ECS services. If health checks fail after deployment, the `rollback` job in the deploy workflow automatically reverts all four services to their previous task definition revision.
**Pipeline flow:**
```
build-and-push → deploy-ecs → health-check → [PASS: done | FAIL: rollback]
```
### 2.3 Database Migration Rollback
ShieldAI uses Drizzle ORM for database migrations. Each migration is versioned and stored in `src/db/migrations/`. Rollback requires running the previous migration set.
---
## 3. ECS Service Rollback (AWS)
### 3.1 Automated (CI/CD Pipeline)
The deploy workflow (`.github/workflows/deploy.yml`) includes a `rollback` job that triggers on health check failure:
```yaml
rollback:
if: failure() && needs.health-check.result == 'failure'
# Rolls back all 4 services to previous task definition
```
**When it runs:**
- Post-deploy health check fails (HTTP 200 not received from `/health`)
- Runs after `deploy-ecs` and `health-check` jobs
- Rolls back all four services: api, darkwatch, spamshield, voiceprint
**How to verify:**
1. Navigate to the GitHub Actions run for the failed deployment
2. Check the `Rollback on Failure` job logs
3. Confirm each service shows "Rolled back" status
### 3.2 Manual Rollback Script
```bash
# Single service
./infra/scripts/rollback.sh production api
# All services
./infra/scripts/rollback.sh production all
# Staging environment
./infra/scripts/rollback.sh staging all
```
**Script behavior:**
1. Iterates over target services (or all if `all` specified)
2. Calls `aws ecs update-service --rollback` for each service
3. Waits for service to stabilize via `aws ecs wait services-stable`
4. Reports success/failure per service
5. Exits with non-zero code if any service fails to stabilize
**Expected output:**
```
Rolling back services in cluster: shieldai-production
Rolling back api...
Waiting for api to stabilize...
api rolled back successfully
Rolling back darkwatch...
Waiting for darkwatch to stabilize...
darkwatch rolled back successfully
...
Rollback complete for api darkwatch spamshield voiceprint
```
### 3.3 Manual CLI Rollback (Fallback)
If the script is unavailable, rollback individual services:
```bash
CLUSTER="shieldai-production"
SERVICE="api"
# Rollback to previous task definition
aws ecs update-service \
--cluster "$CLUSTER" \
--service "${CLUSTER}-${SERVICE}" \
--rollback \
--no-cli-auto-prompt
# Wait for stabilization
aws ecs wait services-stable \
--cluster "$CLUSTER" \
--services "${CLUSTER}-${SERVICE}"
# Verify health
curl -s -o /dev/null -w "%{http_code}" \
"https://shieldai-production-alb.us-east-1.elb.amazonaws.com/health"
```
---
## 4. Docker Compose Rollback (Local / Staging)
### 4.1 Production Compose Rollback
The `docker-compose.prod.yml` deploys all services with tagged images. To rollback:
```bash
# 1. Identify the previous working tag
# Check GitHub releases or git tags for the last known good version
PREVIOUS_TAG="v1.2.3"
# 2. Stop current services
docker compose -f docker-compose.prod.yml down
# 3. Pull previous images
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-api:${PREVIOUS_TAG}
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-darkwatch:${PREVIOUS_TAG}
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-spamshield:${PREVIOUS_TAG}
docker pull ghcr.io/${GITHUB_REPOSITORY_OWNER}/shieldai-voiceprint:${PREVIOUS_TAG}
# 4. Override tag in compose
DOCKER_TAG=${PREVIOUS_TAG} docker compose -f docker-compose.prod.yml up -d
# 5. Verify health
for svc in api darkwatch spamshield voiceprint; do
PORT=$(case $svc in
api) echo 3000;; darkwatch) echo 3001;;
spamshield) echo 3002;; voiceprint) echo 3003;;
esac)
curl -sf "http://localhost:${PORT}/health" && echo "$svc: OK" || echo "$svc: FAIL"
done
```
### 4.2 Local Dev Rollback
```bash
# Stop and remove containers
docker compose down
# Rebuild from previous commit
git checkout <previous-commit>
docker compose up -d --build
```
---
## 5. Database Migration Rollback
### 5.1 Drizzle Migration Rollback
ShieldAI uses Drizzle ORM with Turso dialect. Migrations are stored in `src/db/migrations/`.
```bash
# 1. Get database credentials from AWS Secrets Manager
DB_SECRET=$(aws secretsmanager get-secret-value \
--secret-id "shieldai-${ENVIRONMENT}-db-password" \
--query 'SecretString' --output json)
DB_HOST=$(echo "$DB_SECRET" | jq -r '.host')
DB_PORT=$(echo "$DB_SECRET" | jq -r '.port')
DB_USER=$(echo "$DB_SECRET" | jq -r '.username')
DB_PASS=$(echo "$DB_SECRET" | jq -r '.password')
DATABASE_URL="postgresql://${DB_USER}:${DB_PASS}@${DB_HOST}:${DB_PORT}/shieldai"
# 2. List migrations to identify the one to revert
npx drizzle-kit introspect --config=drizzle.config.ts
# 3. Resolve the problematic migration (marks it as not applied)
npx drizzle-kit migrate:resolve --migration "<migration_name>" --status applied
# 4. Re-run previous migration state
npx drizzle-kit migrate --config=drizzle.config.ts
```
### 5.2 RDS Point-in-Time Recovery (Disaster)
When the database itself needs recovery (e.g., data corruption, bad migration):
```bash
# 1. Find available recovery window (automated backups: every 24h, 7-14 day retention)
aws rds describe-db-instances \
--db-instance-identifier "shieldai-production-db" \
--query 'DBInstances[0].LatestRestorableTime'
# 2. Create restored instance (does not affect primary)
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier "shieldai-production-db" \
--db-instance-identifier "shieldai-production-db-restored" \
--restore-time "2026-05-09T08:00:00Z"
# 3. Verify restored instance
aws rds wait db-instance-available \
--db-instance-identifier "shieldai-production-db-restored"
# 4. Update ECS services to point to restored instance
# Update DATABASE_URL secret in Secrets Manager
aws secretsmanager put-secret-value \
--secret-id "shieldai-production-db-password" \
--secret-string "$(echo "$DB_SECRET" | jq --arg host "$(aws rds describe-db-instances --db-instance-identifier shieldai-production-db-restored --query 'DBInstances[0].Endpoint.Address' --output text)" '.host = $host')"
# 5. Trigger ECS service redeployment to pick up new DB endpoint
./infra/scripts/rollback.sh production all
```
### 5.3 RDS Snapshot Restore
```bash
# 1. List available snapshots
aws rds describe-db-snapshots \
--db-instance-identifier "shieldai-production-db"
# 2. Restore from specific snapshot
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier "shieldai-production-db-restored" \
--db-snapshot-identifier "rds:shieldai-production-db-2026-05-08-03-00" \
--db-instance-class "db.t3.medium" \
--vpc-security-group-ids "$(terraform -chdir=infra/output -raw vpc_security_group_id)"
# 3. Follow steps 3-5 from Point-in-Time Recovery above
```
---
## 6. Automated Rollback Triggers
### 6.1 CI/CD Health Check Failure
**Trigger:** Post-deploy health check returns non-200 from `/health`
**Pipeline job:** `rollback` in `.github/workflows/deploy.yml`
**Condition:** `if: failure() && needs.health-check.result == 'failure'`
**Action:** Rolls back all four ECS services to previous task definition
**Timeout:** Health check retries for 5 minutes before triggering rollback
### 6.2 ECS Container Health Check
Each container has an in-container health check defined in the ECS task definition:
```json
"healthCheck": {
"command": ["CMD-SHELL", "wget -q --spider http://localhost:{port}/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
```
**Failure consequence:** Container is marked unhealthy after 3 consecutive failures (90 seconds). ALB marks target as unhealthy after 3 failed health checks (90 seconds). Service enters draining state.
### 6.3 ALB Target Group Health Check
The ALB performs HTTP health checks against `/health` on each target:
| Parameter | Value |
|-----------|-------|
| Interval | 30s |
| Timeout | 5s |
| Healthy threshold | 3 |
| Unhealthy threshold | 3 |
| Expected code | 200 |
### 6.4 CloudWatch Alarms
The following alarms are configured in `infra/modules/cloudwatch/main.tf`:
| Alarm | Threshold | Action |
|-------|-----------|--------|
| ECS CPU >80% | 80% for 2 periods (10min) | SNS notification |
| ECS Memory >85% | 85% for 2 periods (10min) | SNS notification |
| ALB 5xx >10/min | 10 for 3 periods (3min) | SNS notification |
| RDS CPU >75% | 75% for 2 periods (10min) | SNS notification |
| RDS Free Storage <500MB | 500MB for 2 periods (10min) | SNS notification |
**Alarm escalation path:**
1. CloudWatch alarm fires
2. SNS notification sent to on-call engineer
3. Engineer evaluates: if service is degraded, trigger manual rollback
4. If root cause is deployment-related, run `./infra/scripts/rollback.sh production all`
---
## 7. Blue-Green Deployment Rollback
### 7.1 Architecture
ShieldAI uses ECS services with rolling deployments. Each deployment creates a new task definition revision. The ALB routes traffic to healthy targets only.
**Rollback mechanism:** ECS `--rollback` flag reverts the service to the previous task definition revision. This is equivalent to a blue-green swap since:
1. Old task definition (blue) remains registered
2. New task definition (green) is deployed
3. On rollback, ECS reverts to blue task definition
4. ALB automatically routes to healthy (blue) targets
### 7.2 Blue-Green Rollback Procedure
```bash
# 1. Check current deployment state
aws ecs list-services --cluster shieldai-production
aws ecs describe-services --cluster shieldai-production \
--services shieldai-production-api \
--query 'services[0].deployments'
# 2. Identify previous deployment
# The deployment with status "PRIMARY" is current.
# Look for "ACTIVE" deployment with older task definition.
# 3. Execute rollback (script handles all services)
./infra/scripts/rollback.sh production all
# 4. Verify rollback
aws ecs describe-services --cluster shieldai-production \
--services shieldai-production-api \
--query 'services[0].deployments[?status==`PRIMARY`].taskDefinition'
```
### 7.3 Docker Compose Blue-Green (Local)
For local/staging environments using Docker Compose, implement blue-green via service version pinning:
```bash
# Current deployment uses DOCKER_TAG env var
# Rollback by setting DOCKER_TAG to previous version
# Save current tag
CURRENT_TAG=$(grep DOCKER_TAG .env.prod 2>/dev/null | cut -d= -f2 || echo "latest")
# Rollback to previous
export DOCKER_TAG="v1.2.3"
docker compose -f docker-compose.prod.yml up -d
# Verify all services
docker compose -f docker-compose.prod.yml ps
```
---
## 8. Rollback Decision Tree
```
Is the service responding?
├── YES → Is the response correct?
│ ├── YES → Monitor, no action needed
│ └── NO → Is it a data issue?
│ ├── YES → Database Migration Rollback (§5)
│ └── NO → ECS Service Rollback (§3)
└── NO → Is it a single service or all?
├── Single → ECS Service Rollback (§3, specific service)
└── All → Full Environment Rollback
├── Is DB corrupted?
│ ├── YES → RDS Point-in-Time Recovery (§5.2)
│ └── NO → ECS Full Rollback + DB Migration Rollback
```
**SLA targets:**
- Single service rollback: **< 5 minutes**
- Full environment rollback: **< 15 minutes**
- Database recovery: **< 30 minutes** (Point-in-Time)
---
## 9. Post-Rollback Verification
After any rollback, verify the following:
### 9.1 Service Health
```bash
# Check all services are healthy
for svc in api darkwatch spamshield voiceprint; do
PORT=$(case $svc in
api) echo 3000;; darkwatch) echo 3001;;
spamshield) echo 3002;; voiceprint) echo 3003;;
esac)
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
"https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health")
echo "$svc: HTTP $HTTP_CODE"
done
```
### 9.2 ECS Service Status
```bash
# Verify all services are stable
for svc in api darkwatch spamshield voiceprint; do
RUNNING=$(aws ecs describe-services \
--cluster "shieldai-${ENVIRONMENT}" \
--services "shieldai-${ENVIRONMENT}-${svc}" \
--query 'services[0].runningCount' --output text)
DESIRED=$(aws ecs describe-services \
--cluster "shieldai-${ENVIRONMENT}" \
--services "shieldai-${ENVIRONMENT}-${svc}" \
--query 'services[0].desiredCount' --output text)
echo "$svc: $RUNNING/$DESIRED running"
done
```
### 9.3 Database Connectivity
```bash
# Verify database connection
aws ecs execute-command \
--cluster "shieldai-${ENVIRONMENT}" \
--service "shieldai-${ENVIRONMENT}-api" \
--command "npx drizzle-kit status" \
--interactive --cluster "shieldai-${ENVIRONMENT}"
```
### 9.4 CloudWatch Verification
1. Navigate to CloudWatch dashboard: `shieldai-${ENVIRONMENT}-dashboard`
2. Verify CPU/Memory utilization is within normal range
3. Verify ALB 5xx errors have returned to baseline
4. Verify no new alarms are in ALARM state
---
## 10. Testing Checklist
### 10.1 ECS Rollback Test
- [ ] Deploy a known-bad image (e.g., image with `/health` returning 500)
- [ ] Verify CI/CD health check fails within 5 minutes
- [ ] Verify `rollback` job triggers automatically
- [ ] Verify all four services revert to previous task definition
- [ ] Verify health check passes post-rollback
- [ ] Verify CloudWatch metrics show recovery
### 10.2 Manual Script Test
- [ ] Run `./infra/scripts/rollback.sh staging api` on staging
- [ ] Verify single service rolls back correctly
- [ ] Run `./infra/scripts/rollback.sh staging all` on staging
- [ ] Verify all services roll back correctly
- [ ] Verify script exits with code 0 on success
- [ ] Verify script exits with code 1 on failure
### 10.3 Docker Compose Rollback Test
- [ ] Deploy v2.0.0 of all services via docker-compose.prod.yml
- [ ] Rollback to v1.0.0 using DOCKER_TAG override
- [ ] Verify all services restart with previous images
- [ ] Verify health endpoints respond correctly
### 10.4 Database Migration Rollback Test
- [ ] Apply a test migration on staging
- [ ] Run migration rollback procedure
- [ ] Verify schema matches pre-migration state
- [ ] Verify application connects and functions correctly
### 10.5 RDS Point-in-Time Recovery Test
- [ ] Create a test RDS instance
- [ ] Insert test data
- [ ] Restore to point before data insertion
- [ ] Verify restored instance has correct data state
- [ ] Clean up test instance
### 10.6 End-to-End Rollback Drills
| Drill | Frequency | Participants |
|-------|-----------|--------------|
| ECS service rollback | Monthly | Senior Engineer |
| Full environment rollback | Quarterly | Full engineering team |
| Database recovery | Quarterly | Senior Engineer + Founding Engineer |
| Blue-green rollback | Quarterly | Full engineering team |
---
## 11. Runbook: Emergency Rollback
### 11.1 Symptoms
- ALB 5xx error rate > 10/minute for 3+ minutes
- CloudWatch alarm: `shieldai-production-alb-5xx` in ALARM state
- Customer-reported service degradation
### 11.2 Immediate Actions (0-5 minutes)
```bash
# 1. Confirm environment and scope
ENVIRONMENT="production"
# 2. Check service status
aws ecs describe-services \
--cluster "shieldai-${ENVIRONMENT}" \
--services shieldai-${ENVIRONMENT}-api,shieldai-${ENVIRONMENT}-darkwatch,shieldai-${ENVIRONMENT}-spamshield,shieldai-${ENVIRONMENT}-voiceprint \
--query 'services[*].{Name:serviceName,Running:runningCount,Desired:desiredCount,Status:status}'
# 3. Check ALB health
curl -s -o /dev/null -w "%{http_code}" \
"https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health"
# 4. Execute rollback
./infra/scripts/rollback.sh ${ENVIRONMENT} all
```
### 11.3 Verification (5-10 minutes)
```bash
# 1. Wait for services to stabilize
aws ecs wait services-stable \
--cluster "shieldai-${ENVIRONMENT}" \
--services shieldai-${ENVIRONMENT}-api,shieldai-${ENVIRONMENT}-darkwatch,shieldai-${ENVIRONMENT}-spamshield,shieldai-${ENVIRONMENT}-voiceprint
# 2. Verify health endpoint
curl -sf "https://shieldai-${ENVIRONMENT}-alb.us-east-1.elb.amazonaws.com/health" \
&& echo "Health: OK" || echo "Health: FAIL"
# 3. Check CloudWatch for recovery
# Navigate to CloudWatch dashboard and verify metrics
```
### 11.4 Communication Template
```
## Rollback Notification
**Environment:** production
**Time:** $(date -u '+%Y-%m-%d %H:%M UTC')
**Trigger:** [ALB 5xx alarm / manual / CI/CD health check]
**Action:** Rolled back all services to previous deployment
**Status:** [In Progress / Verified / Resolved]
**Next steps:** [Post-mortem / monitoring / investigation]
```
### 11.5 Post-Incident
1. Create incident ticket with timeline
2. Document root cause
3. Update runbook if procedure changed
4. Schedule post-mortem within 48 hours
5. Create follow-up issues for preventive measures
---
## Appendix A: Quick Reference
| Resource | Command |
|----------|---------|
| Rollback script | `./infra/scripts/rollback.sh <env> <service\|all>` |
| ECS service status | `aws ecs describe-services --cluster shieldai-<env> --services shieldai-<env>-<svc>` |
| ALB health check | `curl -s -o /dev/null -w "%{http_code}" https://shieldai-<env>-alb.us-east-1.elb.amazonaws.com/health` |
| RDS snapshots | `aws rds describe-db-snapshots --db-instance-identifier shieldai-<env>-db` |
| CloudWatch dashboard | `https://us-east-1.console.aws.amazon.com/cloudwatch/home#dashboards/dashboard/shieldai-<env>-dashboard` |
| ECS task logs | `aws logs filter-log-events --log-group-name /ecs/shieldai-<env>-<svc>` |
## Appendix B: Environment Variables
| Variable | Description | Required |
|----------|-------------|----------|
| `AWS_ACCESS_KEY_ID` | IAM user with ECS, RDS permissions | Yes |
| `AWS_SECRET_ACCESS_KEY` | IAM secret key | Yes |
| `AWS_DEFAULT_REGION` | AWS region (default: us-east-1) | Yes |
| `GITHUB_REPOSITORY_OWNER` | GitHub org/user for container registry | Docker Compose only |
| `DOCKER_TAG` | Container image tag to deploy | Docker Compose only |
| `POSTGRES_PASSWORD` | Database password | Docker Compose only |

View File

@@ -1,57 +0,0 @@
terraform {
backend "s3" {
bucket = "shieldai-production-terraform-state"
key = "production/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "shieldai-terraform-locks"
}
}
module "shieldai" {
source = "../.."
environment = "production"
aws_region = "us-east-1"
project_name = "shieldai"
vpc_cidr = "10.1.0.0/16"
az_count = 3
db_instance_class = "db.r6g.large"
db_multi_az = true
db_backup_retention = 14
elasticache_node_type = "cache.r6g.large"
elasticache_num_nodes = 3
secrets = {
HIBP_API_KEY = var.hibp_api_key
RESEND_API_KEY = var.resend_api_key
SENTRY_DSN = var.sentry_dsn
DATADOG_API_KEY = var.datadog_api_key
}
}
variable "hibp_api_key" {
description = "Have I Been Pwned API key"
type = string
sensitive = true
}
variable "resend_api_key" {
description = "Resend API key"
type = string
sensitive = true
}
variable "sentry_dsn" {
description = "Sentry DSN"
type = string
sensitive = true
}
variable "datadog_api_key" {
description = "Datadog API key"
type = string
sensitive = true
}

View File

@@ -1,4 +0,0 @@
hibp_api_key = "YOUR_HIBP_API_KEY"
resend_api_key = "YOUR_RESEND_API_KEY"
sentry_dsn = "YOUR_SENTRY_DSN"
datadog_api_key = "YOUR_DATADOG_API_KEY"

View File

@@ -1,57 +0,0 @@
terraform {
backend "s3" {
bucket = "shieldai-staging-terraform-state"
key = "staging/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "shieldai-terraform-locks"
}
}
module "shieldai" {
source = "../.."
environment = "staging"
aws_region = "us-east-1"
project_name = "shieldai"
vpc_cidr = "10.0.0.0/16"
az_count = 2
db_instance_class = "db.t3.medium"
db_multi_az = false
db_backup_retention = 3
elasticache_node_type = "cache.t3.small"
elasticache_num_nodes = 1
secrets = {
HIBP_API_KEY = var.hibp_api_key
RESEND_API_KEY = var.resend_api_key
SENTRY_DSN = var.sentry_dsn
DATADOG_API_KEY = var.datadog_api_key
}
}
variable "hibp_api_key" {
description = "Have I Been Pwned API key"
type = string
sensitive = true
}
variable "resend_api_key" {
description = "Resend API key"
type = string
sensitive = true
}
variable "sentry_dsn" {
description = "Sentry DSN"
type = string
sensitive = true
}
variable "datadog_api_key" {
description = "Datadog API key"
type = string
sensitive = true
}

View File

@@ -1,4 +0,0 @@
hibp_api_key = "YOUR_HIBP_API_KEY"
resend_api_key = "YOUR_RESEND_API_KEY"
sentry_dsn = "YOUR_SENTRY_DSN"
datadog_api_key = "YOUR_DATADOG_API_KEY"

View File

@@ -1,61 +0,0 @@
# ShieldAI Load Tests
k6 load testing suite for ShieldAI services.
## Prerequisites
- k6 v0.45+ installed
- Target services running on staging environment
- Authentication tokens for API access
## Running Tests
### Local Execution
```bash
# Run against local development environment
k6 run --env BASE_URL=http://localhost:3000 --env AUTH_TOKEN=dev-token src/darkwatch.js
# Run with results output
k6 run --out json=results.json src/darkwatch.js
```
### CI/CD Execution
```bash
# Run on staging environment
k6 run --env BASE_URL=https://staging-api.freno.me --env AUTH_TOKEN=$STAGING_AUTH_TOKEN src/darkwatch.js
```
## Test Configuration
Each test script includes:
- **Stages**: Ramp-up, sustained load, ramp-down
- **Thresholds**: P99 latency and error rate limits
- **Metrics**: Custom metrics for error tracking
### Current Thresholds
| Service | P99 Latency | Error Rate |
|---------|-------------|------------|
| Darkwatch | < 200ms | < 1% |
## Metrics Collection
Run with output options:
```bash
# JSON output for analysis
k6 run --out json=darkwatch-results.json src/darkwatch.js
# InfluxDB for visualization
k6 run --out influxdb=http://influxdb:8086/k6 src/darkwatch.js
```
## Next Steps
1. Create load test scripts for Spamshield and Voiceprint
2. Integrate with GitHub Actions CI pipeline
3. Set up metrics visualization dashboard
4. Configure alerting on threshold breaches

View File

@@ -1,99 +0,0 @@
import http from 'k6/http';
import { check, group } from 'k6';
import { Rate } from 'k6/metrics';
// Test configuration
export const options = {
stages: [
{ duration: '30s', target: 100 }, // Ramp up to 100 users
{ duration: '2m', target: 500 }, // Ramp to 500 req/s
{ duration: '3m', target: 500 }, // Stay at 500 req/s for 3 minutes
{ duration: '30s', target: 0 }, // Ramp down to 0
],
thresholds: {
http_req_duration: ['p(99)<200'], // P99 latency < 200ms
errors: ['rate<0.01'], // Error rate < 1%
},
};
const BASE_URL = __ENV.BASE_URL || 'http://localhost:3000';
export default function () {
group('Watchlist Operations', function () {
// GET /watchlist
const watchlistRes = http.get(`${BASE_URL}/watchlist`, {
headers: { 'Authorization': `Bearer ${getAuthToken()}` },
});
check(watchlistRes, {
'watchlist GET status is 200': (r) => r.status === 200,
'watchlist GET P99 < 100ms': (r) => r.timings.duration < 100,
});
// POST /watchlist
const newItemRes = http.post(
`${BASE_URL}/watchlist`,
JSON.stringify({ type: 'email', value: `test${Date()}@example.com` }),
{
headers: {
'Authorization': `Bearer ${getAuthToken()}`,
'Content-Type': 'application/json',
},
}
);
check(newItemRes, {
'watchlist POST status is 201': (r) => r.status === 201,
'watchlist POST P99 < 200ms': (r) => r.timings.duration < 200,
});
// POST /scan
const scanRes = http.post(
`${BASE_URL}/scan`,
{},
{
headers: { 'Authorization': `Bearer ${getAuthToken()}` },
}
);
check(scanRes, {
'scan POST status is 200': (r) => r.status === 200,
'scan POST P99 < 150ms': (r) => r.timings.duration < 150,
});
// GET /scan/schedule
const scheduleRes = http.get(`${BASE_URL}/scan/schedule`, {
headers: { 'Authorization': `Bearer ${getAuthToken()}` },
});
check(scheduleRes, {
'schedule GET status is 200': (r) => r.status === 200,
'schedule GET P99 < 100ms': (r) => r.timings.duration < 100,
});
// GET /exposures
const exposuresRes = http.get(`${BASE_URL}/exposures`, {
headers: { 'Authorization': `Bearer ${getAuthToken()}` },
});
check(exposuresRes, {
'exposures GET status is 200': (r) => r.status === 200,
'exposures GET P99 < 150ms': (r) => r.timings.duration < 150,
});
// GET /alerts
const alertsRes = http.get(`${BASE_URL}/alerts`, {
headers: { 'Authorization': `Bearer ${getAuthToken()}` },
});
check(alertsRes, {
'alerts GET status is 200': (r) => r.status === 200,
'alerts GET P99 < 150ms': (r) => r.timings.duration < 150,
});
});
}
// Helper function to get auth token (replace with actual token retrieval)
function getAuthToken() {
return __ENV.AUTH_TOKEN || 'test-token';
}

View File

@@ -1,113 +0,0 @@
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.30"
}
}
backend "s3" {
bucket = "shieldai-terraform-state"
key = "global/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "shieldai-terraform-locks"
}
}
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Project = "ShieldAI"
ManagedBy = "terraform"
Environment = var.environment
}
}
}
module "vpc" {
source = "./modules/vpc"
environment = var.environment
vpc_cidr = var.vpc_cidr
az_count = var.az_count
project_name = var.project_name
kms_key_arn = module.ecs.kms_key_arn
}
module "ecs" {
source = "./modules/ecs"
environment = var.environment
cluster_name = "${var.project_name}-${var.environment}"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
public_subnet_ids = module.vpc.public_subnet_ids
security_group_ids = [module.vpc.ecs_security_group_id]
alb_security_group_id = module.vpc.alb_security_group_id
services = var.services
container_images = var.container_images
secrets_arn = module.secrets.secrets_manager_arn
cache_cluster_arn = module.elasticache.replication_group_arn
domain_name = var.domain_name
}
module "rds" {
source = "./modules/rds"
environment = var.environment
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
security_group_id = module.vpc.rds_security_group_id
db_name = var.db_name
db_instance_class = var.db_instance_class
multi_az = var.db_multi_az
backup_retention = var.db_backup_retention
project_name = var.project_name
}
module "elasticache" {
source = "./modules/elasticache"
environment = var.environment
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
security_group_id = module.vpc.elasticache_security_group_id
node_type = var.elasticache_node_type
num_nodes = var.elasticache_num_nodes
project_name = var.project_name
}
module "s3" {
source = "./modules/s3"
environment = var.environment
project_name = var.project_name
}
module "secrets" {
source = "./modules/secrets"
environment = var.environment
project_name = var.project_name
rds_endpoint = module.rds.db_endpoint
db_password = module.rds.db_password
elasticache_endpoint = module.elasticache.cache_endpoint
redis_auth_token = module.elasticache.auth_token
secrets = var.secrets
}
module "cloudwatch" {
source = "./modules/cloudwatch"
environment = var.environment
cluster_name = "${var.project_name}-${var.environment}"
project_name = var.project_name
rds_identifier = module.rds.db_instance_identifier
cache_endpoint = module.elasticache.cache_endpoint
}

View File

@@ -1,464 +0,0 @@
variable "environment" {
description = "Deployment environment"
type = string
}
variable "cluster_name" {
description = "ECS cluster name"
type = string
}
variable "project_name" {
description = "Project name"
type = string
}
variable "rds_identifier" {
description = "RDS instance identifier"
type = string
}
variable "cache_endpoint" {
description = "ElastiCache endpoint"
type = string
}
variable "alert_email" {
description = "Email address for alert notifications"
type = string
default = "ops@shieldai.com"
}
resource "aws_sns_topic" "alerts" {
name = "${var.project_name}-${var.environment}-alerts"
tags = {
Environment = var.environment
Project = var.project_name
}
}
resource "aws_sns_topic_subscription" "alerts_email" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "email"
endpoint = var.alert_email
}
resource "aws_cloudwatch_dashboard" "main" {
dashboard_name = "${var.project_name}-${var.environment}-dashboard"
dashboard_body = jsonencode({
widgets = [
{
type = "metric"
properties = {
title = "ECS CPU Utilization"
metrics = [
["AWS/ECS", "CPUUtilization", "ClusterName", var.cluster_name]
]
view = "timeSeries"
stacked = false
region = "us-east-1"
period = 300
}
},
{
type = "metric"
properties = {
title = "ECS Memory Utilization"
metrics = [
["AWS/ECS", "MemoryUtilization", "ClusterName", var.cluster_name]
]
view = "timeSeries"
stacked = false
region = "us-east-1"
period = 300
}
},
{
type = "metric"
properties = {
title = "RDS CPU Utilization"
metrics = [
["AWS/RDS", "CPUUtilization", "DBInstanceIdentifier", var.rds_identifier]
]
view = "timeSeries"
stacked = false
region = "us-east-1"
period = 300
}
},
{
type = "metric"
properties = {
title = "ALB Request Count"
metrics = [
["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "${var.cluster_name}-alb"]
]
view = "timeSeries"
stacked = false
region = "us-east-1"
period = 60
}
},
{
type = "metric"
properties = {
title = "ALB 5xx Errors"
metrics = [
["AWS/ApplicationELB", "HTTPCode_Elb_5XX_Count", "LoadBalancer", "${var.cluster_name}-alb"]
]
view = "timeSeries"
stacked = false
region = "us-east-1"
period = 60
}
},
{
type = "metric"
properties = {
title = "P99 Latency (Target Group)"
metrics = [
["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "${var.cluster_name}-alb", "Statistic", "p99"],
["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "${var.cluster_name}-alb", "Statistic", "p95"]
]
view = "timeSeries"
stacked = false
region = "us-east-1"
period = 60
}
},
{
type = "metric"
properties = {
title = "Error Rate (5xx / Total)"
metrics = [
["AWS/ApplicationELB", "HTTPCode_Elb_5XX_Count", "LoadBalancer", "${var.cluster_name}-alb"],
["AWS/ApplicationELB", "HTTPCode_Elb_4XX_Count", "LoadBalancer", "${var.cluster_name}-alb"]
]
view = "timeSeries"
stacked = false
region = "us-east-1"
period = 60
}
},
{
type = "metric"
properties = {
title = "Throughput (Request Count)"
metrics = [
["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "${var.cluster_name}-alb"]
]
view = "timeSeries"
stacked = false
region = "us-east-1"
period = 60
yAxis = {
left = {
label = "Requests/sec"
}
}
}
},
{
type = "metric"
properties = {
title = "API Latency Percentiles"
metrics = [
["ShieldAI", "api_latency", "service", "api", "percentile", "p99", "statistic", "Average"],
["ShieldAI", "api_latency", "service", "api", "percentile", "p95", "statistic", "Average"],
["ShieldAI", "api_latency", "service", "api", "percentile", "p50", "statistic", "Average"]
]
view = "timeSeries"
stacked = false
region = "us-east-1"
period = 60
}
},
{
type = "metric"
properties = {
title = "API Error Rate"
metrics = [
["ShieldAI", "api_errors", "service", "api", "statistic", "Sum"]
]
view = "timeSeries"
stacked = false
region = "us-east-1"
period = 60
}
},
{
type = "metric"
properties = {
title = "API Throughput"
metrics = [
["ShieldAI", "api_requests", "service", "api", "statistic", "Sum"]
]
view = "timeSeries"
stacked = false
region = "us-east-1"
period = 60
}
},
{
type = "metric"
properties = {
title = "ECS Running Tasks"
metrics = [
["AWS/ECS", "RunningTaskCount", "ClusterName", var.cluster_name]
]
view = "timeSeries"
stacked = false
region = "us-east-1"
period = 60
}
},
{
type = "metric"
properties = {
title = "RDS Read/Write IOPS"
metrics = [
["AWS/RDS", "ReadIOPS", "DBInstanceIdentifier", var.rds_identifier],
["AWS/RDS", "WriteIOPS", "DBInstanceIdentifier", var.rds_identifier]
]
view = "timeSeries"
stacked = false
region = "us-east-1"
period = 60
}
}
]
})
}
resource "aws_cloudwatch_metric_alarm" "ecs_cpu_high" {
alarm_name = "${var.project_name}-${var.environment}-ecs-cpu-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/ECS"
period = 300
statistic = "Average"
threshold = 80
alarm_description = "ECS CPU utilization above 80%"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
ClusterName = var.cluster_name
}
}
resource "aws_cloudwatch_metric_alarm" "ecs_memory_high" {
alarm_name = "${var.project_name}-${var.environment}-ecs-memory-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "MemoryUtilization"
namespace = "AWS/ECS"
period = 300
statistic = "Average"
threshold = 85
alarm_description = "ECS memory utilization above 85%"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
ClusterName = var.cluster_name
}
}
resource "aws_cloudwatch_metric_alarm" "alb_5xx" {
alarm_name = "${var.project_name}-${var.environment}-alb-5xx"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "HTTPCode_Elb_5XX_Count"
namespace = "AWS/ApplicationELB"
period = 60
statistic = "Sum"
threshold = 10
alarm_description = "ALB 5xx errors above 10 per minute"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
LoadBalancer = "${var.cluster_name}-alb"
}
}
resource "aws_cloudwatch_metric_alarm" "rds_cpu_high" {
alarm_name = "${var.project_name}-${var.environment}-rds-cpu-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/RDS"
period = 300
statistic = "Average"
threshold = 75
alarm_description = "RDS CPU utilization above 75%"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
DBInstanceIdentifier = var.rds_identifier
}
}
resource "aws_cloudwatch_metric_alarm" "rds_free_storage" {
alarm_name = "${var.project_name}-${var.environment}-rds-free-storage"
comparison_operator = "LessThanThreshold"
evaluation_periods = 2
metric_name = "FreeStorageSpace"
namespace = "AWS/RDS"
period = 300
statistic = "Average"
threshold = 524288000
alarm_description = "RDS free storage below 500MB"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
DBInstanceIdentifier = var.rds_identifier
}
}
resource "aws_cloudwatch_metric_alarm" "p99_latency_high" {
alarm_name = "${var.project_name}-${var.environment}-p99-latency-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "TargetResponseTime"
namespace = "AWS/ApplicationELB"
period = 60
statistic = "p99"
threshold = 2
alarm_description = "P99 latency above 2 seconds"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
LoadBalancer = "${var.cluster_name}-alb"
}
}
resource "aws_cloudwatch_metric_alarm" "error_rate_high" {
alarm_name = "${var.project_name}-${var.environment}-error-rate-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "HTTPCode_Elb_5XX_Count"
namespace = "AWS/ApplicationELB"
period = 60
statistic = "Sum"
threshold = 5
alarm_description = "Error rate above 5 errors per minute"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
LoadBalancer = "${var.cluster_name}-alb"
}
}
resource "aws_cloudwatch_metric_alarm" "throughput_low" {
alarm_name = "${var.project_name}-${var.environment}-throughput-low"
comparison_operator = "LessThanThreshold"
evaluation_periods = 5
metric_name = "RequestCount"
namespace = "AWS/ApplicationELB"
period = 60
statistic = "Sum"
threshold = 10
alarm_description = "Throughput below 10 requests per minute"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
LoadBalancer = "${var.cluster_name}-alb"
}
}
resource "aws_cloudwatch_log_group" "api" {
name = "/${var.project_name}/${var.environment}/api"
retention_in_days = 30
tags = {
Environment = var.environment
Project = var.project_name
Service = "api"
}
}
resource "aws_cloudwatch_log_group" "datadog" {
name = "/${var.project_name}/${var.environment}/datadog"
retention_in_days = 30
tags = {
Environment = var.environment
Project = var.project_name
Service = "datadog"
}
}
resource "aws_cloudwatch_log_group" "sentry" {
name = "/${var.project_name}/${var.environment}/sentry"
retention_in_days = 30
tags = {
Environment = var.environment
Project = var.project_name
Service = "sentry"
}
}
resource "aws_cloudwatch_metric_alarm" "app_p99_latency_high" {
alarm_name = "${var.project_name}-${var.environment}-app-p99-latency-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "api_latency"
namespace = "ShieldAI"
period = 60
statistic = "Average"
threshold = 2000
alarm_description = "Application P99 latency above 2000ms"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
service = "api"
percentile = "p99"
}
}
resource "aws_cloudwatch_metric_alarm" "app_error_rate_high" {
alarm_name = "${var.project_name}-${var.environment}-app-error-rate-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "api_errors"
namespace = "ShieldAI"
period = 60
statistic = "Sum"
threshold = 10
alarm_description = "Application error count above 10 per minute"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
service = "api"
}
}
resource "aws_cloudwatch_metric_alarm" "app_throughput_low" {
alarm_name = "${var.project_name}-${var.environment}-app-throughput-low"
comparison_operator = "LessThanThreshold"
evaluation_periods = 5
metric_name = "api_requests"
namespace = "ShieldAI"
period = 60
statistic = "Sum"
threshold = 10
alarm_description = "Application throughput below 10 requests per minute"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
service = "api"
}
}
output "dashboard_url" {
description = "CloudWatch dashboard URL"
value = "https://us-east-1.console.aws.amazon.com/cloudwatch/home#dashboards/dashboard/${var.project_name}-${var.environment}-dashboard"
}
output "sns_topic_arn" {
description = "SNS topic ARN for alerts"
value = aws_sns_topic.alerts.arn
}

View File

@@ -1,519 +0,0 @@
variable "environment" {
description = "Deployment environment"
type = string
}
variable "cluster_name" {
description = "ECS cluster name"
type = string
}
variable "vpc_id" {
description = "VPC ID"
type = string
}
variable "subnet_ids" {
description = "Private subnet IDs for ECS tasks"
type = list(string)
}
variable "public_subnet_ids" {
description = "Public subnet IDs for ALB"
type = list(string)
}
variable "security_group_ids" {
description = "Security group IDs"
type = list(string)
}
variable "alb_security_group_id" {
description = "ALB security group ID"
type = string
}
variable "services" {
description = "ECS services to deploy"
type = map(object({
cpu = number
memory = number
port = number
}))
}
variable "container_images" {
description = "Container image tags"
type = map(string)
}
variable "secrets_arn" {
description = "Secrets Manager ARN"
type = string
}
variable "cache_cluster_arn" {
description = "ElastiCache replication group ARN"
type = string
}
variable "domain_name" {
description = "Route53 hosted zone domain for ACM cert validation"
type = string
default = "shieldai.app"
}
resource "aws_ecs_cluster" "main" {
name = var.cluster_name
settings {
name = "containerInsights"
value = "enabled"
}
tags = {
Name = var.cluster_name
}
}
resource "aws_ecs_cluster_capacity_providers" "main" {
cluster_name = aws_ecs_cluster.main.name
capacity_providers = ["FARGATE"]
default_capacity_provider_strategy {
base = 1
weight = 100
capacity_provider = "FARGATE"
}
}
resource "aws_ecs_task_definition" "services" {
for_each = var.services
family = "${var.cluster_name}-${each.key}"
container_definitions = jsonencode([
{
name = each.key
image = "ghcr.io/shieldai/shieldai-${each.key}:${var.container_images[each.key]}"
cpu = each.cpu
memory = each.memory
essential = true
portMappings = [
{
containerPort = each.port
hostPort = each.port
protocol = "tcp"
}
]
environment = [
{
name = "NODE_ENV"
value = var.environment
},
{
name = "PORT"
value = tostring(each.port)
},
{
name = "DD_ENV"
value = var.environment
},
{
name = "DD_SERVICE"
value = "${var.cluster_name}-${each.key}"
},
{
name = "DD_VERSION"
value = var.container_images[each.key]
},
{
name = "DD_TRACE_ENABLED"
value = "true"
},
{
name = "DD_LOGS_INJECTION"
value = "true"
},
{
name = "DD_AGENT_HOST"
value = "localhost"
},
{
name = "DD_AGENT_PORT"
value = "8126"
},
{
name = "SENTRY_ENVIRONMENT"
value = var.environment
},
{
name = "SENTRY_RELEASE"
value = var.container_images[each.key]
},
{
name = "AWS_REGION"
value = "us-east-1"
},
{
name = "DD_SITE"
value = "datadoghq.com"
}
]
secrets = [
{
name = "DATABASE_URL"
valueFrom = "${var.secrets_arn}:DATABASE_URL::"
},
{
name = "REDIS_URL"
valueFrom = "${var.secrets_arn}:REDIS_URL::"
},
{
name = "HIBP_API_KEY"
valueFrom = "${var.secrets_arn}:HIBP_API_KEY::"
},
{
name = "RESEND_API_KEY"
valueFrom = "${var.secrets_arn}:RESEND_API_KEY::"
},
{
name = "SENTRY_DSN"
valueFrom = "${var.secrets_arn}:SENTRY_DSN::"
},
{
name = "DD_API_KEY"
valueFrom = "${var.secrets_arn}:DD_API_KEY::"
}
]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = "/ecs/${var.cluster_name}-${each.key}"
"awslogs-region" = "us-east-1"
"awslogs-stream-prefix" = each.key
}
}
healthCheck = {
command = ["CMD-SHELL", "curl -f http://localhost:${each.port}/health || exit 1"]
interval = 30
timeout = 5
retries = 3
startPeriod = 60
}
}
])
network_mode = "awsvpc"
memory = each.memory
cpu = each.cpu
requires_compatibilities = ["FARGATE"]
execution_role_arn = aws_iam_role.execution[each.key].arn
task_role_arn = aws_iam_role.task[each.key].arn
tags = {
Name = "${var.cluster_name}-${each.key}"
}
}
resource "aws_iam_role" "execution" {
for_each = var.services
name = "${var.cluster_name}-${each.key}-execution"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ecs-tasks.amazonaws.com"
}
}
]
})
managed_policy_arns = [
"arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
]
}
resource "aws_iam_role" "task" {
for_each = var.services
name = "${var.cluster_name}-${each.key}-task"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ecs-tasks.amazonaws.com"
}
}
]
})
inline_policy {
name = "secrets-manager-access"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"secretsmanager:GetSecretValue",
"secretsmanager:DescribeSecret"
]
Resource = var.secrets_arn
}
]
})
}
inline_policy {
name = "elasticache-access"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"elasticache:DescribeCacheClusters",
"elasticache:DescribeCacheSubnetGroups"
]
Resource = var.cache_cluster_arn
}
]
})
}
}
resource "aws_ecs_service" "services" {
for_each = var.services
name = "${var.cluster_name}-${each.key}"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.services[each.key].arn
desired_count = var.environment == "production" ? 3 : 1
launch_type = "FARGATE"
network_configuration {
subnets = var.subnet_ids
security_groups = var.security_group_ids
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.services[each.key].arn
container_name = each.key
container_port = each.port
}
auto_scaling {
max_capacity = var.environment == "production" ? 10 : 3
min_capacity = var.environment == "production" ? 2 : 1
}
tags = {
Name = "${var.cluster_name}-${each.key}"
Service = each.key
}
depends_on = [
aws_lb_listener.https
]
}
resource "aws_lb" "main" {
name = "${var.cluster_name}-alb"
internal = false
load_balancer_type = "application"
security_groups = [var.alb_security_group_id]
subnets = var.public_subnet_ids
tags = {
Name = "${var.cluster_name}-alb"
}
}
resource "aws_acm_certificate" "main" {
domain_name = "${var.cluster_name}.${var.environment}.shieldai.app"
validation_method = "DNS"
tags = {
Name = "${var.cluster_name}-cert"
}
}
data "aws_route53_zone" "main" {
name = var.domain_name
}
resource "aws_route53_record" "acm_validation" {
for_each = {
for rv in aws_acm_certificate.main.domain_validation_options : rv.domain_name => rv
if rv.resource_record_name != null
}
zone_id = data.aws_route53_zone.main.zone_id
name = each.value.resource_record_name
type = each.value.resource_record_type
ttl = 60
records = [each.value.resource_record_value]
}
resource "aws_acm_certificate_validation" "main" {
certificate_arn = aws_acm_certificate.main.arn
validation_record_fqdns = [aws_route53_record.acm_validation[*].fqdn]
}
resource "aws_lb_target_group" "services" {
for_each = var.services
name = "${var.cluster_name}-${each.key}-tg"
port = each.port
protocol = "HTTP"
vpc_id = var.vpc_id
health_check {
enabled = true
healthy_threshold = 3
interval = 30
matcher = "200"
path = "/health"
port = "traffic-port"
protocol = "HTTP"
timeout = 5
unhealthy_threshold = 3
}
stickiness {
type = "lb_cookie"
cookie_duration = 86400
}
}
resource "aws_lb_listener" "https" {
load_balancer_arn = aws_lb.main.arn
port = 443
protocol = "HTTPS"
ssl_certificate_arn = aws_acm_certificate_validation.main.certificate_arn
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.services["api"].arn
}
}
resource "aws_lb_listener_rule" "services" {
for_each = { for k, v in var.services : k => v if k != "api" }
listener_arn = aws_lb_listener.https.arn
action {
type = "forward"
target_group_arn = aws_lb_target_group.services[each.key].arn
}
condition {
path_pattern {
values = ["/${each.key}/*", "/${each.key}"]
}
}
}
resource "aws_lb_listener" "http_redirect" {
load_balancer_arn = aws_lb.main.arn
port = 80
protocol = "HTTP"
default_action {
type = "redirect"
redirect {
port = "443"
protocol = "HTTPS"
status_code = "HTTP_301"
}
}
}
resource "aws_appautoscaling_target" "services" {
for_each = var.services
service_namespace = "ecs"
resource_id = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.services[each.key].name}"
scalable_dimension = "ecs:service:DesiredCount"
min_capacity = var.environment == "production" ? 2 : 1
max_capacity = var.environment == "production" ? 10 : 3
}
resource "aws_appautoscaling_policy" "cpu" {
for_each = var.services
name = "${var.cluster_name}-${each.key}-cpu-scaling"
service_namespace = "ecs"
resource_id = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.services[each.key].name}"
scalable_dimension = "ecs:service:DesiredCount"
target_tracking_scaling_policy_configuration {
target_value = 70.0
scale_in_cooldown = 60
scale_out_cooldown = 30
customized_metric_specification {
metric_name = "CPUUtilization"
namespace = "AWS/ECS"
statistic = "Average"
dimensions = [{ name = "ClusterName", value = aws_ecs_cluster.main.name }]
}
}
}
resource "aws_kms_key" "logs" {
description = "${var.cluster_name} logs encryption key"
deletion_window_in_days = 7
enable_key_rotation = true
tags = {
Name = "${var.cluster_name}-logs-kms"
}
}
resource "aws_cloudwatch_log_group" "services" {
for_each = var.services
name = "/ecs/${var.cluster_name}-${each.key}"
retention_in_days = var.environment == "production" ? 30 : 7
kms_key_id = aws_kms_key.logs.arn
tags = {
Name = "${var.cluster_name}-${each.key}-logs"
}
}
output "cluster_arn" {
description = "ECS cluster ARN"
value = aws_ecs_cluster.main.arn
}
output "alb_dns_name" {
description = "ALB DNS name"
value = aws_lb.main.dns_name
}
output "kms_key_arn" {
description = "KMS key ARN for log encryption"
value = aws_kms_key.logs.arn
}

View File

@@ -1,102 +0,0 @@
variable "environment" {
description = "Deployment environment"
type = string
}
variable "vpc_id" {
description = "VPC ID"
type = string
}
variable "subnet_ids" {
description = "Private subnet IDs"
type = list(string)
}
variable "security_group_id" {
description = "ElastiCache security group ID"
type = string
}
variable "node_type" {
description = "Cache node type"
type = string
}
variable "num_nodes" {
description = "Number of cache nodes"
type = number
}
variable "project_name" {
description = "Project name"
type = string
}
resource "aws_elasticache_subnet_group" "main" {
name = "${var.project_name}-${var.environment}-redis-subnet"
subnet_ids = var.subnet_ids
tags = {
Name = "${var.project_name}-${var.environment}-redis-subnet"
}
}
resource "random_password" "redis_auth" {
length = 32
special = false
keepers = {
environment = var.environment
}
}
resource "aws_elasticache_replication_group" "main" {
replication_group_id = "${var.project_name}-${var.environment}-redis"
description = "${var.project_name} Redis cluster (${var.environment})"
node_type = var.node_type
num_cache_clusters = var.num_nodes
engine = "redis"
engine_version = "7.0"
auth_token = random_password.redis_auth.result
transit_encryption_enabled = true
at_rest_encryption_enabled = true
port = 6379
subnet_group_name = aws_elasticache_subnet_group.main.name
security_group_ids = [var.security_group_id]
automatic_failover_enabled = var.environment == "production"
snapshot_retention_limit = var.environment == "production" ? 7 : 1
snapshot_window = "03:00-04:00"
tags = {
Name = "${var.project_name}-${var.environment}-redis"
}
}
output "cache_endpoint" {
description = "ElastiCache primary endpoint"
value = aws_elasticache_replication_group.main.primary_endpoint_address
}
output "reader_endpoint" {
description = "ElastiCache reader endpoint"
value = aws_elasticache_replication_group.main.reader_endpoint_address
}
output "auth_token" {
description = "Redis auth token"
value = random_password.redis_auth.result
sensitive = true
}
output "replication_group_arn" {
description = "ElastiCache replication group ARN"
value = aws_elasticache_replication_group.main.arn
}

View File

@@ -1,138 +0,0 @@
variable "environment" {
description = "Deployment environment"
type = string
}
variable "vpc_id" {
description = "VPC ID"
type = string
}
variable "subnet_ids" {
description = "Private subnet IDs"
type = list(string)
}
variable "security_group_id" {
description = "RDS security group ID"
type = string
}
variable "db_name" {
description = "Database name"
type = string
}
variable "db_instance_class" {
description = "RDS instance class"
type = string
}
variable "multi_az" {
description = "Multi-AZ deployment"
type = bool
}
variable "backup_retention" {
description = "Backup retention days"
type = number
}
variable "project_name" {
description = "Project name"
type = string
}
resource "aws_db_subnet_group" "main" {
name = "${var.project_name}-${var.environment}-db-subnet"
subnet_ids = var.subnet_ids
tags = {
Name = "${var.project_name}-${var.environment}-db-subnet"
}
}
resource "aws_db_instance" "main" {
identifier = "${var.project_name}-${var.environment}-db"
engine = "postgres"
engine_version = "16.2"
instance_class = var.db_instance_class
allocated_storage = var.environment == "production" ? 100 : 20
db_name = var.db_name
username = "shieldai"
password = random_password.db_password.result
multi_az = var.multi_az
db_subnet_group_name = aws_db_subnet_group.main.name
vpc_security_group_ids = [var.security_group_id]
backup_retention_period = var.backup_retention
backup_window = "03:00-04:00"
maintenance_window = "sun:04:00-sun:05:00"
skip_final_snapshot = var.environment != "production"
final_snapshot_identifier = "${var.project_name}-${var.environment}-final"
storage_encrypted = true
storage_type = "gp3"
iops = var.environment == "production" ? 3000 : 1000
deletion_protection = var.environment == "production"
copy_tags_to_snapshot = true
tags = {
Name = "${var.project_name}-${var.environment}-db"
}
}
resource "random_password" "db_password" {
length = 16
special = true
keepers = {
environment = var.environment
}
}
resource "aws_secretsmanager_secret_version" "db_password" {
secret_id = aws_secretsmanager_secret.db_password.id
secret_string = jsonencode({
username = "shieldai"
password = random_password.db_password.result
engine = "postgres"
host = aws_db_instance.main.address
port = aws_db_instance.main.port
})
}
resource "aws_secretsmanager_secret" "db_password" {
name = "${var.project_name}-${var.environment}-db-password"
tags = {
Name = "${var.project_name}-${var.environment}-db-password"
}
}
output "db_endpoint" {
description = "RDS endpoint"
value = aws_db_instance.main.endpoint
sensitive = true
}
output "db_instance_identifier" {
description = "RDS instance identifier"
value = aws_db_instance.main.identifier
}
output "db_password_secret_arn" {
description = "DB password secret ARN"
value = aws_secretsmanager_secret.db_password.arn
}
output "db_password" {
description = "Generated DB password"
value = random_password.db_password.result
sensitive = true
}

View File

@@ -1,145 +0,0 @@
variable "environment" {
description = "Deployment environment"
type = string
}
variable "project_name" {
description = "Project name"
type = string
}
resource "aws_s3_bucket" "terraform_state" {
bucket = "${var.project_name}-${var.environment}-terraform-state"
tags = {
Name = "${var.project_name}-${var.environment}-terraform-state"
}
}
resource "aws_s3_bucket_public_access_block" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
}
}
}
resource "aws_s3_bucket_lifecycle_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
id = "expire-noncurrent"
status = "Enabled"
noncurrent_version_expiration {
noncurrent_days = 30
}
}
}
resource "aws_s3_bucket" "artifacts" {
bucket = "${var.project_name}-${var.environment}-artifacts"
tags = {
Name = "${var.project_name}-${var.environment}-artifacts"
}
}
resource "aws_s3_bucket_public_access_block" "artifacts" {
bucket = aws_s3_bucket.artifacts.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_s3_bucket_versioning" "artifacts" {
bucket = aws_s3_bucket.artifacts.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "artifacts" {
bucket = aws_s3_bucket.artifacts.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
}
}
}
resource "aws_s3_bucket" "logs" {
bucket = "${var.project_name}-${var.environment}-logs"
tags = {
Name = "${var.project_name}-${var.environment}-logs"
}
}
resource "aws_s3_bucket_public_access_block" "logs" {
bucket = aws_s3_bucket.logs.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_s3_bucket_server_side_encryption_configuration" "logs" {
bucket = aws_s3_bucket.logs.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
}
}
}
resource "aws_s3_bucket_lifecycle_configuration" "logs" {
bucket = aws_s3_bucket.logs.id
rule {
id = "expire-old-logs"
status = "Enabled"
expiration {
days = 90
}
}
}
output "bucket_name" {
description = "Terraform state S3 bucket name"
value = aws_s3_bucket.terraform_state.id
}
output "artifacts_bucket_name" {
description = "Artifacts S3 bucket name"
value = aws_s3_bucket.artifacts.id
}
output "logs_bucket_name" {
description = "Logs S3 bucket name"
value = aws_s3_bucket.logs.id
}

View File

@@ -1,69 +0,0 @@
variable "environment" {
description = "Deployment environment"
type = string
}
variable "project_name" {
description = "Project name"
type = string
}
variable "rds_endpoint" {
description = "RDS instance endpoint"
type = string
}
variable "db_password" {
description = "Generated RDS password"
type = string
sensitive = true
}
variable "elasticache_endpoint" {
description = "ElastiCache primary endpoint"
type = string
}
variable "redis_auth_token" {
description = "ElastiCache auth token"
type = string
sensitive = true
}
variable "secrets" {
description = "Secrets to store"
type = map(string)
default = {}
}
resource "aws_secretsmanager_secret" "main" {
name = "${var.project_name}-${var.environment}-app-secrets"
description = "Application secrets for ${var.project_name} (${var.environment})"
tags = {
Name = "${var.project_name}-${var.environment}-app-secrets"
Environment = var.environment
}
}
resource "aws_secretsmanager_secret_version" "main" {
secret_id = aws_secretsmanager_secret.main.id
secret_string = jsonencode(merge({
DATABASE_URL = "postgresql://shieldai:${var.db_password}@${var.rds_endpoint}:5432/shieldai"
REDIS_URL = "redis://:${var.redis_auth_token}@${var.elasticache_endpoint}:6379"
NODE_ENV = var.environment
LOG_LEVEL = var.environment == "production" ? "info" : "debug"
}, var.secrets))
}
output "secrets_manager_arn" {
description = "Secrets Manager ARN"
value = aws_secretsmanager_secret.main.arn
}
output "secrets_manager_name" {
description = "Secrets Manager secret name"
value = aws_secretsmanager_secret.main.name
}

View File

@@ -1,338 +0,0 @@
variable "environment" {
description = "Deployment environment"
type = string
}
variable "vpc_cidr" {
description = "CIDR block for VPC"
type = string
}
variable "az_count" {
description = "Number of availability zones"
type = number
}
variable "project_name" {
description = "Project name"
type = string
}
variable "kms_key_arn" {
description = "KMS key ARN for log encryption"
type = string
default = ""
}
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_support = true
enable_dns_hostnames = true
tags = {
Name = "${var.project_name}-${var.environment}-vpc"
}
}
data "aws_availability_zones" "available" {
state = "available"
}
resource "aws_subnet" "public" {
count = var.az_count
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 8, count.index)
availability_zone = data.aws_availability_zones.available.names[count.index]
map_public_ip_on_launch = false
tags = {
Name = "${var.project_name}-${var.environment}-public-${data.aws_availability_zones.available.names[count.index]}"
"kubernetes.io/role/elb" = "1"
}
}
resource "aws_subnet" "private" {
count = var.az_count
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 8, var.az_count + count.index)
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = {
Name = "${var.project_name}-${var.environment}-private-${data.aws_availability_zones.available.names[count.index]}"
"kubernetes.io/role/internal-elb" = "1"
}
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = {
Name = "${var.project_name}-${var.environment}-igw"
}
}
resource "aws_eip" "nat" {
count = var.az_count
domain = "vpc"
tags = {
Name = "${var.project_name}-${var.environment}-nat-${count.index}"
}
}
resource "aws_nat_gateway" "main" {
count = var.az_count
allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
tags = {
Name = "${var.project_name}-${var.environment}-nat-${count.index}"
}
depends_on = [aws_internet_gateway.main]
}
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
tags = {
Name = "${var.project_name}-${var.environment}-public-rt"
}
}
resource "aws_route_table" "private" {
count = var.az_count
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.main[count.index].id
}
tags = {
Name = "${var.project_name}-${var.environment}-private-rt-${count.index}"
}
}
resource "aws_route_table_association" "public" {
count = var.az_count
subnet_id = aws_subnet.public[count.index].id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "private" {
count = var.az_count
subnet_id = aws_subnet.private[count.index].id
route_table_id = aws_route_table.private[count.index].id
}
resource "aws_security_group" "alb" {
name_prefix = "${var.project_name}-${var.environment}-alb"
vpc_id = aws_vpc.main.id
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
description = "HTTPS from internet"
}
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
description = "HTTP from internet (redirect)"
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "${var.project_name}-${var.environment}-alb-sg"
}
}
resource "aws_security_group" "ecs" {
name_prefix = "${var.project_name}-${var.environment}-ecs"
vpc_id = aws_vpc.main.id
ingress {
from_port = 3000
to_port = 3003
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
description = "Service ports from ALB only"
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "${var.project_name}-${var.environment}-ecs-sg"
}
}
resource "aws_security_group" "rds" {
name_prefix = "${var.project_name}-${var.environment}-rds"
vpc_id = aws_vpc.main.id
ingress {
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.ecs.id]
description = "PostgreSQL from ECS"
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "${var.project_name}-${var.environment}-rds-sg"
}
}
resource "aws_security_group" "elasticache" {
name_prefix = "${var.project_name}-${var.environment}-elasticache"
vpc_id = aws_vpc.main.id
ingress {
from_port = 6379
to_port = 6379
protocol = "tcp"
security_groups = [aws_security_group.ecs.id]
description = "Redis from ECS"
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "${var.project_name}-${var.environment}-elasticache-sg"
}
}
resource "aws_flow_log" "main" {
iam_role_arn = aws_iam_role.flow_log.arn
log_destination = aws_cloudwatch_log_group.flow_log.arn
vpc_id = aws_vpc.main.id
traffic_type = "ALL"
tags = {
Name = "${var.project_name}-${var.environment}-flow-log"
}
}
resource "aws_iam_role" "flow_log" {
name = "${var.project_name}-${var.environment}-flow-log-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "vpc-flow-logs.amazonaws.com"
}
}
]
})
}
resource "aws_iam_role_policy" "flow_log" {
name = "${var.project_name}-${var.environment}-flow-log-policy"
role = aws_iam_role.flow_log.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogGroups",
"logs:DescribeLogStreams"
]
Effect = "Allow"
Resource = [aws_cloudwatch_log_group.flow_log.arn]
}
]
})
}
resource "aws_cloudwatch_log_group" "flow_log" {
name = "/${var.project_name}/${var.environment}/vpc-flow-log"
retention_in_days = var.environment == "production" ? 30 : 7
kms_key_id = var.kms_key_arn != "" ? var.kms_key_arn : null
tags = {
Name = "${var.project_name}-${var.environment}-flow-log"
}
}
output "vpc_id" {
description = "VPC ID"
value = aws_vpc.main.id
}
output "private_subnet_ids" {
description = "Private subnet IDs"
value = aws_subnet.private[*].id
}
output "public_subnet_ids" {
description = "Public subnet IDs"
value = aws_subnet.public[*].id
}
output "alb_security_group_id" {
description = "ALB security group ID"
value = aws_security_group.alb.id
}
output "ecs_security_group_id" {
description = "ECS security group ID"
value = aws_security_group.ecs.id
}
output "rds_security_group_id" {
description = "RDS security group ID"
value = aws_security_group.rds.id
}
output "elasticache_security_group_id" {
description = "ElastiCache security group ID"
value = aws_security_group.elasticache.id
}

View File

@@ -1,35 +0,0 @@
output "vpc_id" {
description = "VPC ID"
value = module.vpc.vpc_id
}
output "cluster_name" {
description = "ECS cluster name"
value = "${var.project_name}-${var.environment}"
}
output "rds_endpoint" {
description = "RDS endpoint"
value = module.rds.db_endpoint
sensitive = true
}
output "elasticache_endpoint" {
description = "ElastiCache primary endpoint"
value = module.elasticache.cache_endpoint
}
output "s3_bucket_name" {
description = "S3 bucket name"
value = module.s3.bucket_name
}
output "secrets_manager_arn" {
description = "Secrets Manager ARN"
value = module.secrets.secrets_manager_arn
}
output "cloudwatch_dashboard_url" {
description = "CloudWatch dashboard URL"
value = module.cloudwatch.dashboard_url
}

View File

@@ -1,121 +0,0 @@
#!/bin/bash
set -euo pipefail
# ShieldAI Docker Compose Rollback Script
# Usage: ./rollback-compose.sh <previous_tag> [--env prod|dev]
#
# Rolls back all services to a previous tagged image using docker-compose.prod.yml
#
# Examples:
# ./rollback-compose.sh v1.2.3 # Rollback to v1.2.3
# ./rollback-compose.sh v1.2.3 --env prod # Explicit production compose
PREVIOUS_TAG="${1:-}"
ENV_MODE="${2:-prod}"
# ─── Configuration ───────────────────────────────────────────────
SERVICES="api darkwatch spamshield voiceprint"
COMPOSE_FILE="docker-compose.prod.yml"
REGISTRY_OWNER="${GITHUB_REPOSITORY_OWNER:-shieldai}"
# ─── Helpers ─────────────────────────────────────────────────────
log() {
local level="$1"
shift
echo "[$(date -u '+%H:%M:%S')] [$level] $*"
}
log_info() { log "INFO" "$@"; }
log_warn() { log "WARN" "$@"; }
log_error() { log "ERROR" "$@"; }
# ─── Validation ──────────────────────────────────────────────────
if [[ -z "$PREVIOUS_TAG" ]]; then
log_error "Usage: $0 <previous_tag> [--env prod|dev]"
log_error "Example: $0 v1.2.3"
exit 1
fi
if ! command -v docker &>/dev/null; then
log_error "Docker not found in PATH"
exit 1
fi
# ─── Rollback Logic ──────────────────────────────────────────────
main() {
log_info "=== Docker Compose Rollback ==="
log_info "Target tag: $PREVIOUS_TAG"
log_info "Compose file: $COMPOSE_FILE"
log_info "Registry: ghcr.io/$REGISTRY_OWNER"
# 1. Pull previous images
log_info "Pulling previous images..."
local pull_failed=0
for svc in $SERVICES; do
local image="ghcr.io/${REGISTRY_OWNER}/shieldai-${svc}:${PREVIOUS_TAG}"
log_info "Pulling $image..."
if docker pull "$image" 2>/dev/null; then
log_info "Pulled: $image"
else
log_warn "Pull failed: $image (may not exist)"
pull_failed=1
fi
done
if [[ $pull_failed -eq 1 ]]; then
log_warn "Some images may not exist at tag $PREVIOUS_TAG"
log_info "Continuing with available images..."
fi
# 2. Stop current services gracefully
log_info "Stopping current services..."
DOCKER_TAG="$PREVIOUS_TAG" docker compose -f "$COMPOSE_FILE" down --timeout 30 2>/dev/null || true
# 3. Start with previous tag
log_info "Starting services with tag $PREVIOUS_TAG..."
DOCKER_TAG="$PREVIOUS_TAG" docker compose -f "$COMPOSE_FILE" up -d
# 4. Wait for services to be healthy
log_info "Waiting for services to become healthy..."
sleep 10
# 5. Verify health
local passed=0
local failed=0
for svc in $SERVICES; do
local port
port=$(case "$svc" in
api) echo 3000 ;;
darkwatch) echo 3001 ;;
spamshield) echo 3002 ;;
voiceprint) echo 3003 ;;
esac)
local http_code
http_code=$(curl -s -o /dev/null -w "%{http_code}" \
--connect-timeout 10 --max-time 30 \
"http://localhost:${port}/health" 2>/dev/null || echo "000")
if [[ "$http_code" == "200" ]]; then
log_info "Health OK: $svc (port $port, HTTP $http_code)"
((passed++))
else
log_warn "Health FAIL: $svc (port $port, HTTP $http_code)"
((failed++))
fi
done
log_info "=== Rollback Complete ==="
log_info "Passed: $passed, Failed: $failed"
if [[ $failed -gt 0 ]]; then
log_warn "Some services failed health check. Check logs: docker compose -f $COMPOSE_FILE logs"
exit 1
fi
log_info "All services healthy after rollback"
exit 0
}
main "$@"

View File

@@ -1,164 +0,0 @@
#!/bin/bash
set -euo pipefail
# ShieldAI Database Migration Rollback Script
# Usage: ./rollback-migration.sh <environment> [--migration <name>]
#
# Rolls back the most recent migration or a specific named migration
# Uses AWS Secrets Manager for database credentials
#
# Examples:
# ./rollback-migration.sh staging # Rollback latest
# ./rollback-migration.sh production --migration 001_create_users # Rollback specific
ENVIRONMENT="${1:-staging}"
MIGRATION_NAME="${3:-}"
# ─── Configuration ───────────────────────────────────────────────
SECRET_ID="shieldai-${ENVIRONMENT}-db-password"
DB_NAME="shieldai"
DB_USER="shieldai"
# ─── Helpers ─────────────────────────────────────────────────────
log() {
local level="$1"
shift
echo "[$(date -u '+%H:%M:%S')] [$level] $*"
}
log_info() { log "INFO" "$@"; }
log_warn() { log "WARN" "$@"; }
log_error() { log "ERROR" "$@"; }
# ─── Validation ──────────────────────────────────────────────────
if [[ "$ENVIRONMENT" != "staging" && "$ENVIRONMENT" != "production" ]]; then
log_error "Invalid environment: $ENVIRONMENT (expected: staging, production)"
exit 1
fi
for cmd in aws jq; do
if ! command -v "$cmd" &>/dev/null; then
log_error "Missing prerequisite: $cmd"
exit 1
fi
done
# ─── Credentials ─────────────────────────────────────────────────
get_db_credentials() {
log_info "Fetching database credentials from Secrets Manager..."
local secret
secret=$(aws secretsmanager get-secret-value \
--secret-id "$SECRET_ID" \
--query 'SecretString' \
--output json 2>/dev/null)
if [[ -z "$secret" ]]; then
log_error "Failed to fetch secret: $SECRET_ID"
exit 1
fi
export DB_HOST=$(echo "$secret" | jq -r '.host')
export DB_PORT=$(echo "$secret" | jq -r '.port' // '5432')
export DB_PASS=$(echo "$secret" | jq -r '.password')
export DATABASE_URL="postgresql://${DB_USER}:${DB_PASS}@${DB_HOST}:${DB_PORT}/${DB_NAME}"
log_info "Database: ${DB_HOST}:${DB_PORT}/${DB_NAME}"
}
# ─── Migration Status ────────────────────────────────────────────
show_migration_status() {
log_info "=== Current Migration Status ==="
if command -v npx &>/dev/null; then
npx drizzle-kit status --config=drizzle.config.ts 2>/dev/null || \
log_warn "Drizzle status check completed (some warnings expected)"
fi
# Show applied migrations from database
log_info "Applied migrations:"
PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d "$DB_NAME" \
-c "SELECT id, checksum, type FROM __drizzle_migrations_schema ORDER BY id DESC;" 2>/dev/null || \
log_warn "Could not query migration table (psql may not be installed)"
}
# ─── Rollback Logic ──────────────────────────────────────────────
rollback_latest() {
log_info "=== Rolling Back Latest Migration ==="
# Get the latest applied migration
local latest_migration
latest_migration=$(PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -p "$DB_PORT" \
-U "$DB_USER" -d "$DB_NAME" -t -A \
-c "SELECT id FROM __drizzle_migrations_schema ORDER BY id DESC LIMIT 1;" 2>/dev/null)
if [[ -z "$latest_migration" ]]; then
log_warn "No applied migrations found"
return 0
fi
log_info "Latest migration: $latest_migration"
# Resolve the migration (marks it as not applied)
if command -v npx &>/dev/null; then
npx drizzle-kit migrate:resolve --migration "$latest_migration" --status applied 2>/dev/null || \
log_warn "Migration resolve completed (check output for details)"
fi
log_info "Migration $latest_migration marked as resolved"
}
rollback_specific() {
local target="$1"
log_info "=== Rolling Back Migration: $target ==="
if command -v npx &>/dev/null; then
npx drizzle-kit migrate:resolve --migration "$target" --status applied 2>/dev/null || \
log_warn "Migration resolve completed (check output for details)"
fi
log_info "Migration $target marked as resolved"
}
# ─── Verification ────────────────────────────────────────────────
verify_connection() {
log_info "=== Verifying Database Connection ==="
local result
result=$(PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -p "$DB_PORT" \
-U "$DB_USER" -d "$DB_NAME" -t -A \
-c "SELECT version();" 2>/dev/null || echo "FAIL")
if [[ "$result" != "FAIL" ]]; then
log_info "Connection OK: PostgreSQL $result"
else
log_warn "Connection check failed"
fi
}
# ─── Main ────────────────────────────────────────────────────────
main() {
log_info "=== ShieldAI Migration Rollback ==="
log_info "Environment: $ENVIRONMENT"
log_info "Secret: $SECRET_ID"
get_db_credentials
show_migration_status
if [[ -n "$MIGRATION_NAME" ]]; then
rollback_specific "$MIGRATION_NAME"
else
rollback_latest
fi
verify_connection
show_migration_status
log_info "=== Rollback Complete ==="
log_info "Next steps:"
log_info "1. Verify application schema compatibility"
log_info "2. Run application health checks"
log_info "3. If needed, redeploy ECS services: ./rollback.sh $ENVIRONMENT all"
}
main "$@"

View File

@@ -1,255 +0,0 @@
#!/bin/bash
set -euo pipefail
# ShieldAI ECS Rollback Script
# Usage: ./rollback.sh <environment> <service|all> [--verify]
#
# Environments: staging, production
# Services: api, darkwatch, spamshield, voiceprint, all
#
# Examples:
# ./rollback.sh staging api # Rollback single service
# ./rollback.sh production all # Rollback all services
# ./rollback.sh production all --verify # Rollback with post-verification
# ─── Configuration ───────────────────────────────────────────────
ENVIRONMENT="${1:-staging}"
SERVICE="${2:-all}"
VERIFY="${3:-false}"
CLUSTER="shieldai-${ENVIRONMENT}"
SERVICES_LIST="api darkwatch spamshield voiceprint"
EXIT_CODE=0
TIMESTAMP=$(date -u '+%Y-%m-%d %H:%M:%S UTC')
LOG_FILE="/tmp/shieldai-rollback-${ENVIRONMENT}-${TIMESTAMP//[: ]/_}.log"
# ─── Helpers ─────────────────────────────────────────────────────
log() {
local level="$1"
shift
local msg="$*"
echo "[$(date -u '+%H:%M:%S')] [$level] $msg" | tee -a "$LOG_FILE"
}
log_info() { log "INFO" "$@"; }
log_warn() { log "WARN" "$@"; }
log_error() { log "ERROR" "$@"; }
# ─── Validation ──────────────────────────────────────────────────
validate_environment() {
if [[ "$ENVIRONMENT" != "staging" && "$ENVIRONMENT" != "production" ]]; then
log_error "Invalid environment: $ENVIRONMENT (expected: staging, production)"
exit 1
fi
}
validate_service() {
if [[ "$SERVICE" == "all" ]]; then
return 0
fi
if ! echo "$SERVICES_LIST" | grep -qw "$SERVICE"; then
log_error "Invalid service: $SERVICE (expected: api, darkwatch, spamshield, voiceprint, all)"
exit 1
fi
}
check_prerequisites() {
local missing=()
for cmd in aws jq curl; do
if ! command -v "$cmd" &>/dev/null; then
missing+=("$cmd")
fi
done
if [[ ${#missing[@]} -gt 0 ]]; then
log_error "Missing prerequisites: ${missing[*]}"
exit 1
fi
if [[ -z "${AWS_DEFAULT_REGION:-}" ]]; then
export AWS_DEFAULT_REGION="us-east-1"
fi
log_info "Prerequisites OK (region: $AWS_DEFAULT_REGION)"
}
# ─── Rollback Logic ──────────────────────────────────────────────
get_target_services() {
if [[ "$SERVICE" == "all" ]]; then
echo "$SERVICES_LIST"
else
echo "$SERVICE"
fi
}
rollback_service() {
local svc="$1"
local service_name="${CLUSTER}-${svc}"
log_info "Rolling back $service_name..."
# Check current deployment status
local current_task_def
current_task_def=$(aws ecs describe-services \
--cluster "$CLUSTER" \
--services "$service_name" \
--query 'services[0].taskDefinition' \
--output text 2>/dev/null || echo "UNKNOWN")
log_info "Current task definition: $current_task_def"
# Execute rollback
if aws ecs update-service \
--cluster "$CLUSTER" \
--service "$service_name" \
--rollback \
--no-cli-auto-prompt 2>>"$LOG_FILE"; then
log_info "Rollback initiated for $service_name"
else
log_error "Rollback failed to initiate for $service_name"
EXIT_CODE=1
return 1
fi
# Wait for stabilization (max 5 minutes)
log_info "Waiting for $service_name to stabilize (timeout: 300s)..."
if aws ecs wait services-stable \
--cluster "$CLUSTER" \
--services "$service_name" \
--timeout 300 2>>"$LOG_FILE"; then
log_info "$service_name stabilized successfully"
else
log_warn "$service_name stabilization timed out or failed"
EXIT_CODE=1
return 1
fi
# Get new task definition after rollback
local new_task_def
new_task_def=$(aws ecs describe-services \
--cluster "$CLUSTER" \
--services "$service_name" \
--query 'services[0].taskDefinition' \
--output text 2>/dev/null || echo "UNKNOWN")
local running_count
running_count=$(aws ecs describe-services \
--cluster "$CLUSTER" \
--services "$service_name" \
--query 'services[0].runningCount' \
--output text 2>/dev/null || echo "0")
local desired_count
desired_count=$(aws ecs describe-services \
--cluster "$CLUSTER" \
--services "$service_name" \
--query 'services[0].desiredCount' \
--output text 2>/dev/null || echo "0")
log_info "Rollback complete: $service_name -> $new_task_def ($running_count/$desired_count running)"
return 0
}
# ─── Health Verification ─────────────────────────────────────────
verify_health() {
local svc="$1"
local port
port=$(case "$svc" in
api) echo 3000 ;;
darkwatch) echo 3001 ;;
spamshield) echo 3002 ;;
voiceprint) echo 3003 ;;
*) echo 3000 ;;
esac)
local alb_dns="https://${CLUSTER}-alb.${AWS_DEFAULT_REGION}.elb.amazonaws.com"
log_info "Verifying health for $svc (ALB: $alb_dns)..."
local http_code
http_code=$(curl -s -o /dev/null -w "%{http_code}" \
--connect-timeout 10 \
--max-time 30 \
"$alb_dns/health" 2>/dev/null || echo "000")
if [[ "$http_code" == "200" ]]; then
log_info "Health check PASSED: $svc (HTTP $http_code)"
return 0
else
log_warn "Health check FAILED: $svc (HTTP $http_code)"
return 1
fi
}
verify_all_services() {
log_info "=== Post-Rollback Health Verification ==="
local passed=0
local failed=0
for svc in $(get_target_services); do
if verify_health "$svc"; then
((passed++))
else
((failed++))
fi
done
log_info "Verification complete: $passed passed, $failed failed"
if [[ $failed -gt 0 ]]; then
log_warn "Some services failed health verification"
EXIT_CODE=1
fi
}
# ─── Main Execution ──────────────────────────────────────────────
main() {
log_info "=== ShieldAI Rollback ==="
log_info "Environment: $ENVIRONMENT"
log_info "Service(s): $SERVICE"
log_info "Cluster: $CLUSTER"
log_info "Verify: $VERIFY"
log_info "Timestamp: $TIMESTAMP"
log_info "Log file: $LOG_FILE"
log_info "=========================="
# Validate inputs
validate_environment
validate_service
check_prerequisites
# Execute rollback for each target service
local rolled_back=0
local failed=0
for svc in $(get_target_services); do
if rollback_service "$svc"; then
((rolled_back++))
else
((failed++))
fi
done
log_info "=== Rollback Summary ==="
log_info "Rolled back: $rolled_back services"
log_info "Failed: $failed services"
# Post-rollback verification
if [[ "$VERIFY" == "--verify" ]] || [[ "$VERIFY" == "true" ]]; then
verify_all_services
fi
if [[ $failed -gt 0 ]]; then
log_error "Rollback completed with $failed failure(s)"
log_info "Full log: $LOG_FILE"
exit "$EXIT_CODE"
fi
log_info "Rollback completed successfully"
log_info "Full log: $LOG_FILE"
exit 0
}
main "$@"

View File

@@ -1,237 +0,0 @@
#!/bin/bash
set -uo pipefail
# ShieldAI Rollback Test Suite
# Usage: ./test-rollback.sh [ecs|compose|migration|all]
#
# Validates rollback scripts and procedures without mutating production
# Run against staging environment for integration tests
TEST_SUITE="${1:-all}"
PASS=0
FAIL=0
SKIP=0
# ─── Helpers ─────────────────────────────────────────────────────
log() {
echo "[$(date -u '+%H:%M:%S')] $*"
}
assert_eq() {
local desc="$1" expected="$2" actual="$3"
if [[ "$expected" == "$actual" ]]; then
log " ✅ PASS: $desc"
((PASS++))
else
log " ❌ FAIL: $desc (expected: $expected, got: $actual)"
((FAIL++))
fi
}
assert_file_exists() {
local desc="$1" path="$2"
if [[ -f "$path" ]]; then
log " ✅ PASS: $desc"
((PASS++))
else
log " ❌ FAIL: $desc ($path not found)"
((FAIL++))
fi
}
assert_executable() {
local desc="$1" path="$2"
if [[ -x "$path" ]]; then
log " ✅ PASS: $desc"
((PASS++))
else
log " ❌ FAIL: $desc ($path not executable)"
((FAIL++))
fi
}
assert_script_syntax() {
local desc="$1" path="$2"
if bash -n "$path" 2>/dev/null; then
log " ✅ PASS: $desc (syntax OK)"
((PASS++))
else
log " ❌ FAIL: $desc (syntax error)"
((FAIL++))
fi
}
assert_contains() {
local desc="$1" file="$2" pattern="$3"
if grep -q -- "$pattern" "$file" 2>/dev/null; then
log " ✅ PASS: $desc"
((PASS++))
else
log " ❌ FAIL: $desc (pattern '$pattern' not found in $file)"
((FAIL++))
fi
}
# ─── Test: File Structure ────────────────────────────────────────
test_file_structure() {
log "=== Test: File Structure ==="
assert_file_exists "ROLLBACK.md exists" "infra/ROLLBACK.md"
assert_file_exists "rollback.sh exists" "infra/scripts/rollback.sh"
assert_file_exists "rollback-compose.sh exists" "infra/scripts/rollback-compose.sh"
assert_file_exists "rollback-migration.sh exists" "infra/scripts/rollback-migration.sh"
assert_executable "rollback.sh is executable" "infra/scripts/rollback.sh"
assert_executable "rollback-compose.sh is executable" "infra/scripts/rollback-compose.sh"
assert_executable "rollback-migration.sh is executable" "infra/scripts/rollback-migration.sh"
}
# ─── Test: Script Syntax ─────────────────────────────────────────
test_script_syntax() {
log "=== Test: Script Syntax ==="
assert_script_syntax "rollback.sh syntax" "infra/scripts/rollback.sh"
assert_script_syntax "rollback-compose.sh syntax" "infra/scripts/rollback-compose.sh"
assert_script_syntax "rollback-migration.sh syntax" "infra/scripts/rollback-migration.sh"
}
# ─── Test: ROLLBACK.md Content ───────────────────────────────────
test_documentation() {
log "=== Test: Documentation Content ==="
local doc="infra/ROLLBACK.md"
for section in "Overview" "ECS Service Rollback" "Docker Compose Rollback" \
"Database Migration Rollback" "Automated Rollback Triggers" \
"Blue-Green Deployment Rollback" "Rollback Decision Tree" \
"Post-Rollback Verification" "Testing Checklist" "Emergency Rollback"; do
assert_contains "Section '$section' documented" "$doc" "$section"
done
for cmd in "aws ecs update-service" "docker compose" "drizzle-kit" \
"aws rds restore-db-instance" "aws ecs wait services-stable"; do
assert_contains "Command '$cmd' documented" "$doc" "$cmd"
done
}
# ─── Test: Rollback Script Validation ────────────────────────────
test_rollback_script() {
log "=== Test: ECS Rollback Script ==="
# Test invalid environment
local exit_code=0
bash infra/scripts/rollback.sh invalid_env api >/dev/null 2>&1 || exit_code=$?
assert_eq "Invalid environment returns exit code 1" "1" "$exit_code"
# Test invalid service
exit_code=0
bash infra/scripts/rollback.sh staging invalid_svc >/dev/null 2>&1 || exit_code=$?
assert_eq "Invalid service returns exit code 1" "1" "$exit_code"
# Verify script has required functions
for func in "validate_environment" "validate_service" "rollback_service" \
"verify_health" "check_prerequisites" "main"; do
assert_contains "Function '$func' defined" "infra/scripts/rollback.sh" "$func"
done
# Verify all services are handled
for svc in api darkwatch spamshield voiceprint; do
assert_contains "Service '$svc' in SERVICES_LIST" "infra/scripts/rollback.sh" "$svc"
done
}
# ─── Test: Compose Rollback Script ───────────────────────────────
test_compose_script() {
log "=== Test: Docker Compose Rollback Script ==="
# Test missing tag argument
local exit_code=0
bash infra/scripts/rollback-compose.sh >/dev/null 2>&1 || exit_code=$?
assert_eq "Missing tag returns exit code 1" "1" "$exit_code"
# Verify compose file exists
assert_file_exists "docker-compose.prod.yml exists" "docker-compose.prod.yml"
# Verify all services are defined in compose
for svc in api darkwatch spamshield voiceprint; do
assert_contains "Service '$svc' in docker-compose.prod.yml" "docker-compose.prod.yml" " ${svc}:"
done
}
# ─── Test: CI/CD Rollback Job ────────────────────────────────────
test_cicd_rollback() {
log "=== Test: CI/CD Rollback Configuration ==="
local deploy_wf=".github/workflows/deploy.yml"
assert_contains "Rollback job defined" "$deploy_wf" "rollback:"
assert_contains "Health check triggers rollback" "$deploy_wf" "needs.health-check.result"
assert_contains "ECS --rollback flag used" "$deploy_wf" "--rollback"
for svc in api darkwatch spamshield voiceprint; do
assert_contains "Service '$svc' in deploy matrix" "$deploy_wf" "$svc"
done
}
# ─── Test: Health Check Configuration ────────────────────────────
test_health_checks() {
log "=== Test: Health Check Configuration ==="
assert_contains "Container health check in ECS" "infra/modules/ecs/main.tf" "healthCheck"
assert_contains "ALB health check defined" "infra/modules/ecs/main.tf" "health_check"
assert_contains "ALB 5xx alarm configured" "infra/modules/cloudwatch/main.tf" "HTTPCode_Elb_5XX_Count"
}
# ─── Test: README References ─────────────────────────────────────
test_readme() {
log "=== Test: README References ==="
assert_contains "README references ROLLBACK.md" "infra/README.md" "ROLLBACK.md"
assert_contains "README documents rollback.sh" "infra/README.md" "rollback.sh"
assert_contains "README documents rollback-compose.sh" "infra/README.md" "rollback-compose.sh"
assert_contains "README documents rollback-migration.sh" "infra/README.md" "rollback-migration.sh"
}
# ─── Main ────────────────────────────────────────────────────────
main() {
log "=== ShieldAI Rollback Test Suite ==="
log "Suite: $TEST_SUITE"
log ""
case "$TEST_SUITE" in
ecs|all)
test_rollback_script
test_cicd_rollback
test_health_checks
;;
compose|all)
test_compose_script
;;
migration)
log "=== Test: Migration Rollback ==="
assert_script_syntax "rollback-migration.sh syntax" "infra/scripts/rollback-migration.sh"
assert_contains "Uses Secrets Manager" "infra/scripts/rollback-migration.sh" "secretsmanager"
assert_contains "Uses drizzle-kit" "infra/scripts/rollback-migration.sh" "drizzle-kit"
;;
esac
test_file_structure
test_script_syntax
test_documentation
test_readme
log ""
log "=== Results ==="
log "Passed: $PASS"
log "Failed: $FAIL"
log ""
if [[ $FAIL -gt 0 ]]; then
log "❌ SOME TESTS FAILED"
return 1
fi
log "✅ ALL TESTS PASSED"
return 0
}
main "$@"

View File

@@ -1,122 +0,0 @@
variable "aws_region" {
description = "AWS region"
type = string
default = "us-east-1"
}
variable "environment" {
description = "Deployment environment"
type = string
validation {
condition = contains(["dev", "staging", "production"], var.environment)
error_message = "Environment must be one of: dev, staging, production."
}
}
variable "project_name" {
description = "Project name for resource naming"
type = string
default = "shieldai"
}
variable "vpc_cidr" {
description = "CIDR block for VPC"
type = string
default = "10.0.0.0/16"
}
variable "az_count" {
description = "Number of availability zones"
type = number
default = 2
}
variable "db_name" {
description = "RDS database name"
type = string
default = "shieldai"
}
variable "db_instance_class" {
description = "RDS instance class"
type = string
default = "db.t3.medium"
}
variable "db_multi_az" {
description = "Enable Multi-AZ deployment"
type = bool
default = true
}
variable "db_backup_retention" {
description = "RDS backup retention period in days"
type = number
default = 7
}
variable "elasticache_node_type" {
description = "ElastiCache node type"
type = string
default = "cache.t3.medium"
}
variable "elasticache_num_nodes" {
description = "Number of ElastiCache nodes"
type = number
default = 2
}
variable "services" {
description = "ECS services to deploy"
type = map(object({
cpu = number
memory = number
port = number
}))
default = {
api = {
cpu = 512
memory = 1024
port = 3000
}
darkwatch = {
cpu = 256
memory = 512
port = 3001
}
spamshield = {
cpu = 256
memory = 512
port = 3002
}
voiceprint = {
cpu = 512
memory = 1024
port = 3003
}
}
}
variable "container_images" {
description = "Container image tags per service"
type = map(string)
default = {
api = "latest"
darkwatch = "latest"
spamshield = "latest"
voiceprint = "latest"
}
}
variable "secrets" {
description = "Secrets to store in AWS Secrets Manager"
type = map(string)
default = {}
}
variable "domain_name" {
description = "Route53 hosted zone domain for ACM cert validation"
type = string
default = "shieldai.app"
}