- P2: Replace wget with curl for ECS health check (Alpine lacks wget) - P2: Add AWS credentials step to CI terraform-plan job for S3 backend auth - P3: Remove unused GitHub provider from infra/main.tf Co-Authored-By: Paperclip <noreply@paperclip.ing>
/infra/ ├── main.tf # Root module: VPC, ECS, RDS, ElastiCache, S3, Secrets, CloudWatch ├── variables.tf # Input variables with validation ├── outputs.tf # Output values (endpoints, ARNs, URLs) ├── modules/ │ ├── vpc/main.tf # VPC, subnets, IGW, NAT GW, security groups │ ├── ecs/main.tf # ECS cluster, task definitions, services, ALB, auto-scaling │ ├── rds/main.tf # RDS PostgreSQL with automated backups │ ├── elasticache/main.tf # ElastiCache Redis with replication │ ├── s3/main.tf # S3 buckets: state, artifacts, logs │ ├── secrets/main.tf # AWS Secrets Manager │ └── cloudwatch/main.tf # Dashboards, alarms, notifications ├── environments/ │ ├── staging/main.tf # Staging environment config │ └── production/main.tf # Production environment config └── scripts/ ├── rollback.sh # ECS service rollback (AWS) ├── rollback-compose.sh # Docker Compose rollback (local/staging) └── rollback-migration.sh # Database migration rollback
Quick Start
Prerequisites
- Terraform >= 1.5.0
- AWS CLI configured with appropriate credentials
- AWS account with ECS, RDS, ElastiCache permissions
Initialize
cd infra/environments/staging
terraform init
terraform plan -var-file=terraform.tfvars.example
terraform apply -var-file=terraform.tfvars.example
Deploy via CI/CD
- Push to
main→ deploys to staging - Create a release → deploys to production
- Health check failure → automatic rollback
Architecture
Networking
- VPC with public/private subnets across multiple AZs
- NAT Gateway for outbound traffic from private subnets
- Security groups: ECS → RDS (5432), ECS → ElastiCache (6379)
Compute
- ECS Fargate for serverless container orchestration
- Application Load Balancer with health checks
- Auto-scaling: CPU-based scaling (70% target)
- Production: 3 replicas per service, min 2, max 10
Data
- RDS PostgreSQL 16.2 with Multi-AZ (production)
- Automated daily backups, 7-14 day retention
- ElastiCache Redis 7.0 with replication
- S3 with versioning and lifecycle policies
Secrets
- AWS Secrets Manager for all credentials
- ECS task execution role with SecretsManagerReadOnly
- DB credentials auto-rotated via RDS integration
Monitoring
- CloudWatch dashboards: CPU, memory, ALB metrics
- Alarms: CPU >80%, memory >85%, 5xx >10/min, RDS storage <500MB
- Container Insights enabled for ECS
- Logs: 30-day retention (production), 7-day (staging)
Backup Strategy
- RDS: automated snapshots every 24h, 7-14 day retention
- RDS: Multi-AZ for automatic failover (production)
- ElastiCache: daily snapshots, 1-7 day retention
- S3: versioning enabled, non-current versions expire after 30 days
- Terraform state: S3 with versioning + DynamoDB locking
Rollback
See ROLLBACK.md for the complete rollback runbook, including:
- ECS service rollback (automated + manual)
- Docker Compose rollback (local / staging)
- Database migration rollback (Drizzle)
- Blue-green deployment rollback
- RDS point-in-time recovery
- Automated rollback triggers and health checks
- Emergency rollback runbook
- Testing checklist
Quick Reference
# ECS service rollback (AWS)
./infra/scripts/rollback.sh <environment> <service|all> [--verify]
# Docker Compose rollback (local/staging)
./infra/scripts/rollback-compose.sh <previous_tag>
# Database migration rollback
./infra/scripts/rollback-migration.sh <environment> [--migration <name>]
GitHub Secrets Required
| Secret | Description |
|---|---|
| AWS_ACCESS_KEY_ID | IAM user with ECS, RDS, ElastiCache permissions |
| AWS_SECRET_ACCESS_KEY | IAM secret key |
| HIBP_API_KEY | Have I Been Pwned API key |
| RESEND_API_KEY | Resend email API key |
| SENTRY_DSN | Sentry error tracking DSN |
| DATADOG_API_KEY | Datadog monitoring API key |
| GITHUB_TOKEN | Auto-provided, needs write:packages scope |