Add Terraform AWS infrastructure and enhanced CI/CD pipeline (FRE-4574)

- Terraform modules: VPC, ECS Fargate, RDS PostgreSQL, ElastiCache Redis, S3, Secrets Manager, CloudWatch
- Multi-environment support: staging and production configs
- ECS auto-scaling: CPU-based scaling with configurable min/max
- CI/CD: pnpm caching, Docker Buildx, Trivy security scanning, Terraform plan on PR
- Deploy: ECS service updates with automatic rollback on health check failure
- Backup: automated RDS snapshots, S3 versioning, ElastiCache snapshots
- Monitoring: CloudWatch dashboards, CPU/memory/5xx alarms
- Rollback script for manual service rollback
- Infrastructure documentation with architecture overview
This commit is contained in:
Senior Engineer
2026-05-08 02:54:39 -04:00
committed by Michael Freno
parent baa216d62c
commit a0799c0647
19 changed files with 1902 additions and 45 deletions

114
infra/README.md Normal file
View File

@@ -0,0 +1,114 @@
/infra/
├── main.tf # Root module: VPC, ECS, RDS, ElastiCache, S3, Secrets, CloudWatch
├── variables.tf # Input variables with validation
├── outputs.tf # Output values (endpoints, ARNs, URLs)
├── modules/
│ ├── vpc/main.tf # VPC, subnets, IGW, NAT GW, security groups
│ ├── ecs/main.tf # ECS cluster, task definitions, services, ALB, auto-scaling
│ ├── rds/main.tf # RDS PostgreSQL with automated backups
│ ├── elasticache/main.tf # ElastiCache Redis with replication
│ ├── s3/main.tf # S3 buckets: state, artifacts, logs
│ ├── secrets/main.tf # AWS Secrets Manager
│ └── cloudwatch/main.tf # Dashboards, alarms, notifications
├── environments/
│ ├── staging/main.tf # Staging environment config
│ └── production/main.tf # Production environment config
└── scripts/
└── rollback.sh # Manual rollback script
## Quick Start
### Prerequisites
- Terraform >= 1.5.0
- AWS CLI configured with appropriate credentials
- AWS account with ECS, RDS, ElastiCache permissions
### Initialize
```bash
cd infra/environments/staging
terraform init
terraform plan -var-file=terraform.tfvars.example
terraform apply -var-file=terraform.tfvars.example
```
### Deploy via CI/CD
- Push to `main` → deploys to staging
- Create a release → deploys to production
- Health check failure → automatic rollback
## Architecture
### Networking
- VPC with public/private subnets across multiple AZs
- NAT Gateway for outbound traffic from private subnets
- Security groups: ECS → RDS (5432), ECS → ElastiCache (6379)
### Compute
- ECS Fargate for serverless container orchestration
- Application Load Balancer with health checks
- Auto-scaling: CPU-based scaling (70% target)
- Production: 3 replicas per service, min 2, max 10
### Data
- RDS PostgreSQL 16.2 with Multi-AZ (production)
- Automated daily backups, 7-14 day retention
- ElastiCache Redis 7.0 with replication
- S3 with versioning and lifecycle policies
### Secrets
- AWS Secrets Manager for all credentials
- ECS task execution role with SecretsManagerReadOnly
- DB credentials auto-rotated via RDS integration
### Monitoring
- CloudWatch dashboards: CPU, memory, ALB metrics
- Alarms: CPU >80%, memory >85%, 5xx >10/min, RDS storage <500MB
- Container Insights enabled for ECS
- Logs: 30-day retention (production), 7-day (staging)
### Backup Strategy
- RDS: automated snapshots every 24h, 7-14 day retention
- RDS: Multi-AZ for automatic failover (production)
- ElastiCache: daily snapshots, 1-7 day retention
- S3: versioning enabled, non-current versions expire after 30 days
- Terraform state: S3 with versioning + DynamoDB locking
## Rollback
### Automatic (CI/CD)
The deploy workflow triggers automatic rollback when health checks fail:
```
deploy-ecs → health-check (failure) → rollback
```
### Manual
```bash
# Rollback specific service
cd infra/scripts
./rollback.sh staging api
# Rollback all services
./rollback.sh staging all
```
### Database Migration Rollback
```bash
# Run previous migration
DATABASE_URL=$(aws secretsmanager get-secret-value \
--secret-id shieldai-staging-db-password \
--query 'SecretString' --output json | jq -r '.host')
npx prisma migrate resolve --applied <migration_name>
npx prisma migrate deploy
```
## GitHub Secrets Required
| Secret | Description |
|--------|-------------|
| AWS_ACCESS_KEY_ID | IAM user with ECS, RDS, ElastiCache permissions |
| AWS_SECRET_ACCESS_KEY | IAM secret key |
| HIBP_API_KEY | Have I Been Pwned API key |
| RESEND_API_KEY | Resend email API key |
| SENTRY_DSN | Sentry error tracking DSN |
| DATADOG_API_KEY | Datadog monitoring API key |
| GITHUB_TOKEN | Auto-provided, needs write:packages scope |