Add Terraform AWS infrastructure and enhanced CI/CD pipeline (FRE-4574)
- Terraform modules: VPC, ECS Fargate, RDS PostgreSQL, ElastiCache Redis, S3, Secrets Manager, CloudWatch - Multi-environment support: staging and production configs - ECS auto-scaling: CPU-based scaling with configurable min/max - CI/CD: pnpm caching, Docker Buildx, Trivy security scanning, Terraform plan on PR - Deploy: ECS service updates with automatic rollback on health check failure - Backup: automated RDS snapshots, S3 versioning, ElastiCache snapshots - Monitoring: CloudWatch dashboards, CPU/memory/5xx alarms - Rollback script for manual service rollback - Infrastructure documentation with architecture overview
This commit is contained in:
114
infra/README.md
Normal file
114
infra/README.md
Normal file
@@ -0,0 +1,114 @@
|
||||
/infra/
|
||||
├── main.tf # Root module: VPC, ECS, RDS, ElastiCache, S3, Secrets, CloudWatch
|
||||
├── variables.tf # Input variables with validation
|
||||
├── outputs.tf # Output values (endpoints, ARNs, URLs)
|
||||
├── modules/
|
||||
│ ├── vpc/main.tf # VPC, subnets, IGW, NAT GW, security groups
|
||||
│ ├── ecs/main.tf # ECS cluster, task definitions, services, ALB, auto-scaling
|
||||
│ ├── rds/main.tf # RDS PostgreSQL with automated backups
|
||||
│ ├── elasticache/main.tf # ElastiCache Redis with replication
|
||||
│ ├── s3/main.tf # S3 buckets: state, artifacts, logs
|
||||
│ ├── secrets/main.tf # AWS Secrets Manager
|
||||
│ └── cloudwatch/main.tf # Dashboards, alarms, notifications
|
||||
├── environments/
|
||||
│ ├── staging/main.tf # Staging environment config
|
||||
│ └── production/main.tf # Production environment config
|
||||
└── scripts/
|
||||
└── rollback.sh # Manual rollback script
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Prerequisites
|
||||
- Terraform >= 1.5.0
|
||||
- AWS CLI configured with appropriate credentials
|
||||
- AWS account with ECS, RDS, ElastiCache permissions
|
||||
|
||||
### Initialize
|
||||
```bash
|
||||
cd infra/environments/staging
|
||||
terraform init
|
||||
terraform plan -var-file=terraform.tfvars.example
|
||||
terraform apply -var-file=terraform.tfvars.example
|
||||
```
|
||||
|
||||
### Deploy via CI/CD
|
||||
- Push to `main` → deploys to staging
|
||||
- Create a release → deploys to production
|
||||
- Health check failure → automatic rollback
|
||||
|
||||
## Architecture
|
||||
|
||||
### Networking
|
||||
- VPC with public/private subnets across multiple AZs
|
||||
- NAT Gateway for outbound traffic from private subnets
|
||||
- Security groups: ECS → RDS (5432), ECS → ElastiCache (6379)
|
||||
|
||||
### Compute
|
||||
- ECS Fargate for serverless container orchestration
|
||||
- Application Load Balancer with health checks
|
||||
- Auto-scaling: CPU-based scaling (70% target)
|
||||
- Production: 3 replicas per service, min 2, max 10
|
||||
|
||||
### Data
|
||||
- RDS PostgreSQL 16.2 with Multi-AZ (production)
|
||||
- Automated daily backups, 7-14 day retention
|
||||
- ElastiCache Redis 7.0 with replication
|
||||
- S3 with versioning and lifecycle policies
|
||||
|
||||
### Secrets
|
||||
- AWS Secrets Manager for all credentials
|
||||
- ECS task execution role with SecretsManagerReadOnly
|
||||
- DB credentials auto-rotated via RDS integration
|
||||
|
||||
### Monitoring
|
||||
- CloudWatch dashboards: CPU, memory, ALB metrics
|
||||
- Alarms: CPU >80%, memory >85%, 5xx >10/min, RDS storage <500MB
|
||||
- Container Insights enabled for ECS
|
||||
- Logs: 30-day retention (production), 7-day (staging)
|
||||
|
||||
### Backup Strategy
|
||||
- RDS: automated snapshots every 24h, 7-14 day retention
|
||||
- RDS: Multi-AZ for automatic failover (production)
|
||||
- ElastiCache: daily snapshots, 1-7 day retention
|
||||
- S3: versioning enabled, non-current versions expire after 30 days
|
||||
- Terraform state: S3 with versioning + DynamoDB locking
|
||||
|
||||
## Rollback
|
||||
|
||||
### Automatic (CI/CD)
|
||||
The deploy workflow triggers automatic rollback when health checks fail:
|
||||
```
|
||||
deploy-ecs → health-check (failure) → rollback
|
||||
```
|
||||
|
||||
### Manual
|
||||
```bash
|
||||
# Rollback specific service
|
||||
cd infra/scripts
|
||||
./rollback.sh staging api
|
||||
|
||||
# Rollback all services
|
||||
./rollback.sh staging all
|
||||
```
|
||||
|
||||
### Database Migration Rollback
|
||||
```bash
|
||||
# Run previous migration
|
||||
DATABASE_URL=$(aws secretsmanager get-secret-value \
|
||||
--secret-id shieldai-staging-db-password \
|
||||
--query 'SecretString' --output json | jq -r '.host')
|
||||
|
||||
npx prisma migrate resolve --applied <migration_name>
|
||||
npx prisma migrate deploy
|
||||
```
|
||||
|
||||
## GitHub Secrets Required
|
||||
| Secret | Description |
|
||||
|--------|-------------|
|
||||
| AWS_ACCESS_KEY_ID | IAM user with ECS, RDS, ElastiCache permissions |
|
||||
| AWS_SECRET_ACCESS_KEY | IAM secret key |
|
||||
| HIBP_API_KEY | Have I Been Pwned API key |
|
||||
| RESEND_API_KEY | Resend email API key |
|
||||
| SENTRY_DSN | Sentry error tracking DSN |
|
||||
| DATADOG_API_KEY | Datadog monitoring API key |
|
||||
| GITHUB_TOKEN | Auto-provided, needs write:packages scope |
|
||||
Reference in New Issue
Block a user