Kordant/tasks/web-production/18-load-testing.md

# 18. Load & Stress Testing

meta:
  id: web-production-18
  feature: web-production
  priority: P2
  depends_on: []
  tags: [testing, performance, production]

objective:
- Validate application performance under production-like load and identify bottlenecks

deliverables:
- Load test suite with k6 or Artillery
- Performance baseline documentation
- Bottleneck identification report
- Scaling recommendations

steps:
1. Set up load testing tool:
   - Install k6 or Artillery
   - Create tests/ directory for load tests
   - Configure test environment (staging)
2. Write load tests for critical endpoints:
   - GET / (landing page)
   - POST /api/trpc/user.login
   - GET /api/trpc/user.me (authenticated)
   - GET /api/trpc/darkwatch.getExposures
   - GET /api/trpc/alerts.getAlerts
   - WebSocket connection and alert subscription
3. Define load scenarios:
   - Baseline: 100 concurrent users, 5 minutes
   - Target: 1000 concurrent users, 10 minutes
   - Stress: 5000 concurrent users, 5 minutes
   - Spike: 0 to 2000 users in 10 seconds
4. Measure and record:
   - Response time percentiles (p50, p95, p99)
   - Error rate
   - Requests per second (throughput)
   - CPU and memory usage on server
   - Database connection pool utilization
   - Redis memory usage
5. Identify bottlenecks:
   - Slow queries from database
   - Memory leaks
   - Connection pool exhaustion
   - CPU-bound operations
6. Document scaling recommendations:
   - Horizontal scaling (more instances)
   - Vertical scaling (bigger instances)
   - Caching improvements
   - Query optimization

tests:
- Load: Baseline test passes with <200ms p95
- Stress: App remains functional under 5x normal load
- Spike: App recovers within 30 seconds after spike

acceptance_criteria:
- Baseline load (100 concurrent) → p95 < 200ms, 0% errors
- Target load (1000 concurrent) → p95 < 500ms, <1% errors
- Stress load (5000 concurrent) → no crashes, <5% errors
- Spike test → recovery within 30 seconds
- Performance baseline documented with metrics
- Bottleneck report with actionable recommendations
- Scaling plan documented

validation:
- Run k6 against staging → results within acceptable thresholds
- Check server metrics during test → CPU <80%, memory <80%
- Database connections → pool not exhausted
- Review report → identified 3+ bottlenecks with fixes

notes:
- Always test against staging, never production
- Schedule load tests during low-traffic periods
- Use k6 Cloud for distributed load testing if needed
- Consider using Vercel Analytics for real-user monitoring (RUM)