get to prod tasks

2026-05-26 16:06:34 -04:00
parent 04e839640f
commit 5214412fff
105 changed files with 7447 additions and 38 deletions
--- a/tasks/web-production/18-load-testing.md
+++ b/tasks/web-production/18-load-testing.md
@@ -0,0 +1,78 @@
+# 18. Load & Stress Testing
+
+meta:
+  id: web-production-18
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [testing, performance, production]
+
+objective:
+- Validate application performance under production-like load and identify bottlenecks
+
+deliverables:
+- Load test suite with k6 or Artillery
+- Performance baseline documentation
+- Bottleneck identification report
+- Scaling recommendations
+
+steps:
+1. Set up load testing tool:
+   - Install k6 or Artillery
+   - Create tests/ directory for load tests
+   - Configure test environment (staging)
+2. Write load tests for critical endpoints:
+   - GET / (landing page)
+   - POST /api/trpc/user.login
+   - GET /api/trpc/user.me (authenticated)
+   - GET /api/trpc/darkwatch.getExposures
+   - GET /api/trpc/alerts.getAlerts
+   - WebSocket connection and alert subscription
+3. Define load scenarios:
+   - Baseline: 100 concurrent users, 5 minutes
+   - Target: 1000 concurrent users, 10 minutes
+   - Stress: 5000 concurrent users, 5 minutes
+   - Spike: 0 to 2000 users in 10 seconds
+4. Measure and record:
+   - Response time percentiles (p50, p95, p99)
+   - Error rate
+   - Requests per second (throughput)
+   - CPU and memory usage on server
+   - Database connection pool utilization
+   - Redis memory usage
+5. Identify bottlenecks:
+   - Slow queries from database
+   - Memory leaks
+   - Connection pool exhaustion
+   - CPU-bound operations
+6. Document scaling recommendations:
+   - Horizontal scaling (more instances)
+   - Vertical scaling (bigger instances)
+   - Caching improvements
+   - Query optimization
+
+tests:
+- Load: Baseline test passes with <200ms p95
+- Stress: App remains functional under 5x normal load
+- Spike: App recovers within 30 seconds after spike
+
+acceptance_criteria:
+- Baseline load (100 concurrent) → p95 < 200ms, 0% errors
+- Target load (1000 concurrent) → p95 < 500ms, <1% errors
+- Stress load (5000 concurrent) → no crashes, <5% errors
+- Spike test → recovery within 30 seconds
+- Performance baseline documented with metrics
+- Bottleneck report with actionable recommendations
+- Scaling plan documented
+
+validation:
+- Run k6 against staging → results within acceptable thresholds
+- Check server metrics during test → CPU <80%, memory <80%
+- Database connections → pool not exhausted
+- Review report → identified 3+ bottlenecks with fixes
+
+notes:
+- Always test against staging, never production
+- Schedule load tests during low-traffic periods
+- Use k6 Cloud for distributed load testing if needed
+- Consider using Vercel Analytics for real-user monitoring (RUM)