get to prod tasks
This commit is contained in:
78
tasks/web-production/18-load-testing.md
Normal file
78
tasks/web-production/18-load-testing.md
Normal file
@@ -0,0 +1,78 @@
|
||||
# 18. Load & Stress Testing
|
||||
|
||||
meta:
|
||||
id: web-production-18
|
||||
feature: web-production
|
||||
priority: P2
|
||||
depends_on: []
|
||||
tags: [testing, performance, production]
|
||||
|
||||
objective:
|
||||
- Validate application performance under production-like load and identify bottlenecks
|
||||
|
||||
deliverables:
|
||||
- Load test suite with k6 or Artillery
|
||||
- Performance baseline documentation
|
||||
- Bottleneck identification report
|
||||
- Scaling recommendations
|
||||
|
||||
steps:
|
||||
1. Set up load testing tool:
|
||||
- Install k6 or Artillery
|
||||
- Create tests/ directory for load tests
|
||||
- Configure test environment (staging)
|
||||
2. Write load tests for critical endpoints:
|
||||
- GET / (landing page)
|
||||
- POST /api/trpc/user.login
|
||||
- GET /api/trpc/user.me (authenticated)
|
||||
- GET /api/trpc/darkwatch.getExposures
|
||||
- GET /api/trpc/alerts.getAlerts
|
||||
- WebSocket connection and alert subscription
|
||||
3. Define load scenarios:
|
||||
- Baseline: 100 concurrent users, 5 minutes
|
||||
- Target: 1000 concurrent users, 10 minutes
|
||||
- Stress: 5000 concurrent users, 5 minutes
|
||||
- Spike: 0 to 2000 users in 10 seconds
|
||||
4. Measure and record:
|
||||
- Response time percentiles (p50, p95, p99)
|
||||
- Error rate
|
||||
- Requests per second (throughput)
|
||||
- CPU and memory usage on server
|
||||
- Database connection pool utilization
|
||||
- Redis memory usage
|
||||
5. Identify bottlenecks:
|
||||
- Slow queries from database
|
||||
- Memory leaks
|
||||
- Connection pool exhaustion
|
||||
- CPU-bound operations
|
||||
6. Document scaling recommendations:
|
||||
- Horizontal scaling (more instances)
|
||||
- Vertical scaling (bigger instances)
|
||||
- Caching improvements
|
||||
- Query optimization
|
||||
|
||||
tests:
|
||||
- Load: Baseline test passes with <200ms p95
|
||||
- Stress: App remains functional under 5x normal load
|
||||
- Spike: App recovers within 30 seconds after spike
|
||||
|
||||
acceptance_criteria:
|
||||
- Baseline load (100 concurrent) → p95 < 200ms, 0% errors
|
||||
- Target load (1000 concurrent) → p95 < 500ms, <1% errors
|
||||
- Stress load (5000 concurrent) → no crashes, <5% errors
|
||||
- Spike test → recovery within 30 seconds
|
||||
- Performance baseline documented with metrics
|
||||
- Bottleneck report with actionable recommendations
|
||||
- Scaling plan documented
|
||||
|
||||
validation:
|
||||
- Run k6 against staging → results within acceptable thresholds
|
||||
- Check server metrics during test → CPU <80%, memory <80%
|
||||
- Database connections → pool not exhausted
|
||||
- Review report → identified 3+ bottlenecks with fixes
|
||||
|
||||
notes:
|
||||
- Always test against staging, never production
|
||||
- Schedule load tests during low-traffic periods
|
||||
- Use k6 Cloud for distributed load testing if needed
|
||||
- Consider using Vercel Analytics for real-user monitoring (RUM)
|
||||
Reference in New Issue
Block a user