79 lines
2.5 KiB
Markdown
79 lines
2.5 KiB
Markdown
# 18. Load & Stress Testing
|
|
|
|
meta:
|
|
id: web-production-18
|
|
feature: web-production
|
|
priority: P2
|
|
depends_on: []
|
|
tags: [testing, performance, production]
|
|
|
|
objective:
|
|
- Validate application performance under production-like load and identify bottlenecks
|
|
|
|
deliverables:
|
|
- Load test suite with k6 or Artillery
|
|
- Performance baseline documentation
|
|
- Bottleneck identification report
|
|
- Scaling recommendations
|
|
|
|
steps:
|
|
1. Set up load testing tool:
|
|
- Install k6 or Artillery
|
|
- Create tests/ directory for load tests
|
|
- Configure test environment (staging)
|
|
2. Write load tests for critical endpoints:
|
|
- GET / (landing page)
|
|
- POST /api/trpc/user.login
|
|
- GET /api/trpc/user.me (authenticated)
|
|
- GET /api/trpc/darkwatch.getExposures
|
|
- GET /api/trpc/alerts.getAlerts
|
|
- WebSocket connection and alert subscription
|
|
3. Define load scenarios:
|
|
- Baseline: 100 concurrent users, 5 minutes
|
|
- Target: 1000 concurrent users, 10 minutes
|
|
- Stress: 5000 concurrent users, 5 minutes
|
|
- Spike: 0 to 2000 users in 10 seconds
|
|
4. Measure and record:
|
|
- Response time percentiles (p50, p95, p99)
|
|
- Error rate
|
|
- Requests per second (throughput)
|
|
- CPU and memory usage on server
|
|
- Database connection pool utilization
|
|
- Redis memory usage
|
|
5. Identify bottlenecks:
|
|
- Slow queries from database
|
|
- Memory leaks
|
|
- Connection pool exhaustion
|
|
- CPU-bound operations
|
|
6. Document scaling recommendations:
|
|
- Horizontal scaling (more instances)
|
|
- Vertical scaling (bigger instances)
|
|
- Caching improvements
|
|
- Query optimization
|
|
|
|
tests:
|
|
- Load: Baseline test passes with <200ms p95
|
|
- Stress: App remains functional under 5x normal load
|
|
- Spike: App recovers within 30 seconds after spike
|
|
|
|
acceptance_criteria:
|
|
- Baseline load (100 concurrent) → p95 < 200ms, 0% errors
|
|
- Target load (1000 concurrent) → p95 < 500ms, <1% errors
|
|
- Stress load (5000 concurrent) → no crashes, <5% errors
|
|
- Spike test → recovery within 30 seconds
|
|
- Performance baseline documented with metrics
|
|
- Bottleneck report with actionable recommendations
|
|
- Scaling plan documented
|
|
|
|
validation:
|
|
- Run k6 against staging → results within acceptable thresholds
|
|
- Check server metrics during test → CPU <80%, memory <80%
|
|
- Database connections → pool not exhausted
|
|
- Review report → identified 3+ bottlenecks with fixes
|
|
|
|
notes:
|
|
- Always test against staging, never production
|
|
- Schedule load tests during low-traffic periods
|
|
- Use k6 Cloud for distributed load testing if needed
|
|
- Consider using Vercel Analytics for real-user monitoring (RUM)
|