Files
Kordant/tasks/web-production/18-load-testing.md
2026-05-26 16:06:34 -04:00

2.5 KiB

18. Load & Stress Testing

meta: id: web-production-18 feature: web-production priority: P2 depends_on: [] tags: [testing, performance, production]

objective:

  • Validate application performance under production-like load and identify bottlenecks

deliverables:

  • Load test suite with k6 or Artillery
  • Performance baseline documentation
  • Bottleneck identification report
  • Scaling recommendations

steps:

  1. Set up load testing tool:
    • Install k6 or Artillery
    • Create tests/ directory for load tests
    • Configure test environment (staging)
  2. Write load tests for critical endpoints:
    • GET / (landing page)
    • POST /api/trpc/user.login
    • GET /api/trpc/user.me (authenticated)
    • GET /api/trpc/darkwatch.getExposures
    • GET /api/trpc/alerts.getAlerts
    • WebSocket connection and alert subscription
  3. Define load scenarios:
    • Baseline: 100 concurrent users, 5 minutes
    • Target: 1000 concurrent users, 10 minutes
    • Stress: 5000 concurrent users, 5 minutes
    • Spike: 0 to 2000 users in 10 seconds
  4. Measure and record:
    • Response time percentiles (p50, p95, p99)
    • Error rate
    • Requests per second (throughput)
    • CPU and memory usage on server
    • Database connection pool utilization
    • Redis memory usage
  5. Identify bottlenecks:
    • Slow queries from database
    • Memory leaks
    • Connection pool exhaustion
    • CPU-bound operations
  6. Document scaling recommendations:
    • Horizontal scaling (more instances)
    • Vertical scaling (bigger instances)
    • Caching improvements
    • Query optimization

tests:

  • Load: Baseline test passes with <200ms p95
  • Stress: App remains functional under 5x normal load
  • Spike: App recovers within 30 seconds after spike

acceptance_criteria:

  • Baseline load (100 concurrent) → p95 < 200ms, 0% errors
  • Target load (1000 concurrent) → p95 < 500ms, <1% errors
  • Stress load (5000 concurrent) → no crashes, <5% errors
  • Spike test → recovery within 30 seconds
  • Performance baseline documented with metrics
  • Bottleneck report with actionable recommendations
  • Scaling plan documented

validation:

  • Run k6 against staging → results within acceptable thresholds
  • Check server metrics during test → CPU <80%, memory <80%
  • Database connections → pool not exhausted
  • Review report → identified 3+ bottlenecks with fixes

notes:

  • Always test against staging, never production
  • Schedule load tests during low-traffic periods
  • Use k6 Cloud for distributed load testing if needed
  • Consider using Vercel Analytics for real-user monitoring (RUM)