Files
Kordant/tasks/web-production/11-metrics-dashboards.md
2026-05-26 16:06:34 -04:00

2.2 KiB

11. Application Metrics & Dashboards

meta: id: web-production-11 feature: web-production priority: P2 depends_on: [] tags: [observability, metrics, production]

objective:

  • Collect and visualize application metrics for performance monitoring and capacity planning

deliverables:

  • Prometheus metrics endpoint
  • Custom business metrics
  • Grafana or Datadog dashboards
  • Alerting on metric thresholds

steps:

  1. Add metrics collection:
    • Install prom-client for Node.js metrics
    • Create web/src/server/lib/metrics.ts
    • Expose /metrics endpoint for Prometheus scraping
  2. Collect standard metrics:
    • HTTP request duration (histogram)
    • HTTP request count (counter, by status code, endpoint)
    • Active connections (gauge)
    • Memory usage (gauge)
    • Event loop lag (gauge)
  3. Collect business metrics:
    • Signup rate (counter)
    • Login success/failure rate (counter)
    • Subscription conversions (counter)
    • DarkWatch scan completions (counter)
    • Alert generation rate (counter)
    • Average threat score (gauge)
  4. Set up dashboards:
    • Grafana dashboard or Datadog dashboard
    • Request latency percentiles (p50, p95, p99)
    • Error rate over time
    • Business funnel (landing → signup → subscribe)
    • Infrastructure health (CPU, memory, DB connections)
  5. Configure alerts:
    • p99 latency > 500ms for 5 minutes
    • Error rate > 1% for 2 minutes
    • Memory usage > 80% for 10 minutes
    • DB connection pool > 90% for 5 minutes

tests:

  • Unit: Test metrics increment correctly
  • Integration: Verify /metrics endpoint returns valid Prometheus format
  • Dashboard: Confirm all panels show data

acceptance_criteria:

  • /metrics endpoint serving valid Prometheus exposition format
  • Request duration histogram with 0.1, 0.5, 1, 2, 5 second buckets
  • Business metrics visible in dashboard
  • Alert fires when p99 latency exceeds 500ms
  • Dashboard refreshes every 10 seconds with live data
  • Metrics retention for 30 days

validation:

  • curl /metrics → valid Prometheus output
  • Grafana dashboard shows request latency graph
  • Trigger slow endpoint → alert fires within 5 minutes

notes:

  • Prometheus + Grafana is open source and cost-effective
  • Datadog is easier but more expensive
  • Consider using Vercel Analytics if deployed on Vercel