get to prod tasks

2026-05-26 16:06:34 -04:00
parent 04e839640f
commit 5214412fff
105 changed files with 7447 additions and 38 deletions
--- a/tasks/web-production/11-metrics-dashboards.md
+++ b/tasks/web-production/11-metrics-dashboards.md
@@ -0,0 +1,70 @@
+# 11. Application Metrics & Dashboards
+
+meta:
+  id: web-production-11
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [observability, metrics, production]
+
+objective:
+- Collect and visualize application metrics for performance monitoring and capacity planning
+
+deliverables:
+- Prometheus metrics endpoint
+- Custom business metrics
+- Grafana or Datadog dashboards
+- Alerting on metric thresholds
+
+steps:
+1. Add metrics collection:
+   - Install prom-client for Node.js metrics
+   - Create web/src/server/lib/metrics.ts
+   - Expose /metrics endpoint for Prometheus scraping
+2. Collect standard metrics:
+   - HTTP request duration (histogram)
+   - HTTP request count (counter, by status code, endpoint)
+   - Active connections (gauge)
+   - Memory usage (gauge)
+   - Event loop lag (gauge)
+3. Collect business metrics:
+   - Signup rate (counter)
+   - Login success/failure rate (counter)
+   - Subscription conversions (counter)
+   - DarkWatch scan completions (counter)
+   - Alert generation rate (counter)
+   - Average threat score (gauge)
+4. Set up dashboards:
+   - Grafana dashboard or Datadog dashboard
+   - Request latency percentiles (p50, p95, p99)
+   - Error rate over time
+   - Business funnel (landing → signup → subscribe)
+   - Infrastructure health (CPU, memory, DB connections)
+5. Configure alerts:
+   - p99 latency > 500ms for 5 minutes
+   - Error rate > 1% for 2 minutes
+   - Memory usage > 80% for 10 minutes
+   - DB connection pool > 90% for 5 minutes
+
+tests:
+- Unit: Test metrics increment correctly
+- Integration: Verify /metrics endpoint returns valid Prometheus format
+- Dashboard: Confirm all panels show data
+
+acceptance_criteria:
+- /metrics endpoint serving valid Prometheus exposition format
+- Request duration histogram with 0.1, 0.5, 1, 2, 5 second buckets
+- Business metrics visible in dashboard
+- Alert fires when p99 latency exceeds 500ms
+- Dashboard refreshes every 10 seconds with live data
+- Metrics retention for 30 days
+
+validation:
+- `curl /metrics` → valid Prometheus output
+- Grafana dashboard shows request latency graph
+- Trigger slow endpoint → alert fires within 5 minutes
+
+notes:
+- Prometheus + Grafana is open source and cost-effective
+- Datadog is easier but more expensive
+- Consider using Vercel Analytics if deployed on Vercel