get to prod tasks

This commit is contained in:
2026-05-26 16:06:34 -04:00
parent 04e839640f
commit 5214412fff
105 changed files with 7447 additions and 38 deletions

View File

@@ -0,0 +1,70 @@
# 11. Application Metrics & Dashboards
meta:
id: web-production-11
feature: web-production
priority: P2
depends_on: []
tags: [observability, metrics, production]
objective:
- Collect and visualize application metrics for performance monitoring and capacity planning
deliverables:
- Prometheus metrics endpoint
- Custom business metrics
- Grafana or Datadog dashboards
- Alerting on metric thresholds
steps:
1. Add metrics collection:
- Install prom-client for Node.js metrics
- Create web/src/server/lib/metrics.ts
- Expose /metrics endpoint for Prometheus scraping
2. Collect standard metrics:
- HTTP request duration (histogram)
- HTTP request count (counter, by status code, endpoint)
- Active connections (gauge)
- Memory usage (gauge)
- Event loop lag (gauge)
3. Collect business metrics:
- Signup rate (counter)
- Login success/failure rate (counter)
- Subscription conversions (counter)
- DarkWatch scan completions (counter)
- Alert generation rate (counter)
- Average threat score (gauge)
4. Set up dashboards:
- Grafana dashboard or Datadog dashboard
- Request latency percentiles (p50, p95, p99)
- Error rate over time
- Business funnel (landing → signup → subscribe)
- Infrastructure health (CPU, memory, DB connections)
5. Configure alerts:
- p99 latency > 500ms for 5 minutes
- Error rate > 1% for 2 minutes
- Memory usage > 80% for 10 minutes
- DB connection pool > 90% for 5 minutes
tests:
- Unit: Test metrics increment correctly
- Integration: Verify /metrics endpoint returns valid Prometheus format
- Dashboard: Confirm all panels show data
acceptance_criteria:
- /metrics endpoint serving valid Prometheus exposition format
- Request duration histogram with 0.1, 0.5, 1, 2, 5 second buckets
- Business metrics visible in dashboard
- Alert fires when p99 latency exceeds 500ms
- Dashboard refreshes every 10 seconds with live data
- Metrics retention for 30 days
validation:
- `curl /metrics` → valid Prometheus output
- Grafana dashboard shows request latency graph
- Trigger slow endpoint → alert fires within 5 minutes
notes:
- Prometheus + Grafana is open source and cost-effective
- Datadog is easier but more expensive
- Consider using Vercel Analytics if deployed on Vercel