get to prod tasks
This commit is contained in:
70
tasks/web-production/11-metrics-dashboards.md
Normal file
70
tasks/web-production/11-metrics-dashboards.md
Normal file
@@ -0,0 +1,70 @@
|
||||
# 11. Application Metrics & Dashboards
|
||||
|
||||
meta:
|
||||
id: web-production-11
|
||||
feature: web-production
|
||||
priority: P2
|
||||
depends_on: []
|
||||
tags: [observability, metrics, production]
|
||||
|
||||
objective:
|
||||
- Collect and visualize application metrics for performance monitoring and capacity planning
|
||||
|
||||
deliverables:
|
||||
- Prometheus metrics endpoint
|
||||
- Custom business metrics
|
||||
- Grafana or Datadog dashboards
|
||||
- Alerting on metric thresholds
|
||||
|
||||
steps:
|
||||
1. Add metrics collection:
|
||||
- Install prom-client for Node.js metrics
|
||||
- Create web/src/server/lib/metrics.ts
|
||||
- Expose /metrics endpoint for Prometheus scraping
|
||||
2. Collect standard metrics:
|
||||
- HTTP request duration (histogram)
|
||||
- HTTP request count (counter, by status code, endpoint)
|
||||
- Active connections (gauge)
|
||||
- Memory usage (gauge)
|
||||
- Event loop lag (gauge)
|
||||
3. Collect business metrics:
|
||||
- Signup rate (counter)
|
||||
- Login success/failure rate (counter)
|
||||
- Subscription conversions (counter)
|
||||
- DarkWatch scan completions (counter)
|
||||
- Alert generation rate (counter)
|
||||
- Average threat score (gauge)
|
||||
4. Set up dashboards:
|
||||
- Grafana dashboard or Datadog dashboard
|
||||
- Request latency percentiles (p50, p95, p99)
|
||||
- Error rate over time
|
||||
- Business funnel (landing → signup → subscribe)
|
||||
- Infrastructure health (CPU, memory, DB connections)
|
||||
5. Configure alerts:
|
||||
- p99 latency > 500ms for 5 minutes
|
||||
- Error rate > 1% for 2 minutes
|
||||
- Memory usage > 80% for 10 minutes
|
||||
- DB connection pool > 90% for 5 minutes
|
||||
|
||||
tests:
|
||||
- Unit: Test metrics increment correctly
|
||||
- Integration: Verify /metrics endpoint returns valid Prometheus format
|
||||
- Dashboard: Confirm all panels show data
|
||||
|
||||
acceptance_criteria:
|
||||
- /metrics endpoint serving valid Prometheus exposition format
|
||||
- Request duration histogram with 0.1, 0.5, 1, 2, 5 second buckets
|
||||
- Business metrics visible in dashboard
|
||||
- Alert fires when p99 latency exceeds 500ms
|
||||
- Dashboard refreshes every 10 seconds with live data
|
||||
- Metrics retention for 30 days
|
||||
|
||||
validation:
|
||||
- `curl /metrics` → valid Prometheus output
|
||||
- Grafana dashboard shows request latency graph
|
||||
- Trigger slow endpoint → alert fires within 5 minutes
|
||||
|
||||
notes:
|
||||
- Prometheus + Grafana is open source and cost-effective
|
||||
- Datadog is easier but more expensive
|
||||
- Consider using Vercel Analytics if deployed on Vercel
|
||||
Reference in New Issue
Block a user