get to prod tasks

2026-05-26 16:06:34 -04:00
parent 04e839640f
commit 5214412fff
105 changed files with 7447 additions and 38 deletions
--- a/tasks/web-production/12-uptime-monitoring.md
+++ b/tasks/web-production/12-uptime-monitoring.md
@@ -0,0 +1,69 @@
+# 12. Uptime & Performance Monitoring
+
+meta:
+  id: web-production-12
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [observability, uptime, production]
+
+objective:
+- Monitor application uptime and performance from external vantage points to ensure reliability
+
+deliverables:
+- External uptime monitoring (Pingdom, UptimeRobot, or Datadog Synthetics)
+- Synthetic monitoring for critical user journeys
+- Performance budget enforcement
+- Status page for incident communication
+
+steps:
+1. Set up uptime monitoring:
+   - Configure checks for homepage, API health, dashboard
+   - Check from multiple regions (US East, US West, EU)
+   - 1-minute interval checks
+   - Alert on 2 consecutive failures
+2. Implement synthetic monitoring:
+   - Signup flow: homepage → signup → verify email
+   - Login flow: login → dashboard → view alerts
+   - Billing flow: dashboard → pricing → checkout (test mode)
+   - DarkWatch flow: dashboard → darkwatch → add watchlist item
+3. Set performance budgets:
+   - LCP (Largest Contentful Paint) < 2.5s mobile, < 1.5s desktop
+   - FID (First Input Delay) < 100ms
+   - CLS (Cumulative Layout Shift) < 0.1
+   - TTFB (Time to First Byte) < 200ms
+   - API response p95 < 200ms
+4. Configure alerting:
+   - Downtime alert via Slack/SMS
+   - Performance degradation alert (LCP > 3s)
+   - SSL certificate expiry alert (30 days before)
+   - Domain expiry alert (30 days before)
+5. Set up status page:
+   - Use statuspage.io or instatus.com
+   - Auto-update from monitoring checks
+   - Subscribe users for incident notifications
+   - Post incident updates and post-mortems
+
+tests:
+- Integration: Verify monitoring catches simulated outage
+- Performance: Confirm synthetic tests complete successfully
+- Alert: Test alert channels with deliberate failure
+
+acceptance_criteria:
+- Uptime monitoring checking every 60 seconds from 3+ regions
+- 99.9% uptime SLA measured over 30 days
+- Synthetic tests covering signup, login, and core flows
+- Performance budget alerts for LCP > 2.5s
+- Status page accessible and auto-updating
+- SSL certificate expiry alert 30 days in advance
+
+validation:
+- Simulate outage → alert received within 2 minutes
+- Check status page → shows incident with timeline
+- Run synthetic test → completes in <30 seconds
+- Lighthouse CI shows all metrics within budget
+
+notes:
+- UptimeRobot free tier: 50 monitors, 5-minute intervals
+- Pingdom more reliable but paid
+- Consider using Checkly for synthetic monitoring with JS