# 12. Uptime & Performance Monitoring meta: id: web-production-12 feature: web-production priority: P2 depends_on: [] tags: [observability, uptime, production] objective: - Monitor application uptime and performance from external vantage points to ensure reliability deliverables: - External uptime monitoring (Pingdom, UptimeRobot, or Datadog Synthetics) - Synthetic monitoring for critical user journeys - Performance budget enforcement - Status page for incident communication steps: 1. Set up uptime monitoring: - Configure checks for homepage, API health, dashboard - Check from multiple regions (US East, US West, EU) - 1-minute interval checks - Alert on 2 consecutive failures 2. Implement synthetic monitoring: - Signup flow: homepage → signup → verify email - Login flow: login → dashboard → view alerts - Billing flow: dashboard → pricing → checkout (test mode) - DarkWatch flow: dashboard → darkwatch → add watchlist item 3. Set performance budgets: - LCP (Largest Contentful Paint) < 2.5s mobile, < 1.5s desktop - FID (First Input Delay) < 100ms - CLS (Cumulative Layout Shift) < 0.1 - TTFB (Time to First Byte) < 200ms - API response p95 < 200ms 4. Configure alerting: - Downtime alert via Slack/SMS - Performance degradation alert (LCP > 3s) - SSL certificate expiry alert (30 days before) - Domain expiry alert (30 days before) 5. Set up status page: - Use statuspage.io or instatus.com - Auto-update from monitoring checks - Subscribe users for incident notifications - Post incident updates and post-mortems tests: - Integration: Verify monitoring catches simulated outage - Performance: Confirm synthetic tests complete successfully - Alert: Test alert channels with deliberate failure acceptance_criteria: - Uptime monitoring checking every 60 seconds from 3+ regions - 99.9% uptime SLA measured over 30 days - Synthetic tests covering signup, login, and core flows - Performance budget alerts for LCP > 2.5s - Status page accessible and auto-updating - SSL certificate expiry alert 30 days in advance validation: - Simulate outage → alert received within 2 minutes - Check status page → shows incident with timeline - Run synthetic test → completes in <30 seconds - Lighthouse CI shows all metrics within budget notes: - UptimeRobot free tier: 50 monitors, 5-minute intervals - Pingdom more reliable but paid - Consider using Checkly for synthetic monitoring with JS