2.4 KiB
2.4 KiB
12. Uptime & Performance Monitoring
meta: id: web-production-12 feature: web-production priority: P2 depends_on: [] tags: [observability, uptime, production]
objective:
- Monitor application uptime and performance from external vantage points to ensure reliability
deliverables:
- External uptime monitoring (Pingdom, UptimeRobot, or Datadog Synthetics)
- Synthetic monitoring for critical user journeys
- Performance budget enforcement
- Status page for incident communication
steps:
- Set up uptime monitoring:
- Configure checks for homepage, API health, dashboard
- Check from multiple regions (US East, US West, EU)
- 1-minute interval checks
- Alert on 2 consecutive failures
- Implement synthetic monitoring:
- Signup flow: homepage → signup → verify email
- Login flow: login → dashboard → view alerts
- Billing flow: dashboard → pricing → checkout (test mode)
- DarkWatch flow: dashboard → darkwatch → add watchlist item
- Set performance budgets:
- LCP (Largest Contentful Paint) < 2.5s mobile, < 1.5s desktop
- FID (First Input Delay) < 100ms
- CLS (Cumulative Layout Shift) < 0.1
- TTFB (Time to First Byte) < 200ms
- API response p95 < 200ms
- Configure alerting:
- Downtime alert via Slack/SMS
- Performance degradation alert (LCP > 3s)
- SSL certificate expiry alert (30 days before)
- Domain expiry alert (30 days before)
- Set up status page:
- Use statuspage.io or instatus.com
- Auto-update from monitoring checks
- Subscribe users for incident notifications
- Post incident updates and post-mortems
tests:
- Integration: Verify monitoring catches simulated outage
- Performance: Confirm synthetic tests complete successfully
- Alert: Test alert channels with deliberate failure
acceptance_criteria:
- Uptime monitoring checking every 60 seconds from 3+ regions
- 99.9% uptime SLA measured over 30 days
- Synthetic tests covering signup, login, and core flows
- Performance budget alerts for LCP > 2.5s
- Status page accessible and auto-updating
- SSL certificate expiry alert 30 days in advance
validation:
- Simulate outage → alert received within 2 minutes
- Check status page → shows incident with timeline
- Run synthetic test → completes in <30 seconds
- Lighthouse CI shows all metrics within budget
notes:
- UptimeRobot free tier: 50 monitors, 5-minute intervals
- Pingdom more reliable but paid
- Consider using Checkly for synthetic monitoring with JS