Files
Kordant/tasks/web-production/12-uptime-monitoring.md
2026-05-26 16:06:34 -04:00

2.4 KiB

12. Uptime & Performance Monitoring

meta: id: web-production-12 feature: web-production priority: P2 depends_on: [] tags: [observability, uptime, production]

objective:

  • Monitor application uptime and performance from external vantage points to ensure reliability

deliverables:

  • External uptime monitoring (Pingdom, UptimeRobot, or Datadog Synthetics)
  • Synthetic monitoring for critical user journeys
  • Performance budget enforcement
  • Status page for incident communication

steps:

  1. Set up uptime monitoring:
    • Configure checks for homepage, API health, dashboard
    • Check from multiple regions (US East, US West, EU)
    • 1-minute interval checks
    • Alert on 2 consecutive failures
  2. Implement synthetic monitoring:
    • Signup flow: homepage → signup → verify email
    • Login flow: login → dashboard → view alerts
    • Billing flow: dashboard → pricing → checkout (test mode)
    • DarkWatch flow: dashboard → darkwatch → add watchlist item
  3. Set performance budgets:
    • LCP (Largest Contentful Paint) < 2.5s mobile, < 1.5s desktop
    • FID (First Input Delay) < 100ms
    • CLS (Cumulative Layout Shift) < 0.1
    • TTFB (Time to First Byte) < 200ms
    • API response p95 < 200ms
  4. Configure alerting:
    • Downtime alert via Slack/SMS
    • Performance degradation alert (LCP > 3s)
    • SSL certificate expiry alert (30 days before)
    • Domain expiry alert (30 days before)
  5. Set up status page:
    • Use statuspage.io or instatus.com
    • Auto-update from monitoring checks
    • Subscribe users for incident notifications
    • Post incident updates and post-mortems

tests:

  • Integration: Verify monitoring catches simulated outage
  • Performance: Confirm synthetic tests complete successfully
  • Alert: Test alert channels with deliberate failure

acceptance_criteria:

  • Uptime monitoring checking every 60 seconds from 3+ regions
  • 99.9% uptime SLA measured over 30 days
  • Synthetic tests covering signup, login, and core flows
  • Performance budget alerts for LCP > 2.5s
  • Status page accessible and auto-updating
  • SSL certificate expiry alert 30 days in advance

validation:

  • Simulate outage → alert received within 2 minutes
  • Check status page → shows incident with timeline
  • Run synthetic test → completes in <30 seconds
  • Lighthouse CI shows all metrics within budget

notes:

  • UptimeRobot free tier: 50 monitors, 5-minute intervals
  • Pingdom more reliable but paid
  • Consider using Checkly for synthetic monitoring with JS