70 lines
2.4 KiB
Markdown
70 lines
2.4 KiB
Markdown
# 12. Uptime & Performance Monitoring
|
|
|
|
meta:
|
|
id: web-production-12
|
|
feature: web-production
|
|
priority: P2
|
|
depends_on: []
|
|
tags: [observability, uptime, production]
|
|
|
|
objective:
|
|
- Monitor application uptime and performance from external vantage points to ensure reliability
|
|
|
|
deliverables:
|
|
- External uptime monitoring (Pingdom, UptimeRobot, or Datadog Synthetics)
|
|
- Synthetic monitoring for critical user journeys
|
|
- Performance budget enforcement
|
|
- Status page for incident communication
|
|
|
|
steps:
|
|
1. Set up uptime monitoring:
|
|
- Configure checks for homepage, API health, dashboard
|
|
- Check from multiple regions (US East, US West, EU)
|
|
- 1-minute interval checks
|
|
- Alert on 2 consecutive failures
|
|
2. Implement synthetic monitoring:
|
|
- Signup flow: homepage → signup → verify email
|
|
- Login flow: login → dashboard → view alerts
|
|
- Billing flow: dashboard → pricing → checkout (test mode)
|
|
- DarkWatch flow: dashboard → darkwatch → add watchlist item
|
|
3. Set performance budgets:
|
|
- LCP (Largest Contentful Paint) < 2.5s mobile, < 1.5s desktop
|
|
- FID (First Input Delay) < 100ms
|
|
- CLS (Cumulative Layout Shift) < 0.1
|
|
- TTFB (Time to First Byte) < 200ms
|
|
- API response p95 < 200ms
|
|
4. Configure alerting:
|
|
- Downtime alert via Slack/SMS
|
|
- Performance degradation alert (LCP > 3s)
|
|
- SSL certificate expiry alert (30 days before)
|
|
- Domain expiry alert (30 days before)
|
|
5. Set up status page:
|
|
- Use statuspage.io or instatus.com
|
|
- Auto-update from monitoring checks
|
|
- Subscribe users for incident notifications
|
|
- Post incident updates and post-mortems
|
|
|
|
tests:
|
|
- Integration: Verify monitoring catches simulated outage
|
|
- Performance: Confirm synthetic tests complete successfully
|
|
- Alert: Test alert channels with deliberate failure
|
|
|
|
acceptance_criteria:
|
|
- Uptime monitoring checking every 60 seconds from 3+ regions
|
|
- 99.9% uptime SLA measured over 30 days
|
|
- Synthetic tests covering signup, login, and core flows
|
|
- Performance budget alerts for LCP > 2.5s
|
|
- Status page accessible and auto-updating
|
|
- SSL certificate expiry alert 30 days in advance
|
|
|
|
validation:
|
|
- Simulate outage → alert received within 2 minutes
|
|
- Check status page → shows incident with timeline
|
|
- Run synthetic test → completes in <30 seconds
|
|
- Lighthouse CI shows all metrics within budget
|
|
|
|
notes:
|
|
- UptimeRobot free tier: 50 monitors, 5-minute intervals
|
|
- Pingdom more reliable but paid
|
|
- Consider using Checkly for synthetic monitoring with JS
|