get to prod tasks
This commit is contained in:
69
tasks/web-production/12-uptime-monitoring.md
Normal file
69
tasks/web-production/12-uptime-monitoring.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# 12. Uptime & Performance Monitoring
|
||||
|
||||
meta:
|
||||
id: web-production-12
|
||||
feature: web-production
|
||||
priority: P2
|
||||
depends_on: []
|
||||
tags: [observability, uptime, production]
|
||||
|
||||
objective:
|
||||
- Monitor application uptime and performance from external vantage points to ensure reliability
|
||||
|
||||
deliverables:
|
||||
- External uptime monitoring (Pingdom, UptimeRobot, or Datadog Synthetics)
|
||||
- Synthetic monitoring for critical user journeys
|
||||
- Performance budget enforcement
|
||||
- Status page for incident communication
|
||||
|
||||
steps:
|
||||
1. Set up uptime monitoring:
|
||||
- Configure checks for homepage, API health, dashboard
|
||||
- Check from multiple regions (US East, US West, EU)
|
||||
- 1-minute interval checks
|
||||
- Alert on 2 consecutive failures
|
||||
2. Implement synthetic monitoring:
|
||||
- Signup flow: homepage → signup → verify email
|
||||
- Login flow: login → dashboard → view alerts
|
||||
- Billing flow: dashboard → pricing → checkout (test mode)
|
||||
- DarkWatch flow: dashboard → darkwatch → add watchlist item
|
||||
3. Set performance budgets:
|
||||
- LCP (Largest Contentful Paint) < 2.5s mobile, < 1.5s desktop
|
||||
- FID (First Input Delay) < 100ms
|
||||
- CLS (Cumulative Layout Shift) < 0.1
|
||||
- TTFB (Time to First Byte) < 200ms
|
||||
- API response p95 < 200ms
|
||||
4. Configure alerting:
|
||||
- Downtime alert via Slack/SMS
|
||||
- Performance degradation alert (LCP > 3s)
|
||||
- SSL certificate expiry alert (30 days before)
|
||||
- Domain expiry alert (30 days before)
|
||||
5. Set up status page:
|
||||
- Use statuspage.io or instatus.com
|
||||
- Auto-update from monitoring checks
|
||||
- Subscribe users for incident notifications
|
||||
- Post incident updates and post-mortems
|
||||
|
||||
tests:
|
||||
- Integration: Verify monitoring catches simulated outage
|
||||
- Performance: Confirm synthetic tests complete successfully
|
||||
- Alert: Test alert channels with deliberate failure
|
||||
|
||||
acceptance_criteria:
|
||||
- Uptime monitoring checking every 60 seconds from 3+ regions
|
||||
- 99.9% uptime SLA measured over 30 days
|
||||
- Synthetic tests covering signup, login, and core flows
|
||||
- Performance budget alerts for LCP > 2.5s
|
||||
- Status page accessible and auto-updating
|
||||
- SSL certificate expiry alert 30 days in advance
|
||||
|
||||
validation:
|
||||
- Simulate outage → alert received within 2 minutes
|
||||
- Check status page → shows incident with timeline
|
||||
- Run synthetic test → completes in <30 seconds
|
||||
- Lighthouse CI shows all metrics within budget
|
||||
|
||||
notes:
|
||||
- UptimeRobot free tier: 50 monitors, 5-minute intervals
|
||||
- Pingdom more reliable but paid
|
||||
- Consider using Checkly for synthetic monitoring with JS
|
||||
Reference in New Issue
Block a user