get to prod tasks

This commit is contained in:
2026-05-26 16:06:34 -04:00
parent 04e839640f
commit 5214412fff
105 changed files with 7447 additions and 38 deletions

View File

@@ -0,0 +1,69 @@
# 12. Uptime & Performance Monitoring
meta:
id: web-production-12
feature: web-production
priority: P2
depends_on: []
tags: [observability, uptime, production]
objective:
- Monitor application uptime and performance from external vantage points to ensure reliability
deliverables:
- External uptime monitoring (Pingdom, UptimeRobot, or Datadog Synthetics)
- Synthetic monitoring for critical user journeys
- Performance budget enforcement
- Status page for incident communication
steps:
1. Set up uptime monitoring:
- Configure checks for homepage, API health, dashboard
- Check from multiple regions (US East, US West, EU)
- 1-minute interval checks
- Alert on 2 consecutive failures
2. Implement synthetic monitoring:
- Signup flow: homepage → signup → verify email
- Login flow: login → dashboard → view alerts
- Billing flow: dashboard → pricing → checkout (test mode)
- DarkWatch flow: dashboard → darkwatch → add watchlist item
3. Set performance budgets:
- LCP (Largest Contentful Paint) < 2.5s mobile, < 1.5s desktop
- FID (First Input Delay) < 100ms
- CLS (Cumulative Layout Shift) < 0.1
- TTFB (Time to First Byte) < 200ms
- API response p95 < 200ms
4. Configure alerting:
- Downtime alert via Slack/SMS
- Performance degradation alert (LCP > 3s)
- SSL certificate expiry alert (30 days before)
- Domain expiry alert (30 days before)
5. Set up status page:
- Use statuspage.io or instatus.com
- Auto-update from monitoring checks
- Subscribe users for incident notifications
- Post incident updates and post-mortems
tests:
- Integration: Verify monitoring catches simulated outage
- Performance: Confirm synthetic tests complete successfully
- Alert: Test alert channels with deliberate failure
acceptance_criteria:
- Uptime monitoring checking every 60 seconds from 3+ regions
- 99.9% uptime SLA measured over 30 days
- Synthetic tests covering signup, login, and core flows
- Performance budget alerts for LCP > 2.5s
- Status page accessible and auto-updating
- SSL certificate expiry alert 30 days in advance
validation:
- Simulate outage → alert received within 2 minutes
- Check status page → shows incident with timeline
- Run synthetic test → completes in <30 seconds
- Lighthouse CI shows all metrics within budget
notes:
- UptimeRobot free tier: 50 monitors, 5-minute intervals
- Pingdom more reliable but paid
- Consider using Checkly for synthetic monitoring with JS