Files
Kordant/tasks/core-services-implementation/05-darkwatch-scheduler.md
2026-05-31 22:03:18 -04:00

73 lines
4.1 KiB
Markdown

# 05. Periodic Scan Scheduling, WebSocket Progress, and Alert Deduplication
meta:
id: core-services-05
feature: core-services-implementation
priority: P1
depends_on: [core-services-03, core-services-04]
tags: [darkwatch, scheduler, websocket, real-time, deduplication, alerts]
objective:
- Make DarkWatch continuously useful by scheduling periodic scans, providing real-time progress via WebSocket, and eliminating alert fatigue through intelligent deduplication.
deliverables:
- Cron-based scan scheduler with configurable frequency per tier
- WebSocket real-time scan progress updates (already have `websocket.ts`)
- Alert cooldown periods to prevent duplicate notifications
- Digest mode: batch low-priority alerts into daily/weekly summaries
- Scan history and metrics dashboard data
steps:
1. Implement cron job scheduler in `jobs/handlers/darkwatch.scan.ts`:
- Daily scans for active subscriptions
- Respects tier limits (Shield = HIBP only daily, Guard+ = full suite weekly)
2. Add `scanFrequency` field to subscription schema (daily, weekly, monthly)
3. Wire WebSocket push from existing `websocket.ts` into scan engine:
- Emit `scan:started`, `scan:progress` (completedSources/totalSources), `scan:completed` events
- Client dashboard subscribes to user-specific scan events
4. Enhance alert deduplication beyond existing exposure dedup:
- Add `alertCooldownHours` per alert type (e.g., 24h for same breach, 72h for property changes)
- Track lastAlertSentAt per (userId, alertType, source) tuple
- Don't create new alerts during cooldown unless severity increases
5. Implement digest mode:
- Low-priority alerts (info) batched into daily digest email
- Warning/critical alerts sent immediately via push + email
- User preference: immediate vs. digest per severity level
6. Add scan metrics:
- Store scan duration, sources checked, exposures found, alerts generated
- Aggregate for dashboard "threat score" calculation
7. Implement scan failure recovery:
- Partial scan results saved even if one source fails
- Failed sources retried individually in next scan window
8. Add rate limit per user: max 1 concurrent scan, queue subsequent requests
tests:
- Unit: Verify cron expression parsing, cooldown logic, digest batching
- Integration: Trigger scheduled scan, verify WebSocket events emitted in correct order
- E2E: Start scan from dashboard → watch progress bar → receive completion notification
acceptance_criteria:
- [ ] Scans run automatically on schedule without manual trigger (cron job)
- [ ] WebSocket pushes real-time progress: `scan:progress` events with percentage complete
- [ ] Only one scan runs per user at a time; additional requests are queued
- [ ] Duplicate alerts are suppressed during cooldown period (configurable per type)
- [ ] Info-level alerts are batched into daily digest; warning/critical sent immediately
- [ ] Scan history is persisted and visible in dashboard (last scan date, sources checked, findings)
- [ ] Failed sources don't fail entire scan — partial results are saved
- [ ] Dashboard threat score updates automatically after each scan completion
- [ ] Free tier gets weekly scans; paid tiers get daily scans
- [ ] No duplicate notifications for same exposure across multiple scans
validation:
- Run cron job manually: `bun run job:darkwatch:scan`, verify scan completes and exposures created
- Connect to WebSocket: `wscat -c ws://localhost:3000/ws`, subscribe to scan events
- Check dashboard: Scan progress bar animates during active scan, threat score updates after
- Test cooldown: Trigger same scan twice rapidly, verify second scan doesn't create duplicate alerts
notes:
- The existing `scanStates` Map in `darkwatch.service.ts` is in-memory — move to Redis for multi-instance safety
- WebSocket infrastructure exists at `websocket.ts` — extend it for scan-specific events
- The scheduler directory (`scheduler/`) currently only has Dockerfiles — this task creates actual job logic
- Consider using Honker (Rust queue) for scan job distribution once it's production-ready
- Alert fatigue is a real churn driver — aggressive deduplication is a competitive advantage