Kordant/tasks/core-services-implementation/05-darkwatch-scheduler.md

# 05. Periodic Scan Scheduling, WebSocket Progress, and Alert Deduplication

meta:
  id: core-services-05
  feature: core-services-implementation
  priority: P1
  depends_on: [core-services-03, core-services-04]
  tags: [darkwatch, scheduler, websocket, real-time, deduplication, alerts]

objective:
- Make DarkWatch continuously useful by scheduling periodic scans, providing real-time progress via WebSocket, and eliminating alert fatigue through intelligent deduplication.

deliverables:
- Cron-based scan scheduler with configurable frequency per tier
- WebSocket real-time scan progress updates (already have `websocket.ts`)
- Alert cooldown periods to prevent duplicate notifications
- Digest mode: batch low-priority alerts into daily/weekly summaries
- Scan history and metrics dashboard data

steps:
1. Implement cron job scheduler in `jobs/handlers/darkwatch.scan.ts`:
   - Daily scans for active subscriptions
   - Respects tier limits (Shield = HIBP only daily, Guard+ = full suite weekly)
2. Add `scanFrequency` field to subscription schema (daily, weekly, monthly)
3. Wire WebSocket push from existing `websocket.ts` into scan engine:
   - Emit `scan:started`, `scan:progress` (completedSources/totalSources), `scan:completed` events
   - Client dashboard subscribes to user-specific scan events
4. Enhance alert deduplication beyond existing exposure dedup:
   - Add `alertCooldownHours` per alert type (e.g., 24h for same breach, 72h for property changes)
   - Track lastAlertSentAt per (userId, alertType, source) tuple
   - Don't create new alerts during cooldown unless severity increases
5. Implement digest mode:
   - Low-priority alerts (info) batched into daily digest email
   - Warning/critical alerts sent immediately via push + email
   - User preference: immediate vs. digest per severity level
6. Add scan metrics:
   - Store scan duration, sources checked, exposures found, alerts generated
   - Aggregate for dashboard "threat score" calculation
7. Implement scan failure recovery:
   - Partial scan results saved even if one source fails
   - Failed sources retried individually in next scan window
8. Add rate limit per user: max 1 concurrent scan, queue subsequent requests

tests:
- Unit: Verify cron expression parsing, cooldown logic, digest batching
- Integration: Trigger scheduled scan, verify WebSocket events emitted in correct order
- E2E: Start scan from dashboard → watch progress bar → receive completion notification

acceptance_criteria:
- [ ] Scans run automatically on schedule without manual trigger (cron job)
- [ ] WebSocket pushes real-time progress: `scan:progress` events with percentage complete
- [ ] Only one scan runs per user at a time; additional requests are queued
- [ ] Duplicate alerts are suppressed during cooldown period (configurable per type)
- [ ] Info-level alerts are batched into daily digest; warning/critical sent immediately
- [ ] Scan history is persisted and visible in dashboard (last scan date, sources checked, findings)
- [ ] Failed sources don't fail entire scan — partial results are saved
- [ ] Dashboard threat score updates automatically after each scan completion
- [ ] Free tier gets weekly scans; paid tiers get daily scans
- [ ] No duplicate notifications for same exposure across multiple scans

validation:
- Run cron job manually: `bun run job:darkwatch:scan`, verify scan completes and exposures created
- Connect to WebSocket: `wscat -c ws://localhost:3000/ws`, subscribe to scan events
- Check dashboard: Scan progress bar animates during active scan, threat score updates after
- Test cooldown: Trigger same scan twice rapidly, verify second scan doesn't create duplicate alerts

notes:
- The existing `scanStates` Map in `darkwatch.service.ts` is in-memory — move to Redis for multi-instance safety
- WebSocket infrastructure exists at `websocket.ts` — extend it for scan-specific events
- The scheduler directory (`scheduler/`) currently only has Dockerfiles — this task creates actual job logic
- Consider using Honker (Rust queue) for scan job distribution once it's production-ready
- Alert fatigue is a real churn driver — aggressive deduplication is a competitive advantage