Files
Kordant/tasks/core-services-implementation/05-darkwatch-scheduler.md
2026-05-31 22:03:18 -04:00

4.1 KiB

05. Periodic Scan Scheduling, WebSocket Progress, and Alert Deduplication

meta: id: core-services-05 feature: core-services-implementation priority: P1 depends_on: [core-services-03, core-services-04] tags: [darkwatch, scheduler, websocket, real-time, deduplication, alerts]

objective:

  • Make DarkWatch continuously useful by scheduling periodic scans, providing real-time progress via WebSocket, and eliminating alert fatigue through intelligent deduplication.

deliverables:

  • Cron-based scan scheduler with configurable frequency per tier
  • WebSocket real-time scan progress updates (already have websocket.ts)
  • Alert cooldown periods to prevent duplicate notifications
  • Digest mode: batch low-priority alerts into daily/weekly summaries
  • Scan history and metrics dashboard data

steps:

  1. Implement cron job scheduler in jobs/handlers/darkwatch.scan.ts:
    • Daily scans for active subscriptions
    • Respects tier limits (Shield = HIBP only daily, Guard+ = full suite weekly)
  2. Add scanFrequency field to subscription schema (daily, weekly, monthly)
  3. Wire WebSocket push from existing websocket.ts into scan engine:
    • Emit scan:started, scan:progress (completedSources/totalSources), scan:completed events
    • Client dashboard subscribes to user-specific scan events
  4. Enhance alert deduplication beyond existing exposure dedup:
    • Add alertCooldownHours per alert type (e.g., 24h for same breach, 72h for property changes)
    • Track lastAlertSentAt per (userId, alertType, source) tuple
    • Don't create new alerts during cooldown unless severity increases
  5. Implement digest mode:
    • Low-priority alerts (info) batched into daily digest email
    • Warning/critical alerts sent immediately via push + email
    • User preference: immediate vs. digest per severity level
  6. Add scan metrics:
    • Store scan duration, sources checked, exposures found, alerts generated
    • Aggregate for dashboard "threat score" calculation
  7. Implement scan failure recovery:
    • Partial scan results saved even if one source fails
    • Failed sources retried individually in next scan window
  8. Add rate limit per user: max 1 concurrent scan, queue subsequent requests

tests:

  • Unit: Verify cron expression parsing, cooldown logic, digest batching
  • Integration: Trigger scheduled scan, verify WebSocket events emitted in correct order
  • E2E: Start scan from dashboard → watch progress bar → receive completion notification

acceptance_criteria:

  • Scans run automatically on schedule without manual trigger (cron job)
  • WebSocket pushes real-time progress: scan:progress events with percentage complete
  • Only one scan runs per user at a time; additional requests are queued
  • Duplicate alerts are suppressed during cooldown period (configurable per type)
  • Info-level alerts are batched into daily digest; warning/critical sent immediately
  • Scan history is persisted and visible in dashboard (last scan date, sources checked, findings)
  • Failed sources don't fail entire scan — partial results are saved
  • Dashboard threat score updates automatically after each scan completion
  • Free tier gets weekly scans; paid tiers get daily scans
  • No duplicate notifications for same exposure across multiple scans

validation:

  • Run cron job manually: bun run job:darkwatch:scan, verify scan completes and exposures created
  • Connect to WebSocket: wscat -c ws://localhost:3000/ws, subscribe to scan events
  • Check dashboard: Scan progress bar animates during active scan, threat score updates after
  • Test cooldown: Trigger same scan twice rapidly, verify second scan doesn't create duplicate alerts

notes:

  • The existing scanStates Map in darkwatch.service.ts is in-memory — move to Redis for multi-instance safety
  • WebSocket infrastructure exists at websocket.ts — extend it for scan-specific events
  • The scheduler directory (scheduler/) currently only has Dockerfiles — this task creates actual job logic
  • Consider using Honker (Rust queue) for scan job distribution once it's production-ready
  • Alert fatigue is a real churn driver — aggressive deduplication is a competitive advantage