get to prod tasks

2026-05-26 16:06:34 -04:00
parent 04e839640f
commit 5214412fff
105 changed files with 7447 additions and 38 deletions
--- a/tasks/web-production/01-security-headers-cors.md
+++ b/tasks/web-production/01-security-headers-cors.md
@@ -0,0 +1,61 @@
+# 01. Security Headers & CORS Configuration
+
+meta:
+  id: web-production-01
+  feature: web-production
+  priority: P1
+  depends_on: []
+  tags: [security, infrastructure, production]
+
+objective:
+- Implement comprehensive security headers and CORS configuration to protect against common web vulnerabilities
+
+deliverables:
+- Security headers middleware in web/src/middleware.ts or Nitro config
+- CORS configuration for API endpoints
+- Content Security Policy (CSP) headers
+- Remove X-Powered-By and other identifying headers
+
+steps:
+1. Add helmet-like security headers via Nitro hooks or Vite plugin:
+   - Strict-Transport-Security (HSTS)
+   - X-Content-Type-Options: nosniff
+   - X-Frame-Options: DENY
+   - X-XSS-Protection: 1; mode=block
+   - Referrer-Policy: strict-origin-when-cross-origin
+   - Permissions-Policy for camera, microphone, geolocation
+2. Implement CSP header allowing only necessary sources:
+   - script-src: 'self', stripe.com, clerk.dev
+   - style-src: 'self', 'unsafe-inline' (needed for Tailwind)
+   - img-src: 'self', data:, blob:, gravatar.com
+   - connect-src: 'self', api endpoints, websocket URL
+   - frame-src: 'self', stripe.com (for Checkout)
+3. Configure CORS for /api/trpc endpoints:
+   - Allow origins: production domain, mobile app origins
+   - Allow methods: GET, POST
+   - Allow headers: Content-Type, Authorization, x-api-key
+   - Credentials: true
+4. Remove server-identifying headers (X-Powered-By, Server)
+5. Add tests verifying headers are present on all responses
+
+tests:
+- Unit: Test each header is present and correct value
+- Integration: Test API endpoints return correct CORS headers
+- Security scan: Use securityheaders.com or similar to verify A+ rating
+
+acceptance_criteria:
+- All 8 security headers present on every HTTP response
+- CSP blocking inline scripts except nonce/hash approved
+- CORS preflight requests handled correctly for API endpoints
+- SecurityHeaders.com scan returns A+ rating
+- No server version information leaked in headers
+
+validation:
+- Run `curl -I https://localhost:3000` and verify headers
+- Run automated security header scanner
+- Check browser dev tools Network tab for all response headers
+
+notes:
+- SolidStart/Nitro may require custom plugin for headers
+- CSP 'unsafe-inline' for styles is acceptable with Tailwind v4 but document the trade-off
+- Consider using nonce-based CSP once Tailwind supports it fully
--- a/tasks/web-production/02-rate-limiting-ddos.md
+++ b/tasks/web-production/02-rate-limiting-ddos.md
@@ -0,0 +1,58 @@
+# 02. Rate Limiting & DDoS Protection
+
+meta:
+  id: web-production-02
+  feature: web-production
+  priority: P1
+  depends_on: []
+  tags: [security, infrastructure, production]
+
+objective:
+- Implement robust rate limiting and DDoS protection beyond the basic in-memory tRPC middleware
+
+deliverables:
+- Redis-backed rate limiting for distributed deployment
+- Per-endpoint rate limit tiers
+- IP-based and user-based limiting
+- DDoS protection via Cloudflare or similar
+
+steps:
+1. Replace in-memory rate limit map with Redis-backed solution:
+   - Use ioredis or @upstash/ratelimit for distributed rate limiting
+   - Create web/src/server/lib/ratelimit.ts with configurable tiers
+2. Define rate limit tiers:
+   - Public endpoints (login, signup): 5 req/min per IP
+   - Authenticated API: 100 req/min per user
+   - Sensitive operations (password reset): 3 req/hour per email
+   - WebSocket connections: 1 per user, reconnect max 5/min
+   - Admin endpoints: 50 req/min per admin
+3. Add IP-based rate limiting at edge/Nitro level for anonymous traffic
+4. Configure Cloudflare (or alternative) for:
+   - DDoS protection
+   - Bot management
+   - Challenge pages for suspicious traffic
+5. Add rate limit response headers (X-RateLimit-Remaining, X-RateLimit-Reset)
+6. Implement sliding window algorithm for fairer limiting
+
+tests:
+- Unit: Test rate limiter correctly counts and resets
+- Integration: Flood endpoint with requests, verify 429 responses
+- Load: Use k6 or artillery to test limits under load
+
+acceptance_criteria:
+- Redis-backed rate limiting active on all endpoints
+- 429 responses include Retry-After header
+- Rate limits enforced per-IP, per-user, and per-endpoint
+- DDoS protection layer active at edge
+- No single IP can exceed 1000 req/min to any endpoint
+- Rate limit headers present on all API responses
+
+validation:
+- `ab -n 1000 -c 10` against login endpoint → 429s after limit
+- Verify Redis keys exist for rate limit counters
+- Check Cloudflare dashboard for blocked threats
+
+notes:
+- Current in-memory rate limit in web/src/server/api/utils.ts will not work across multiple server instances
+- Upstash Redis recommended for serverless deployments
+- Consider implementing token bucket for burst tolerance
--- a/tasks/web-production/03-input-validation-xss.md
+++ b/tasks/web-production/03-input-validation-xss.md
@@ -0,0 +1,62 @@
+# 03. Input Validation & XSS Prevention Audit
+
+meta:
+  id: web-production-03
+  feature: web-production
+  priority: P1
+  depends_on: []
+  tags: [security, validation, production]
+
+objective:
+- Audit and harden all input validation to prevent XSS, injection attacks, and malformed data
+
+deliverables:
+- XSS prevention audit report
+- Input sanitization layer
+- HTML escaping on all user-generated content
+- SQL injection protection verification
+
+steps:
+1. Audit all tRPC routers for input validation gaps:
+   - Check web/src/server/api/routers/*.ts for missing valibot schemas
+   - Ensure all user inputs have strict type validation
+   - Add maxLength constraints to all string inputs
+2. Implement output escaping for user-generated content:
+   - Blog posts, user names, alert messages
+   - Use DOMPurify or similar on client-side rendering
+   - Escape HTML entities server-side before DB storage
+3. Audit database queries for SQL injection:
+   - Verify all queries use Drizzle parameterized queries
+   - Check raw SQL usage in jobs and services
+   - Ensure no string concatenation in SQL
+4. Add content validation for file uploads (if any):
+   - MIME type verification
+   - File size limits
+   - Scan for malware
+5. Implement request body size limits:
+   - 1MB max for JSON payloads
+   - 10MB max for file uploads
+6. Add tests for malformed input handling
+
+tests:
+- Unit: Test each router with XSS payloads, SQL injection attempts
+- Integration: Submit malicious inputs via API, verify safe handling
+- Security: Run OWASP ZAP or Burp Suite against app
+
+acceptance_criteria:
+- All tRPC inputs have strict valibot validation with bounds
+- User-generated content escaped before rendering
+- No SQL injection vectors in any query
+- XSS payloads rendered as plain text, not executed
+- Request body size limits enforced
+- OWASP ZAP scan shows no high/critical vulnerabilities
+
+validation:
+- Submit `<script>alert('xss')</script>` in all text fields → rendered safely
+- Submit SQL injection in search fields → no database errors
+- Run `npm audit` and address all high severity issues
+
+notes:
+- Valibot schemas already in use — expand them with stricter bounds
+- Consider using zod for more complex validation if valibot is limiting
+- Sanitize inputs at API boundary, not just client-side
--- a/tasks/web-production/04-auth-session-hardening.md
+++ b/tasks/web-production/04-auth-session-hardening.md
@@ -0,0 +1,71 @@
+# 04. Authentication & Session Security Hardening
+
+meta:
+  id: web-production-04
+  feature: web-production
+  priority: P1
+  depends_on: []
+  tags: [security, auth, production]
+
+objective:
+- Harden authentication and session management to prevent session hijacking, fixation, and brute force attacks
+
+deliverables:
+- Secure session configuration
+- JWT hardening
+- Brute force protection
+- Session invalidation on logout
+- Multi-factor authentication foundation
+
+steps:
+1. Harden JWT implementation in web/src/server/auth/jwt.ts:
+   - Remove fallback secret (currently uses dev secret if env missing)
+   - Add JWT issuer and audience claims
+   - Implement token blacklisting for logout
+   - Add refresh token rotation
+2. Harden session management in web/src/server/auth/session.ts:
+   - Use httpOnly, secure, sameSite=strict cookies
+   - Add session fingerprinting (user agent hash)
+   - Implement concurrent session limits (max 5 per user)
+   - Add automatic session expiry refresh on activity
+3. Add brute force protection:
+   - Track failed login attempts per IP/email
+   - Progressive delays: 1s, 2s, 4s, 8s, 16s
+   - Lock account after 10 failed attempts (1 hour)
+4. Implement secure logout:
+   - Invalidate session in database
+   - Clear all cookies
+   - Blacklist JWT token
+   - Revoke refresh token
+5. Add MFA foundation:
+   - TOTP secret generation
+   - QR code for authenticator apps
+   - Backup codes
+6. Audit Clerk integration for security:
+   - Verify webhook signature validation
+   - Check Clerk session sync with custom sessions
+
+tests:
+- Unit: Test JWT signing/verification with invalid tokens
+- Integration: Test brute force lockout, session expiry
+- Security: Test session hijacking resistance
+
+acceptance_criteria:
+- No hardcoded or fallback secrets in auth code
+- All cookies have httpOnly, secure, sameSite=strict
+- Brute force protection active on login endpoints
+- Logout invalidates session completely
+- JWT tokens include iss, aud, iat, exp claims
+- Session fingerprinting prevents cookie theft reuse
+- MFA TOTP generation working with Google Authenticator
+
+validation:
+- Attempt 10 failed logins → account locked
+- Steal session cookie from one browser → invalid in another (fingerprinting)
+- Logout → session token rejected on subsequent requests
+- Check JWT with jwt.io → valid iss and aud claims
+
+notes:
+- Current JWT has fallback secret — this is critical to fix before production
+- Clerk handles frontend auth but backend needs its own hardening
+- Consider using Lucia Auth or NextAuth patterns for session management
--- a/tasks/web-production/05-cdn-asset-optimization.md
+++ b/tasks/web-production/05-cdn-asset-optimization.md
@@ -0,0 +1,61 @@
+# 05. CDN & Asset Optimization
+
+meta:
+  id: web-production-05
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [performance, infrastructure, production]
+
+objective:
+- Configure CDN for static assets and optimize frontend bundle delivery
+
+deliverables:
+- CDN configuration (Cloudflare, Vercel Edge, or AWS CloudFront)
+- Asset optimization (images, fonts, JS/CSS)
+- Brotli/Gzip compression
+- Cache-Control headers for static assets
+
+steps:
+1. Configure CDN for static assets:
+   - Set up Cloudflare or Vercel Edge Network
+   - Point CDN to web/dist/client or .output/public
+   - Configure cache rules for static files
+2. Optimize image delivery:
+   - Convert landing page SVGs to optimized formats where appropriate
+   - Add responsive image srcset for photos
+   - Implement lazy loading for below-fold images
+3. Configure compression:
+   - Enable Brotli compression (better than gzip)
+   - Ensure Nitro/Vite build outputs compressed assets
+4. Set Cache-Control headers:
+   - Immutable assets (hashed filenames): 1 year
+   - HTML pages: no-cache (for SSR)
+   - API responses: no-store or short cache
+5. Implement resource hints:
+   - Preconnect to API domain, Stripe, Clerk
+   - Prefetch critical routes
+6. Add tests verifying asset optimization
+
+tests:
+- Unit: Test asset hashing and cache headers
+- Integration: Test CDN cache hit rates
+- Performance: Lighthouse performance audit >90
+
+acceptance_criteria:
+- Static assets served from CDN with <50ms TTFB
+- Brotli compression active on all text assets
+- Cache-Control headers correct per asset type
+- Image optimization reducing total page weight by >30%
+- Lighthouse Performance score ≥ 90
+- Preconnect hints present on critical pages
+
+validation:
+- `curl -I https://cdn.example.com/assets/main.js` → Cache-Control: public, max-age=31536000, immutable
+- Lighthouse CI run shows Performance ≥ 90
+- PageSpeed Insights shows <2s LCP on mobile
+
+notes:
+- SolidStart with Nitro should handle asset hashing automatically
+- Vercel deployment may include CDN automatically
+- Consider using @solidjs/start image optimization if available
--- a/tasks/web-production/06-db-connection-pooling.md
+++ b/tasks/web-production/06-db-connection-pooling.md
@@ -0,0 +1,62 @@
+# 06. Database Connection Pooling & Query Optimization
+
+meta:
+  id: web-production-06
+  feature: web-production
+  priority: P1
+  depends_on: []
+  tags: [performance, database, production]
+
+objective:
+- Optimize database connections and queries for production load
+
+deliverables:
+- Connection pooling configuration
+- Query performance audit
+- Index optimization
+- Slow query logging
+
+steps:
+1. Configure connection pooling:
+   - If using PostgreSQL: configure PgBouncer or use @libsql/client pooling
+   - Set max connections based on server instances (e.g., 20 per instance)
+   - Add connection timeout and idle timeout settings
+2. Audit all Drizzle queries for performance:
+   - Check web/src/server/db/schema/*.ts for missing indexes
+   - Review web/src/server/api/routers/*.ts for N+1 queries
+   - Add pagination to all list endpoints (default 50, max 100)
+3. Add database indexes:
+   - createdAt indexes for time-range queries (alerts, exposures)
+   - Composite indexes for common filter combinations
+   - userId indexes on all user-scoped tables
+4. Implement query result caching:
+   - Cache user profile lookups (5 min TTL)
+   - Cache subscription status (1 min TTL)
+   - Cache dashboard summary (30 sec TTL)
+5. Add slow query logging:
+   - Log queries taking >500ms
+   - Alert on >1s queries
+6. Set up database performance monitoring
+
+tests:
+- Unit: Test query execution plans for major endpoints
+- Load: Run 1000 concurrent dashboard loads, verify <200ms p95
+- Integration: Test pagination boundaries
+
+acceptance_criteria:
+- Database connection pool configured with max 20 connections
+- No N+1 queries in any API endpoint
+- All list endpoints paginated with cursor or offset
+- Slow query logging active
+- Dashboard load query <100ms p95
+- Alert endpoint query <50ms p95
+
+validation:
+- EXPLAIN ANALYZE on major queries shows index usage
+- Load test with k6: 1000 concurrent users, p95 < 200ms
+- Database CPU <50% under normal load
+
+notes:
+- Current schema has some indexes but may need more for production scale
+- Drizzle ORM doesn't automatically handle connection pooling — configure at driver level
+- Consider read replicas if dashboard load is heavy
--- a/tasks/web-production/07-caching-strategy.md
+++ b/tasks/web-production/07-caching-strategy.md
@@ -0,0 +1,61 @@
+# 07. Caching Strategy (Redis + HTTP Cache)
+
+meta:
+  id: web-production-07
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [performance, caching, production]
+
+objective:
+- Implement multi-layer caching to reduce database load and improve response times
+
+deliverables:
+- Redis caching layer for API responses
+- HTTP cache headers for client-side caching
+- Cache invalidation strategy
+- Stale-while-revalidate pattern
+
+steps:
+1. Implement Redis caching for API responses:
+   - Create web/src/server/lib/cache.ts with Redis-backed cache
+   - Cache user profile: key `user:{id}`, TTL 5 minutes
+   - Cache subscription: key `sub:{userId}`, TTL 1 minute
+   - Cache dashboard summary: key `dash:{userId}`, TTL 30 seconds
+   - Cache blog posts: key `blog:{slug}`, TTL 1 hour
+2. Add cache decorators/procedures:
+   - Create cachedProcedure wrapper for tRPC
+   - Support cache tags for invalidation
+3. Implement HTTP caching headers:
+   - Static assets: Cache-Control: public, max-age=31536000, immutable
+   - API responses: Cache-Control: private, max-age=30
+   - HTML pages: Cache-Control: no-cache (SSR)
+4. Add cache invalidation:
+   - Invalidate user cache on profile update
+   - Invalidate subscription cache on billing event
+   - Invalidate blog cache on publish/edit
+5. Implement stale-while-revalidate for dashboard data
+6. Add cache hit/miss metrics
+
+tests:
+- Unit: Test cache set/get/delete operations
+- Integration: Test cache invalidation on mutations
+- Performance: Compare cached vs uncached response times
+
+acceptance_criteria:
+- Redis cache layer active on all read-heavy endpoints
+- Cache hit rate >80% for user profile and subscription endpoints
+- Cache invalidation working on all mutations
+- HTTP cache headers correct per endpoint type
+- Stale-while-revalidate pattern on dashboard widgets
+- Cache metrics visible in monitoring dashboard
+
+validation:
+- Load test: cached endpoint p95 < 20ms
+- Verify Redis keys created for cached data
+- Update profile → cache invalidated, next request hits DB
+
+notes:
+- Redis already used for BullMQ jobs — share connection or use separate DB index
+- Be careful caching authenticated data — always include userId in key
+- Consider using Vercel KV or Upstash Redis for serverless
--- a/tasks/web-production/08-health-checks-shutdown.md
+++ b/tasks/web-production/08-health-checks-shutdown.md
@@ -0,0 +1,67 @@
+# 08. Graceful Shutdown & Health Check Endpoints
+
+meta:
+  id: web-production-08
+  feature: web-production
+  priority: P1
+  depends_on: []
+  tags: [reliability, infrastructure, production]
+
+objective:
+- Implement health checks and graceful shutdown to ensure zero-downtime deployments and reliable operations
+
+deliverables:
+- Health check endpoint (/health)
+- Readiness probe endpoint (/ready)
+- Graceful shutdown handler
+- Dependency health checks (DB, Redis, Stripe)
+
+steps:
+1. Create health check endpoints:
+   - GET /health → basic liveness (HTTP 200 if process running)
+   - GET /ready → readiness check (DB, Redis, Stripe connectivity)
+   - GET /health/deep → comprehensive check with dependency status
+2. Implement dependency health checks:
+   - Database: simple SELECT 1 query
+   - Redis: PING command
+   - Stripe: retrieve account info (cached)
+   - WebSocket server: connection count
+3. Add graceful shutdown:
+   - Handle SIGTERM/SIGINT signals
+   - Stop accepting new connections
+   - Wait for active requests to complete (30s timeout)
+   - Close database connections
+   - Close Redis connections
+   - Exit process cleanly
+4. Add startup probe:
+   - Delay readiness until all services initialized
+   - Retry logic for DB connection on startup
+5. Add metrics endpoint (/metrics) for Prometheus:
+   - Request count and duration
+   - Error rates
+   - Active connections
+   - Dependency health status
+
+tests:
+- Unit: Test health check responses
+- Integration: Test graceful shutdown with active requests
+- Load: Verify zero failed requests during rolling restart
+
+acceptance_criteria:
+- /health returns 200 within 100ms
+- /ready returns 200 only when all dependencies healthy
+- /ready returns 503 with detailed error when dependency down
+- Graceful shutdown completes within 30 seconds
+- Zero failed requests during rolling deployment
+- Prometheus metrics endpoint available
+
+validation:
+- `curl /health` → {"status":"ok"}
+- `curl /ready` → {"status":"ok","dependencies":{"db":"ok","redis":"ok","stripe":"ok"}}
+- Stop container with active requests → all complete before exit
+- Block DB port → /ready returns 503
+
+notes:
+- Nitro/SolidStart may need custom server plugin for signal handling
+- Use node-graceful-shutdown or similar library
+- Kubernetes/Docker health checks rely on these endpoints
--- a/tasks/web-production/09-structured-logging.md
+++ b/tasks/web-production/09-structured-logging.md
@@ -0,0 +1,66 @@
+# 09. Structured Logging & Log Aggregation
+
+meta:
+  id: web-production-09
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [observability, logging, production]
+
+objective:
+- Replace ad-hoc logging with structured, aggregated logging for production debugging and auditing
+
+deliverables:
+- Structured logging library integration (Pino or Winston)
+- Log aggregation pipeline (Datadog, Logtail, or CloudWatch)
+- Request ID propagation across all logs
+- Log rotation and retention policy
+
+steps:
+1. Add structured logging library:
+   - Install pino or winston in web/package.json
+   - Create web/src/server/lib/logger.ts with configured logger
+   - Replace all console.log/console.error with logger
+2. Implement request context logging:
+   - Generate request ID for each incoming request
+   - Attach user ID, session ID to log context
+   - Propagate request ID through tRPC context
+3. Configure log levels:
+   - ERROR: unhandled exceptions, auth failures, DB errors
+   - WARN: rate limit hits, slow queries, deprecated API usage
+   - INFO: requests, logins, signups, billing events
+   - DEBUG: query details, cache hits/misses (dev only)
+4. Set up log aggregation:
+   - Configure log shipping to aggregation service
+   - Set up log parsing and indexing
+   - Create saved searches for common issues
+5. Implement log rotation:
+   - 100MB max per file
+   - 7 days retention for production
+   - 30 days retention for audit logs
+6. Add sensitive data redaction:
+   - Mask credit card numbers, SSNs, passwords in logs
+   - Redact JWT tokens (show only first 10 chars)
+
+tests:
+- Unit: Test logger outputs valid JSON
+- Integration: Test request ID propagation
+- Security: Verify no sensitive data in logs
+
+acceptance_criteria:
+- All logs output as structured JSON
+- Request ID present on every log line for a given request
+- Log aggregation service receiving logs in real-time
+- Sensitive data redacted from all log output
+- Log rotation preventing disk fill
+- Searchable logs by user ID, request ID, endpoint
+
+validation:
+- Trigger error → log appears in aggregation with stack trace, request ID, user ID
+- Search logs by request ID → all related logs returned
+- Check log files → no credit card numbers, passwords, full JWTs
+
+notes:
+- Pino is fastest and recommended for Node.js
+- Use pino-pretty for local development, JSON for production
+- Consider OpenTelemetry for unified tracing + logging
--- a/tasks/web-production/10-error-tracking.md
+++ b/tasks/web-production/10-error-tracking.md
@@ -0,0 +1,69 @@
+# 10. Error Tracking & Alerting (Sentry Integration)
+
+meta:
+  id: web-production-10
+  feature: web-production
+  priority: P1
+  depends_on: []
+  tags: [observability, error-tracking, production]
+
+objective:
+- Implement comprehensive error tracking with Sentry to catch and alert on production errors in real-time
+
+deliverables:
+- Sentry integration for backend and frontend
+- Error alerting rules
+- Source maps upload for production builds
+- Breadcrumbs for error context
+
+steps:
+1. Add Sentry SDK:
+   - Install @sentry/node for backend
+   - Install @sentry/solid or @sentry/browser for frontend
+   - Configure DSN from environment variable
+2. Initialize Sentry in backend:
+   - Add to web/src/entry-server.tsx or Nitro plugin
+   - Capture unhandled exceptions
+   - Capture unhandled promise rejections
+   - Attach user context (ID, email) when available
+3. Initialize Sentry in frontend:
+   - Add to web/src/entry-client.tsx
+   - Capture JavaScript errors
+   - Capture SolidJS component errors via ErrorBoundary
+   - Attach release version and environment
+4. Configure error alerting:
+   - Slack/Discord/PagerDuty integration for P1 errors
+   - Email alerts for new error types
+   - Digest emails for recurring errors
+   - Alert thresholds: >10 errors/minute or >1 unhandled exception
+5. Upload source maps:
+   - Configure Vite plugin for source map generation
+   - Upload maps to Sentry during build
+   - Verify error stack traces show original source
+6. Add breadcrumbs:
+   - Log navigation changes
+   - Log API calls with response status
+   - Log user actions (clicks, form submissions)
+
+tests:
+- Unit: Test Sentry capture in error scenarios
+- Integration: Trigger error, verify appears in Sentry
+- Alert: Verify alert fires within 1 minute of error
+
+acceptance_criteria:
+- 100% of unhandled exceptions captured in Sentry
+- All errors include user context, request URL, and environment
+- Source maps working → stack traces show original TypeScript
+- Alert fired within 60 seconds of first occurrence
+- No duplicate alerts for same error (grouping working)
+- Error rate dashboard showing trends over time
+
+validation:
+- Deploy with intentional bug → error appears in Sentry within 30s
+- Check alert channel → notification received
+- View error detail → correct file, line number, user context
+
+notes:
+- Sentry free tier: 5k errors/month — may need paid plan for scale
+- Use Sentry releases to track which deploy introduced errors
+- Consider integrating with GitHub for suspect commits
--- a/tasks/web-production/11-metrics-dashboards.md
+++ b/tasks/web-production/11-metrics-dashboards.md
@@ -0,0 +1,70 @@
+# 11. Application Metrics & Dashboards
+
+meta:
+  id: web-production-11
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [observability, metrics, production]
+
+objective:
+- Collect and visualize application metrics for performance monitoring and capacity planning
+
+deliverables:
+- Prometheus metrics endpoint
+- Custom business metrics
+- Grafana or Datadog dashboards
+- Alerting on metric thresholds
+
+steps:
+1. Add metrics collection:
+   - Install prom-client for Node.js metrics
+   - Create web/src/server/lib/metrics.ts
+   - Expose /metrics endpoint for Prometheus scraping
+2. Collect standard metrics:
+   - HTTP request duration (histogram)
+   - HTTP request count (counter, by status code, endpoint)
+   - Active connections (gauge)
+   - Memory usage (gauge)
+   - Event loop lag (gauge)
+3. Collect business metrics:
+   - Signup rate (counter)
+   - Login success/failure rate (counter)
+   - Subscription conversions (counter)
+   - DarkWatch scan completions (counter)
+   - Alert generation rate (counter)
+   - Average threat score (gauge)
+4. Set up dashboards:
+   - Grafana dashboard or Datadog dashboard
+   - Request latency percentiles (p50, p95, p99)
+   - Error rate over time
+   - Business funnel (landing → signup → subscribe)
+   - Infrastructure health (CPU, memory, DB connections)
+5. Configure alerts:
+   - p99 latency > 500ms for 5 minutes
+   - Error rate > 1% for 2 minutes
+   - Memory usage > 80% for 10 minutes
+   - DB connection pool > 90% for 5 minutes
+
+tests:
+- Unit: Test metrics increment correctly
+- Integration: Verify /metrics endpoint returns valid Prometheus format
+- Dashboard: Confirm all panels show data
+
+acceptance_criteria:
+- /metrics endpoint serving valid Prometheus exposition format
+- Request duration histogram with 0.1, 0.5, 1, 2, 5 second buckets
+- Business metrics visible in dashboard
+- Alert fires when p99 latency exceeds 500ms
+- Dashboard refreshes every 10 seconds with live data
+- Metrics retention for 30 days
+
+validation:
+- `curl /metrics` → valid Prometheus output
+- Grafana dashboard shows request latency graph
+- Trigger slow endpoint → alert fires within 5 minutes
+
+notes:
+- Prometheus + Grafana is open source and cost-effective
+- Datadog is easier but more expensive
+- Consider using Vercel Analytics if deployed on Vercel
--- a/tasks/web-production/12-uptime-monitoring.md
+++ b/tasks/web-production/12-uptime-monitoring.md
@@ -0,0 +1,69 @@
+# 12. Uptime & Performance Monitoring
+
+meta:
+  id: web-production-12
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [observability, uptime, production]
+
+objective:
+- Monitor application uptime and performance from external vantage points to ensure reliability
+
+deliverables:
+- External uptime monitoring (Pingdom, UptimeRobot, or Datadog Synthetics)
+- Synthetic monitoring for critical user journeys
+- Performance budget enforcement
+- Status page for incident communication
+
+steps:
+1. Set up uptime monitoring:
+   - Configure checks for homepage, API health, dashboard
+   - Check from multiple regions (US East, US West, EU)
+   - 1-minute interval checks
+   - Alert on 2 consecutive failures
+2. Implement synthetic monitoring:
+   - Signup flow: homepage → signup → verify email
+   - Login flow: login → dashboard → view alerts
+   - Billing flow: dashboard → pricing → checkout (test mode)
+   - DarkWatch flow: dashboard → darkwatch → add watchlist item
+3. Set performance budgets:
+   - LCP (Largest Contentful Paint) < 2.5s mobile, < 1.5s desktop
+   - FID (First Input Delay) < 100ms
+   - CLS (Cumulative Layout Shift) < 0.1
+   - TTFB (Time to First Byte) < 200ms
+   - API response p95 < 200ms
+4. Configure alerting:
+   - Downtime alert via Slack/SMS
+   - Performance degradation alert (LCP > 3s)
+   - SSL certificate expiry alert (30 days before)
+   - Domain expiry alert (30 days before)
+5. Set up status page:
+   - Use statuspage.io or instatus.com
+   - Auto-update from monitoring checks
+   - Subscribe users for incident notifications
+   - Post incident updates and post-mortems
+
+tests:
+- Integration: Verify monitoring catches simulated outage
+- Performance: Confirm synthetic tests complete successfully
+- Alert: Test alert channels with deliberate failure
+
+acceptance_criteria:
+- Uptime monitoring checking every 60 seconds from 3+ regions
+- 99.9% uptime SLA measured over 30 days
+- Synthetic tests covering signup, login, and core flows
+- Performance budget alerts for LCP > 2.5s
+- Status page accessible and auto-updating
+- SSL certificate expiry alert 30 days in advance
+
+validation:
+- Simulate outage → alert received within 2 minutes
+- Check status page → shows incident with timeline
+- Run synthetic test → completes in <30 seconds
+- Lighthouse CI shows all metrics within budget
+
+notes:
+- UptimeRobot free tier: 50 monitors, 5-minute intervals
+- Pingdom more reliable but paid
+- Consider using Checkly for synthetic monitoring with JS
--- a/tasks/web-production/13-github-actions-ci.md
+++ b/tasks/web-production/13-github-actions-ci.md
@@ -0,0 +1,72 @@
+# 13. GitHub Actions CI Pipeline
+
+meta:
+  id: web-production-13
+  feature: web-production
+  priority: P1
+  depends_on: [web-production-17, web-production-18, web-production-19, web-production-20]
+  tags: [cicd, automation, production]
+
+objective:
+- Build a comprehensive CI pipeline that runs tests, linting, type checking, and security scans on every pull request
+
+deliverables:
+- GitHub Actions workflow files
+- PR checks for web and browser-ext
+- Test reporting and coverage
+- Dependency vulnerability scanning
+
+steps:
+1. Create .github/workflows/ci.yml:
+   - Trigger on pull_request and push to main
+   - Set up Node.js 22 with pnpm
+   - Install dependencies with frozen lockfile
+2. Add job: lint-and-typecheck:
+   - Run `pnpm lint` (tsc --noEmit)
+   - Run `pnpm lint:ext`
+   - Fail on any TypeScript errors
+3. Add job: test:
+   - Run `pnpm test` (vitest for web)
+   - Run `pnpm test:ext` (vitest for browser-ext)
+   - Generate coverage reports with @vitest/coverage-v8
+   - Upload coverage to Codecov or similar
+4. Add job: build:
+   - Run `pnpm build` for web
+   - Run `pnpm build:ext` for browser-ext
+   - Verify build artifacts exist
+5. Add job: security-scan:
+   - Run `pnpm audit` with --audit-level=high
+   - Run `npm audit fix` suggestions as PR comment
+   - Add OWASP dependency check
+6. Add job: docker-build:
+   - Build scheduler Dockerfile
+   - Verify Docker image builds successfully
+7. Configure branch protection:
+   - Require all checks to pass before merge
+   - Require 1 reviewer approval
+   - Require up-to-date branch before merge
+
+tests:
+- Integration: Create test PR, verify all checks run
+- Security: Introduce vulnerable dependency, verify scan catches it
+- Build: Verify build artifacts are created
+
+acceptance_criteria:
+- All PRs trigger CI pipeline automatically
+- Lint, typecheck, test, build, and security jobs run in parallel
+- Tests failing blocks PR merge
+- Coverage report uploaded for every PR
+- Security vulnerabilities (high+) block PR merge
+- Docker build verified on every PR
+- Pipeline completes in <10 minutes
+
+validation:
+- Open test PR → all checks green
+- Introduce TypeScript error → lint job fails
+- Add vulnerable package → security scan fails
+- Check Codecov → coverage diff visible in PR
+
+notes:
+- Use pnpm/action-setup for proper pnpm installation
+- Cache node_modules between runs for speed
+- Consider using GitHub Actions matrix for multiple Node versions
--- a/tasks/web-production/14-deployment-pipeline.md
+++ b/tasks/web-production/14-deployment-pipeline.md
@@ -0,0 +1,75 @@
+# 14. Automated Deployment Pipeline
+
+meta:
+  id: web-production-14
+  feature: web-production
+  priority: P1
+  depends_on: [web-production-13, web-production-15, web-production-16]
+  tags: [cicd, deployment, production]
+
+objective:
+- Build automated deployment pipelines for staging and production environments with rollback capability
+
+deliverables:
+- Staging deployment on merge to main
+- Production deployment with manual approval
+- Database migration automation
+- Rollback strategy
+
+steps:
+1. Create .github/workflows/deploy-staging.yml:
+   - Trigger on push to main
+   - Build web application
+   - Run database migrations (drizzle-kit push)
+   - Deploy to staging environment (Vercel, Railway, or VPS)
+   - Run smoke tests against staging
+2. Create .github/workflows/deploy-production.yml:
+   - Trigger on release published or manual dispatch
+   - Require manual approval from 1 team member
+   - Build and tag Docker image
+   - Run database migrations in dry-run first
+   - Deploy to production with blue-green or rolling strategy
+   - Run post-deploy smoke tests
+3. Implement database migration safety:
+   - Migrations run before app deployment
+   - Backward-compatible migrations only (add columns, don't drop)
+   - Migration rollback script for each migration
+   - Database backup before production migration
+4. Add deployment notifications:
+   - Slack notification on deploy start, success, failure
+   - Include commit SHA, author, and changelog
+5. Implement rollback:
+   - One-click rollback to previous release
+   - Database migration rollback (if safe)
+   - CDN cache purge on rollback
+6. Add smoke tests:
+   - Test homepage loads
+   - Test login API responds
+   - Test health endpoint
+   - Test critical user journey with Playwright
+
+tests:
+- Integration: Deploy to staging, verify app functional
+- Rollback: Trigger rollback, verify previous version restored
+- Migration: Test migration failure doesn't break deployment
+
+acceptance_criteria:
+- Every merge to main auto-deploys to staging
+- Production deploy requires manual approval
+- Database migrations run automatically before app start
+- Rollback completes in <5 minutes
+- Smoke tests pass before marking deploy successful
+- Deployment notifications sent to Slack
+- Zero-downtime deployment for web app
+
+validation:
+- Merge PR → staging deploys automatically within 5 minutes
+- Trigger production deploy → approval gate shown
+- Approve → production deploys, smoke tests pass
+- Introduce bug → rollback to previous version in <5 minutes
+
+notes:
+- Vercel offers automatic preview deployments per PR
+- For VPS deployment, use Docker Compose with rolling restart
+- Consider using GitHub Environments for approval gates
+- Database migrations should be additive-only in production
--- a/tasks/web-production/15-docker-infra.md
+++ b/tasks/web-production/15-docker-infra.md
@@ -0,0 +1,75 @@
+# 15. Docker & Infrastructure Optimization
+
+meta:
+  id: web-production-15
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [infrastructure, docker, production]
+
+objective:
+- Optimize Docker images and infrastructure for production deployment with security and efficiency
+
+deliverables:
+- Multi-stage optimized Dockerfile for web app
+- Docker Compose for local production simulation
+- Infrastructure as Code (Terraform or Pulumi)
+- Security scanning for Docker images
+
+steps:
+1. Create optimized Dockerfile for web app:
+   - Multi-stage build (deps → build → runtime)
+   - Use node:22-alpine for minimal image size
+   - Run as non-root user
+   - Copy only necessary files to runtime stage
+   - Health check in Dockerfile
+2. Optimize scheduler Dockerfile:
+   - Reduce image size (currently copies many files)
+   - Use .dockerignore to exclude unnecessary files
+   - Pin base image versions
+3. Create docker-compose.prod.yml:
+   - Web app service with replicas
+   - Redis service with persistence
+   - PostgreSQL service (or external)
+   - Nginx reverse proxy with SSL termination
+   - Watchtower for automatic updates
+4. Add security scanning:
+   - Trivy or Snyk scan in CI pipeline
+   - Fail build on CRITICAL vulnerabilities
+   - Weekly automated scan of production images
+5. Implement Infrastructure as Code:
+   - Terraform configuration for AWS/GCP/Vultr
+   - VPC, subnets, security groups
+   - ECS/Fargate or Kubernetes deployment
+   - Load balancer with SSL
+   - RDS/Cloud SQL for PostgreSQL
+   - ElastiCache/Memorystore for Redis
+6. Add environment-specific configs:
+   - Production nginx.conf with rate limiting
+   - SSL certificate management (Let's Encrypt)
+   - Firewall rules
+
+tests:
+- Integration: Build image, verify size <200MB
+- Security: Trivy scan shows no CRITICAL vulnerabilities
+- Deploy: Terraform apply creates infrastructure
+
+acceptance_criteria:
+- Web Docker image <200MB compressed
+- Scheduler Docker image <150MB compressed
+- No CRITICAL vulnerabilities in image scans
+- docker-compose.prod.yml runs full stack locally
+- Terraform creates reproducible infrastructure
+- Nginx reverse proxy with SSL and rate limiting
+- Non-root user running containers
+
+validation:
+- `docker images` → web image <200MB
+- `trivy image kordant-web` → no CRITICAL
+- `docker-compose -f docker-compose.prod.yml up` → full stack running
+- `terraform plan` → no unexpected changes
+
+notes:
+- Current scheduler/Dockerfile copies many source files — optimize with .dockerignore
+- Consider using distroless images for even smaller footprint
+- Use AWS Fargate or Google Cloud Run for serverless containers
--- a/tasks/web-production/16-env-secrets.md
+++ b/tasks/web-production/16-env-secrets.md
@@ -0,0 +1,75 @@
+# 16. Environment Management & Secrets Rotation
+
+meta:
+  id: web-production-16
+  feature: web-production
+  priority: P1
+  depends_on: []
+  tags: [security, infrastructure, production]
+
+objective:
+- Implement secure environment variable management and automated secrets rotation
+
+deliverables:
+- Environment variable validation on startup
+- Secrets manager integration (AWS Secrets Manager, Doppler, or 1Password)
+- Automated secrets rotation
+- Environment documentation
+
+steps:
+1. Create environment validation:
+   - Create web/src/server/lib/env.ts with Zod/Valibot schema
+   - Validate all required env vars on server startup
+   - Fail fast with clear error messages for missing vars
+   - Type-safe env access throughout codebase
+2. Migrate to secrets manager:
+   - Set up Doppler or AWS Secrets Manager
+   - Move DATABASE_URL, JWT_SECRET, STRIPE_SECRET_KEY, CLERK_SECRET_KEY to secrets manager
+   - Remove secrets from .env files in production
+   - Use short-lived tokens where possible
+3. Implement secrets rotation:
+   - JWT secret: rotate quarterly
+   - Database credentials: rotate monthly
+   - Stripe keys: rotate after any suspected leak
+   - API keys: rotate every 6 months
+   - Automated rotation scripts
+4. Add environment documentation:
+   - Document all environment variables in docs/ENVIRONMENT.md
+   - Mark required vs optional
+   - Include examples and validation rules
+   - Document secrets rotation schedule
+5. Secure local development:
+   - .env.example with dummy values
+   - .env.local in .gitignore
+   - Pre-commit hook to prevent secret commits
+   - Use 1Password CLI or Doppler CLI for local secrets
+6. Audit existing secrets:
+   - Scan git history for leaked secrets (git-secrets, truffleHog)
+   - Rotate any potentially leaked secrets
+   - Enable GitHub secret scanning
+
+tests:
+- Unit: Test env validation catches missing vars
+- Security: Verify no secrets in codebase with scanner
+- Integration: Test secrets manager integration
+
+acceptance_criteria:
+- Server fails to start with clear error if required env var missing
+- Zero secrets in codebase or git history
+- All production secrets stored in secrets manager
+- Rotation schedule documented and automated
+- Environment documentation complete and accurate
+- GitHub secret scanning enabled
+- Pre-commit hooks preventing secret commits
+
+validation:
+- Remove DATABASE_URL → server exits with clear error
+- Run truffleHog → no secrets found in history
+- Check secrets manager → all production secrets stored
+- Run rotation script → new JWT secret generated, app continues working
+
+notes:
+- Doppler is excellent for team secret management
+- AWS Secrets Manager integrates well with ECS/Fargate
+- Never commit .env files — use .env.example only
+- Consider using sealed secrets for Kubernetes
--- a/tasks/web-production/17-e2e-testing.md
+++ b/tasks/web-production/17-e2e-testing.md
@@ -0,0 +1,73 @@
+# 17. End-to-End Testing (Playwright)
+
+meta:
+  id: web-production-17
+  feature: web-production
+  priority: P1
+  depends_on: []
+  tags: [testing, e2e, quality]
+
+objective:
+- Implement comprehensive end-to-end tests covering critical user journeys using Playwright
+
+deliverables:
+- Playwright test suite for critical flows
+- Test database seeding and cleanup
+- Visual regression testing setup
+- CI integration for E2E tests
+
+steps:
+1. Install and configure Playwright:
+   - Install @playwright/test in web/package.json
+   - Create playwright.config.ts with project settings
+   - Configure test database (separate from dev)
+2. Create test utilities:
+   - Test user creation helper
+   - Database reset between tests
+   - Authentication state management
+   - API mocking helpers
+3. Write critical path tests:
+   - Landing page → Signup → Onboarding → Dashboard
+   - Login → Dashboard → DarkWatch → Add watchlist item
+   - Login → Settings → Update profile
+   - Login → Billing → View pricing → Checkout (test mode)
+   - Admin login → Blog → Create post → Publish
+   - Real-time alerts: WebSocket connection and alert display
+4. Add visual regression tests:
+   - Screenshot comparison for landing page
+   - Screenshot comparison for dashboard
+   - Screenshot comparison for mobile responsive layout
+5. Configure test data:
+   - Seed test database with known data
+   - Use test Stripe keys for billing tests
+   - Mock external APIs (Twilio, FCM) in tests
+6. Add CI integration:
+   - Run E2E tests on PR (not blocking initially)
+   - Upload test artifacts (screenshots, videos)
+   - Parallel test execution across browsers
+
+tests:
+- E2E: All critical paths pass in CI
+- Visual: Screenshot diffs reviewed and approved
+- Cross-browser: Tests pass on Chromium, Firefox, WebKit
+
+acceptance_criteria:
+- 10+ E2E tests covering critical user journeys
+- Tests run in <5 minutes with parallel execution
+- Tests pass on Chromium, Firefox, and WebKit
+- Visual regression catching UI changes
+- Test artifacts (screenshots, videos) uploaded on failure
+- Tests use isolated test database
+- Mobile viewport tests included
+
+validation:
+- `npx playwright test` → all tests pass
+- CI pipeline runs E2E tests on PR
+- Change button color → visual regression test fails
+- Check test report → screenshots and traces available
+
+notes:
+- Playwright is faster and more reliable than Cypress
+- Use test database to avoid polluting dev data
+- Start with 5 critical paths, expand over time
+- Consider using MSW for API mocking in tests
--- a/tasks/web-production/18-load-testing.md
+++ b/tasks/web-production/18-load-testing.md
@@ -0,0 +1,78 @@
+# 18. Load & Stress Testing
+
+meta:
+  id: web-production-18
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [testing, performance, production]
+
+objective:
+- Validate application performance under production-like load and identify bottlenecks
+
+deliverables:
+- Load test suite with k6 or Artillery
+- Performance baseline documentation
+- Bottleneck identification report
+- Scaling recommendations
+
+steps:
+1. Set up load testing tool:
+   - Install k6 or Artillery
+   - Create tests/ directory for load tests
+   - Configure test environment (staging)
+2. Write load tests for critical endpoints:
+   - GET / (landing page)
+   - POST /api/trpc/user.login
+   - GET /api/trpc/user.me (authenticated)
+   - GET /api/trpc/darkwatch.getExposures
+   - GET /api/trpc/alerts.getAlerts
+   - WebSocket connection and alert subscription
+3. Define load scenarios:
+   - Baseline: 100 concurrent users, 5 minutes
+   - Target: 1000 concurrent users, 10 minutes
+   - Stress: 5000 concurrent users, 5 minutes
+   - Spike: 0 to 2000 users in 10 seconds
+4. Measure and record:
+   - Response time percentiles (p50, p95, p99)
+   - Error rate
+   - Requests per second (throughput)
+   - CPU and memory usage on server
+   - Database connection pool utilization
+   - Redis memory usage
+5. Identify bottlenecks:
+   - Slow queries from database
+   - Memory leaks
+   - Connection pool exhaustion
+   - CPU-bound operations
+6. Document scaling recommendations:
+   - Horizontal scaling (more instances)
+   - Vertical scaling (bigger instances)
+   - Caching improvements
+   - Query optimization
+
+tests:
+- Load: Baseline test passes with <200ms p95
+- Stress: App remains functional under 5x normal load
+- Spike: App recovers within 30 seconds after spike
+
+acceptance_criteria:
+- Baseline load (100 concurrent) → p95 < 200ms, 0% errors
+- Target load (1000 concurrent) → p95 < 500ms, <1% errors
+- Stress load (5000 concurrent) → no crashes, <5% errors
+- Spike test → recovery within 30 seconds
+- Performance baseline documented with metrics
+- Bottleneck report with actionable recommendations
+- Scaling plan documented
+
+validation:
+- Run k6 against staging → results within acceptable thresholds
+- Check server metrics during test → CPU <80%, memory <80%
+- Database connections → pool not exhausted
+- Review report → identified 3+ bottlenecks with fixes
+
+notes:
+- Always test against staging, never production
+- Schedule load tests during low-traffic periods
+- Use k6 Cloud for distributed load testing if needed
+- Consider using Vercel Analytics for real-user monitoring (RUM)
--- a/tasks/web-production/19-accessibility-audit.md
+++ b/tasks/web-production/19-accessibility-audit.md
@@ -0,0 +1,78 @@
+# 19. Accessibility Audit & WCAG Compliance
+
+meta:
+  id: web-production-19
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [testing, accessibility, compliance]
+
+objective:
+- Ensure the web application meets WCAG 2.1 AA standards and is usable by people with disabilities
+
+deliverables:
+- Automated accessibility testing with axe-core
+- Manual keyboard navigation audit
+- Screen reader testing
+- Accessibility statement page
+
+steps:
+1. Set up automated accessibility testing:
+   - Install @axe-core/react or jest-axe
+   - Add accessibility tests to component test suite
+   - Integrate axe-core with Playwright E2E tests
+   - Fail build on critical accessibility violations
+2. Run automated audit:
+   - Test all pages: landing, auth, dashboard, settings
+   - Check for: missing alt text, low contrast, missing labels, focus issues
+   - Generate report with violation severity
+3. Manual keyboard navigation audit:
+   - Navigate entire app using only Tab, Enter, Space, Escape
+   - Verify focus indicators visible on all interactive elements
+   - Test skip links and logical tab order
+   - Verify no keyboard traps
+4. Screen reader testing:
+   - Test with NVDA (Windows) or VoiceOver (macOS)
+   - Verify all interactive elements have accessible names
+   - Test live regions for dynamic content (alerts, toasts)
+   - Verify form error messages announced
+5. Fix critical issues:
+   - Add missing aria-labels and aria-describedby
+   - Fix color contrast ratios (minimum 4.5:1 for normal text)
+   - Ensure all images have alt text
+   - Add proper heading hierarchy (h1 → h2 → h3)
+6. Create accessibility statement:
+   - Page at /accessibility
+   - Commitment to WCAG 2.1 AA
+   - Known limitations
+   - Contact for accessibility feedback
+7. Add accessibility CI check:
+   - Lighthouse accessibility audit >95
+   - axe-core scan in CI pipeline
+
+tests:
+- Automated: axe-core scan passes with 0 violations
+- Manual: Keyboard navigation completes all flows
+- Screen reader: All critical paths navigable
+
+acceptance_criteria:
+- WCAG 2.1 AA compliance on all pages
+- Lighthouse accessibility score ≥ 95
+- 0 critical or serious axe-core violations
+- All interactive elements keyboard accessible
+- Focus indicators visible and logical
+- All images have descriptive alt text
+- Color contrast ratios ≥ 4.5:1 for normal text
+- Accessibility statement page live
+
+validation:
+- Run axe-core → 0 critical/serious violations
+- Lighthouse CI → Accessibility score ≥ 95
+- Navigate with keyboard only → complete signup flow
+- Screen reader test → all elements announced correctly
+
+notes:
+- Current app has some accessibility features (skip link, aria-live) but needs audit
+- SolidJS components need proper aria attributes
+- Consider using Radix UI primitives for built-in accessibility
+- Test with actual assistive technology, not just automated tools
--- a/tasks/web-production/20-dependency-scanning.md
+++ b/tasks/web-production/20-dependency-scanning.md
@@ -0,0 +1,71 @@
+# 20. Dependency Vulnerability Scanning
+
+meta:
+  id: web-production-20
+  feature: web-production
+  priority: P1
+  depends_on: []
+  tags: [security, dependencies, production]
+
+objective:
+- Implement continuous dependency vulnerability scanning and automated updates
+
+deliverables:
+- npm audit integration in CI
+- Snyk or Dependabot monitoring
+- Automated security patch PRs
+- SBOM (Software Bill of Materials) generation
+
+steps:
+1. Set up automated scanning:
+   - Enable Dependabot alerts in GitHub repository settings
+   - Configure Dependabot version updates (weekly)
+   - Add Snyk integration for deeper analysis
+   - Configure Snyk to fail builds on high+ severity
+2. Add CI scanning:
+   - `pnpm audit --audit-level=high` in GitHub Actions
+   - `snyk test` in CI pipeline
+   - Block PR merge on high/critical vulnerabilities
+3. Implement automated patching:
+   - Dependabot auto-PR for patch updates
+   - Snyk auto-fix PRs for fixable vulnerabilities
+   - Manual review required for major version updates
+4. Generate SBOM:
+   - Use cyclonedx or spdx-sbom-generator
+   - Generate on every release
+   - Store with release artifacts
+5. Audit current dependencies:
+   - Run `pnpm audit` and fix all high/critical issues
+   - Check for unmaintained packages
+   - Review direct dependencies for necessity
+   - Remove unused dependencies
+6. Set up alerting:
+   - Slack notification for new vulnerabilities
+   - Weekly vulnerability report
+   - Emergency alert for critical CVEs
+
+tests:
+- Security: Introduce vulnerable package → CI blocks merge
+- Integration: Verify Dependabot creates PR for outdated package
+- Audit: SBOM generated and contains all dependencies
+
+acceptance_criteria:
+- Zero high or critical vulnerabilities in dependencies
+- Dependabot monitoring all dependencies
+- CI fails on high+ severity vulnerabilities
+- SBOM generated for every release
+- Automated PRs for security patches within 24 hours
+- Weekly dependency update report
+- All unused dependencies removed
+
+validation:
+- `pnpm audit` → 0 high/critical findings
+- Check GitHub Security tab → no open alerts
+- Merge PR with vulnerable package → CI fails
+- Create release → SBOM artifact attached
+
+notes:
+- Some vulnerabilities may be in devDependencies — these are lower priority
+- Focus on production dependencies first
+- Consider using pnpm overrides for emergency patches
+- Review major version updates carefully for breaking changes
--- a/tasks/web-production/21-legal-pages.md
+++ b/tasks/web-production/21-legal-pages.md
@@ -0,0 +1,78 @@
+# 21. Privacy Policy, TOS & Legal Pages
+
+meta:
+  id: web-production-21
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [compliance, legal, production]
+
+objective:
+- Create and deploy all required legal pages for production operation
+
+deliverables:
+- Privacy Policy page (/privacy)
+- Terms of Service page (/terms)
+- Cookie Policy page (/cookies)
+- Data Processing Agreement (DPA) page
+- Legal pages linked in footer
+
+steps:
+1. Create Privacy Policy:
+   - Data collection practices (what, why, how long)
+   - Third-party services (Stripe, Clerk, Twilio, Firebase)
+   - User rights (access, rectification, deletion, portability)
+   - Contact information for privacy inquiries
+   - Last updated date
+2. Create Terms of Service:
+   - Service description and limitations
+   - User responsibilities and prohibited conduct
+   - Subscription terms and billing
+   - Termination clauses
+   - Limitation of liability
+   - Dispute resolution
+3. Create Cookie Policy:
+   - Types of cookies used (essential, analytics, marketing)
+   - Purpose of each cookie
+   - How to manage cookies
+   - Third-party cookies
+4. Create Data Processing Agreement:
+   - Roles and responsibilities
+   - Data security measures
+   - Subprocessor list
+   - Breach notification procedures
+5. Add legal pages to app:
+   - Create routes: /privacy, /terms, /cookies, /dpa
+   - Add links in Footer component
+   - Ensure pages are server-rendered for SEO
+6. Review with legal counsel:
+   - Have privacy policy reviewed by attorney
+   - Ensure compliance with applicable jurisdictions
+   - Update based on feedback
+
+tests:
+- Unit: Test routes render correctly
+- Integration: Verify links in footer navigate correctly
+- Compliance: Review with legal counsel
+
+acceptance_criteria:
+- Privacy Policy live at /privacy
+- Terms of Service live at /terms
+- Cookie Policy live at /cookies
+- DPA live at /dpa
+- All pages linked in site footer
+- Pages reviewed and approved by legal counsel
+- Last updated date within 30 days of launch
+- Contact email for privacy inquiries functional
+
+validation:
+- Navigate to /privacy → complete policy displayed
+- Click footer links → correct pages load
+- Legal counsel approval documented
+- Email to privacy@kordant.com → received
+
+notes:
+- Consider using Termly or iubenda for generated policies
+- Ensure policies cover all data processors (Stripe, Clerk, etc.)
+- Update policies when adding new third-party services
+- Keep records of user consent to terms
--- a/tasks/web-production/22-cookie-gdpr.md
+++ b/tasks/web-production/22-cookie-gdpr.md
@@ -0,0 +1,80 @@
+# 22. Cookie Consent & GDPR Compliance
+
+meta:
+  id: web-production-22
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [compliance, gdpr, cookies, production]
+
+objective:
+- Implement GDPR-compliant cookie consent with granular controls and data processing transparency
+
+deliverables:
+- Cookie consent banner component
+- Granular cookie preference management
+- Consent storage and enforcement
+- GDPR compliance verification
+
+steps:
+1. Create cookie consent banner:
+   - Banner appears on first visit
+   - Accept all, reject non-essential, customize options
+   - Links to cookie policy
+   - Dismissible but persistent until choice made
+   - Mobile-responsive design
+2. Implement granular controls:
+   - Essential cookies (always on): auth, security
+   - Analytics cookies (opt-in): PostHog, Plausible
+   - Marketing cookies (opt-in): retargeting, ads
+   - Preference cookies (opt-in): theme, language
+3. Create preference modal:
+   - Toggle switches for each category
+   - Description of each cookie type
+   - Save preferences button
+   - Re-openable from footer link
+4. Implement consent enforcement:
+   - Store consent in cookie/localStorage
+   - Block analytics scripts until consent given
+   - Block marketing scripts until consent given
+   - Respect "Do Not Track" browser setting
+5. Add GDPR-specific features:
+   - Data processing notice in signup flow
+   - Right to access data (export tool)
+   - Right to erasure (delete account)
+   - Right to portability (data export)
+   - Data retention periods documented
+6. Add consent logging:
+   - Log consent choices with timestamp
+   - Store for compliance audit trail
+   - Allow users to view their consent history
+
+tests:
+- Unit: Test consent banner rendering and interaction
+- Integration: Test analytics blocked until consent
+- Compliance: Verify DNT respected
+
+acceptance_criteria:
+- Cookie banner appears on first visit to all users
+- Users can accept, reject, or customize cookie preferences
+- Analytics scripts load only after opt-in consent
+- Marketing scripts load only after opt-in consent
+- Essential cookies function without consent
+- Consent preferences persist across sessions
+- "Do Not Track" browser setting respected
+- Consent choice logged with timestamp
+- GDPR rights accessible from settings page
+- Cookie policy linked from banner and footer
+
+validation:
+- Clear cookies → visit site → banner appears
+- Click "Reject" → analytics network requests blocked
+- Click "Customize" → toggle analytics on → requests allowed
+- Enable DNT in browser → banner shows "DNT detected"
+- Check localStorage → consent object stored
+
+notes:
+- Use CookieConsent by Orestbida or build custom with SolidJS
+- Must comply with both GDPR (EU) and CCPA (California)
+- Analytics must be completely blocked, not just paused
+- Document consent choices for 2 years (regulatory requirement)
--- a/tasks/web-production/23-data-export-deletion.md
+++ b/tasks/web-production/23-data-export-deletion.md
@@ -0,0 +1,76 @@
+# 23. Data Export & Deletion Tools
+
+meta:
+  id: web-production-23
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [compliance, gdpr, privacy, production]
+
+objective:
+- Implement user-facing data export and account deletion tools to comply with GDPR and CCPA requirements
+
+deliverables:
+- Data export API and UI (/settings/data-export)
+- Account deletion API and UI (/settings/delete-account)
+- Data retention policy enforcement
+- Deletion confirmation and grace period
+
+steps:
+1. Create data export functionality:
+   - API endpoint: POST /api/trpc/user.exportData
+   - Collect all user data: profile, alerts, exposures, subscriptions, family members
+   - Format as JSON or machine-readable format
+   - Include metadata: export date, data categories
+   - Email download link or provide direct download
+   - Complete within 30 days (GDPR requirement)
+2. Create account deletion:
+   - UI in settings page with warning and confirmation
+   - Require password re-entry for confirmation
+   - API endpoint: POST /api/trpc/user.delete
+   - Soft delete first (mark deletedAt, anonymize)
+   - Hard delete after 30-day grace period
+   - Cancel active subscriptions via Stripe
+   - Remove from email lists
+3. Implement family data handling:
+   - If family group owner: transfer ownership or delete group
+   - If family member: remove from group
+   - Notify family members of account deletion
+4. Add data retention policy:
+   - Define retention periods per data type
+   - Automated cleanup of deleted accounts after 30 days
+   - Audit logs retained for 1 year
+   - Backup deletion after retention period
+5. Add admin tools:
+   - Admin endpoint to fulfill data export requests
+   - Admin endpoint to process deletion requests
+   - Audit log of all export/deletion actions
+
+tests:
+- Unit: Test export includes all user data
+- Integration: Test deletion flow end-to-end
+- Compliance: Verify grace period and hard delete
+
+acceptance_criteria:
+- Users can export all personal data from settings
+- Export includes: profile, alerts, exposures, watchlist, subscriptions, family data
+- Export delivered within 30 seconds (async for large data)
+- Account deletion requires password confirmation
+- Deleted accounts soft-deleted immediately, hard-deleted after 30 days
+- Active subscriptions cancelled on deletion
+- Family group handled correctly (ownership transfer)
+- Deletion audit log maintained
+- Data retention policy documented and enforced
+
+validation:
+- Export data → JSON file contains all user data
+- Delete account → user marked deleted, can login to restore within 30 days
+- After 30 days → user data completely removed from DB
+- Check Stripe → subscription cancelled
+- Check audit log → deletion action recorded
+
+notes:
+- Soft delete preserves referential integrity for family groups
+- Hard delete must cascade through all related tables
+- Consider GDPR Article 17 exceptions (legal obligations)
+- Backup restoration may temporarily restore deleted data
--- a/tasks/web-production/24-security-txt.md
+++ b/tasks/web-production/24-security-txt.md
@@ -0,0 +1,79 @@
+# 24. Security.txt & Responsible Disclosure
+
+meta:
+  id: web-production-24
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [security, compliance, production]
+
+objective:
+- Implement security.txt and responsible disclosure process for security researchers
+
+deliverables:
+- security.txt file at /.well-known/security.txt
+- security@kordant.com email address
+- Responsible disclosure policy page
+- Bug bounty program foundation
+
+steps:
+1. Create security.txt:
+   - Contact: mailto:security@kordant.com
+   - Expires: date 1 year in future
+   - Encryption: link to PGP key (optional)
+   - Acknowledgments: link to hall of fame page
+   - Policy: link to disclosure policy
+   - Hiring: link to security jobs (if applicable)
+2. Create responsible disclosure policy:
+   - Page at /security/disclosure
+   - Scope of testing (what's in scope, what's out)
+   - Rules of engagement (no DDoS, no data exfiltration)
+   - Safe harbor promise (won't prosecute good faith research)
+   - Reporting process and expected response time
+   - Reward/recognition program details
+3. Set up security email:
+   - Create security@kordant.com alias
+   - Forward to engineering team
+   - Set up auto-responder with acknowledgment
+   - Create internal triage process
+4. Create vulnerability response process:
+   - Internal SLA: acknowledge within 48 hours
+   - Triage within 72 hours
+   - Fix critical vulnerabilities within 7 days
+   - Fix high severity within 30 days
+   - Public disclosure after fix deployed
+5. Add hall of fame page:
+   - Page at /security/hall-of-fame
+   - List researchers who reported valid vulnerabilities
+   - Include date, severity, and researcher name (with permission)
+6. Add security page to footer:
+   - Link to disclosure policy
+   - Link to security.txt
+   - Link to hall of fame
+
+tests:
+- Integration: Verify security.txt accessible
+- Process: Test email auto-responder
+- Content: Review policy with security team
+
+acceptance_criteria:
+- security.txt accessible at /.well-known/security.txt
+- Disclosure policy live at /security/disclosure
+- security@kordant.com email active with auto-responder
+- Hall of fame page live at /security/hall-of-fame
+- Safe harbor promise clearly stated
+- Response SLA documented and followed
+- Security links in site footer
+- PGP key available for encrypted communication (optional)
+
+validation:
+- `curl https://kordant.com/.well-known/security.txt` → valid security.txt
+- Email security@kordant.com → auto-responder received
+- Navigate to /security/disclosure → complete policy visible
+- Check footer → security links present
+
+notes:
+- security.txt standard defined by RFC 9116
+- Safe harbor is critical for encouraging responsible disclosure
+- Consider joining HackerOne or Bugcrowd for managed bug bounty
+- Document vulnerability severity classification (CVSS)
--- a/tasks/web-production/25-seo-meta.md
+++ b/tasks/web-production/25-seo-meta.md
@@ -0,0 +1,83 @@
+# 25. Sitemap, Robots.txt & Open Graph
+
+meta:
+  id: web-production-25
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [seo, marketing, production]
+
+objective:
+- Implement SEO fundamentals including sitemap, robots.txt, and Open Graph meta tags for all pages
+
+deliverables:
+- Dynamic sitemap.xml generation
+- robots.txt configuration
+- Open Graph meta tags on all pages
+- Twitter Card meta tags
+- Canonical URLs
+
+steps:
+1. Create dynamic sitemap:
+   - Route: /sitemap.xml
+   - Include all public pages: /, /about, /features, /pricing, /blog/*
+   - Include auth pages: /login, /signup
+   - Exclude admin pages and user-specific pages
+   - Set priorities and change frequencies
+   - Auto-update when blog posts published
+2. Create robots.txt:
+   - Allow: all public pages
+   - Disallow: /(admin)/*, /api/*, /billing/*, /auth/*
+   - Sitemap reference
+   - Crawl-delay for respectful crawling
+3. Add Open Graph tags to all pages:
+   - og:title matching page title
+   - og:description from meta description
+   - og:image with branded preview image (1200x630)
+   - og:url with canonical URL
+   - og:type (website, article for blog)
+   - og:site_name: Kordant
+4. Add Twitter Card tags:
+   - twitter:card: summary_large_image
+   - twitter:title, twitter:description, twitter:image
+5. Add canonical URLs:
+   - Prevent duplicate content issues
+   - Use absolute URLs with https
+   - Handle query parameters correctly
+6. Create branded OG image:
+   - Design 1200x630px image with Kordant branding
+   - Include logo, tagline, and shield icon
+   - Generate dynamically for blog posts (optional)
+7. Add structured data:
+   - Organization schema on homepage
+   - WebSite schema with SearchAction
+   - Article schema for blog posts
+   - SoftwareApplication schema for app
+
+tests:
+- Unit: Test sitemap XML generation
+- Integration: Verify meta tags on all pages
+- SEO: Test with Facebook Sharing Debugger and Twitter Card Validator
+
+acceptance_criteria:
+- Sitemap accessible at /sitemap.xml with all public pages
+- robots.txt accessible at /robots.txt with correct directives
+- Open Graph tags present on all public pages
+- Twitter Card tags present on all public pages
+- Canonical URL on every page
+- Branded OG image displaying correctly in social shares
+- Structured data valid per schema.org (test with Google Rich Results)
+- Blog posts have Article schema
+
+validation:
+- `curl /sitemap.xml` → valid XML with all routes
+- `curl /robots.txt` → correct allow/disallow directives
+- Facebook Sharing Debugger → OG image and title display correctly
+- Google Rich Results Test → structured data valid
+- View page source → all meta tags present
+
+notes:
+- SolidJS MetaProvider already in use — extend with OG tags
+- Use @solidjs/meta for dynamic meta tags per route
+- Consider using @vercel/og or similar for dynamic OG images
+- Blog sitemap should update automatically on publish
--- a/tasks/web-production/26-analytics.md
+++ b/tasks/web-production/26-analytics.md
@@ -0,0 +1,83 @@
+# 26. Analytics Integration (Plausible/PostHog)
+
+meta:
+  id: web-production-26
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [analytics, marketing, production]
+
+objective:
+- Implement privacy-respecting analytics to understand user behavior and measure conversion funnels
+
+deliverables:
+- Analytics tracking setup
+- Custom event tracking for key actions
+- Conversion funnel measurement
+- Dashboard for key metrics
+
+steps:
+1. Set up analytics platform:
+   - Choose: Plausible (privacy-first, simple) or PostHog (powerful, self-hostable)
+   - Create account and add tracking script
+   - Configure domain and goals
+2. Add tracking to app:
+   - Add script to web/src/entry-client.tsx or layout
+   - Respect cookie consent (load only after opt-in)
+   - Respect Do Not Track
+   - Exclude admin traffic
+3. Track page views:
+   - All public pages
+   - Dashboard pages (anonymized)
+   - Blog post reads
+4. Track custom events:
+   - signup_started, signup_completed
+   - login, logout
+   - subscription_started, subscription_completed
+   - darkwatch_scan_initiated
+   - alert_viewed, alert_resolved
+   - feature_page_viewed (voiceprint, spamshield, etc.)
+5. Create conversion funnels:
+   - Landing → Signup → Onboarding → Dashboard
+   - Dashboard → Pricing → Checkout → Subscription
+   - Blog → Signup (content marketing ROI)
+6. Set up dashboards:
+   - Daily/weekly active users
+   - Signup conversion rate
+   - Subscription conversion rate
+   - Feature adoption (DarkWatch, VoicePrint, etc.)
+   - Churn rate
+   - Revenue metrics (via Stripe integration)
+7. Add A/B testing foundation:
+   - PostHog feature flags or Split.io
+   - Test landing page variants
+   - Test pricing page variants
+
+tests:
+- Integration: Verify events fire correctly
+- Privacy: Confirm no PII in analytics payload
+- Consent: Test analytics blocked until cookie consent
+
+acceptance_criteria:
+- Analytics tracking active on all public pages
+- Custom events firing for signup, login, subscription, key features
+- Conversion funnels visible in dashboard
+- No PII (names, emails, IDs) sent to analytics
+- Analytics loads only after cookie consent (if required)
+- Admin pages excluded from tracking
+- Daily active users metric available
+- Subscription conversion rate tracked
+- A/B testing framework ready for use
+
+validation:
+- Visit landing page → pageview event in analytics
+- Sign up → signup_completed event with funnel progression
+- Check analytics dashboard → conversion rates visible
+- Inspect network tab → no email addresses in payload
+- Reject cookies → analytics script not loaded
+
+notes:
+- Plausible is GDPR-compliant without cookie consent banner
+- PostHog offers more features but requires consent in EU
+- Consider self-hosting Plausible for complete data control
+- Stripe can send revenue data to analytics automatically
--- a/tasks/web-production/27-structured-data.md
+++ b/tasks/web-production/27-structured-data.md
@@ -0,0 +1,82 @@
+# 27. Structured Data & Rich Snippets
+
+meta:
+  id: web-production-27
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [seo, marketing, production]
+
+objective:
+- Implement schema.org structured data to enable rich snippets in search results and improve SEO
+
+deliverables:
+- JSON-LD structured data on all relevant pages
+- Organization schema
+- WebSite schema with search
+- Article schema for blog posts
+- SoftwareApplication schema
+- BreadcrumbList schema
+
+steps:
+1. Add Organization schema to homepage:
+   - @type: Organization
+   - name: Kordant
+   - url: https://kordant.com
+   - logo: URL to logo image
+   - sameAs: social media profiles
+   - description: AI-powered identity protection
+2. Add WebSite schema:
+   - @type: WebSite
+   - url: https://kordant.com
+   - potentialAction: SearchAction with search URL template
+3. Add SoftwareApplication schema:
+   - @type: SoftwareApplication
+   - name: Kordant
+   - applicationCategory: SecurityApplication
+   - operatingSystem: Web, iOS, Android
+   - offers: Free tier, Plus ($12/mo), Premium ($29/mo)
+   - aggregateRating (once reviews collected)
+   - featureList: DarkWatch, VoicePrint, SpamShield, HomeTitle, RemoveBrokers
+4. Add Article schema for blog posts:
+   - @type: Article
+   - headline, author, datePublished, dateModified
+   - image, articleBody, keywords
+   - publisher (Organization reference)
+5. Add BreadcrumbList schema:
+   - Dynamic breadcrumbs based on current route
+   - Include in all non-home pages
+6. Add FAQPage schema (optional):
+   - For /about or /features pages
+   - Common questions and answers
+7. Validate all structured data:
+   - Test with Google Rich Results Test
+   - Test with Schema Markup Validator
+   - Fix any warnings or errors
+
+tests:
+- Unit: Test JSON-LD generation for each schema type
+- Integration: Verify schema present in page source
+- SEO: Validate with Google's tools
+
+acceptance_criteria:
+- Organization schema on homepage
+- WebSite schema with SearchAction on homepage
+- SoftwareApplication schema with pricing and features
+- Article schema on all blog posts
+- BreadcrumbList on all non-home pages
+- All schemas pass Google Rich Results Test
+- No errors or warnings in Schema Markup Validator
+- Schemas dynamically generated based on page data
+
+validation:
+- View homepage source → Organization and WebSite JSON-LD present
+- View blog post source → Article JSON-LD with correct dates
+- Google Rich Results Test → all schemas valid
+- Search console → rich results reported
+
+notes:
+- Use @solidjs/meta or script tags in JSX for JSON-LD
+- Keep JSON-LD in <head> for optimal crawler discovery
+- Update SoftwareApplication schema when pricing changes
+- Consider adding Review schema once user reviews available
--- a/tasks/web-production/28-api-versioning.md
+++ b/tasks/web-production/28-api-versioning.md
@@ -0,0 +1,73 @@
+# 28. API Versioning & Deprecation Strategy
+
+meta:
+  id: web-production-28
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [api, stability, mobile]
+
+objective:
+- Establish API versioning and deprecation strategy to support mobile app updates without breaking existing clients
+
+deliverables:
+- API versioning scheme
+- Deprecation policy documentation
+- Backward compatibility testing
+- Mobile client version tracking
+
+steps:
+1. Implement API versioning:
+   - Current: tRPC v10 (consider upgrade to v11)
+   - Add version header or URL prefix for breaking changes
+   - Version format: v1, v2, etc.
+   - Mobile apps send X-API-Version header
+2. Create deprecation policy:
+   - Document in docs/API_VERSIONING.md
+   - Breaking changes only in major versions
+   - Support previous version for minimum 6 months
+   - Announce deprecations 3 months in advance
+   - Sunset dates for old versions
+3. Add version negotiation:
+   - Backend supports multiple tRPC router versions
+   - Route to correct router based on version header
+   - Default to latest for web clients
+4. Track client versions:
+   - Log app version from User-Agent or X-Client-Version
+   - Dashboard showing active client versions
+   - Alert when old versions still in use near sunset
+5. Add compatibility tests:
+   - Test all mobile app versions against current API
+   - Automated compatibility matrix
+   - Breaking change detection in CI
+6. Document API changes:
+   - Changelog for all API modifications
+   - Migration guides for major versions
+   - Breaking vs non-breaking classification
+
+tests:
+- Unit: Test version routing
+- Integration: Test old client with new API
+- Compatibility: Verify mobile app versions work
+
+acceptance_criteria:
+- API versioning scheme documented and implemented
+- Mobile apps send version header in all requests
+- Backend supports at least 2 API versions simultaneously
+- Deprecation policy published and followed
+- 6-month support window for old versions
+- Client version tracking dashboard active
+- Compatibility tests passing for all supported versions
+- Changelog maintained for all API changes
+
+validation:
+- Mobile app sends X-API-Version: 1 → receives v1 responses
+- Deploy v2 changes → v1 clients continue working
+- Check dashboard → active client versions visible
+- Review changelog → all changes documented
+
+notes:
+- tRPC v10 to v11 is a breaking change — plan migration carefully
+- Mobile apps may take weeks to update — long support windows needed
+- Consider using feature flags instead of versioning for minor changes
+- Track iOS and Android app versions separately
--- a/tasks/web-production/29-api-documentation.md
+++ b/tasks/web-production/29-api-documentation.md
@@ -0,0 +1,82 @@
+# 29. API Documentation (OpenAPI/tRPC Docs)
+
+meta:
+  id: web-production-29
+  feature: web-production
+  priority: P2
+  depends_on: []
+  tags: [api, documentation, production]
+
+objective:
+- Generate and publish comprehensive API documentation for internal and external developers
+
+deliverables:
+- Auto-generated API documentation
+- Interactive API explorer
+- Authentication documentation
+- Error code reference
+
+steps:
+1. Set up tRPC documentation generation:
+   - Use trpc-openapi or @trpc/openapi-v3 to generate OpenAPI spec
+   - Or use trpc-docs or @trpc/doc-generator
+   - Export spec as JSON/YAML
+2. Create documentation site:
+   - Use Swagger UI or Scalar for interactive docs
+   - Host at /api/docs or separate docs subdomain
+   - Include request/response examples
+   - Include authentication requirements
+3. Document all routers:
+   - User router: login, signup, profile, family
+   - Billing router: subscription, checkout, webhooks
+   - DarkWatch router: watchlist, exposures, scan
+   - VoicePrint router: enrollments, analysis
+   - SpamShield router: rules, phone check
+   - HomeTitle router: properties, monitoring
+   - RemoveBrokers router: listings, removals
+   - Alerts router: list, resolve, correlation
+   - Admin router: user management, blog
+4. Add authentication docs:
+   - Session cookie authentication
+   - JWT bearer token authentication
+   - API key authentication (for extensions)
+   - Clerk webhook handling
+5. Add error documentation:
+   - Standard error codes (400, 401, 403, 404, 429, 500)
+   - tRPC error codes and meanings
+   - Rate limit headers explanation
+6. Add webhook documentation:
+   - Stripe webhook events
+   - Clerk webhook events
+   - Payload schemas and verification
+7. Keep docs in sync:
+   - Auto-generate on build
+   - CI check for doc changes
+   - Version docs with API versions
+
+tests:
+- Unit: Test OpenAPI spec generation
+- Integration: Verify docs site loads and examples work
+- Review: Team review for accuracy
+
+acceptance_criteria:
+- API docs accessible at /api/docs
+- All tRPC routers documented with input/output schemas
+- Interactive explorer allowing test requests
+- Authentication methods documented with examples
+- All error codes explained with examples
+- Webhook payloads documented with verification steps
+- Docs auto-generated from code (single source of truth)
+- Examples use realistic test data
+
+validation:
+- Navigate to /api/docs → interactive explorer loads
+- Try user.me endpoint → returns example response
+- Check auth section → all methods documented
+- Review webhook docs → verification steps clear
+
+notes:
+- trpc-openapi requires adding meta tags to procedures
+- Consider using Scalar (modern alternative to Swagger UI)
+- Docs should be public but sensitive endpoints marked as auth-required
+- Keep examples updated when schemas change
--- a/tasks/web-production/30-websocket-production.md
+++ b/tasks/web-production/30-websocket-production.md
@@ -0,0 +1,82 @@
+# 30. WebSocket Production Hardening
+
+meta:
+  id: web-production-30
+  feature: web-production
+  priority: P1
+  depends_on: []
+  tags: [security, websockets, production]
+
+objective:
+- Harden WebSocket server for production with authentication, rate limiting, and connection management
+
+deliverables:
+- Authenticated WebSocket connections
+- Connection rate limiting
+- Connection cleanup on logout
+- Horizontal scaling support (Redis adapter)
+
+steps:
+1. Harden WebSocket authentication:
+   - Validate JWT token in connection query param
+   - Reject unauthenticated connections immediately
+   - Re-authenticate periodically (every 15 minutes)
+   - Close connection on token expiry
+2. Implement connection rate limiting:
+   - Max 1 WebSocket connection per user
+   - Max 5 reconnection attempts per minute
+   - IP-based connection limits (100 per IP)
+3. Add connection management:
+   - Track active connections per user
+   - Close duplicate connections
+   - Heartbeat with timeout (current implementation good)
+   - Graceful close on server shutdown
+4. Implement horizontal scaling:
+   - Use Redis adapter for ws (socket.io-redis or @socket.io/redis-adapter)
+   - Or use Redis pub/sub for broadcast across instances
+   - Ensure alerts reach all connected clients regardless of instance
+5. Add message validation:
+   - Validate all incoming message schemas
+   - Reject malformed messages
+   - Limit message size (max 10KB)
+   - Sanitize message content
+6. Add monitoring:
+   - Track active connection count
+   - Track messages per second
+   - Track connection duration
+   - Alert on connection spikes (possible DDoS)
+7. Secure WebSocket server:
+   - Run on separate port or path
+   - TLS encryption (wss://)
+   - No mixed content (ws on https page)
+
+tests:
+- Unit: Test authentication rejection
+- Integration: Test duplicate connection handling
+- Load: Test 1000 concurrent WebSocket connections
+- Security: Test unauthenticated connection rejection
+
+acceptance_criteria:
+- All WebSocket connections authenticated with valid JWT
+- Unauthenticated connections rejected immediately
+- Max 1 connection per user (duplicates closed)
+- Heartbeat/ping-pong working with 30s interval
+- Redis adapter active for multi-instance deployment
+- Message size limited to 10KB
+- TLS encryption (wss://) in production
+- Connection metrics visible in monitoring
+- Graceful shutdown closes all connections cleanly
+
+validation:
+- Connect without token → connection rejected
+- Connect with valid token → connection accepted
+- Open second connection → first connection closed
+- Send 20KB message → connection closed with error
+- Scale to 2 server instances → alerts broadcast to all clients
+- Check metrics → active connections, message rate visible
+
+notes:
+- Current WebSocket in web/src/lib/websocket.ts and web/src/server/websocket.ts
+- ws library supports Redis adapter for scaling
+- Consider using Socket.io for more robust connection management
+- WebSocket auth via query params is common but consider cookie-based for security
--- a/tasks/web-production/31-db-backup.md
+++ b/tasks/web-production/31-db-backup.md
@@ -0,0 +1,77 @@
+# 31. Backup Strategy & Point-in-Time Recovery
+
+meta:
+  id: web-production-31
+  feature: web-production
+  priority: P1
+  depends_on: []
+  tags: [database, reliability, production]
+
+objective:
+- Implement automated database backups with point-in-time recovery capability
+
+deliverables:
+- Automated daily backups
+- Point-in-time recovery setup
+- Backup testing and verification
+- Retention policy
+
+steps:
+1. Set up automated backups:
+   - If PostgreSQL: configure pg_dump cron job or managed backups (RDS, Cloud SQL)
+   - If SQLite/Turso: configure Turso database branching/backups
+   - Daily full backups at off-peak hours (3 AM UTC)
+   - Hourly incremental backups (WAL archiving for Postgres)
+2. Configure backup storage:
+   - Store in separate region/cloud provider (S3, GCS, R2)
+   - Encrypt backups at rest
+   - Versioning enabled (protect against deletion)
+3. Implement point-in-time recovery:
+   - WAL archiving for PostgreSQL
+   - Transaction log backups every 15 minutes
+   - Test recovery to specific timestamp
+4. Add backup monitoring:
+   - Alert on backup failure
+   - Track backup size and duration
+   - Verify backup integrity (checksum)
+5. Test restore procedures:
+   - Monthly restore test to staging environment
+   - Document step-by-step restore process
+   - Measure RTO (Recovery Time Objective) and RPO (Recovery Point Objective)
+   - Target: RTO < 1 hour, RPO < 15 minutes
+6. Document retention:
+   - Daily backups: 7 days
+   - Weekly backups: 4 weeks
+   - Monthly backups: 12 months
+   - Annual backups: 7 years (compliance)
+7. Add Redis backup:
+   - RDB snapshots every 6 hours
+   - AOF persistence for point-in-time
+   - Backup to S3/GCS
+
+tests:
+- Integration: Test backup creation
+- Recovery: Test restore to staging
+- Monitoring: Verify backup alerts
+
+acceptance_criteria:
+- Daily automated backups running successfully
+- Backups stored in separate region with encryption
+- Point-in-time recovery tested and working
+- Backup failures trigger alerts within 5 minutes
+- Monthly restore test completed and documented
+- RTO < 1 hour, RPO < 15 minutes
+- Retention policy enforced automatically
+- Redis backups included in strategy
+
+validation:
+- Check backup storage → daily backups present
+- Trigger restore test → staging database restored successfully
+- Simulate backup failure → alert received
+- Check retention → old backups purged per policy
+
+notes:
+- Turso offers automatic backups for SQLite — verify configuration
+- RDS automated backups are easiest for PostgreSQL
+- Test restores are critical — untested backups are useless
+- Document restore process for on-call engineers
--- a/tasks/web-production/32-migration-safety.md
+++ b/tasks/web-production/32-migration-safety.md
@@ -0,0 +1,79 @@
+# 32. Migration Safety & Rollback Procedures
+
+meta:
+  id: web-production-32
+  feature: web-production
+  priority: P1
+  depends_on: []
+  tags: [database, reliability, production]
+
+objective:
+- Ensure database migrations are safe, reversible, and won't cause downtime or data loss in production
+
+deliverables:
+- Migration safety guidelines
+- Backward-compatible migration policy
+- Rollback scripts for each migration
+- Migration testing in staging
+
+steps:
+1. Create migration safety guidelines:
+   - Document in docs/MIGRATIONS.md
+   - Additive changes only in production (add columns, create tables)
+   - No destructive changes during deployment (no DROP COLUMN)
+   - Two-phase migrations for destructive changes:
+     - Phase 1: Add new column/table, deploy code to use it
+     - Phase 2: Remove old column/table after code stable
+2. Audit existing migrations:
+   - Review all drizzle migrations in web/src/server/db/
+   - Check for any destructive operations
+   - Add rollback scripts where missing
+3. Implement migration testing:
+   - Run migrations against staging database copy
+   - Verify app works after migration
+   - Test rollback script
+   - Measure migration duration (must be <30 seconds)
+4. Add migration safety checks:
+   - CI check: verify no destructive migrations in PR
+   - Pre-deploy: dry-run migration in production
+   - Post-deploy: verify migration applied successfully
+5. Document rollback procedures:
+   - Step-by-step rollback for each migration
+   - Database backup before migration
+   - Code rollback procedure
+   - Data recovery steps if needed
+6. Add migration monitoring:
+   - Log migration start, duration, success/failure
+   - Alert on migration failure
+   - Track migration duration trends
+7. Set up migration automation:
+   - GitHub Action to run migrations on staging deploy
+   - Manual approval for production migrations
+   - Automated rollback on migration failure
+
+tests:
+- Unit: Test migration scripts in isolation
+- Integration: Test migration on staging database
+- Rollback: Test rollback procedure
+
+acceptance_criteria:
+- All production migrations are additive-only
+- Two-phase migration process documented for destructive changes
+- Rollback script exists for every migration
+- Migrations tested on staging before production
+- Migration duration <30 seconds
+- Automated CI check preventing destructive migrations
+- Backup taken before every production migration
+- Migration failure triggers automatic alert and rollback
+
+validation:
+- Review migration history → no destructive changes in production
+- Test rollback → database restored to previous state
+- Run destructive migration in PR → CI blocks merge
+- Check migration logs → all migrations completed successfully
+
+notes:
+- Drizzle migrations are generally safe but review generated SQL
+- Use drizzle-kit generate with --custom for complex migrations
+- Consider using gh-ost or pt-online-schema-change for large tables
+- Always have a database backup before running production migrations
--- a/tasks/web-production/README.md
+++ b/tasks/web-production/README.md
@@ -0,0 +1,93 @@
+# Web Production Readiness
+
+Objective: Harden, optimize, and operationalize the SolidStart web application for production deployment with enterprise-grade security, performance, monitoring, and compliance.
+
+Status legend: [ ] todo, [~] in-progress, [x] done
+
+## Tasks
+
+### Security & Hardening
+- [ ] 01 — Security Headers & CORS Configuration → `01-security-headers-cors.md`
+- [ ] 02 — Rate Limiting & DDoS Protection → `02-rate-limiting-ddos.md`
+- [ ] 03 — Input Validation & XSS Prevention Audit → `03-input-validation-xss.md`
+- [ ] 04 — Authentication & Session Security Hardening → `04-auth-session-hardening.md`
+
+### Performance & Reliability
+- [ ] 05 — CDN & Asset Optimization → `05-cdn-asset-optimization.md`
+- [ ] 06 — Database Connection Pooling & Query Optimization → `06-db-connection-pooling.md`
+- [ ] 07 — Caching Strategy (Redis + HTTP Cache) → `07-caching-strategy.md`
+- [ ] 08 — Graceful Shutdown & Health Check Endpoints → `08-health-checks-shutdown.md`
+
+### Monitoring & Observability
+- [ ] 09 — Structured Logging & Log Aggregation → `09-structured-logging.md`
+- [ ] 10 — Error Tracking & Alerting (Sentry Integration) → `10-error-tracking.md`
+- [ ] 11 — Application Metrics & Dashboards → `11-metrics-dashboards.md`
+- [ ] 12 — Uptime & Performance Monitoring → `12-uptime-monitoring.md`
+
+### CI/CD & DevOps
+- [ ] 13 — GitHub Actions CI Pipeline → `13-github-actions-ci.md`
+- [ ] 14 — Automated Deployment Pipeline → `14-deployment-pipeline.md`
+- [ ] 15 — Docker & Infrastructure Optimization → `15-docker-infra.md`
+- [ ] 16 — Environment Management & Secrets Rotation → `16-env-secrets.md`
+
+### Testing & Quality Assurance
+- [ ] 17 — End-to-End Testing (Playwright) → `17-e2e-testing.md`
+- [ ] 18 — Load & Stress Testing → `18-load-testing.md`
+- [ ] 19 — Accessibility Audit & WCAG Compliance → `19-accessibility-audit.md`
+- [ ] 20 — Dependency Vulnerability Scanning → `20-dependency-scanning.md`
+
+### Compliance & Legal
+- [ ] 21 — Privacy Policy, TOS & Legal Pages → `21-legal-pages.md`
+- [ ] 22 — Cookie Consent & GDPR Compliance → `22-cookie-gdpr.md`
+- [ ] 23 — Data Export & Deletion Tools → `23-data-export-deletion.md`
+- [ ] 24 — Security.txt & Responsible Disclosure → `24-security-txt.md`
+
+### SEO & Marketing
+- [ ] 25 — Sitemap, Robots.txt & Open Graph → `25-seo-meta.md`
+- [ ] 26 — Analytics Integration (Plausible/PostHog) → `26-analytics.md`
+- [ ] 27 — Structured Data & Rich Snippets → `27-structured-data.md`
+
+### API & Backend Stability
+- [ ] 28 — API Versioning & Deprecation Strategy → `28-api-versioning.md`
+- [ ] 29 — API Documentation (OpenAPI/tRPC Docs) → `29-api-documentation.md`
+- [ ] 30 — WebSocket Production Hardening → `30-websocket-production.md`
+
+### Database Production Readiness
+- [ ] 31 — Backup Strategy & Point-in-Time Recovery → `31-db-backup.md`
+- [ ] 32 — Migration Safety & Rollback Procedures → `32-migration-safety.md`
+
+## Dependencies
+- 01, 02, 03, 04 can be done in parallel (security foundation)
+- 05, 06, 07, 08 can be done in parallel (performance foundation)
+- 09, 10, 11, 12 can be done in parallel (observability)
+- 13 depends on 17, 18, 19, 20 (tests must pass before CI)
+- 14 depends on 13, 15, 16 (CI + infra + env)
+- 21, 22, 23, 24 can be done in parallel (compliance)
+- 25, 26, 27 can be done in parallel (SEO)
+- 28, 29, 30 can be done in parallel (API stability)
+- 31, 32 can be done in parallel (DB ops)
+- All groups can proceed independently
+
+## Exit Criteria
+- All security headers present and scoring A+ on Security Headers scan
+- Rate limiting active on all public endpoints (100 req/min)
+- Database queries optimized with connection pooling (PgBouncer or equivalent)
+- Redis caching layer active for hot paths
+- Health check endpoint responding with 200 and dependency status
+- Structured logging shipping to aggregation service
+- Error tracking capturing 100% of unhandled exceptions
+- CI pipeline running tests, lint, typecheck, and build on every PR
+- Automated deployment to staging on merge to main
+- E2E tests covering critical user journeys (signup → dashboard → billing)
+- Load tests confirming 1000 concurrent users with <200ms p95 latency
+- Accessibility audit passing WCAG 2.1 AA
+- All production dependencies vulnerability-free
+- Legal pages live and linked in footer
+- Cookie consent banner functional with granular controls
+- GDPR data export and deletion APIs operational
+- SEO meta tags, sitemap, and robots.txt serving correctly
+- Analytics tracking page views and conversion events
+- API documentation publicly accessible and up-to-date
+- WebSocket connections stable with reconnection logic tested
+- Database backups automated with 7-day retention
+- Migration rollback tested and documented