shortcommings

2026-05-31 22:03:18 -04:00
parent 3b29de3234
commit c159f07322
17 changed files with 1535 additions and 4 deletions
--- a/tasks/core-services-implementation/11-voiceprint-azure-api.md
+++ b/tasks/core-services-implementation/11-voiceprint-azure-api.md
@@ -0,0 +1,84 @@
+# 11. Azure Voice Live API for Synthetic Voice Detection
+
+meta:
+  id: core-services-11
+  feature: core-services-implementation
+  priority: P2
+  depends_on: [core-services-01]
+  tags: [voiceprint, azure, voice-clone-detection, liveness, api-integration]
+
+objective:
+- Replace the stub `detectSynthetic()` that returns `{ isSynthetic: false, confidence: 1.0 }` with a real Azure Voice Live API integration, enabling consumer-facing voice clone detection via uploaded call recordings or live microphone capture.
+
+deliverables:
+- Azure Speech Services client with Voice Live API endpoint
+- Audio preprocessing pipeline (resampling, normalization, VAD)
+- Voice enrollment system for trusted contacts (family member voice templates)
+- Synthetic detection endpoint that returns real confidence scores
+- Call recording upload and analysis workflow
+
+steps:
+1. Sign up for Azure Speech Services at https://azure.microsoft.com/services/cognitive-services/speech-services/
+2. Add `AZURE_SPEECH_KEY` and `AZURE_SPEECH_REGION` to `.env.example`
+3. Create `voiceprint/azure.client.ts`:
+   - `detectLiveness(audioBuffer, referenceText?)` — Voice Live API for challenge-response liveness
+   - `verifySpeaker(audioBuffer, enrollmentId)` — speaker verification against enrolled voice
+   - `enrollSpeaker(audioSamples): Promise<enrollmentId>` — create voice template from samples
+4. Implement audio preprocessing:
+   - Convert to 16kHz mono PCM (Azure requirement)
+   - Normalize amplitude to -3 dBFS
+   - Trim silence using VAD (WebRTC or Silero)
+   - Max duration: 30 seconds per analysis
+5. Implement enrollment flow:
+   - User records 3–5 samples of family member saying phrases
+   - Store enrollment in database with `voiceEnrollments` schema (already exists)
+   - Generate enrollment ID, link to user account
+6. Implement detection flow:
+   - User uploads suspicious call recording or captures live audio
+   - Preprocess audio → Azure Voice Live API → get liveness score
+   - If enrollment exists, also run speaker verification → similarity score
+   - Combine scores: synthetic = low liveness AND low speaker match
+7. Implement `detectSynthetic()` to return real analysis:
+   - Score: 0.0–1.0 (synthetic likelihood)
+   - Confidence: based on audio quality and API response certainty
+   - Decision: synthetic if score > 0.7, suspicious if 0.4–0.7, genuine if < 0.4
+8. Add analysis history:
+   - Store every analysis in database (audio hash, score, decision)
+   - Dashboard shows history of analyzed calls
+   - User can report false positive/negative for model improvement
+9. Implement tier limits:
+   - Fortress+: VoicePrint included
+   - Lower tiers: not available or limited to 5 analyses/month
+
+tests:
+- Unit: Mock Azure API responses, verify score calculation and decision logic
+- Integration: Test with real Azure Voice Live API using synthetic and genuine audio samples
+- E2E: Upload suspicious call recording → receive analysis result with confidence score
+
+acceptance_criteria:
+- [ ] `detectSynthetic()` calls real Azure Voice Live API (not returning hardcoded `isSynthetic: false`)
+- [ ] Audio preprocessing converts to 16kHz mono PCM and normalizes amplitude
+- [ ] Voice enrollment creates usable template from 3–5 user-provided samples
+- [ ] Speaker verification returns similarity score between 0.0 and 1.0
+- [ ] Liveness detection returns pass/fail with confidence for challenge-response mode
+- [ ] Combined score correctly flags known synthetic voice samples (>0.7 threshold)
+- [ ] Analysis results are stored in database with audio hash and metadata
+- [ ] Dashboard shows analysis history with play button for uploaded audio
+- [ ] Tier enforcement: VoicePrint only available on Fortress+ plans
+- [ ] Graceful fallback: if Azure API fails, return "analysis unavailable" (not false negative)
+- [ ] False positive rate < 5% on genuine voice samples (tested with 100+ samples)
+
+validation:
+- Run `vitest run voiceprint.test.ts` — all tests pass with Azure mock
+- Manual: Upload genuine voice sample, verify `isSynthetic: false` with confidence > 0.9
+- Manual: Upload synthetic voice (e.g., from ElevenLabs), verify `isSynthetic: true` with confidence > 0.7
+- Check enrollment: Database `voiceEnrollments` table has real templates with Azure enrollment IDs
+
+notes:
+- Azure Voice Live API costs ~$0.016/minute of audio analyzed
+- At 50 analyses/user/month (1–2 min each), cost is ~$0.80–$1.60/user/month
+- This is the ONLY practical path for a startup — building in-house costs $840K–$1.25M Year 1
+- The differentiator isn't the detection tech (everyone uses Azure/Daon/Pindrop) — it's the consumer UX and integration
+- Consider adding forensic analysis mode: detailed spectrogram visualization for user education
+- Mobile integration (iOS CallKit, Android Telecom) is Phase 4 (task 12) — this task is server-side only
+- Store audio samples securely (encrypted at rest) and allow user deletion (privacy compliance)