Files
Kordant/tasks/core-services-implementation/11-voiceprint-azure-api.md
2026-05-31 22:03:18 -04:00

85 lines
4.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 11. Azure Voice Live API for Synthetic Voice Detection
meta:
id: core-services-11
feature: core-services-implementation
priority: P2
depends_on: [core-services-01]
tags: [voiceprint, azure, voice-clone-detection, liveness, api-integration]
objective:
- Replace the stub `detectSynthetic()` that returns `{ isSynthetic: false, confidence: 1.0 }` with a real Azure Voice Live API integration, enabling consumer-facing voice clone detection via uploaded call recordings or live microphone capture.
deliverables:
- Azure Speech Services client with Voice Live API endpoint
- Audio preprocessing pipeline (resampling, normalization, VAD)
- Voice enrollment system for trusted contacts (family member voice templates)
- Synthetic detection endpoint that returns real confidence scores
- Call recording upload and analysis workflow
steps:
1. Sign up for Azure Speech Services at https://azure.microsoft.com/services/cognitive-services/speech-services/
2. Add `AZURE_SPEECH_KEY` and `AZURE_SPEECH_REGION` to `.env.example`
3. Create `voiceprint/azure.client.ts`:
- `detectLiveness(audioBuffer, referenceText?)` — Voice Live API for challenge-response liveness
- `verifySpeaker(audioBuffer, enrollmentId)` — speaker verification against enrolled voice
- `enrollSpeaker(audioSamples): Promise<enrollmentId>` — create voice template from samples
4. Implement audio preprocessing:
- Convert to 16kHz mono PCM (Azure requirement)
- Normalize amplitude to -3 dBFS
- Trim silence using VAD (WebRTC or Silero)
- Max duration: 30 seconds per analysis
5. Implement enrollment flow:
- User records 35 samples of family member saying phrases
- Store enrollment in database with `voiceEnrollments` schema (already exists)
- Generate enrollment ID, link to user account
6. Implement detection flow:
- User uploads suspicious call recording or captures live audio
- Preprocess audio → Azure Voice Live API → get liveness score
- If enrollment exists, also run speaker verification → similarity score
- Combine scores: synthetic = low liveness AND low speaker match
7. Implement `detectSynthetic()` to return real analysis:
- Score: 0.01.0 (synthetic likelihood)
- Confidence: based on audio quality and API response certainty
- Decision: synthetic if score > 0.7, suspicious if 0.40.7, genuine if < 0.4
8. Add analysis history:
- Store every analysis in database (audio hash, score, decision)
- Dashboard shows history of analyzed calls
- User can report false positive/negative for model improvement
9. Implement tier limits:
- Fortress+: VoicePrint included
- Lower tiers: not available or limited to 5 analyses/month
tests:
- Unit: Mock Azure API responses, verify score calculation and decision logic
- Integration: Test with real Azure Voice Live API using synthetic and genuine audio samples
- E2E: Upload suspicious call recording → receive analysis result with confidence score
acceptance_criteria:
- [ ] `detectSynthetic()` calls real Azure Voice Live API (not returning hardcoded `isSynthetic: false`)
- [ ] Audio preprocessing converts to 16kHz mono PCM and normalizes amplitude
- [ ] Voice enrollment creates usable template from 35 user-provided samples
- [ ] Speaker verification returns similarity score between 0.0 and 1.0
- [ ] Liveness detection returns pass/fail with confidence for challenge-response mode
- [ ] Combined score correctly flags known synthetic voice samples (>0.7 threshold)
- [ ] Analysis results are stored in database with audio hash and metadata
- [ ] Dashboard shows analysis history with play button for uploaded audio
- [ ] Tier enforcement: VoicePrint only available on Fortress+ plans
- [ ] Graceful fallback: if Azure API fails, return "analysis unavailable" (not false negative)
- [ ] False positive rate < 5% on genuine voice samples (tested with 100+ samples)
validation:
- Run `vitest run voiceprint.test.ts` — all tests pass with Azure mock
- Manual: Upload genuine voice sample, verify `isSynthetic: false` with confidence > 0.9
- Manual: Upload synthetic voice (e.g., from ElevenLabs), verify `isSynthetic: true` with confidence > 0.7
- Check enrollment: Database `voiceEnrollments` table has real templates with Azure enrollment IDs
notes:
- Azure Voice Live API costs ~$0.016/minute of audio analyzed
- At 50 analyses/user/month (12 min each), cost is ~$0.80$1.60/user/month
- This is the ONLY practical path for a startup — building in-house costs $840K$1.25M Year 1
- The differentiator isn't the detection tech (everyone uses Azure/Daon/Pindrop) — it's the consumer UX and integration
- Consider adding forensic analysis mode: detailed spectrogram visualization for user education
- Mobile integration (iOS CallKit, Android Telecom) is Phase 4 (task 12) — this task is server-side only
- Store audio samples securely (encrypted at rest) and allow user deletion (privacy compliance)