4.8 KiB
4.8 KiB
11. Azure Voice Live API for Synthetic Voice Detection
meta: id: core-services-11 feature: core-services-implementation priority: P2 depends_on: [core-services-01] tags: [voiceprint, azure, voice-clone-detection, liveness, api-integration]
objective:
- Replace the stub
detectSynthetic()that returns{ isSynthetic: false, confidence: 1.0 }with a real Azure Voice Live API integration, enabling consumer-facing voice clone detection via uploaded call recordings or live microphone capture.
deliverables:
- Azure Speech Services client with Voice Live API endpoint
- Audio preprocessing pipeline (resampling, normalization, VAD)
- Voice enrollment system for trusted contacts (family member voice templates)
- Synthetic detection endpoint that returns real confidence scores
- Call recording upload and analysis workflow
steps:
- Sign up for Azure Speech Services at https://azure.microsoft.com/services/cognitive-services/speech-services/
- Add
AZURE_SPEECH_KEYandAZURE_SPEECH_REGIONto.env.example - Create
voiceprint/azure.client.ts:detectLiveness(audioBuffer, referenceText?)— Voice Live API for challenge-response livenessverifySpeaker(audioBuffer, enrollmentId)— speaker verification against enrolled voiceenrollSpeaker(audioSamples): Promise<enrollmentId>— create voice template from samples
- Implement audio preprocessing:
- Convert to 16kHz mono PCM (Azure requirement)
- Normalize amplitude to -3 dBFS
- Trim silence using VAD (WebRTC or Silero)
- Max duration: 30 seconds per analysis
- Implement enrollment flow:
- User records 3–5 samples of family member saying phrases
- Store enrollment in database with
voiceEnrollmentsschema (already exists) - Generate enrollment ID, link to user account
- Implement detection flow:
- User uploads suspicious call recording or captures live audio
- Preprocess audio → Azure Voice Live API → get liveness score
- If enrollment exists, also run speaker verification → similarity score
- Combine scores: synthetic = low liveness AND low speaker match
- Implement
detectSynthetic()to return real analysis:- Score: 0.0–1.0 (synthetic likelihood)
- Confidence: based on audio quality and API response certainty
- Decision: synthetic if score > 0.7, suspicious if 0.4–0.7, genuine if < 0.4
- Add analysis history:
- Store every analysis in database (audio hash, score, decision)
- Dashboard shows history of analyzed calls
- User can report false positive/negative for model improvement
- Implement tier limits:
- Fortress+: VoicePrint included
- Lower tiers: not available or limited to 5 analyses/month
tests:
- Unit: Mock Azure API responses, verify score calculation and decision logic
- Integration: Test with real Azure Voice Live API using synthetic and genuine audio samples
- E2E: Upload suspicious call recording → receive analysis result with confidence score
acceptance_criteria:
detectSynthetic()calls real Azure Voice Live API (not returning hardcodedisSynthetic: false)- Audio preprocessing converts to 16kHz mono PCM and normalizes amplitude
- Voice enrollment creates usable template from 3–5 user-provided samples
- Speaker verification returns similarity score between 0.0 and 1.0
- Liveness detection returns pass/fail with confidence for challenge-response mode
- Combined score correctly flags known synthetic voice samples (>0.7 threshold)
- Analysis results are stored in database with audio hash and metadata
- Dashboard shows analysis history with play button for uploaded audio
- Tier enforcement: VoicePrint only available on Fortress+ plans
- Graceful fallback: if Azure API fails, return "analysis unavailable" (not false negative)
- False positive rate < 5% on genuine voice samples (tested with 100+ samples)
validation:
- Run
vitest run voiceprint.test.ts— all tests pass with Azure mock - Manual: Upload genuine voice sample, verify
isSynthetic: falsewith confidence > 0.9 - Manual: Upload synthetic voice (e.g., from ElevenLabs), verify
isSynthetic: truewith confidence > 0.7 - Check enrollment: Database
voiceEnrollmentstable has real templates with Azure enrollment IDs
notes:
- Azure Voice Live API costs ~$0.016/minute of audio analyzed
- At 50 analyses/user/month (1–2 min each), cost is ~$0.80–$1.60/user/month
- This is the ONLY practical path for a startup — building in-house costs $840K–$1.25M Year 1
- The differentiator isn't the detection tech (everyone uses Azure/Daon/Pindrop) — it's the consumer UX and integration
- Consider adding forensic analysis mode: detailed spectrogram visualization for user education
- Mobile integration (iOS CallKit, Android Telecom) is Phase 4 (task 12) — this task is server-side only
- Store audio samples securely (encrypted at rest) and allow user deletion (privacy compliance)