11. Azure Voice Live API for Synthetic Voice Detection

meta: id: core-services-11 feature: core-services-implementation priority: P2 depends_on: [core-services-01] tags: [voiceprint, azure, voice-clone-detection, liveness, api-integration]

objective:

Replace the stub detectSynthetic() that returns { isSynthetic: false, confidence: 1.0 } with a real Azure Voice Live API integration, enabling consumer-facing voice clone detection via uploaded call recordings or live microphone capture.

deliverables:

Azure Speech Services client with Voice Live API endpoint
Audio preprocessing pipeline (resampling, normalization, VAD)
Voice enrollment system for trusted contacts (family member voice templates)
Synthetic detection endpoint that returns real confidence scores
Call recording upload and analysis workflow

steps:

Sign up for Azure Speech Services at https://azure.microsoft.com/services/cognitive-services/speech-services/
Add AZURE_SPEECH_KEY and AZURE_SPEECH_REGION to .env.example
Create voiceprint/azure.client.ts:
- detectLiveness(audioBuffer, referenceText?) — Voice Live API for challenge-response liveness
- verifySpeaker(audioBuffer, enrollmentId) — speaker verification against enrolled voice
- enrollSpeaker(audioSamples): Promise<enrollmentId> — create voice template from samples
Implement audio preprocessing:
- Convert to 16kHz mono PCM (Azure requirement)
- Normalize amplitude to -3 dBFS
- Trim silence using VAD (WebRTC or Silero)
- Max duration: 30 seconds per analysis
Implement enrollment flow:
- User records 3–5 samples of family member saying phrases
- Store enrollment in database with voiceEnrollments schema (already exists)
- Generate enrollment ID, link to user account
Implement detection flow:
- User uploads suspicious call recording or captures live audio
- Preprocess audio → Azure Voice Live API → get liveness score
- If enrollment exists, also run speaker verification → similarity score
- Combine scores: synthetic = low liveness AND low speaker match
Implement detectSynthetic() to return real analysis:
- Score: 0.0–1.0 (synthetic likelihood)
- Confidence: based on audio quality and API response certainty
- Decision: synthetic if score > 0.7, suspicious if 0.4–0.7, genuine if < 0.4
Add analysis history:
- Store every analysis in database (audio hash, score, decision)
- Dashboard shows history of analyzed calls
- User can report false positive/negative for model improvement
Implement tier limits:
- Fortress+: VoicePrint included
- Lower tiers: not available or limited to 5 analyses/month

tests:

Unit: Mock Azure API responses, verify score calculation and decision logic
Integration: Test with real Azure Voice Live API using synthetic and genuine audio samples
E2E: Upload suspicious call recording → receive analysis result with confidence score

acceptance_criteria:

detectSynthetic() calls real Azure Voice Live API (not returning hardcoded isSynthetic: false)
Audio preprocessing converts to 16kHz mono PCM and normalizes amplitude
Voice enrollment creates usable template from 3–5 user-provided samples
Speaker verification returns similarity score between 0.0 and 1.0
Liveness detection returns pass/fail with confidence for challenge-response mode
Combined score correctly flags known synthetic voice samples (>0.7 threshold)
Analysis results are stored in database with audio hash and metadata
Dashboard shows analysis history with play button for uploaded audio
Tier enforcement: VoicePrint only available on Fortress+ plans
Graceful fallback: if Azure API fails, return "analysis unavailable" (not false negative)
False positive rate < 5% on genuine voice samples (tested with 100+ samples)

validation:

Run vitest run voiceprint.test.ts — all tests pass with Azure mock
Manual: Upload genuine voice sample, verify isSynthetic: false with confidence > 0.9
Manual: Upload synthetic voice (e.g., from ElevenLabs), verify isSynthetic: true with confidence > 0.7
Check enrollment: Database voiceEnrollments table has real templates with Azure enrollment IDs

notes:

Azure Voice Live API costs ~$0.016/minute of audio analyzed
At 50 analyses/user/month (1–2 min each), cost is ~$0.80–$1.60/user/month
This is the ONLY practical path for a startup — building in-house costs $840K–$1.25M Year 1
The differentiator isn't the detection tech (everyone uses Azure/Daon/Pindrop) — it's the consumer UX and integration
Consider adding forensic analysis mode: detailed spectrogram visualization for user education
Mobile integration (iOS CallKit, Android Telecom) is Phase 4 (task 12) — this task is server-side only
Store audio samples securely (encrypted at rest) and allow user deletion (privacy compliance)

Reference in New Issue View Git Blame Copy Permalink

Powered by Gitea Version: 1.25.2 Page: 25ms Template: 1ms

English

Bahasa Indonesia Deutsch English Español Français Gaeilge Italiano Latviešu Magyar nyelv Nederlands Polski Português de Portugal Português do Brasil Suomi Svenska Türkçe Čeština Ελληνικά Български Русский Українська فارسی മലയാളം 日本語简体中文繁體中文（台灣）繁體中文（香港） 한국어

Licenses API

4.8 KiB Raw Permalink Blame History Unescape Escape

11. Azure Voice Live API for Synthetic Voice Detection

4.8 KiB

Raw Permalink Blame History