Files
Kordant/tasks/core-services-implementation/11-voiceprint-azure-api.md
2026-05-31 22:03:18 -04:00

4.8 KiB
Raw Blame History

11. Azure Voice Live API for Synthetic Voice Detection

meta: id: core-services-11 feature: core-services-implementation priority: P2 depends_on: [core-services-01] tags: [voiceprint, azure, voice-clone-detection, liveness, api-integration]

objective:

  • Replace the stub detectSynthetic() that returns { isSynthetic: false, confidence: 1.0 } with a real Azure Voice Live API integration, enabling consumer-facing voice clone detection via uploaded call recordings or live microphone capture.

deliverables:

  • Azure Speech Services client with Voice Live API endpoint
  • Audio preprocessing pipeline (resampling, normalization, VAD)
  • Voice enrollment system for trusted contacts (family member voice templates)
  • Synthetic detection endpoint that returns real confidence scores
  • Call recording upload and analysis workflow

steps:

  1. Sign up for Azure Speech Services at https://azure.microsoft.com/services/cognitive-services/speech-services/
  2. Add AZURE_SPEECH_KEY and AZURE_SPEECH_REGION to .env.example
  3. Create voiceprint/azure.client.ts:
    • detectLiveness(audioBuffer, referenceText?) — Voice Live API for challenge-response liveness
    • verifySpeaker(audioBuffer, enrollmentId) — speaker verification against enrolled voice
    • enrollSpeaker(audioSamples): Promise<enrollmentId> — create voice template from samples
  4. Implement audio preprocessing:
    • Convert to 16kHz mono PCM (Azure requirement)
    • Normalize amplitude to -3 dBFS
    • Trim silence using VAD (WebRTC or Silero)
    • Max duration: 30 seconds per analysis
  5. Implement enrollment flow:
    • User records 35 samples of family member saying phrases
    • Store enrollment in database with voiceEnrollments schema (already exists)
    • Generate enrollment ID, link to user account
  6. Implement detection flow:
    • User uploads suspicious call recording or captures live audio
    • Preprocess audio → Azure Voice Live API → get liveness score
    • If enrollment exists, also run speaker verification → similarity score
    • Combine scores: synthetic = low liveness AND low speaker match
  7. Implement detectSynthetic() to return real analysis:
    • Score: 0.01.0 (synthetic likelihood)
    • Confidence: based on audio quality and API response certainty
    • Decision: synthetic if score > 0.7, suspicious if 0.40.7, genuine if < 0.4
  8. Add analysis history:
    • Store every analysis in database (audio hash, score, decision)
    • Dashboard shows history of analyzed calls
    • User can report false positive/negative for model improvement
  9. Implement tier limits:
    • Fortress+: VoicePrint included
    • Lower tiers: not available or limited to 5 analyses/month

tests:

  • Unit: Mock Azure API responses, verify score calculation and decision logic
  • Integration: Test with real Azure Voice Live API using synthetic and genuine audio samples
  • E2E: Upload suspicious call recording → receive analysis result with confidence score

acceptance_criteria:

  • detectSynthetic() calls real Azure Voice Live API (not returning hardcoded isSynthetic: false)
  • Audio preprocessing converts to 16kHz mono PCM and normalizes amplitude
  • Voice enrollment creates usable template from 35 user-provided samples
  • Speaker verification returns similarity score between 0.0 and 1.0
  • Liveness detection returns pass/fail with confidence for challenge-response mode
  • Combined score correctly flags known synthetic voice samples (>0.7 threshold)
  • Analysis results are stored in database with audio hash and metadata
  • Dashboard shows analysis history with play button for uploaded audio
  • Tier enforcement: VoicePrint only available on Fortress+ plans
  • Graceful fallback: if Azure API fails, return "analysis unavailable" (not false negative)
  • False positive rate < 5% on genuine voice samples (tested with 100+ samples)

validation:

  • Run vitest run voiceprint.test.ts — all tests pass with Azure mock
  • Manual: Upload genuine voice sample, verify isSynthetic: false with confidence > 0.9
  • Manual: Upload synthetic voice (e.g., from ElevenLabs), verify isSynthetic: true with confidence > 0.7
  • Check enrollment: Database voiceEnrollments table has real templates with Azure enrollment IDs

notes:

  • Azure Voice Live API costs ~$0.016/minute of audio analyzed
  • At 50 analyses/user/month (12 min each), cost is ~$0.80$1.60/user/month
  • This is the ONLY practical path for a startup — building in-house costs $840K$1.25M Year 1
  • The differentiator isn't the detection tech (everyone uses Azure/Daon/Pindrop) — it's the consumer UX and integration
  • Consider adding forensic analysis mode: detailed spectrogram visualization for user education
  • Mobile integration (iOS CallKit, Android Telecom) is Phase 4 (task 12) — this task is server-side only
  • Store audio samples securely (encrypted at rest) and allow user deletion (privacy compliance)