Files
FrenoCorp/technical_architecture.md
2026-03-09 09:21:48 -04:00

5.6 KiB

Technical Architecture Document

Date: 2026-03-08
Version: 1.0
Author: CTO (13842aab)
Status: Draft


Executive Summary

AudiobookPipeline is a TTS-based audiobook generation system using Qwen3-TTS 1.7B models. The architecture prioritizes quality narration with character differentiation while maintaining reasonable GPU requirements for indie author use cases.


System Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Client App    │────▶│  API Gateway     │────▶│  Worker Pool    │
│  (CLI/Web)      │     │  (FastAPI)       │     │  (GPU Workers)  │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                             │                         │
                             ▼                         ▼
                      ┌──────────────┐         ┌──────────────┐
                      │   Queue      │         │   Models     │
                      │  (Redis)     │         │ (Qwen3-TTS)  │
                      └──────────────┘         └──────────────┘

Core Components

1. Input Processing Layer

Parsers Module

  • epub parser (primary format - 80% of indie books)
  • pdf parser (secondary, OCR-dependent)
  • html parser (for web-published books)
  • mobi parser (legacy support)

Features:

  • Text normalization and whitespace cleanup
  • Chapter/section detection
  • Dialogue annotation (confidence threshold: 0.7)
  • Character identification from dialogue tags

2. Analysis Layer

Analyzer Module

  • Genre detection (optional ML-based, currently heuristic)
  • Tone/style analysis for voice selection
  • Length estimation for batching

Annotator Module

  • Dialogue confidence scoring
  • Speaker attribution
  • Pacing markers

3. Voice Generation Layer

Generation Module

  • Qwen3-TTS 1.7B Base model (primary)
  • Qwen3-TTS 1.7B VoiceDesign model (custom voices)
  • Batch processing optimization
  • Retry logic with exponential backoff (5s, 15s, 45s)

Voice Management:

  • Narrator voice (auto-inferred or user-selected)
  • Character voices (diverse defaults to avoid similarity)
  • Voice cloning via prompt extraction

4. Assembly Layer

Assembly Module

  • Audio segment stitching
  • Speaker transition padding: 0.4s
  • Paragraph padding: 0.2s
  • Loudness normalization to -23 LUFS
  • Output format generation (WAV, MP3 @ 128kbps)

5. Validation Layer

Validation Module

  • Audio energy threshold: -60dB
  • Loudness tolerance: ±3 LUFS
  • Strict mode flag for CI/CD

Technology Stack

Core Framework

  • Language: Python 3.11+
  • ML Framework: PyTorch 2.0+
  • Audio Processing: SoundFile, librosa
  • Web API: FastAPI + Uvicorn
  • Queue: Redis (for async processing)

Infrastructure

  • GPU Requirements: RTX 3060 12GB minimum, RTX 4090 recommended
  • Memory: 32GB RAM minimum
  • Storage: 50GB SSD for model weights and cache

Dependencies

torch: ">=2.0.0"
soundfile: ">=0.12.0"
librosa: ">=0.10.0"
fastapi: ">=0.104.0"
uvicorn: ">=0.24.0"
redis: ">=5.0.0"
pydub: ">=0.25.0"
ebooklib: ">=0.18"
pypdf: ">=3.0.0"

Data Flow

  1. Upload: User uploads epub via CLI or web UI
  2. Parse: Text extraction with dialogue annotation
  3. Analyze: Genre detection, character identification
  4. Queue: Job added to Redis queue
  5. Process: GPU worker pulls job, generates audio segments
  6. Assemble: Stitch segments with padding, normalize loudness
  7. Validate: Check audio quality thresholds
  8. Deliver: MP3/WAV file to user

Performance Targets

Metric Target Notes
Gen speed 0.5x real-time RTX 4090, batch=4
Quality -23 LUFS ±1dB Audiobook standard
Latency <5 min per chapter For 20k words
Concurrent users 10 With 4 GPU workers

Scalability Considerations

Phase 1 (MVP - Week 1-4)

  • Single-machine deployment
  • CLI-only interface
  • Local queue (in-memory)
  • Manual GPU provisioning

Phase 2 (Beta - Week 5-8)

  • FastAPI web interface
  • Redis queue for async jobs
  • Docker containerization
  • Cloud GPU option (RunPod, Lambda Labs)

Phase 3 (Production - Quarter 2)

  • Kubernetes cluster
  • Auto-scaling GPU workers
  • Multi-region deployment
  • CDN for file delivery

Security Considerations

  • User audio files stored encrypted at rest
  • API authentication via API keys
  • Rate limiting: 100 requests/hour per tier
  • No third-party data sharing

Risks & Mitigations

Risk Impact Mitigation
GPU availability High Cloud GPU partnerships, queue-based scaling
Model quality variance Medium Human review workflow for premium tier
Format parsing edge cases Low Extensive test suite, graceful degradation
Competition from big players Medium Focus on indie author niche, character voices

Next Steps

  1. Week 1: Set up development environment, create ADRs for key decisions
  2. Week 2-3: Implement MVP features (single-narrator, epub, MP3)
  3. Week 4: Beta testing with 5-10 indie authors
  4. Week 5+: Character voice refinement, web UI

Document lives at project root for cross-agent access. Update with ADRs as decisions evolve.