# Technical Architecture Document **Date:** 2026-03-08 **Version:** 1.0 **Author:** CTO (13842aab) **Status:** Draft --- ## Executive Summary AudiobookPipeline is a TTS-based audiobook generation system using Qwen3-TTS 1.7B models. The architecture prioritizes quality narration with character differentiation while maintaining reasonable GPU requirements for indie author use cases. --- ## System Architecture ``` ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ Client App │────▶│ API Gateway │────▶│ Worker Pool │ │ (CLI/Web) │ │ (FastAPI) │ │ (GPU Workers) │ └─────────────────┘ └──────────────────┘ └─────────────────┘ │ │ ▼ ▼ ┌──────────────┐ ┌──────────────┐ │ Queue │ │ Models │ │ (Redis) │ │ (Qwen3-TTS) │ └──────────────┘ └──────────────┘ ``` --- ## Core Components ### 1. Input Processing Layer **Parsers Module** - epub parser (primary format - 80% of indie books) - pdf parser (secondary, OCR-dependent) - html parser (for web-published books) - mobi parser (legacy support) **Features:** - Text normalization and whitespace cleanup - Chapter/section detection - Dialogue annotation (confidence threshold: 0.7) - Character identification from dialogue tags ### 2. Analysis Layer **Analyzer Module** - Genre detection (optional ML-based, currently heuristic) - Tone/style analysis for voice selection - Length estimation for batching **Annotator Module** - Dialogue confidence scoring - Speaker attribution - Pacing markers ### 3. Voice Generation Layer **Generation Module** - Qwen3-TTS 1.7B Base model (primary) - Qwen3-TTS 1.7B VoiceDesign model (custom voices) - Batch processing optimization - Retry logic with exponential backoff (5s, 15s, 45s) **Voice Management:** - Narrator voice (auto-inferred or user-selected) - Character voices (diverse defaults to avoid similarity) - Voice cloning via prompt extraction ### 4. Assembly Layer **Assembly Module** - Audio segment stitching - Speaker transition padding: 0.4s - Paragraph padding: 0.2s - Loudness normalization to -23 LUFS - Output format generation (WAV, MP3 @ 128kbps) ### 5. Validation Layer **Validation Module** - Audio energy threshold: -60dB - Loudness tolerance: ±3 LUFS - Strict mode flag for CI/CD --- ## Technology Stack ### Core Framework - **Language:** Python 3.11+ - **ML Framework:** PyTorch 2.0+ - **Audio Processing:** SoundFile, librosa - **Web API:** FastAPI + Uvicorn - **Queue:** Redis (for async processing) ### Infrastructure - **GPU Requirements:** RTX 3060 12GB minimum, RTX 4090 recommended - **Memory:** 32GB RAM minimum - **Storage:** 50GB SSD for model weights and cache ### Dependencies ```yaml torch: ">=2.0.0" soundfile: ">=0.12.0" librosa: ">=0.10.0" fastapi: ">=0.104.0" uvicorn: ">=0.24.0" redis: ">=5.0.0" pydub: ">=0.25.0" ebooklib: ">=0.18" pypdf: ">=3.0.0" ``` --- ## Data Flow 1. **Upload:** User uploads epub via CLI or web UI 2. **Parse:** Text extraction with dialogue annotation 3. **Analyze:** Genre detection, character identification 4. **Queue:** Job added to Redis queue 5. **Process:** GPU worker pulls job, generates audio segments 6. **Assemble:** Stitch segments with padding, normalize loudness 7. **Validate:** Check audio quality thresholds 8. **Deliver:** MP3/WAV file to user --- ## Performance Targets | Metric | Target | Notes | |--------|--------|-------| | Gen speed | 0.5x real-time | RTX 4090, batch=4 | | Quality | -23 LUFS ±1dB | Audiobook standard | | Latency | <5 min per chapter | For 20k words | | Concurrent users | 10 | With 4 GPU workers | --- ## Scalability Considerations ### Phase 1 (MVP - Week 1-4) - Single-machine deployment - CLI-only interface - Local queue (in-memory) - Manual GPU provisioning ### Phase 2 (Beta - Week 5-8) - FastAPI web interface - Redis queue for async jobs - Docker containerization - Cloud GPU option (RunPod, Lambda Labs) ### Phase 3 (Production - Quarter 2) - Kubernetes cluster - Auto-scaling GPU workers - Multi-region deployment - CDN for file delivery --- ## Security Considerations - User audio files stored encrypted at rest - API authentication via API keys - Rate limiting: 100 requests/hour per tier - No third-party data sharing --- ## Risks & Mitigations | Risk | Impact | Mitigation | |------|--------|------------| | GPU availability | High | Cloud GPU partnerships, queue-based scaling | | Model quality variance | Medium | Human review workflow for premium tier | | Format parsing edge cases | Low | Extensive test suite, graceful degradation | | Competition from big players | Medium | Focus on indie author niche, character voices | --- ## Next Steps 1. **Week 1:** Set up development environment, create ADRs for key decisions 2. **Week 2-3:** Implement MVP features (single-narrator, epub, MP3) 3. **Week 4:** Beta testing with 5-10 indie authors 4. **Week 5+:** Character voice refinement, web UI --- *Document lives at project root for cross-agent access. Update with ADRs as decisions evolve.*