Mike/FrenoCorp

Fork 0

Files

Michael Freno 22e4864b8e current org

2026-03-09 09:21:48 -04:00

5.6 KiB

Raw Blame History

Technical Architecture Document

Date: 2026-03-08
Version: 1.0
Author: CTO (13842aab)
Status: Draft

Executive Summary

AudiobookPipeline is a TTS-based audiobook generation system using Qwen3-TTS 1.7B models. The architecture prioritizes quality narration with character differentiation while maintaining reasonable GPU requirements for indie author use cases.

System Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Client App    │────▶│  API Gateway     │────▶│  Worker Pool    │
│  (CLI/Web)      │     │  (FastAPI)       │     │  (GPU Workers)  │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                             │                         │
                             ▼                         ▼
                      ┌──────────────┐         ┌──────────────┐
                      │   Queue      │         │   Models     │
                      │  (Redis)     │         │ (Qwen3-TTS)  │
                      └──────────────┘         └──────────────┘

Core Components

1. Input Processing Layer

Parsers Module

epub parser (primary format - 80% of indie books)
pdf parser (secondary, OCR-dependent)
html parser (for web-published books)
mobi parser (legacy support)

Features:

Text normalization and whitespace cleanup
Chapter/section detection
Dialogue annotation (confidence threshold: 0.7)
Character identification from dialogue tags

2. Analysis Layer

Analyzer Module

Genre detection (optional ML-based, currently heuristic)
Tone/style analysis for voice selection
Length estimation for batching

Annotator Module

Dialogue confidence scoring
Speaker attribution
Pacing markers

3. Voice Generation Layer

Generation Module

Qwen3-TTS 1.7B Base model (primary)
Qwen3-TTS 1.7B VoiceDesign model (custom voices)
Batch processing optimization
Retry logic with exponential backoff (5s, 15s, 45s)

Voice Management:

Narrator voice (auto-inferred or user-selected)
Character voices (diverse defaults to avoid similarity)
Voice cloning via prompt extraction

4. Assembly Layer

Assembly Module

Audio segment stitching
Speaker transition padding: 0.4s
Paragraph padding: 0.2s
Loudness normalization to -23 LUFS
Output format generation (WAV, MP3 @ 128kbps)

5. Validation Layer

Validation Module

Audio energy threshold: -60dB
Loudness tolerance: ±3 LUFS
Strict mode flag for CI/CD

Technology Stack

Core Framework

Language: Python 3.11+
ML Framework: PyTorch 2.0+
Audio Processing: SoundFile, librosa
Web API: FastAPI + Uvicorn
Queue: Redis (for async processing)

Infrastructure

GPU Requirements: RTX 3060 12GB minimum, RTX 4090 recommended
Memory: 32GB RAM minimum
Storage: 50GB SSD for model weights and cache

Dependencies

torch: ">=2.0.0"
soundfile: ">=0.12.0"
librosa: ">=0.10.0"
fastapi: ">=0.104.0"
uvicorn: ">=0.24.0"
redis: ">=5.0.0"
pydub: ">=0.25.0"
ebooklib: ">=0.18"
pypdf: ">=3.0.0"

Data Flow

Upload: User uploads epub via CLI or web UI
Parse: Text extraction with dialogue annotation
Analyze: Genre detection, character identification
Queue: Job added to Redis queue
Process: GPU worker pulls job, generates audio segments
Assemble: Stitch segments with padding, normalize loudness
Validate: Check audio quality thresholds
Deliver: MP3/WAV file to user

Performance Targets

Metric	Target	Notes
Gen speed	0.5x real-time	RTX 4090, batch=4
Quality	-23 LUFS ±1dB	Audiobook standard
Latency	<5 min per chapter	For 20k words
Concurrent users	10	With 4 GPU workers

Scalability Considerations

Phase 1 (MVP - Week 1-4)

Single-machine deployment
CLI-only interface
Local queue (in-memory)
Manual GPU provisioning

Phase 2 (Beta - Week 5-8)

FastAPI web interface
Redis queue for async jobs
Docker containerization
Cloud GPU option (RunPod, Lambda Labs)

Phase 3 (Production - Quarter 2)

Kubernetes cluster
Auto-scaling GPU workers
Multi-region deployment
CDN for file delivery

Security Considerations

User audio files stored encrypted at rest
API authentication via API keys
Rate limiting: 100 requests/hour per tier
No third-party data sharing

Risks & Mitigations

Risk	Impact	Mitigation
GPU availability	High	Cloud GPU partnerships, queue-based scaling
Model quality variance	Medium	Human review workflow for premium tier
Format parsing edge cases	Low	Extensive test suite, graceful degradation
Competition from big players	Medium	Focus on indie author niche, character voices

Next Steps

Week 1: Set up development environment, create ADRs for key decisions
Week 2-3: Implement MVP features (single-narrator, epub, MP3)
Week 4: Beta testing with 5-10 indie authors
Week 5+: Character voice refinement, web UI

Document lives at project root for cross-agent access. Update with ADRs as decisions evolve.

5.6 KiB Raw Blame History