# Technical Architecture Document

**Date:** 2026-03-08  
**Version:** 1.0  
**Author:** CTO (13842aab)  
**Status:** Draft

---

## Executive Summary

AudiobookPipeline is a TTS-based audiobook generation system using Qwen3-TTS 1.7B models. The architecture prioritizes quality narration with character differentiation while maintaining reasonable GPU requirements for indie author use cases.

---

## System Architecture

```
┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Client App    │────▶│  API Gateway     │────▶│  Worker Pool    │
│  (CLI/Web)      │     │  (FastAPI)       │     │  (GPU Workers)  │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                             │                         │
                             ▼                         ▼
                      ┌──────────────┐         ┌──────────────┐
                      │   Queue      │         │   Models     │
                      │  (Redis)     │         │ (Qwen3-TTS)  │
                      └──────────────┘         └──────────────┘
```

---

## Core Components

### 1. Input Processing Layer

**Parsers Module**
- epub parser (primary format - 80% of indie books)
- pdf parser (secondary, OCR-dependent)
- html parser (for web-published books)
- mobi parser (legacy support)

**Features:**
- Text normalization and whitespace cleanup
- Chapter/section detection
- Dialogue annotation (confidence threshold: 0.7)
- Character identification from dialogue tags

### 2. Analysis Layer

**Analyzer Module**
- Genre detection (optional ML-based, currently heuristic)
- Tone/style analysis for voice selection
- Length estimation for batching

**Annotator Module**
- Dialogue confidence scoring
- Speaker attribution
- Pacing markers

### 3. Voice Generation Layer

**Generation Module**
- Qwen3-TTS 1.7B Base model (primary)
- Qwen3-TTS 1.7B VoiceDesign model (custom voices)
- Batch processing optimization
- Retry logic with exponential backoff (5s, 15s, 45s)

**Voice Management:**
- Narrator voice (auto-inferred or user-selected)
- Character voices (diverse defaults to avoid similarity)
- Voice cloning via prompt extraction

### 4. Assembly Layer

**Assembly Module**
- Audio segment stitching
- Speaker transition padding: 0.4s
- Paragraph padding: 0.2s
- Loudness normalization to -23 LUFS
- Output format generation (WAV, MP3 @ 128kbps)

### 5. Validation Layer

**Validation Module**
- Audio energy threshold: -60dB
- Loudness tolerance: ±3 LUFS
- Strict mode flag for CI/CD

---

## Technology Stack

### Core Framework
- **Language:** Python 3.11+
- **ML Framework:** PyTorch 2.0+
- **Audio Processing:** SoundFile, librosa
- **Web API:** FastAPI + Uvicorn
- **Queue:** Redis (for async processing)

### Infrastructure
- **GPU Requirements:** RTX 3060 12GB minimum, RTX 4090 recommended
- **Memory:** 32GB RAM minimum
- **Storage:** 50GB SSD for model weights and cache

### Dependencies
```yaml
torch: ">=2.0.0"
soundfile: ">=0.12.0"
librosa: ">=0.10.0"
fastapi: ">=0.104.0"
uvicorn: ">=0.24.0"
redis: ">=5.0.0"
pydub: ">=0.25.0"
ebooklib: ">=0.18"
pypdf: ">=3.0.0"
```

---

## Data Flow

1. **Upload:** User uploads epub via CLI or web UI
2. **Parse:** Text extraction with dialogue annotation
3. **Analyze:** Genre detection, character identification
4. **Queue:** Job added to Redis queue
5. **Process:** GPU worker pulls job, generates audio segments
6. **Assemble:** Stitch segments with padding, normalize loudness
7. **Validate:** Check audio quality thresholds
8. **Deliver:** MP3/WAV file to user

---

## Performance Targets

| Metric | Target | Notes |
|--------|--------|-------|
| Gen speed | 0.5x real-time | RTX 4090, batch=4 |
| Quality | -23 LUFS ±1dB | Audiobook standard |
| Latency | <5 min per chapter | For 20k words |
| Concurrent users | 10 | With 4 GPU workers |

---

## Scalability Considerations

### Phase 1 (MVP - Week 1-4)
- Single-machine deployment
- CLI-only interface
- Local queue (in-memory)
- Manual GPU provisioning

### Phase 2 (Beta - Week 5-8)
- FastAPI web interface
- Redis queue for async jobs
- Docker containerization
- Cloud GPU option (RunPod, Lambda Labs)

### Phase 3 (Production - Quarter 2)
- Kubernetes cluster
- Auto-scaling GPU workers
- Multi-region deployment
- CDN for file delivery

---

## Security Considerations

- User audio files stored encrypted at rest
- API authentication via API keys
- Rate limiting: 100 requests/hour per tier
- No third-party data sharing

---

## Risks & Mitigations

| Risk | Impact | Mitigation |
|------|--------|------------|
| GPU availability | High | Cloud GPU partnerships, queue-based scaling |
| Model quality variance | Medium | Human review workflow for premium tier |
| Format parsing edge cases | Low | Extensive test suite, graceful degradation |
| Competition from big players | Medium | Focus on indie author niche, character voices |

---

## Next Steps

1. **Week 1:** Set up development environment, create ADRs for key decisions
2. **Week 2-3:** Implement MVP features (single-narrator, epub, MP3)
3. **Week 4:** Beta testing with 5-10 indie authors
4. **Week 5+:** Character voice refinement, web UI

---

*Document lives at project root for cross-agent access. Update with ADRs as decisions evolve.*