current org
This commit is contained in:
196
technical_architecture.md
Normal file
196
technical_architecture.md
Normal file
@@ -0,0 +1,196 @@
|
||||
# Technical Architecture Document
|
||||
|
||||
**Date:** 2026-03-08
|
||||
**Version:** 1.0
|
||||
**Author:** CTO (13842aab)
|
||||
**Status:** Draft
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
AudiobookPipeline is a TTS-based audiobook generation system using Qwen3-TTS 1.7B models. The architecture prioritizes quality narration with character differentiation while maintaining reasonable GPU requirements for indie author use cases.
|
||||
|
||||
---
|
||||
|
||||
## System Architecture
|
||||
|
||||
```
|
||||
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
|
||||
│ Client App │────▶│ API Gateway │────▶│ Worker Pool │
|
||||
│ (CLI/Web) │ │ (FastAPI) │ │ (GPU Workers) │
|
||||
└─────────────────┘ └──────────────────┘ └─────────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌──────────────┐ ┌──────────────┐
|
||||
│ Queue │ │ Models │
|
||||
│ (Redis) │ │ (Qwen3-TTS) │
|
||||
└──────────────┘ └──────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Core Components
|
||||
|
||||
### 1. Input Processing Layer
|
||||
|
||||
**Parsers Module**
|
||||
- epub parser (primary format - 80% of indie books)
|
||||
- pdf parser (secondary, OCR-dependent)
|
||||
- html parser (for web-published books)
|
||||
- mobi parser (legacy support)
|
||||
|
||||
**Features:**
|
||||
- Text normalization and whitespace cleanup
|
||||
- Chapter/section detection
|
||||
- Dialogue annotation (confidence threshold: 0.7)
|
||||
- Character identification from dialogue tags
|
||||
|
||||
### 2. Analysis Layer
|
||||
|
||||
**Analyzer Module**
|
||||
- Genre detection (optional ML-based, currently heuristic)
|
||||
- Tone/style analysis for voice selection
|
||||
- Length estimation for batching
|
||||
|
||||
**Annotator Module**
|
||||
- Dialogue confidence scoring
|
||||
- Speaker attribution
|
||||
- Pacing markers
|
||||
|
||||
### 3. Voice Generation Layer
|
||||
|
||||
**Generation Module**
|
||||
- Qwen3-TTS 1.7B Base model (primary)
|
||||
- Qwen3-TTS 1.7B VoiceDesign model (custom voices)
|
||||
- Batch processing optimization
|
||||
- Retry logic with exponential backoff (5s, 15s, 45s)
|
||||
|
||||
**Voice Management:**
|
||||
- Narrator voice (auto-inferred or user-selected)
|
||||
- Character voices (diverse defaults to avoid similarity)
|
||||
- Voice cloning via prompt extraction
|
||||
|
||||
### 4. Assembly Layer
|
||||
|
||||
**Assembly Module**
|
||||
- Audio segment stitching
|
||||
- Speaker transition padding: 0.4s
|
||||
- Paragraph padding: 0.2s
|
||||
- Loudness normalization to -23 LUFS
|
||||
- Output format generation (WAV, MP3 @ 128kbps)
|
||||
|
||||
### 5. Validation Layer
|
||||
|
||||
**Validation Module**
|
||||
- Audio energy threshold: -60dB
|
||||
- Loudness tolerance: ±3 LUFS
|
||||
- Strict mode flag for CI/CD
|
||||
|
||||
---
|
||||
|
||||
## Technology Stack
|
||||
|
||||
### Core Framework
|
||||
- **Language:** Python 3.11+
|
||||
- **ML Framework:** PyTorch 2.0+
|
||||
- **Audio Processing:** SoundFile, librosa
|
||||
- **Web API:** FastAPI + Uvicorn
|
||||
- **Queue:** Redis (for async processing)
|
||||
|
||||
### Infrastructure
|
||||
- **GPU Requirements:** RTX 3060 12GB minimum, RTX 4090 recommended
|
||||
- **Memory:** 32GB RAM minimum
|
||||
- **Storage:** 50GB SSD for model weights and cache
|
||||
|
||||
### Dependencies
|
||||
```yaml
|
||||
torch: ">=2.0.0"
|
||||
soundfile: ">=0.12.0"
|
||||
librosa: ">=0.10.0"
|
||||
fastapi: ">=0.104.0"
|
||||
uvicorn: ">=0.24.0"
|
||||
redis: ">=5.0.0"
|
||||
pydub: ">=0.25.0"
|
||||
ebooklib: ">=0.18"
|
||||
pypdf: ">=3.0.0"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Data Flow
|
||||
|
||||
1. **Upload:** User uploads epub via CLI or web UI
|
||||
2. **Parse:** Text extraction with dialogue annotation
|
||||
3. **Analyze:** Genre detection, character identification
|
||||
4. **Queue:** Job added to Redis queue
|
||||
5. **Process:** GPU worker pulls job, generates audio segments
|
||||
6. **Assemble:** Stitch segments with padding, normalize loudness
|
||||
7. **Validate:** Check audio quality thresholds
|
||||
8. **Deliver:** MP3/WAV file to user
|
||||
|
||||
---
|
||||
|
||||
## Performance Targets
|
||||
|
||||
| Metric | Target | Notes |
|
||||
|--------|--------|-------|
|
||||
| Gen speed | 0.5x real-time | RTX 4090, batch=4 |
|
||||
| Quality | -23 LUFS ±1dB | Audiobook standard |
|
||||
| Latency | <5 min per chapter | For 20k words |
|
||||
| Concurrent users | 10 | With 4 GPU workers |
|
||||
|
||||
---
|
||||
|
||||
## Scalability Considerations
|
||||
|
||||
### Phase 1 (MVP - Week 1-4)
|
||||
- Single-machine deployment
|
||||
- CLI-only interface
|
||||
- Local queue (in-memory)
|
||||
- Manual GPU provisioning
|
||||
|
||||
### Phase 2 (Beta - Week 5-8)
|
||||
- FastAPI web interface
|
||||
- Redis queue for async jobs
|
||||
- Docker containerization
|
||||
- Cloud GPU option (RunPod, Lambda Labs)
|
||||
|
||||
### Phase 3 (Production - Quarter 2)
|
||||
- Kubernetes cluster
|
||||
- Auto-scaling GPU workers
|
||||
- Multi-region deployment
|
||||
- CDN for file delivery
|
||||
|
||||
---
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- User audio files stored encrypted at rest
|
||||
- API authentication via API keys
|
||||
- Rate limiting: 100 requests/hour per tier
|
||||
- No third-party data sharing
|
||||
|
||||
---
|
||||
|
||||
## Risks & Mitigations
|
||||
|
||||
| Risk | Impact | Mitigation |
|
||||
|------|--------|------------|
|
||||
| GPU availability | High | Cloud GPU partnerships, queue-based scaling |
|
||||
| Model quality variance | Medium | Human review workflow for premium tier |
|
||||
| Format parsing edge cases | Low | Extensive test suite, graceful degradation |
|
||||
| Competition from big players | Medium | Focus on indie author niche, character voices |
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Week 1:** Set up development environment, create ADRs for key decisions
|
||||
2. **Week 2-3:** Implement MVP features (single-narrator, epub, MP3)
|
||||
3. **Week 4:** Beta testing with 5-10 indie authors
|
||||
4. **Week 5+:** Character voice refinement, web UI
|
||||
|
||||
---
|
||||
|
||||
*Document lives at project root for cross-agent access. Update with ADRs as decisions evolve.*
|
||||
Reference in New Issue
Block a user