current org

2026-03-09 09:21:48 -04:00
commit 22e4864b8e
82 changed files with 4587 additions and 0 deletions
--- a/technical_architecture.md
+++ b/technical_architecture.md
@@ -0,0 +1,196 @@
+# Technical Architecture Document
+
+**Date:** 2026-03-08  
+**Version:** 1.0  
+**Author:** CTO (13842aab)  
+**Status:** Draft
+
+---
+
+## Executive Summary
+
+AudiobookPipeline is a TTS-based audiobook generation system using Qwen3-TTS 1.7B models. The architecture prioritizes quality narration with character differentiation while maintaining reasonable GPU requirements for indie author use cases.
+
+---
+
+## System Architecture
+
+```
+┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
+│   Client App    │────▶│  API Gateway     │────▶│  Worker Pool    │
+│  (CLI/Web)      │     │  (FastAPI)       │     │  (GPU Workers)  │
+└─────────────────┘     └──────────────────┘     └─────────────────┘
+                             │                         │
+                             ▼                         ▼
+                      ┌──────────────┐         ┌──────────────┐
+                      │   Queue      │         │   Models     │
+                      │  (Redis)     │         │ (Qwen3-TTS)  │
+                      └──────────────┘         └──────────────┘
+```
+
+---
+
+## Core Components
+
+### 1. Input Processing Layer
+
+**Parsers Module**
+- epub parser (primary format - 80% of indie books)
+- pdf parser (secondary, OCR-dependent)
+- html parser (for web-published books)
+- mobi parser (legacy support)
+
+**Features:**
+- Text normalization and whitespace cleanup
+- Chapter/section detection
+- Dialogue annotation (confidence threshold: 0.7)
+- Character identification from dialogue tags
+
+### 2. Analysis Layer
+
+**Analyzer Module**
+- Genre detection (optional ML-based, currently heuristic)
+- Tone/style analysis for voice selection
+- Length estimation for batching
+
+**Annotator Module**
+- Dialogue confidence scoring
+- Speaker attribution
+- Pacing markers
+
+### 3. Voice Generation Layer
+
+**Generation Module**
+- Qwen3-TTS 1.7B Base model (primary)
+- Qwen3-TTS 1.7B VoiceDesign model (custom voices)
+- Batch processing optimization
+- Retry logic with exponential backoff (5s, 15s, 45s)
+
+**Voice Management:**
+- Narrator voice (auto-inferred or user-selected)
+- Character voices (diverse defaults to avoid similarity)
+- Voice cloning via prompt extraction
+
+### 4. Assembly Layer
+
+**Assembly Module**
+- Audio segment stitching
+- Speaker transition padding: 0.4s
+- Paragraph padding: 0.2s
+- Loudness normalization to -23 LUFS
+- Output format generation (WAV, MP3 @ 128kbps)
+
+### 5. Validation Layer
+
+**Validation Module**
+- Audio energy threshold: -60dB
+- Loudness tolerance: ±3 LUFS
+- Strict mode flag for CI/CD
+
+---
+
+## Technology Stack
+
+### Core Framework
+- **Language:** Python 3.11+
+- **ML Framework:** PyTorch 2.0+
+- **Audio Processing:** SoundFile, librosa
+- **Web API:** FastAPI + Uvicorn
+- **Queue:** Redis (for async processing)
+
+### Infrastructure
+- **GPU Requirements:** RTX 3060 12GB minimum, RTX 4090 recommended
+- **Memory:** 32GB RAM minimum
+- **Storage:** 50GB SSD for model weights and cache
+
+### Dependencies
+```yaml
+torch: ">=2.0.0"
+soundfile: ">=0.12.0"
+librosa: ">=0.10.0"
+fastapi: ">=0.104.0"
+uvicorn: ">=0.24.0"
+redis: ">=5.0.0"
+pydub: ">=0.25.0"
+ebooklib: ">=0.18"
+pypdf: ">=3.0.0"
+```
+
+---
+
+## Data Flow
+
+1. **Upload:** User uploads epub via CLI or web UI
+2. **Parse:** Text extraction with dialogue annotation
+3. **Analyze:** Genre detection, character identification
+4. **Queue:** Job added to Redis queue
+5. **Process:** GPU worker pulls job, generates audio segments
+6. **Assemble:** Stitch segments with padding, normalize loudness
+7. **Validate:** Check audio quality thresholds
+8. **Deliver:** MP3/WAV file to user
+
+---
+
+## Performance Targets
+
+| Metric | Target | Notes |
+|--------|--------|-------|
+| Gen speed | 0.5x real-time | RTX 4090, batch=4 |
+| Quality | -23 LUFS ±1dB | Audiobook standard |
+| Latency | <5 min per chapter | For 20k words |
+| Concurrent users | 10 | With 4 GPU workers |
+
+---
+
+## Scalability Considerations
+
+### Phase 1 (MVP - Week 1-4)
+- Single-machine deployment
+- CLI-only interface
+- Local queue (in-memory)
+- Manual GPU provisioning
+
+### Phase 2 (Beta - Week 5-8)
+- FastAPI web interface
+- Redis queue for async jobs
+- Docker containerization
+- Cloud GPU option (RunPod, Lambda Labs)
+
+### Phase 3 (Production - Quarter 2)
+- Kubernetes cluster
+- Auto-scaling GPU workers
+- Multi-region deployment
+- CDN for file delivery
+
+---
+
+## Security Considerations
+
+- User audio files stored encrypted at rest
+- API authentication via API keys
+- Rate limiting: 100 requests/hour per tier
+- No third-party data sharing
+
+---
+
+## Risks & Mitigations
+
+| Risk | Impact | Mitigation |
+|------|--------|------------|
+| GPU availability | High | Cloud GPU partnerships, queue-based scaling |
+| Model quality variance | Medium | Human review workflow for premium tier |
+| Format parsing edge cases | Low | Extensive test suite, graceful degradation |
+| Competition from big players | Medium | Focus on indie author niche, character voices |
+
+---
+
+## Next Steps
+
+1. **Week 1:** Set up development environment, create ADRs for key decisions
+2. **Week 2-3:** Implement MVP features (single-narrator, epub, MP3)
+3. **Week 4:** Beta testing with 5-10 indie authors
+4. **Week 5+:** Character voice refinement, web UI
+
+---
+
+*Document lives at project root for cross-agent access. Update with ADRs as decisions evolve.*