FrenoCorp/tasks/FRE-17.yaml

---
date: 2026-03-08
day_of_week: Sunday
task_id: FRE-17
title: Add Memory-Efficient Model Loading
status: done
completed_date: 2026-03-15
company_id: FrenoCorp
objective: Implement gradient checkpointing and mixed precision for lower VRAM usage
context: |
  - Qwen3-TTS 1.7B may not fit in low-end GPUs
  - Gradient checkpointing trades compute for memory
  - Mixed precision (FP16) reduces memory by half
issue_type: enhancement
priority: medium
assignee: Atlas
parent_task: FRE-32
goal_id: MVP_Pipeline_Working
blocking_tasks: []
expected_outcome: |
  - Model runs on GPUs with <8GB VRAM
  - Configurable precision (FP32/FP16/BF16)
  - Graceful degradation when memory insufficient
acceptance_criteria:
  - FP16 mode reduces memory usage by ~50%
  - Gradient checkpointing option available
  - Clear error when memory still insufficient

notes:
  - Use torch.cuda.amp for mixed precision
  - Set gradient_checkpointing=True in model config
  - COMPLETED: Added memory-efficient model loading with auto-detection

completion_notes: |
  Completed 2026-03-15. Deliverables:

  **New Parameters:**
  - `memory_efficient` (bool, default=True): Enable all memory-saving features
  - `use_gradient_checkpointing` (bool, default=False): Trade compute for memory
  - Enhanced `dtype` support with auto-selection based on available GPU memory

  **New Methods:**
  - `_check_gpu_memory()`: Returns (total_gb, available_gb)
  - `_select_optimal_dtype(available_gb)`: Auto-selects fp32/bf16/fp16
  - `get_memory_stats()`: Returns dict with current GPU memory usage
  - `estimate_model_memory()`: Returns estimated memory for different precisions

  **Features:**
  - Auto-detects GPU memory and selects optimal dtype (bf16 for Ampere+, fp16 otherwise)
  - Graceful degradation: fp32 → bf16 → fp16 based on available memory
  - Enhanced OOM error messages with actionable suggestions
  - Memory stats reported on load/unload
  - Gradient checkpointing support for training scenarios

  **Memory Estimates:**
  - FP32: ~6.8GB (1.7B params × 4 bytes + overhead)
  - FP16/BF16: ~3.9GB (50% reduction)
  - Minimum recommended: 4GB VRAM

links:
  tts_model: /home/mike/code/AudiobookPipeline/src/generation/tts_model.py
---