task to get this here done

2026-06-12 13:20:33 -04:00
parent 6379860123
commit 34855eff55
7 changed files with 1307 additions and 85 deletions
--- a/tasks/hierarchical-model-upgrade/README.md
+++ b/tasks/hierarchical-model-upgrade/README.md
@@ -0,0 +1,63 @@
+# Hierarchical Model Architecture Upgrade
+
+**Scale**: 1.47M images across 11,499 disease-plant classes
+**Goal**: Replace flat MobileNetV2 (38-class PlantVillage) with hierarchical Swin-Tiny (species → disease)
+**Deployment**: Hybrid — lightweight browser model (TF.js) + full server model (ONNX Runtime)
+
+## Hardware
+
+| Machine        | Role                            | Specs                                    |
+| -------------- | ------------------------------- | ---------------------------------------- |
+| **Strix Halo** | Primary training + inference    | AI 395+ MAX (ROCm), 128GB unified memory |
+| **RTX 3090**   | Secondary training / CUDA path  | 24GB VRAM                                |
+| **M3 Pro**     | Development only (work machine) | —                                        |
+
+**Key advantage**: Strix Halo's 128GB unified memory allows loading the entire 1.5M image dataset into RAM and training with extremely large effective batch sizes — the GPU accesses the full 128GB pool, no VRAM ceiling.
+
+## Status Legend
+
+```
+[ ] not started    [~] in progress    [x] done    [-] skipped
+```
+
+## Task Map
+
+```
+Phase 1 ──→ Phase 2 ──→ Phase 3 ──→ Phase 4 ──→ Phase 5
+Dataset        Model          Model         Server        Integration
+Reorg          Training       Export        Inference     + Testing
+                              & Quant.      Pipeline
+```
+
+## Phases
+
+- [ ] [Phase 1 — Dataset Reorganization](01-dataset-reorganization.md)
+      Parse 11,499 flat directories into hierarchical species→disease structure, create train/val splits, build species index.
+- [ ] [Phase 2 — Hierarchical Model Training](02-hierarchical-training.md)
+      Train Swin-Tiny backbone + species head + disease heads using PyTorch + ROCm on Strix Halo.
+- [ ] [Phase 3 — ONNX Export & Quantization](03-export-quantization.md)
+      Export trained models to ONNX, apply INT8 quantization, verify accuracy.
+- [ ] [Phase 4 — Server Inference Pipeline](04-server-inference.md)
+      Build server-side inference API with ONNX Runtime, OOD detection, species routing.
+- [ ] [Phase 5 — Browser Model & Hybrid Integration](05-browser-hybrid.md)
+      Lightweight TF.js model for client, hybrid confidence-based routing, full integration.
+
+## Dependencies
+
+```
+01 (dataset) ──→ 02 (training) ──→ 03 (export) ──→ 04 (server)
+                                                      │
+                                                      └──→ 05 (browser + hybrid)
+```
+
+## Exit Criteria
+
+- [ ] Species classifier achieves ≥95% top-1 accuracy on held-out val set
+- [ ] Disease classifiers achieve ≥90% top-3 accuracy per species
+- [ ] ONNX INT8 models infer in <200ms on CPU, <50ms on GPU
+- [ ] Browser TF.js model loads and runs in <100ms on mid-range devices
+- [ ] Hybrid routing works: high-confidence results served instantly from browser
+- [ ] Server fallback fires automatically when browser confidence is low
+- [ ] OOD detection rejects non-plant images with ≥99% precision
+- [ ] Full integration: upload → result in <500ms (browser) or <1s (server)
+- [ ] Existing app functionality preserved (all routes, pages, API endpoints)