task to get this here done

2026-06-12 13:20:33 -04:00
parent 6379860123
commit 34855eff55
7 changed files with 1307 additions and 85 deletions
--- a/tasks/hierarchical-model-upgrade/01-dataset-reorganization.md
+++ b/tasks/hierarchical-model-upgrade/01-dataset-reorganization.md
@@ -0,0 +1,169 @@
+# Phase 1 — Dataset Reorganization
+
+**Blocked by**: Nothing
+**Blocks**: Phase 2 (training)
+**Est. time**: 2-3 days
+**Machine**: Strix Halo (fast I/O, 128GB RAM for in-memory processing)
+
+## Objective
+
+Reorganize the flat directory structure (`data/dataset/plant-disease-name/`) into a proper hierarchical layout (`data/organized/species/disease/`) with train/val splits and metadata files.
+
+## Current State
+
+- `data/dataset/` — 11,499 flat directories, each named `{plant}-{disease}`
+- Files mixed: `.jpg`, `.jpeg`, `.png`, `.webp` per directory
+- Total: ~1.47M images, 64-244 images per class (well-balanced)
+- **Total size: ~450 GB**
+- **SSD available**: 8TB NVMe (7,300 MB/s read, 6,300 MB/s write — PCIe 5.0)
+
+## Deliverables
+
+```
+data/
+├── organized/
+│   ├── train/                          # 85% of images
+│   │   ├── {species_1}/
+│   │   │   ├── healthy/
+│   │   │   ├── {disease_a}/
+│   │   │   └── {disease_b}/
+│   │   ├── {species_2}/
+│   │   └── ...
+│   ├── val/                            # 15% of images
+│   │   └── ... (mirrors train structure)
+│   ├── species_index.json              # Maps species → [disease IDs]
+│   ├── class_hierarchy.json            # Full mapping + metadata
+│   └── dataset_stats.json              # Counts per class, splits
+```
+
+## Steps
+
+### 1.1 Parse directory names → (species, disease) pairs
+
+**Problem**: Directory names like `acorn-squash-powdery-mildew` use an inconsistent separator (hyphen). Need to reliably split plant name from disease name.
+
+**Approach**: Use `src/data/diseases.json` as ground truth. Try matching each directory name against known disease ID suffixes (sorted longest-first). The remainder is the plant name.
+
+**Fallback**: For unmatched dirs, build a plant suffix list from `src/data/plants.json` and try prefix matching. Log any truly unmatched dirs for manual review.
+
+**Script**: `scripts/organize-dataset.py`
+
+```python
+# Pseudocode for the matching algorithm:
+disease_ids = sorted([d["id"] for d in diseases], key=len, reverse=True)
+plant_names = [p["id"] for p in plants]  # or extract from dir prefixes
+
+for dir_name in dataset_dirs:
+    matched_disease = next(d for d in disease_ids if dir_name.endswith(d))
+    plant = dir_name[:-(len(matched_disease)+1)]  # +1 for hyphen
+    hierarchy[plant].append(matched_disease)
+```
+
+### 1.2 Split into train/val (85/15)
+
+Use stratified splitting per class to preserve class distribution.
+
+- For each disease-plant class, randomly assign 85% to train, 15% to val
+- Copy files (or symlink) to new directory structure
+- Verify no data leakage (same image in both splits)
+
+### 1.3 Build metadata files
+
+```json
+// species_index.json
+{
+  "tomato": ["healthy", "early-blight", "late-blight", "bacterial-spot", ...],
+  "acorn-squash": ["healthy", "powdery-mildew", "downy-mildew", ...],
+  ...
+}
+
+// dataset_stats.json
+{
+  "total_images": 1465818,
+  "total_species": 320,
+  "total_classes": 11499,
+  "images_per_class": { "min": 64, "max": 244, "mean": 127 },
+  "train_images": 1245945,
+  "val_images": 219873,
+  "species_disease_counts": {
+    "tomato": { "early-blight": 156, "late-blight": 142, ... }
+  }
+}
+```
+
+### 1.4 Data quality checks
+
+### 1.4 Image normalization & compression (before splitting)
+
+**450GB is unnecessarily large for 224px training.** Many source images are high-resolution (e.g., 4000×3000 from phone cameras), but the model only sees 224×224 crops. Resizing to a reasonable max dimension BEFORE training saves massive I/O and enables faster epochs.
+
+**Strategy**: Resize all images to **max dimension of 512px** (preserving aspect ratio), convert to **JPEG quality 90**.
+
+| Approach                        | Est. Size     | Pros                                               | Cons                               |
+| ------------------------------- | ------------- | -------------------------------------------------- | ---------------------------------- |
+| **Keep originals**              | 450 GB        | No quality loss                                    | Slow loading, huge storage         |
+| **Resize 1024px max, JPEG 90**  | ~120 GB       | Good for future higher-res models                  | Still somewhat large               |
+| **Resize 512px max, JPEG 90 ✓** | **~60-80 GB** | **Fast loading, enough detail for 224px training** | Can't go back to full res          |
+| **Resize 256px max, JPEG 95**   | ~30 GB        | Fastest loading                                    | Too small if retrain at higher res |
+
+**Recommendation**: Resize to 512px max, JPEG q90. This:
+
+- Reduces storage from 450GB → ~70GB (fits in RAM for caching)
+- Preserves enough detail for 224×224 RandomResizedCrop augmentation
+- JPEG is hardware-accelerated (libjpeg-turbo) — fastest decode path
+- Single format (no more .png/.webp mixed loading)
+
+```python
+# resize_and_convert.py
+from PIL import Image
+import os
+from joblib import Parallel, delayed
+
+MAX_SIZE = 512
+QUALITY = 90
+
+def process_image(src_path, dst_path):
+    img = Image.open(src_path)
+    # Resize so max dimension = MAX_SIZE, preserving aspect ratio
+    w, h = img.size
+    if max(w, h) > MAX_SIZE:
+        ratio = MAX_SIZE / max(w, h)
+        img = img.resize((int(w * ratio), int(h * ratio)), Image.LANCZOS)
+    # Convert to RGB (handles RGBA PNGs)
+    if img.mode != 'RGB':
+        img = img.convert('RGB')
+    # Save as JPEG
+    os.makedirs(os.path.dirname(dst_path), exist_ok=True)
+    img.save(dst_path, 'JPEG', quality=QUALITY, optimize=True)
+
+# Run in parallel (Strix Halo has many cores)
+Parallel(n_jobs=16)(
+    delayed(process_image)(src, dst)
+    for src, dst in image_pairs
+)
+```
+
+**Time estimate on Strix Halo**: ~2-3 hours to resize + convert 1.47M images with 16 parallel workers. Each image takes ~5-10ms with PIL+LANCZOS.
+
+### 1.5 Data quality checks
+
+- **Label noise**: Run confidence learning (CleanLab) on a sample to estimate mislabel rate. Web-scraped datasets typically have 8-15% label noise.
+- **Duplicate detection**: Check for near-duplicate images (perceptual hashing + Hamming distance) within each class.
+- **Format consistency**: Ensure all images decode successfully; remove corrupted files.
+- **Background bias**: Verify that no single background dominates a class (subset and eyeball a random grid per class).
+
+## Edge Cases & Gotchas
+
+- **Multi-word plant names**: "acorn-squash", "fiddle-leaf-fig", "chili-pepper" — the disease suffix must match the end of the string, not a substring in the plant name. Sorting disease IDs by length (longest first) handles this.
+- **Disease-less "healthy" dirs**: Need to ensure "healthy" is in the disease list as a valid class (index 0 in current model). Some dirs may be `{plant}-healthy`.
+- **Cross-platform path length**: Some species+disease combos produce long paths. Use relative symlinks or shorten names if needed on Windows.
+- **Original files preserved**: The existing `data/dataset/` structure stays untouched; `data/organized/` is a copy.
+
+## Verification
+
+- [ ] `data/organized/train/` has same total image count as original (minus val split)
+- [ ] Every class has at least 50 training images
+- [ ] `species_index.json` covers all 11,499 classes
+- [ ] No files in both train/ and val/ (no overlap)
+- [ ] All images readable (no corrupted files)
+- [ ] Train/val split ratios consistent across all classes (±2%)