# Phase 1 — Dataset Reorganization **Blocked by**: Nothing **Blocks**: Phase 2 (training) **Est. time**: 2-3 days **Machine**: Strix Halo (fast I/O, 128GB RAM for in-memory processing) ## Objective Reorganize the flat directory structure (`data/dataset/plant-disease-name/`) into a proper hierarchical layout (`data/organized/species/disease/`) with train/val splits and metadata files. ## Current State - `data/dataset/` — 11,499 flat directories, each named `{plant}-{disease}` - Files mixed: `.jpg`, `.jpeg`, `.png`, `.webp` per directory - Total: ~1.47M images, 64-244 images per class (well-balanced) - **Total size: ~450 GB** - **SSD available**: 8TB NVMe (7,300 MB/s read, 6,300 MB/s write — PCIe 5.0) ## Deliverables ``` data/ ├── organized/ │ ├── train/ # 85% of images │ │ ├── {species_1}/ │ │ │ ├── healthy/ │ │ │ ├── {disease_a}/ │ │ │ └── {disease_b}/ │ │ ├── {species_2}/ │ │ └── ... │ ├── val/ # 15% of images │ │ └── ... (mirrors train structure) │ ├── species_index.json # Maps species → [disease IDs] │ ├── class_hierarchy.json # Full mapping + metadata │ └── dataset_stats.json # Counts per class, splits ``` ## Steps ### 1.1 Parse directory names → (species, disease) pairs **Problem**: Directory names like `acorn-squash-powdery-mildew` use an inconsistent separator (hyphen). Need to reliably split plant name from disease name. **Approach**: Use `src/data/diseases.json` as ground truth. Try matching each directory name against known disease ID suffixes (sorted longest-first). The remainder is the plant name. **Fallback**: For unmatched dirs, build a plant suffix list from `src/data/plants.json` and try prefix matching. Log any truly unmatched dirs for manual review. **Script**: `scripts/organize-dataset.py` ```python # Pseudocode for the matching algorithm: disease_ids = sorted([d["id"] for d in diseases], key=len, reverse=True) plant_names = [p["id"] for p in plants] # or extract from dir prefixes for dir_name in dataset_dirs: matched_disease = next(d for d in disease_ids if dir_name.endswith(d)) plant = dir_name[:-(len(matched_disease)+1)] # +1 for hyphen hierarchy[plant].append(matched_disease) ``` ### 1.2 Split into train/val (85/15) Use stratified splitting per class to preserve class distribution. - For each disease-plant class, randomly assign 85% to train, 15% to val - Copy files (or symlink) to new directory structure - Verify no data leakage (same image in both splits) ### 1.3 Build metadata files ```json // species_index.json { "tomato": ["healthy", "early-blight", "late-blight", "bacterial-spot", ...], "acorn-squash": ["healthy", "powdery-mildew", "downy-mildew", ...], ... } // dataset_stats.json { "total_images": 1465818, "total_species": 320, "total_classes": 11499, "images_per_class": { "min": 64, "max": 244, "mean": 127 }, "train_images": 1245945, "val_images": 219873, "species_disease_counts": { "tomato": { "early-blight": 156, "late-blight": 142, ... } } } ``` ### 1.4 Data quality checks ### 1.4 Image normalization & compression (before splitting) **450GB is unnecessarily large for 224px training.** Many source images are high-resolution (e.g., 4000×3000 from phone cameras), but the model only sees 224×224 crops. Resizing to a reasonable max dimension BEFORE training saves massive I/O and enables faster epochs. **Strategy**: Resize all images to **max dimension of 512px** (preserving aspect ratio), convert to **JPEG quality 90**. | Approach | Est. Size | Pros | Cons | | ------------------------------- | ------------- | -------------------------------------------------- | ---------------------------------- | | **Keep originals** | 450 GB | No quality loss | Slow loading, huge storage | | **Resize 1024px max, JPEG 90** | ~120 GB | Good for future higher-res models | Still somewhat large | | **Resize 512px max, JPEG 90 ✓** | **~60-80 GB** | **Fast loading, enough detail for 224px training** | Can't go back to full res | | **Resize 256px max, JPEG 95** | ~30 GB | Fastest loading | Too small if retrain at higher res | **Recommendation**: Resize to 512px max, JPEG q90. This: - Reduces storage from 450GB → ~70GB (fits in RAM for caching) - Preserves enough detail for 224×224 RandomResizedCrop augmentation - JPEG is hardware-accelerated (libjpeg-turbo) — fastest decode path - Single format (no more .png/.webp mixed loading) ```python # resize_and_convert.py from PIL import Image import os from joblib import Parallel, delayed MAX_SIZE = 512 QUALITY = 90 def process_image(src_path, dst_path): img = Image.open(src_path) # Resize so max dimension = MAX_SIZE, preserving aspect ratio w, h = img.size if max(w, h) > MAX_SIZE: ratio = MAX_SIZE / max(w, h) img = img.resize((int(w * ratio), int(h * ratio)), Image.LANCZOS) # Convert to RGB (handles RGBA PNGs) if img.mode != 'RGB': img = img.convert('RGB') # Save as JPEG os.makedirs(os.path.dirname(dst_path), exist_ok=True) img.save(dst_path, 'JPEG', quality=QUALITY, optimize=True) # Run in parallel (Strix Halo has many cores) Parallel(n_jobs=16)( delayed(process_image)(src, dst) for src, dst in image_pairs ) ``` **Time estimate on Strix Halo**: ~2-3 hours to resize + convert 1.47M images with 16 parallel workers. Each image takes ~5-10ms with PIL+LANCZOS. ### 1.5 Data quality checks - **Label noise**: Run confidence learning (CleanLab) on a sample to estimate mislabel rate. Web-scraped datasets typically have 8-15% label noise. - **Duplicate detection**: Check for near-duplicate images (perceptual hashing + Hamming distance) within each class. - **Format consistency**: Ensure all images decode successfully; remove corrupted files. - **Background bias**: Verify that no single background dominates a class (subset and eyeball a random grid per class). ## Edge Cases & Gotchas - **Multi-word plant names**: "acorn-squash", "fiddle-leaf-fig", "chili-pepper" — the disease suffix must match the end of the string, not a substring in the plant name. Sorting disease IDs by length (longest first) handles this. - **Disease-less "healthy" dirs**: Need to ensure "healthy" is in the disease list as a valid class (index 0 in current model). Some dirs may be `{plant}-healthy`. - **Cross-platform path length**: Some species+disease combos produce long paths. Use relative symlinks or shorten names if needed on Windows. - **Original files preserved**: The existing `data/dataset/` structure stays untouched; `data/organized/` is a copy. ## Verification - [ ] `data/organized/train/` has same total image count as original (minus val split) - [ ] Every class has at least 50 training images - [ ] `species_index.json` covers all 11,499 classes - [ ] No files in both train/ and val/ (no overlap) - [ ] All images readable (no corrupted files) - [ ] Train/val split ratios consistent across all classes (±2%)