Phase 1 — Dataset Reorganization

Blocked by: Nothing Blocks: Phase 2 (training) Est. time: 2-3 days Machine: Strix Halo (fast I/O, 128GB RAM for in-memory processing)

Objective

Reorganize the flat directory structure (data/dataset/plant-disease-name/) into a proper hierarchical layout (data/organized/species/disease/) with train/val splits and metadata files.

Current State

data/dataset/ — 11,499 flat directories, each named {plant}-{disease}
Files mixed: .jpg, .jpeg, .png, .webp per directory
Total: ~1.47M images, 64-244 images per class (well-balanced)
Total size: ~450 GB
SSD available: 8TB NVMe (7,300 MB/s read, 6,300 MB/s write — PCIe 5.0)

Deliverables

data/
├── organized/
│   ├── train/                          # 85% of images
│   │   ├── {species_1}/
│   │   │   ├── healthy/
│   │   │   ├── {disease_a}/
│   │   │   └── {disease_b}/
│   │   ├── {species_2}/
│   │   └── ...
│   ├── val/                            # 15% of images
│   │   └── ... (mirrors train structure)
│   ├── species_index.json              # Maps species → [disease IDs]
│   ├── class_hierarchy.json            # Full mapping + metadata
│   └── dataset_stats.json              # Counts per class, splits

Steps

1.1 Parse directory names → (species, disease) pairs

Problem: Directory names like acorn-squash-powdery-mildew use an inconsistent separator (hyphen). Need to reliably split plant name from disease name.

Approach: Use src/data/diseases.json as ground truth. Try matching each directory name against known disease ID suffixes (sorted longest-first). The remainder is the plant name.

Fallback: For unmatched dirs, build a plant suffix list from src/data/plants.json and try prefix matching. Log any truly unmatched dirs for manual review.

Script: scripts/organize-dataset.py

# Pseudocode for the matching algorithm:
disease_ids = sorted([d["id"] for d in diseases], key=len, reverse=True)
plant_names = [p["id"] for p in plants]  # or extract from dir prefixes

for dir_name in dataset_dirs:
    matched_disease = next(d for d in disease_ids if dir_name.endswith(d))
    plant = dir_name[:-(len(matched_disease)+1)]  # +1 for hyphen
    hierarchy[plant].append(matched_disease)

1.2 Split into train/val (85/15)

Use stratified splitting per class to preserve class distribution.

For each disease-plant class, randomly assign 85% to train, 15% to val
Copy files (or symlink) to new directory structure
Verify no data leakage (same image in both splits)

1.3 Build metadata files

// species_index.json
{
  "tomato": ["healthy", "early-blight", "late-blight", "bacterial-spot", ...],
  "acorn-squash": ["healthy", "powdery-mildew", "downy-mildew", ...],
  ...
}

// dataset_stats.json
{
  "total_images": 1465818,
  "total_species": 320,
  "total_classes": 11499,
  "images_per_class": { "min": 64, "max": 244, "mean": 127 },
  "train_images": 1245945,
  "val_images": 219873,
  "species_disease_counts": {
    "tomato": { "early-blight": 156, "late-blight": 142, ... }
  }
}

1.4 Data quality checks

1.4 Image normalization & compression (before splitting)

450GB is unnecessarily large for 224px training. Many source images are high-resolution (e.g., 4000×3000 from phone cameras), but the model only sees 224×224 crops. Resizing to a reasonable max dimension BEFORE training saves massive I/O and enables faster epochs.

Strategy: Resize all images to max dimension of 512px (preserving aspect ratio), convert to JPEG quality 90.

Approach	Est. Size	Pros	Cons
Keep originals	450 GB	No quality loss	Slow loading, huge storage
Resize 1024px max, JPEG 90	~120 GB	Good for future higher-res models	Still somewhat large
Resize 512px max, JPEG 90 ✓	~60-80 GB	Fast loading, enough detail for 224px training	Can't go back to full res
Resize 256px max, JPEG 95	~30 GB	Fastest loading	Too small if retrain at higher res

Recommendation: Resize to 512px max, JPEG q90. This:

Reduces storage from 450GB → ~70GB (fits in RAM for caching)
Preserves enough detail for 224×224 RandomResizedCrop augmentation
JPEG is hardware-accelerated (libjpeg-turbo) — fastest decode path
Single format (no more .png/.webp mixed loading)

# resize_and_convert.py
from PIL import Image
import os
from joblib import Parallel, delayed

MAX_SIZE = 512
QUALITY = 90

def process_image(src_path, dst_path):
    img = Image.open(src_path)
    # Resize so max dimension = MAX_SIZE, preserving aspect ratio
    w, h = img.size
    if max(w, h) > MAX_SIZE:
        ratio = MAX_SIZE / max(w, h)
        img = img.resize((int(w * ratio), int(h * ratio)), Image.LANCZOS)
    # Convert to RGB (handles RGBA PNGs)
    if img.mode != 'RGB':
        img = img.convert('RGB')
    # Save as JPEG
    os.makedirs(os.path.dirname(dst_path), exist_ok=True)
    img.save(dst_path, 'JPEG', quality=QUALITY, optimize=True)

# Run in parallel (Strix Halo has many cores)
Parallel(n_jobs=16)(
    delayed(process_image)(src, dst)
    for src, dst in image_pairs
)

Time estimate on Strix Halo: ~2-3 hours to resize + convert 1.47M images with 16 parallel workers. Each image takes ~5-10ms with PIL+LANCZOS.

1.5 Data quality checks

Label noise: Run confidence learning (CleanLab) on a sample to estimate mislabel rate. Web-scraped datasets typically have 8-15% label noise.
Duplicate detection: Check for near-duplicate images (perceptual hashing + Hamming distance) within each class.
Format consistency: Ensure all images decode successfully; remove corrupted files.
Background bias: Verify that no single background dominates a class (subset and eyeball a random grid per class).

Edge Cases & Gotchas

Multi-word plant names: "acorn-squash", "fiddle-leaf-fig", "chili-pepper" — the disease suffix must match the end of the string, not a substring in the plant name. Sorting disease IDs by length (longest first) handles this.
Disease-less "healthy" dirs: Need to ensure "healthy" is in the disease list as a valid class (index 0 in current model). Some dirs may be {plant}-healthy.
Cross-platform path length: Some species+disease combos produce long paths. Use relative symlinks or shorten names if needed on Windows.
Original files preserved: The existing data/dataset/ structure stays untouched; data/organized/ is a copy.

Verification

data/organized/train/ has same total image count as original (minus val split)
Every class has at least 50 training images
species_index.json covers all 11,499 classes
No files in both train/ and val/ (no overlap)
All images readable (no corrupted files)
Train/val split ratios consistent across all classes (±2%)

7.3 KiB Raw Blame History Unescape Escape