Files
plant-disease-id/tasks/hierarchical-model-upgrade/01-dataset-reorganization.md

7.3 KiB
Raw Blame History

Phase 1 — Dataset Reorganization

Blocked by: Nothing Blocks: Phase 2 (training) Est. time: 2-3 days Machine: Strix Halo (fast I/O, 128GB RAM for in-memory processing)

Objective

Reorganize the flat directory structure (data/dataset/plant-disease-name/) into a proper hierarchical layout (data/organized/species/disease/) with train/val splits and metadata files.

Current State

  • data/dataset/ — 11,499 flat directories, each named {plant}-{disease}
  • Files mixed: .jpg, .jpeg, .png, .webp per directory
  • Total: ~1.47M images, 64-244 images per class (well-balanced)
  • Total size: ~450 GB
  • SSD available: 8TB NVMe (7,300 MB/s read, 6,300 MB/s write — PCIe 5.0)

Deliverables

data/
├── organized/
│   ├── train/                          # 85% of images
│   │   ├── {species_1}/
│   │   │   ├── healthy/
│   │   │   ├── {disease_a}/
│   │   │   └── {disease_b}/
│   │   ├── {species_2}/
│   │   └── ...
│   ├── val/                            # 15% of images
│   │   └── ... (mirrors train structure)
│   ├── species_index.json              # Maps species → [disease IDs]
│   ├── class_hierarchy.json            # Full mapping + metadata
│   └── dataset_stats.json              # Counts per class, splits

Steps

1.1 Parse directory names → (species, disease) pairs

Problem: Directory names like acorn-squash-powdery-mildew use an inconsistent separator (hyphen). Need to reliably split plant name from disease name.

Approach: Use src/data/diseases.json as ground truth. Try matching each directory name against known disease ID suffixes (sorted longest-first). The remainder is the plant name.

Fallback: For unmatched dirs, build a plant suffix list from src/data/plants.json and try prefix matching. Log any truly unmatched dirs for manual review.

Script: scripts/organize-dataset.py

# Pseudocode for the matching algorithm:
disease_ids = sorted([d["id"] for d in diseases], key=len, reverse=True)
plant_names = [p["id"] for p in plants]  # or extract from dir prefixes

for dir_name in dataset_dirs:
    matched_disease = next(d for d in disease_ids if dir_name.endswith(d))
    plant = dir_name[:-(len(matched_disease)+1)]  # +1 for hyphen
    hierarchy[plant].append(matched_disease)

1.2 Split into train/val (85/15)

Use stratified splitting per class to preserve class distribution.

  • For each disease-plant class, randomly assign 85% to train, 15% to val
  • Copy files (or symlink) to new directory structure
  • Verify no data leakage (same image in both splits)

1.3 Build metadata files

// species_index.json
{
  "tomato": ["healthy", "early-blight", "late-blight", "bacterial-spot", ...],
  "acorn-squash": ["healthy", "powdery-mildew", "downy-mildew", ...],
  ...
}

// dataset_stats.json
{
  "total_images": 1465818,
  "total_species": 320,
  "total_classes": 11499,
  "images_per_class": { "min": 64, "max": 244, "mean": 127 },
  "train_images": 1245945,
  "val_images": 219873,
  "species_disease_counts": {
    "tomato": { "early-blight": 156, "late-blight": 142, ... }
  }
}

1.4 Data quality checks

1.4 Image normalization & compression (before splitting)

450GB is unnecessarily large for 224px training. Many source images are high-resolution (e.g., 4000×3000 from phone cameras), but the model only sees 224×224 crops. Resizing to a reasonable max dimension BEFORE training saves massive I/O and enables faster epochs.

Strategy: Resize all images to max dimension of 512px (preserving aspect ratio), convert to JPEG quality 90.

Approach Est. Size Pros Cons
Keep originals 450 GB No quality loss Slow loading, huge storage
Resize 1024px max, JPEG 90 ~120 GB Good for future higher-res models Still somewhat large
Resize 512px max, JPEG 90 ✓ ~60-80 GB Fast loading, enough detail for 224px training Can't go back to full res
Resize 256px max, JPEG 95 ~30 GB Fastest loading Too small if retrain at higher res

Recommendation: Resize to 512px max, JPEG q90. This:

  • Reduces storage from 450GB → ~70GB (fits in RAM for caching)
  • Preserves enough detail for 224×224 RandomResizedCrop augmentation
  • JPEG is hardware-accelerated (libjpeg-turbo) — fastest decode path
  • Single format (no more .png/.webp mixed loading)
# resize_and_convert.py
from PIL import Image
import os
from joblib import Parallel, delayed

MAX_SIZE = 512
QUALITY = 90

def process_image(src_path, dst_path):
    img = Image.open(src_path)
    # Resize so max dimension = MAX_SIZE, preserving aspect ratio
    w, h = img.size
    if max(w, h) > MAX_SIZE:
        ratio = MAX_SIZE / max(w, h)
        img = img.resize((int(w * ratio), int(h * ratio)), Image.LANCZOS)
    # Convert to RGB (handles RGBA PNGs)
    if img.mode != 'RGB':
        img = img.convert('RGB')
    # Save as JPEG
    os.makedirs(os.path.dirname(dst_path), exist_ok=True)
    img.save(dst_path, 'JPEG', quality=QUALITY, optimize=True)

# Run in parallel (Strix Halo has many cores)
Parallel(n_jobs=16)(
    delayed(process_image)(src, dst)
    for src, dst in image_pairs
)

Time estimate on Strix Halo: ~2-3 hours to resize + convert 1.47M images with 16 parallel workers. Each image takes ~5-10ms with PIL+LANCZOS.

1.5 Data quality checks

  • Label noise: Run confidence learning (CleanLab) on a sample to estimate mislabel rate. Web-scraped datasets typically have 8-15% label noise.
  • Duplicate detection: Check for near-duplicate images (perceptual hashing + Hamming distance) within each class.
  • Format consistency: Ensure all images decode successfully; remove corrupted files.
  • Background bias: Verify that no single background dominates a class (subset and eyeball a random grid per class).

Edge Cases & Gotchas

  • Multi-word plant names: "acorn-squash", "fiddle-leaf-fig", "chili-pepper" — the disease suffix must match the end of the string, not a substring in the plant name. Sorting disease IDs by length (longest first) handles this.
  • Disease-less "healthy" dirs: Need to ensure "healthy" is in the disease list as a valid class (index 0 in current model). Some dirs may be {plant}-healthy.
  • Cross-platform path length: Some species+disease combos produce long paths. Use relative symlinks or shorten names if needed on Windows.
  • Original files preserved: The existing data/dataset/ structure stays untouched; data/organized/ is a copy.

Verification

  • data/organized/train/ has same total image count as original (minus val split)
  • Every class has at least 50 training images
  • species_index.json covers all 11,499 classes
  • No files in both train/ and val/ (no overlap)
  • All images readable (no corrupted files)
  • Train/val split ratios consistent across all classes (±2%)