7.3 KiB
Phase 1 — Dataset Reorganization
Blocked by: Nothing Blocks: Phase 2 (training) Est. time: 2-3 days Machine: Strix Halo (fast I/O, 128GB RAM for in-memory processing)
Objective
Reorganize the flat directory structure (data/dataset/plant-disease-name/) into a proper hierarchical layout (data/organized/species/disease/) with train/val splits and metadata files.
Current State
data/dataset/— 11,499 flat directories, each named{plant}-{disease}- Files mixed:
.jpg,.jpeg,.png,.webpper directory - Total: ~1.47M images, 64-244 images per class (well-balanced)
- Total size: ~450 GB
- SSD available: 8TB NVMe (7,300 MB/s read, 6,300 MB/s write — PCIe 5.0)
Deliverables
data/
├── organized/
│ ├── train/ # 85% of images
│ │ ├── {species_1}/
│ │ │ ├── healthy/
│ │ │ ├── {disease_a}/
│ │ │ └── {disease_b}/
│ │ ├── {species_2}/
│ │ └── ...
│ ├── val/ # 15% of images
│ │ └── ... (mirrors train structure)
│ ├── species_index.json # Maps species → [disease IDs]
│ ├── class_hierarchy.json # Full mapping + metadata
│ └── dataset_stats.json # Counts per class, splits
Steps
1.1 Parse directory names → (species, disease) pairs
Problem: Directory names like acorn-squash-powdery-mildew use an inconsistent separator (hyphen). Need to reliably split plant name from disease name.
Approach: Use src/data/diseases.json as ground truth. Try matching each directory name against known disease ID suffixes (sorted longest-first). The remainder is the plant name.
Fallback: For unmatched dirs, build a plant suffix list from src/data/plants.json and try prefix matching. Log any truly unmatched dirs for manual review.
Script: scripts/organize-dataset.py
# Pseudocode for the matching algorithm:
disease_ids = sorted([d["id"] for d in diseases], key=len, reverse=True)
plant_names = [p["id"] for p in plants] # or extract from dir prefixes
for dir_name in dataset_dirs:
matched_disease = next(d for d in disease_ids if dir_name.endswith(d))
plant = dir_name[:-(len(matched_disease)+1)] # +1 for hyphen
hierarchy[plant].append(matched_disease)
1.2 Split into train/val (85/15)
Use stratified splitting per class to preserve class distribution.
- For each disease-plant class, randomly assign 85% to train, 15% to val
- Copy files (or symlink) to new directory structure
- Verify no data leakage (same image in both splits)
1.3 Build metadata files
// species_index.json
{
"tomato": ["healthy", "early-blight", "late-blight", "bacterial-spot", ...],
"acorn-squash": ["healthy", "powdery-mildew", "downy-mildew", ...],
...
}
// dataset_stats.json
{
"total_images": 1465818,
"total_species": 320,
"total_classes": 11499,
"images_per_class": { "min": 64, "max": 244, "mean": 127 },
"train_images": 1245945,
"val_images": 219873,
"species_disease_counts": {
"tomato": { "early-blight": 156, "late-blight": 142, ... }
}
}
1.4 Data quality checks
1.4 Image normalization & compression (before splitting)
450GB is unnecessarily large for 224px training. Many source images are high-resolution (e.g., 4000×3000 from phone cameras), but the model only sees 224×224 crops. Resizing to a reasonable max dimension BEFORE training saves massive I/O and enables faster epochs.
Strategy: Resize all images to max dimension of 512px (preserving aspect ratio), convert to JPEG quality 90.
| Approach | Est. Size | Pros | Cons |
|---|---|---|---|
| Keep originals | 450 GB | No quality loss | Slow loading, huge storage |
| Resize 1024px max, JPEG 90 | ~120 GB | Good for future higher-res models | Still somewhat large |
| Resize 512px max, JPEG 90 ✓ | ~60-80 GB | Fast loading, enough detail for 224px training | Can't go back to full res |
| Resize 256px max, JPEG 95 | ~30 GB | Fastest loading | Too small if retrain at higher res |
Recommendation: Resize to 512px max, JPEG q90. This:
- Reduces storage from 450GB → ~70GB (fits in RAM for caching)
- Preserves enough detail for 224×224 RandomResizedCrop augmentation
- JPEG is hardware-accelerated (libjpeg-turbo) — fastest decode path
- Single format (no more .png/.webp mixed loading)
# resize_and_convert.py
from PIL import Image
import os
from joblib import Parallel, delayed
MAX_SIZE = 512
QUALITY = 90
def process_image(src_path, dst_path):
img = Image.open(src_path)
# Resize so max dimension = MAX_SIZE, preserving aspect ratio
w, h = img.size
if max(w, h) > MAX_SIZE:
ratio = MAX_SIZE / max(w, h)
img = img.resize((int(w * ratio), int(h * ratio)), Image.LANCZOS)
# Convert to RGB (handles RGBA PNGs)
if img.mode != 'RGB':
img = img.convert('RGB')
# Save as JPEG
os.makedirs(os.path.dirname(dst_path), exist_ok=True)
img.save(dst_path, 'JPEG', quality=QUALITY, optimize=True)
# Run in parallel (Strix Halo has many cores)
Parallel(n_jobs=16)(
delayed(process_image)(src, dst)
for src, dst in image_pairs
)
Time estimate on Strix Halo: ~2-3 hours to resize + convert 1.47M images with 16 parallel workers. Each image takes ~5-10ms with PIL+LANCZOS.
1.5 Data quality checks
- Label noise: Run confidence learning (CleanLab) on a sample to estimate mislabel rate. Web-scraped datasets typically have 8-15% label noise.
- Duplicate detection: Check for near-duplicate images (perceptual hashing + Hamming distance) within each class.
- Format consistency: Ensure all images decode successfully; remove corrupted files.
- Background bias: Verify that no single background dominates a class (subset and eyeball a random grid per class).
Edge Cases & Gotchas
- Multi-word plant names: "acorn-squash", "fiddle-leaf-fig", "chili-pepper" — the disease suffix must match the end of the string, not a substring in the plant name. Sorting disease IDs by length (longest first) handles this.
- Disease-less "healthy" dirs: Need to ensure "healthy" is in the disease list as a valid class (index 0 in current model). Some dirs may be
{plant}-healthy. - Cross-platform path length: Some species+disease combos produce long paths. Use relative symlinks or shorten names if needed on Windows.
- Original files preserved: The existing
data/dataset/structure stays untouched;data/organized/is a copy.
Verification
data/organized/train/has same total image count as original (minus val split)- Every class has at least 50 training images
species_index.jsoncovers all 11,499 classes- No files in both train/ and val/ (no overlap)
- All images readable (no corrupted files)
- Train/val split ratios consistent across all classes (±2%)