# 02. Label Mapping Layer Implementation meta: id: production-ml-pipeline-02 feature: production-ml-pipeline priority: P0 depends_on: [production-ml-pipeline-01] tags: [implementation, knowledge-base, tests-required] objective: - Expand the knowledge base to cover all PlantVillage plants and diseases - Rewrite `src/lib/ml/labels.ts` to use the PlantVillage class mapping from task 01 - Ensure every model output index resolves to a valid KB disease or the "healthy" sentinel - The label layer must be the single source of truth for model-index → disease mapping deliverables: - Updated `src/data/plants.json` — 10 new PlantVillage plants added (apple, blueberry, cherry, corn, grape, orange, peach, potato, raspberry, soybean) - Updated `src/data/diseases.json` — 19 new disease entries added for PlantVillage diseases not yet in KB - `src/lib/ml/labels.ts` — fully rewritten to use PlantVillage class mapping - `src/lib/ml/labels.test.ts` — updated to validate against new mapping - `scripts/seed-plantvillage-kb.ts` — DB migration script to insert new plants and diseases into Turso steps: 1. **Add 10 new plants to `src/data/plants.json`** — each with proper metadata: ```typescript // New plants needed (PlantVillage coverage): { id: "apple", commonName: "Apple", scientificName: "Malus domestica", family: "Rosaceae", category: "fruit" } { id: "cherry", commonName: "Cherry", scientificName: "Prunus avium", family: "Rosaceae", category: "fruit" } { id: "corn", commonName: "Corn (Maize)", scientificName: "Zea mays", family: "Poaceae", category: "vegetable" } { id: "grape", commonName: "Grape", scientificName: "Vitis vinifera", family: "Vitaceae", category: "fruit" } { id: "orange", commonName: "Orange", scientificName: "Citrus sinensis", family: "Rutaceae", category: "fruit" } { id: "peach", commonName: "Peach", scientificName: "Prunus persica", family: "Rosaceae", category: "fruit" } { id: "potato", commonName: "Potato", scientificName: "Solanum tuberosum", family: "Solanaceae", category: "vegetable" } { id: "blueberry", commonName: "Blueberry", scientificName: "Vaccinium corymbosum", family: "Ericaceae", category: "fruit" } { id: "raspberry", commonName: "Raspberry", scientificName: "Rubus idaeus", family: "Rosaceae", category: "fruit" } { id: "soybean", commonName: "Soybean", scientificName: "Glycine max", family: "Fabaceae", category: "vegetable" } ``` - Add `imageUrl` for each (use Wikipedia pageimages, same pattern as `fill-plant-images.ts`) - Add `careSummary` for each 2. **Add 19 new diseases to `src/data/diseases.json`** — each with full structured data: - Use the template-based approach from `scripts/disease-templates.ts` where possible - Source disease details from: - UW-Madison PDDC factsheets (pddc.wisc.edu) - Cornell Plant Clinic (plantclinic.cornell.edu) - University extension publications - Each disease must have: `id`, `plantId`, `name`, `scientificName`, `causalAgentType`, `description`, `symptoms` (≥3), `causes` (≥2), `treatment` (≥3), `prevention` (≥2), `lookalikeDiseaseIds`, `severity`, `prevalence` - New disease entries needed: - apple-scab, apple-black-rot, apple-cedar-apple-rust (plant: apple) - cherry-powdery-mildew (plant: cherry) - corn-gray-leaf-spot, corn-common-rust, corn-northern-leaf-blight (plant: corn) - grape-black-rot, grape-esca, grape-leaf-blight (plant: grape) - orange-citrus-greening (plant: orange) - peach-bacterial-spot (plant: peach) - potato-early-blight, potato-late-blight (plant: potato) - tomato-leaf-mold, tomato-spider-mites, tomato-target-spot, tomato-yellow-leaf-curl-virus, tomato-mosaic-virus (plant: tomato) - Use programmatic approach: write a generator script that pulls from UW-Madison PDDC / Cornell factsheets and Wikipedia, following the same pattern as `scripts/generate-full-kb.ts` 3. **Update lookalikeDiseaseIds** — cross-reference within new diseases: - Apple scab ↔ Apple black rot (both cause leaf spots on apple) - Potato early blight ↔ Potato late blight (both affect potato foliage) - Grape black rot ↔ Grape esca (both cause fruit rot) - Tomato early blight ↔ Tomato septoria leaf spot ↔ Tomato target spot (all cause leaf lesions) - Tomato leaf mold ↔ Tomato septoria leaf spot (both cause leaf spots in humid conditions) 4. **Rewrite `src/lib/ml/labels.ts`** to use the PlantVillage mapping: ```typescript import { PLANTVILLAGE_CLASSES } from "./plantvillage-classes"; // Total output classes from model export const NUM_CLASSES = 38; // Index 0–37 → disease lookup export function getDiseaseIdForIndex(index: number): string { const entry = PLANTVILLAGE_CLASSES[index]; if (!entry || entry.isHealthy) return "healthy"; return entry.diseaseId; } export function getPlantIdForIndex(index: number): string { return PLANTVILLAGE_CLASSES[index]?.plantId ?? "unknown"; } export function isHealthyClass(index: number): boolean { return PLANTVILLAGE_CLASSES[index]?.isHealthy ?? false; } // Disease ID → index (for reverse lookup) export function getIndexForDiseaseId(diseaseId: string): number { const entry = PLANTVILLAGE_CLASSES.find((c) => c.diseaseId === diseaseId.toLowerCase()); return entry?.index ?? -1; } ``` 5. **Remove old assumptions** — the old labels.ts assumed 95 classes (93 diseases + healthy + unknown). Delete all references to `diseases.json` index ordering from labels.ts. The mapping is now defined by `plantvillage-classes.ts`, not by JSON file order. 6. **Create DB migration script** `scripts/seed-plantvillage-kb.ts`: - Read updated `src/data/plants.json` and `src/data/diseases.json` - Insert new plants and diseases into Turso DB using Drizzle ORM - Use UPSERT (INSERT OR REPLACE) to be idempotent - Log what was inserted/updated 7. **Run the migration** to populate the DB with new data. tests: - Unit: `labels.test.ts` validates all 38 indices map correctly - Unit: `getDiseaseIdForIndex(29)` returns `"early-blight"` - Unit: `getDiseaseIdForIndex(3)` returns `"healthy"` (Apple healthy class) - Unit: `getIndexForDiseaseId("early-blight")` returns `29` - Unit: `isHealthyClass(37)` returns `true` (Tomato healthy) - Unit: `isHealthyClass(29)` returns `false` (Tomato Early_blight) - Unit: `getPlantIdForIndex(0)` returns `"apple"` - Unit: All 25 non-healthy diseaseIds resolve to real DB entries via `getDiseaseById()` - Integration: `scripts/seed-plantvillage-kb.ts` runs without errors, inserts all 10 plants and 19 diseases - Integration: After seeding, DB query for each new disease returns a complete record acceptance_criteria: - `PLANTVILLAGE_CLASSES` in labels.ts has exactly 38 entries matching model output order - 13 healthy indices correctly return "healthy" from `getDiseaseIdForIndex()` - 25 disease indices correctly return valid diseaseIds - All 10 new plants exist in `src/data/plants.json` with valid metadata and imageUrl - All 19 new diseases exist in `src/data/diseases.json` with full structured data (symptoms, treatment, prevention, etc.) - DB migration script runs successfully, all new data queryable from Turso - Old `diseases.json` ordering assumption is completely removed from labels.ts - All existing tests still pass (no regressions in browse, search, detail pages) validation: - `npx vitest run src/lib/ml/labels.test.ts` - `npx vitest run src/lib/ml/plantvillage-classes.test.ts` - `npx tsx scripts/seed-plantvillage-kb.ts` — verify output shows correct inserts - `npx vitest run` — full test suite passes - Manual: query DB for each new plant/disease and verify complete data notes: - Disease data must come from authoritative sources (university extension services), not hand-written - Use the same template-based generation approach from `scripts/generate-full-kb.ts` for consistency - The `pepper-bacterial-wilt` disease already exists — map Pepper\_\_\_Bacterial_spot to it even though it's not a perfect match (it's the closest available) - Blueberry, Raspberry, and Soybean only have "healthy" classes in PlantVillage — add plant entries but no disease entries for these (they don't need new disease IDs since they always map to "healthy") - Total disease count after this task: 93 (existing) + 19 (new) = 112 diseases