Files
plant-disease-id/apps/web/tasks/production-ml-pipeline/02-label-mapping-implementation.md
2026-06-06 15:09:46 -04:00

150 lines
8.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 02. Label Mapping Layer Implementation
meta:
id: production-ml-pipeline-02
feature: production-ml-pipeline
priority: P0
depends_on: [production-ml-pipeline-01]
tags: [implementation, knowledge-base, tests-required]
objective:
- Expand the knowledge base to cover all PlantVillage plants and diseases
- Rewrite `src/lib/ml/labels.ts` to use the PlantVillage class mapping from task 01
- Ensure every model output index resolves to a valid KB disease or the "healthy" sentinel
- The label layer must be the single source of truth for model-index → disease mapping
deliverables:
- Updated `src/data/plants.json` — 10 new PlantVillage plants added (apple, blueberry, cherry, corn, grape, orange, peach, potato, raspberry, soybean)
- Updated `src/data/diseases.json` — 19 new disease entries added for PlantVillage diseases not yet in KB
- `src/lib/ml/labels.ts` — fully rewritten to use PlantVillage class mapping
- `src/lib/ml/labels.test.ts` — updated to validate against new mapping
- `scripts/seed-plantvillage-kb.ts` — DB migration script to insert new plants and diseases into Turso
steps:
1. **Add 10 new plants to `src/data/plants.json`** — each with proper metadata:
```typescript
// New plants needed (PlantVillage coverage):
{ id: "apple", commonName: "Apple", scientificName: "Malus domestica", family: "Rosaceae", category: "fruit" }
{ id: "cherry", commonName: "Cherry", scientificName: "Prunus avium", family: "Rosaceae", category: "fruit" }
{ id: "corn", commonName: "Corn (Maize)", scientificName: "Zea mays", family: "Poaceae", category: "vegetable" }
{ id: "grape", commonName: "Grape", scientificName: "Vitis vinifera", family: "Vitaceae", category: "fruit" }
{ id: "orange", commonName: "Orange", scientificName: "Citrus sinensis", family: "Rutaceae", category: "fruit" }
{ id: "peach", commonName: "Peach", scientificName: "Prunus persica", family: "Rosaceae", category: "fruit" }
{ id: "potato", commonName: "Potato", scientificName: "Solanum tuberosum", family: "Solanaceae", category: "vegetable" }
{ id: "blueberry", commonName: "Blueberry", scientificName: "Vaccinium corymbosum", family: "Ericaceae", category: "fruit" }
{ id: "raspberry", commonName: "Raspberry", scientificName: "Rubus idaeus", family: "Rosaceae", category: "fruit" }
{ id: "soybean", commonName: "Soybean", scientificName: "Glycine max", family: "Fabaceae", category: "vegetable" }
```
- Add `imageUrl` for each (use Wikipedia pageimages, same pattern as `fill-plant-images.ts`)
- Add `careSummary` for each
2. **Add 19 new diseases to `src/data/diseases.json`** — each with full structured data:
- Use the template-based approach from `scripts/disease-templates.ts` where possible
- Source disease details from:
- UW-Madison PDDC factsheets (pddc.wisc.edu)
- Cornell Plant Clinic (plantclinic.cornell.edu)
- University extension publications
- Each disease must have: `id`, `plantId`, `name`, `scientificName`, `causalAgentType`, `description`, `symptoms` (≥3), `causes` (≥2), `treatment` (≥3), `prevention` (≥2), `lookalikeDiseaseIds`, `severity`, `prevalence`
- New disease entries needed:
- apple-scab, apple-black-rot, apple-cedar-apple-rust (plant: apple)
- cherry-powdery-mildew (plant: cherry)
- corn-gray-leaf-spot, corn-common-rust, corn-northern-leaf-blight (plant: corn)
- grape-black-rot, grape-esca, grape-leaf-blight (plant: grape)
- orange-citrus-greening (plant: orange)
- peach-bacterial-spot (plant: peach)
- potato-early-blight, potato-late-blight (plant: potato)
- tomato-leaf-mold, tomato-spider-mites, tomato-target-spot, tomato-yellow-leaf-curl-virus, tomato-mosaic-virus (plant: tomato)
- Use programmatic approach: write a generator script that pulls from UW-Madison PDDC / Cornell factsheets and Wikipedia, following the same pattern as `scripts/generate-full-kb.ts`
3. **Update lookalikeDiseaseIds** — cross-reference within new diseases:
- Apple scab ↔ Apple black rot (both cause leaf spots on apple)
- Potato early blight ↔ Potato late blight (both affect potato foliage)
- Grape black rot ↔ Grape esca (both cause fruit rot)
- Tomato early blight ↔ Tomato septoria leaf spot ↔ Tomato target spot (all cause leaf lesions)
- Tomato leaf mold ↔ Tomato septoria leaf spot (both cause leaf spots in humid conditions)
4. **Rewrite `src/lib/ml/labels.ts`** to use the PlantVillage mapping:
```typescript
import { PLANTVILLAGE_CLASSES } from "./plantvillage-classes";
// Total output classes from model
export const NUM_CLASSES = 38;
// Index 037 → disease lookup
export function getDiseaseIdForIndex(index: number): string {
const entry = PLANTVILLAGE_CLASSES[index];
if (!entry || entry.isHealthy) return "healthy";
return entry.diseaseId;
}
export function getPlantIdForIndex(index: number): string {
return PLANTVILLAGE_CLASSES[index]?.plantId ?? "unknown";
}
export function isHealthyClass(index: number): boolean {
return PLANTVILLAGE_CLASSES[index]?.isHealthy ?? false;
}
// Disease ID → index (for reverse lookup)
export function getIndexForDiseaseId(diseaseId: string): number {
const entry = PLANTVILLAGE_CLASSES.find((c) => c.diseaseId === diseaseId.toLowerCase());
return entry?.index ?? -1;
}
```
5. **Remove old assumptions** — the old labels.ts assumed 95 classes (93 diseases + healthy + unknown). Delete all references to `diseases.json` index ordering from labels.ts. The mapping is now defined by `plantvillage-classes.ts`, not by JSON file order.
6. **Create DB migration script** `scripts/seed-plantvillage-kb.ts`:
- Read updated `src/data/plants.json` and `src/data/diseases.json`
- Insert new plants and diseases into Turso DB using Drizzle ORM
- Use UPSERT (INSERT OR REPLACE) to be idempotent
- Log what was inserted/updated
7. **Run the migration** to populate the DB with new data.
tests:
- Unit: `labels.test.ts` validates all 38 indices map correctly
- Unit: `getDiseaseIdForIndex(29)` returns `"early-blight"`
- Unit: `getDiseaseIdForIndex(3)` returns `"healthy"` (Apple healthy class)
- Unit: `getIndexForDiseaseId("early-blight")` returns `29`
- Unit: `isHealthyClass(37)` returns `true` (Tomato healthy)
- Unit: `isHealthyClass(29)` returns `false` (Tomato Early_blight)
- Unit: `getPlantIdForIndex(0)` returns `"apple"`
- Unit: All 25 non-healthy diseaseIds resolve to real DB entries via `getDiseaseById()`
- Integration: `scripts/seed-plantvillage-kb.ts` runs without errors, inserts all 10 plants and 19 diseases
- Integration: After seeding, DB query for each new disease returns a complete record
acceptance_criteria:
- `PLANTVILLAGE_CLASSES` in labels.ts has exactly 38 entries matching model output order
- 13 healthy indices correctly return "healthy" from `getDiseaseIdForIndex()`
- 25 disease indices correctly return valid diseaseIds
- All 10 new plants exist in `src/data/plants.json` with valid metadata and imageUrl
- All 19 new diseases exist in `src/data/diseases.json` with full structured data (symptoms, treatment, prevention, etc.)
- DB migration script runs successfully, all new data queryable from Turso
- Old `diseases.json` ordering assumption is completely removed from labels.ts
- All existing tests still pass (no regressions in browse, search, detail pages)
validation:
- `npx vitest run src/lib/ml/labels.test.ts`
- `npx vitest run src/lib/ml/plantvillage-classes.test.ts`
- `npx tsx scripts/seed-plantvillage-kb.ts` — verify output shows correct inserts
- `npx vitest run` — full test suite passes
- Manual: query DB for each new plant/disease and verify complete data
notes:
- Disease data must come from authoritative sources (university extension services), not hand-written
- Use the same template-based generation approach from `scripts/generate-full-kb.ts` for consistency
- The `pepper-bacterial-wilt` disease already exists — map Pepper\_\_\_Bacterial_spot to it even though it's not a perfect match (it's the closest available)
- Blueberry, Raspberry, and Soybean only have "healthy" classes in PlantVillage — add plant entries but no disease entries for these (they don't need new disease IDs since they always map to "healthy")
- Total disease count after this task: 93 (existing) + 19 (new) = 112 diseases