# 10. County Recorder Web Scrapers for Top 100 US Counties meta: id: core-services-10 feature: core-services-implementation priority: P2 depends_on: [core-services-09] tags: [hometitle, scraping, county-records, fallback, coverage] objective: - Build Playwright-based web scrapers for county recorder websites in the top 100 US counties by population, providing a fallback for counties not covered by Attom API and reducing API costs. deliverables: - Scrapers for 100 US county recorder websites (starting with top 50) - Unified property record parser that normalizes disparate HTML formats - Fallback logic: Attom API → county scraper → manual request (in order) - scraper health monitoring and breakage detection steps: 1. Identify top 100 US counties by population (start with top 50): - Los Angeles County, CA; Cook County, IL; Harris County, TX; Maricopa County, AZ; etc. 2. Research each county's recorder website: - Search URL pattern (usually `https://{county}.gov/recorder` or similar) - Record search interface (by owner name, parcel ID, or address) - Result format (HTML table, PDF, JSON API, proprietary system) 3. Create `hometitle/county-scrapers/` directory with one module per county 4. Implement base scraper interface: - `searchByAddress(address): Promise` - `searchByParcelId(parcelId): Promise` - `parseResults(html): CountyRecord[]` 5. Implement scrapers for each county using Playwright: - Navigate to recorder website - Fill search form (address or parcel ID) - Submit and wait for results - Parse HTML table or detail page - Extract: owner name, deed date, tax info, lien status 6. Implement unified `parseDeedRecords(html)` that handles common formats: - HTML tables with standard columns - Detail pages with labeled fields - PDF records (download + text extraction) 7. Add fallback chain in `scanner.ts`: - Try Attom API first (fastest, most reliable) - If Attom returns null/empty, try county scraper - If scraper fails, queue for manual request (email to user) 8. Add scraper monitoring: - Track success/failure rate per county - Alert when >20% of scrapers fail in 24h (site changes) - Auto-disable broken scrapers, fall back to Attom/manual 9. Handle rate limiting: - Throttle requests to county sites (max 1 req/5 sec per county) - Use residential proxies if county blocks datacenter IPs - Respect robots.txt and terms of service tests: - Unit: Mock HTML responses for common county formats, verify parser normalization - Integration: Test 5 representative county scrapers against live sites - E2E: Property in county without Attom coverage → scraper fetches real data → snapshot created acceptance_criteria: - [ ] 50+ county recorder scrapers implemented and tested against live sites - [ ] `parseDeedRecords()` parses real HTML and returns structured CountyRecord objects - [ ] Fallback chain works: Attom → county scraper → manual request - [ ] Each scraper handles the county's specific search interface and result format - [ ] Rate limiting respects county sites (max 1 request per 5 seconds) - [ ] Broken scrapers are auto-detected within 24 hours and disabled - [ ] Scraper success rate > 70% across all implemented counties - [ ] Property records from scrapers match Attom data quality (owner name, deed date, liens) - [ ] Failed scraper attempts fall back to manual queue with user notification - [ ] No county site is overwhelmed by scraping (responsible rate limits) validation: - Run `vitest run hometitle.test.ts` — extended tests for county scrapers - Manual: Search property in Cook County IL, verify scraper returns real owner data - Check fallback: Disable Attom API key, trigger scan, verify county scraper activates - Monitor health: Dashboard shows per-county scraper success rate notes: - County recorder sites are notoriously fragile — expect 30–40% of scrapers to break per quarter - Many counties use proprietary systems (e.g., Tyler Technologies, Fidlar Technologies) with complex JavaScript - Some counties require payment per record ($1–$5) — flag these for manual processing - Consider partnering with Attom for counties they don't cover rather than building scrapers - Legal: Ensure scraping complies with each county's terms of service and state public records laws - The existing `parseDeedRecords()` currently logs "not yet implemented" — replace with real parsing