Files
Kordant/tasks/core-services-implementation/10-hometitle-county-scrapers.md
2026-05-31 22:03:18 -04:00

84 lines
4.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 10. County Recorder Web Scrapers for Top 100 US Counties
meta:
id: core-services-10
feature: core-services-implementation
priority: P2
depends_on: [core-services-09]
tags: [hometitle, scraping, county-records, fallback, coverage]
objective:
- Build Playwright-based web scrapers for county recorder websites in the top 100 US counties by population, providing a fallback for counties not covered by Attom API and reducing API costs.
deliverables:
- Scrapers for 100 US county recorder websites (starting with top 50)
- Unified property record parser that normalizes disparate HTML formats
- Fallback logic: Attom API → county scraper → manual request (in order)
- scraper health monitoring and breakage detection
steps:
1. Identify top 100 US counties by population (start with top 50):
- Los Angeles County, CA; Cook County, IL; Harris County, TX; Maricopa County, AZ; etc.
2. Research each county's recorder website:
- Search URL pattern (usually `https://{county}.gov/recorder` or similar)
- Record search interface (by owner name, parcel ID, or address)
- Result format (HTML table, PDF, JSON API, proprietary system)
3. Create `hometitle/county-scrapers/` directory with one module per county
4. Implement base scraper interface:
- `searchByAddress(address): Promise<CountyRecord[]>`
- `searchByParcelId(parcelId): Promise<CountyRecord | null>`
- `parseResults(html): CountyRecord[]`
5. Implement scrapers for each county using Playwright:
- Navigate to recorder website
- Fill search form (address or parcel ID)
- Submit and wait for results
- Parse HTML table or detail page
- Extract: owner name, deed date, tax info, lien status
6. Implement unified `parseDeedRecords(html)` that handles common formats:
- HTML tables with standard columns
- Detail pages with labeled fields
- PDF records (download + text extraction)
7. Add fallback chain in `scanner.ts`:
- Try Attom API first (fastest, most reliable)
- If Attom returns null/empty, try county scraper
- If scraper fails, queue for manual request (email to user)
8. Add scraper monitoring:
- Track success/failure rate per county
- Alert when >20% of scrapers fail in 24h (site changes)
- Auto-disable broken scrapers, fall back to Attom/manual
9. Handle rate limiting:
- Throttle requests to county sites (max 1 req/5 sec per county)
- Use residential proxies if county blocks datacenter IPs
- Respect robots.txt and terms of service
tests:
- Unit: Mock HTML responses for common county formats, verify parser normalization
- Integration: Test 5 representative county scrapers against live sites
- E2E: Property in county without Attom coverage → scraper fetches real data → snapshot created
acceptance_criteria:
- [ ] 50+ county recorder scrapers implemented and tested against live sites
- [ ] `parseDeedRecords()` parses real HTML and returns structured CountyRecord objects
- [ ] Fallback chain works: Attom → county scraper → manual request
- [ ] Each scraper handles the county's specific search interface and result format
- [ ] Rate limiting respects county sites (max 1 request per 5 seconds)
- [ ] Broken scrapers are auto-detected within 24 hours and disabled
- [ ] Scraper success rate > 70% across all implemented counties
- [ ] Property records from scrapers match Attom data quality (owner name, deed date, liens)
- [ ] Failed scraper attempts fall back to manual queue with user notification
- [ ] No county site is overwhelmed by scraping (responsible rate limits)
validation:
- Run `vitest run hometitle.test.ts` — extended tests for county scrapers
- Manual: Search property in Cook County IL, verify scraper returns real owner data
- Check fallback: Disable Attom API key, trigger scan, verify county scraper activates
- Monitor health: Dashboard shows per-county scraper success rate
notes:
- County recorder sites are notoriously fragile — expect 3040% of scrapers to break per quarter
- Many counties use proprietary systems (e.g., Tyler Technologies, Fidlar Technologies) with complex JavaScript
- Some counties require payment per record ($1$5) — flag these for manual processing
- Consider partnering with Attom for counties they don't cover rather than building scrapers
- Legal: Ensure scraping complies with each county's terms of service and state public records laws
- The existing `parseDeedRecords()` currently logs "not yet implemented" — replace with real parsing