shortcommings
This commit is contained in:
@@ -0,0 +1,83 @@
|
||||
# 10. County Recorder Web Scrapers for Top 100 US Counties
|
||||
|
||||
meta:
|
||||
id: core-services-10
|
||||
feature: core-services-implementation
|
||||
priority: P2
|
||||
depends_on: [core-services-09]
|
||||
tags: [hometitle, scraping, county-records, fallback, coverage]
|
||||
|
||||
objective:
|
||||
- Build Playwright-based web scrapers for county recorder websites in the top 100 US counties by population, providing a fallback for counties not covered by Attom API and reducing API costs.
|
||||
|
||||
deliverables:
|
||||
- Scrapers for 100 US county recorder websites (starting with top 50)
|
||||
- Unified property record parser that normalizes disparate HTML formats
|
||||
- Fallback logic: Attom API → county scraper → manual request (in order)
|
||||
- scraper health monitoring and breakage detection
|
||||
|
||||
steps:
|
||||
1. Identify top 100 US counties by population (start with top 50):
|
||||
- Los Angeles County, CA; Cook County, IL; Harris County, TX; Maricopa County, AZ; etc.
|
||||
2. Research each county's recorder website:
|
||||
- Search URL pattern (usually `https://{county}.gov/recorder` or similar)
|
||||
- Record search interface (by owner name, parcel ID, or address)
|
||||
- Result format (HTML table, PDF, JSON API, proprietary system)
|
||||
3. Create `hometitle/county-scrapers/` directory with one module per county
|
||||
4. Implement base scraper interface:
|
||||
- `searchByAddress(address): Promise<CountyRecord[]>`
|
||||
- `searchByParcelId(parcelId): Promise<CountyRecord | null>`
|
||||
- `parseResults(html): CountyRecord[]`
|
||||
5. Implement scrapers for each county using Playwright:
|
||||
- Navigate to recorder website
|
||||
- Fill search form (address or parcel ID)
|
||||
- Submit and wait for results
|
||||
- Parse HTML table or detail page
|
||||
- Extract: owner name, deed date, tax info, lien status
|
||||
6. Implement unified `parseDeedRecords(html)` that handles common formats:
|
||||
- HTML tables with standard columns
|
||||
- Detail pages with labeled fields
|
||||
- PDF records (download + text extraction)
|
||||
7. Add fallback chain in `scanner.ts`:
|
||||
- Try Attom API first (fastest, most reliable)
|
||||
- If Attom returns null/empty, try county scraper
|
||||
- If scraper fails, queue for manual request (email to user)
|
||||
8. Add scraper monitoring:
|
||||
- Track success/failure rate per county
|
||||
- Alert when >20% of scrapers fail in 24h (site changes)
|
||||
- Auto-disable broken scrapers, fall back to Attom/manual
|
||||
9. Handle rate limiting:
|
||||
- Throttle requests to county sites (max 1 req/5 sec per county)
|
||||
- Use residential proxies if county blocks datacenter IPs
|
||||
- Respect robots.txt and terms of service
|
||||
|
||||
tests:
|
||||
- Unit: Mock HTML responses for common county formats, verify parser normalization
|
||||
- Integration: Test 5 representative county scrapers against live sites
|
||||
- E2E: Property in county without Attom coverage → scraper fetches real data → snapshot created
|
||||
|
||||
acceptance_criteria:
|
||||
- [ ] 50+ county recorder scrapers implemented and tested against live sites
|
||||
- [ ] `parseDeedRecords()` parses real HTML and returns structured CountyRecord objects
|
||||
- [ ] Fallback chain works: Attom → county scraper → manual request
|
||||
- [ ] Each scraper handles the county's specific search interface and result format
|
||||
- [ ] Rate limiting respects county sites (max 1 request per 5 seconds)
|
||||
- [ ] Broken scrapers are auto-detected within 24 hours and disabled
|
||||
- [ ] Scraper success rate > 70% across all implemented counties
|
||||
- [ ] Property records from scrapers match Attom data quality (owner name, deed date, liens)
|
||||
- [ ] Failed scraper attempts fall back to manual queue with user notification
|
||||
- [ ] No county site is overwhelmed by scraping (responsible rate limits)
|
||||
|
||||
validation:
|
||||
- Run `vitest run hometitle.test.ts` — extended tests for county scrapers
|
||||
- Manual: Search property in Cook County IL, verify scraper returns real owner data
|
||||
- Check fallback: Disable Attom API key, trigger scan, verify county scraper activates
|
||||
- Monitor health: Dashboard shows per-county scraper success rate
|
||||
|
||||
notes:
|
||||
- County recorder sites are notoriously fragile — expect 30–40% of scrapers to break per quarter
|
||||
- Many counties use proprietary systems (e.g., Tyler Technologies, Fidlar Technologies) with complex JavaScript
|
||||
- Some counties require payment per record ($1–$5) — flag these for manual processing
|
||||
- Consider partnering with Attom for counties they don't cover rather than building scrapers
|
||||
- Legal: Ensure scraping complies with each county's terms of service and state public records laws
|
||||
- The existing `parseDeedRecords()` currently logs "not yet implemented" — replace with real parsing
|
||||
Reference in New Issue
Block a user