Files
Kordant/tasks/core-services-implementation/10-hometitle-county-scrapers.md
2026-05-31 22:03:18 -04:00

4.4 KiB
Raw Blame History

10. County Recorder Web Scrapers for Top 100 US Counties

meta: id: core-services-10 feature: core-services-implementation priority: P2 depends_on: [core-services-09] tags: [hometitle, scraping, county-records, fallback, coverage]

objective:

  • Build Playwright-based web scrapers for county recorder websites in the top 100 US counties by population, providing a fallback for counties not covered by Attom API and reducing API costs.

deliverables:

  • Scrapers for 100 US county recorder websites (starting with top 50)
  • Unified property record parser that normalizes disparate HTML formats
  • Fallback logic: Attom API → county scraper → manual request (in order)
  • scraper health monitoring and breakage detection

steps:

  1. Identify top 100 US counties by population (start with top 50):
    • Los Angeles County, CA; Cook County, IL; Harris County, TX; Maricopa County, AZ; etc.
  2. Research each county's recorder website:
    • Search URL pattern (usually https://{county}.gov/recorder or similar)
    • Record search interface (by owner name, parcel ID, or address)
    • Result format (HTML table, PDF, JSON API, proprietary system)
  3. Create hometitle/county-scrapers/ directory with one module per county
  4. Implement base scraper interface:
    • searchByAddress(address): Promise<CountyRecord[]>
    • searchByParcelId(parcelId): Promise<CountyRecord | null>
    • parseResults(html): CountyRecord[]
  5. Implement scrapers for each county using Playwright:
    • Navigate to recorder website
    • Fill search form (address or parcel ID)
    • Submit and wait for results
    • Parse HTML table or detail page
    • Extract: owner name, deed date, tax info, lien status
  6. Implement unified parseDeedRecords(html) that handles common formats:
    • HTML tables with standard columns
    • Detail pages with labeled fields
    • PDF records (download + text extraction)
  7. Add fallback chain in scanner.ts:
    • Try Attom API first (fastest, most reliable)
    • If Attom returns null/empty, try county scraper
    • If scraper fails, queue for manual request (email to user)
  8. Add scraper monitoring:
    • Track success/failure rate per county
    • Alert when >20% of scrapers fail in 24h (site changes)
    • Auto-disable broken scrapers, fall back to Attom/manual
  9. Handle rate limiting:
    • Throttle requests to county sites (max 1 req/5 sec per county)
    • Use residential proxies if county blocks datacenter IPs
    • Respect robots.txt and terms of service

tests:

  • Unit: Mock HTML responses for common county formats, verify parser normalization
  • Integration: Test 5 representative county scrapers against live sites
  • E2E: Property in county without Attom coverage → scraper fetches real data → snapshot created

acceptance_criteria:

  • 50+ county recorder scrapers implemented and tested against live sites
  • parseDeedRecords() parses real HTML and returns structured CountyRecord objects
  • Fallback chain works: Attom → county scraper → manual request
  • Each scraper handles the county's specific search interface and result format
  • Rate limiting respects county sites (max 1 request per 5 seconds)
  • Broken scrapers are auto-detected within 24 hours and disabled
  • Scraper success rate > 70% across all implemented counties
  • Property records from scrapers match Attom data quality (owner name, deed date, liens)
  • Failed scraper attempts fall back to manual queue with user notification
  • No county site is overwhelmed by scraping (responsible rate limits)

validation:

  • Run vitest run hometitle.test.ts — extended tests for county scrapers
  • Manual: Search property in Cook County IL, verify scraper returns real owner data
  • Check fallback: Disable Attom API key, trigger scan, verify county scraper activates
  • Monitor health: Dashboard shows per-county scraper success rate

notes:

  • County recorder sites are notoriously fragile — expect 3040% of scrapers to break per quarter
  • Many counties use proprietary systems (e.g., Tyler Technologies, Fidlar Technologies) with complex JavaScript
  • Some counties require payment per record ($1$5) — flag these for manual processing
  • Consider partnering with Attom for counties they don't cover rather than building scrapers
  • Legal: Ensure scraping complies with each county's terms of service and state public records laws
  • The existing parseDeedRecords() currently logs "not yet implemented" — replace with real parsing