Files
PodTui/tasks/rss-content-parsing/04-html-content-extraction.md
2026-02-05 22:55:24 -05:00

1.4 KiB

04. Implement HTML Content Extraction

meta: id: rss-content-parsing-04 feature: rss-content-parsing priority: P2 depends_on: [rss-content-parsing-03] tags: [rss, parsing, html]

objective:

  • Parse HTML content from RSS feed descriptions
  • Extract and sanitize text content
  • Convert HTML to plain text for display

deliverables:

  • HTML to text conversion utility
  • Sanitization function for XSS prevention
  • Updated RSS parser integration

steps:

  1. Create src/utils/html-to-text.ts
  2. Implement HTML-to-text conversion algorithm
  3. Add XSS sanitization for extracted content
  4. Handle common HTML elements (paragraphs, lists, links)
  5. Update parseRSSFeed() to use new HTML parser

tests:

  • Unit: Test HTML to text conversion accuracy
  • Integration: Test with HTML-rich RSS feeds
  • Security: Test XSS sanitization with malicious HTML

acceptance_criteria:

  • HTML content is converted to readable plain text
  • No HTML tags remain in output
  • Sanitization prevents XSS attacks
  • Links are properly converted to text format

validation:

  • Test with podcast descriptions containing HTML
  • Verify text is readable and properly formatted
  • Check for any HTML tag remnants

notes:

  • Use existing decodeEntities() function from rss-parser.ts
  • Preserve line breaks and paragraph structure
  • Convert URLs to text format (e.g., "Visit example.com")
  • Consider using a lightweight HTML parser like html-escaper or cheerio