Mike/PodTui

Files

Michael Freno 168e6d5a61 final feature set

2026-02-05 22:55:24 -05:00

1.4 KiB

Raw Blame History

04. Implement HTML Content Extraction

meta: id: rss-content-parsing-04 feature: rss-content-parsing priority: P2 depends_on: [rss-content-parsing-03] tags: [rss, parsing, html]

objective:

Parse HTML content from RSS feed descriptions
Extract and sanitize text content
Convert HTML to plain text for display

deliverables:

HTML to text conversion utility
Sanitization function for XSS prevention
Updated RSS parser integration

steps:

Create src/utils/html-to-text.ts
Implement HTML-to-text conversion algorithm
Add XSS sanitization for extracted content
Handle common HTML elements (paragraphs, lists, links)
Update parseRSSFeed() to use new HTML parser

tests:

Unit: Test HTML to text conversion accuracy
Integration: Test with HTML-rich RSS feeds
Security: Test XSS sanitization with malicious HTML

acceptance_criteria:

HTML content is converted to readable plain text
No HTML tags remain in output
Sanitization prevents XSS attacks
Links are properly converted to text format

validation:

Test with podcast descriptions containing HTML
Verify text is readable and properly formatted
Check for any HTML tag remnants

notes:

Use existing decodeEntities() function from rss-parser.ts
Preserve line breaks and paragraph structure
Convert URLs to text format (e.g., "Visit example.com")
Consider using a lightweight HTML parser like html-escaper or cheerio