Files
PodTui/tasks/rss-content-parsing/04-html-content-extraction.md
2026-02-05 22:55:24 -05:00

48 lines
1.4 KiB
Markdown

# 04. Implement HTML Content Extraction
meta:
id: rss-content-parsing-04
feature: rss-content-parsing
priority: P2
depends_on: [rss-content-parsing-03]
tags: [rss, parsing, html]
objective:
- Parse HTML content from RSS feed descriptions
- Extract and sanitize text content
- Convert HTML to plain text for display
deliverables:
- HTML to text conversion utility
- Sanitization function for XSS prevention
- Updated RSS parser integration
steps:
1. Create `src/utils/html-to-text.ts`
2. Implement HTML-to-text conversion algorithm
3. Add XSS sanitization for extracted content
4. Handle common HTML elements (paragraphs, lists, links)
5. Update `parseRSSFeed()` to use new HTML parser
tests:
- Unit: Test HTML to text conversion accuracy
- Integration: Test with HTML-rich RSS feeds
- Security: Test XSS sanitization with malicious HTML
acceptance_criteria:
- HTML content is converted to readable plain text
- No HTML tags remain in output
- Sanitization prevents XSS attacks
- Links are properly converted to text format
validation:
- Test with podcast descriptions containing HTML
- Verify text is readable and properly formatted
- Check for any HTML tag remnants
notes:
- Use existing `decodeEntities()` function from rss-parser.ts
- Preserve line breaks and paragraph structure
- Convert URLs to text format (e.g., "Visit example.com")
- Consider using a lightweight HTML parser like `html-escaper` or `cheerio`