final feature set

This commit is contained in:
2026-02-05 22:55:24 -05:00
parent 6b00871c32
commit 168e6d5a61
115 changed files with 2401 additions and 4468 deletions

View File

@@ -0,0 +1,45 @@
# 03. Add RSS Content Type Detection
meta:
id: rss-content-parsing-03
feature: rss-content-parsing
priority: P2
depends_on: []
tags: [rss, parsing, utilities]
objective:
- Create utility to detect if RSS feed content is HTML or plain text
- Analyze content type in description and other text fields
- Return appropriate parsing strategy
deliverables:
- Content type detection function
- Type classification utility
- Integration points for different parsers
steps:
1. Create `src/utils/rss-content-detector.ts`
2. Implement content type detection based on HTML tags
3. Add detection for common HTML entities and tags
4. Return type enum (HTML, PLAIN_TEXT, UNKNOWN)
5. Add unit tests for detection accuracy
tests:
- Unit: Test HTML detection with various HTML snippets
- Unit: Test plain text detection with text-only content
- Unit: Test edge cases (mixed content, malformed HTML)
acceptance_criteria:
- Function correctly identifies HTML vs plain text content
- Handles common HTML patterns and entities
- Returns UNKNOWN for unclassifiable content
validation:
- Test with HTML description from real RSS feeds
- Test with plain text descriptions
- Verify UNKNOWN cases are handled gracefully
notes:
- Look for common HTML tags: <div>, <p>, <br>, <a>, <b>, <i>
- Check for HTML entities: &lt;, &gt;, &amp;, &quot;, &apos;
- Consider content length threshold for HTML detection

View File

@@ -0,0 +1,47 @@
# 04. Implement HTML Content Extraction
meta:
id: rss-content-parsing-04
feature: rss-content-parsing
priority: P2
depends_on: [rss-content-parsing-03]
tags: [rss, parsing, html]
objective:
- Parse HTML content from RSS feed descriptions
- Extract and sanitize text content
- Convert HTML to plain text for display
deliverables:
- HTML to text conversion utility
- Sanitization function for XSS prevention
- Updated RSS parser integration
steps:
1. Create `src/utils/html-to-text.ts`
2. Implement HTML-to-text conversion algorithm
3. Add XSS sanitization for extracted content
4. Handle common HTML elements (paragraphs, lists, links)
5. Update `parseRSSFeed()` to use new HTML parser
tests:
- Unit: Test HTML to text conversion accuracy
- Integration: Test with HTML-rich RSS feeds
- Security: Test XSS sanitization with malicious HTML
acceptance_criteria:
- HTML content is converted to readable plain text
- No HTML tags remain in output
- Sanitization prevents XSS attacks
- Links are properly converted to text format
validation:
- Test with podcast descriptions containing HTML
- Verify text is readable and properly formatted
- Check for any HTML tag remnants
notes:
- Use existing `decodeEntities()` function from rss-parser.ts
- Preserve line breaks and paragraph structure
- Convert URLs to text format (e.g., "Visit example.com")
- Consider using a lightweight HTML parser like `html-escaper` or `cheerio`

View File

@@ -0,0 +1,45 @@
# 05. Maintain Plain Text Fallback Handling
meta:
id: rss-content-parsing-05
feature: rss-content-parsing
priority: P2
depends_on: [rss-content-parsing-03]
tags: [rss, parsing, fallback]
objective:
- Ensure plain text RSS feeds continue to work correctly
- Maintain backward compatibility with existing functionality
- Handle mixed content scenarios
deliverables:
- Updated parseRSSFeed() for HTML support
- Plain text handling path remains unchanged
- Error handling for parsing failures
steps:
1. Update `parseRSSFeed()` to use content type detection
2. Route to HTML parser or plain text path based on type
3. Add error handling for parsing failures
4. Test with both HTML and plain text feeds
5. Verify backward compatibility
tests:
- Integration: Test with plain text RSS feeds
- Integration: Test with HTML RSS feeds
- Regression: Verify existing functionality still works
acceptance_criteria:
- Plain text feeds parse without errors
- HTML feeds parse correctly with sanitization
- No regression in existing functionality
validation:
- Test with various podcast RSS feeds
- Verify descriptions display correctly
- Check for any parsing errors
notes:
- Plain text path uses existing `decodeEntities()` logic
- Keep existing parseRSSFeed() structure for plain text
- Add logging for parsing strategy selection

View File

@@ -0,0 +1,18 @@
# HTML vs Plain Text RSS Parsing
Objective: Detect and handle both HTML and plain text content in RSS feeds
Status legend: [ ] todo, [~] in-progress, [x] done
Tasks
- [ ] 03 — Add content type detection utility → `03-rss-content-detection.md`
- [ ] 04 — Implement HTML content parsing → `04-html-content-extraction.md`
- [ ] 05 — Maintain plain text fallback handling → `05-plain-text-content-handling.md`
Dependencies
- 03 -> 04
- 03 -> 05
Exit criteria
- RSS feeds with HTML content are properly parsed and sanitized
- Plain text feeds continue to work as before