final feature set

2026-02-05 22:55:24 -05:00
parent 6b00871c32
commit 168e6d5a61
115 changed files with 2401 additions and 4468 deletions
--- a/tasks/rss-content-parsing/03-rss-content-detection.md
+++ b/tasks/rss-content-parsing/03-rss-content-detection.md
@@ -0,0 +1,45 @@
+# 03. Add RSS Content Type Detection
+
+meta:
+  id: rss-content-parsing-03
+  feature: rss-content-parsing
+  priority: P2
+  depends_on: []
+  tags: [rss, parsing, utilities]
+
+objective:
+- Create utility to detect if RSS feed content is HTML or plain text
+- Analyze content type in description and other text fields
+- Return appropriate parsing strategy
+
+deliverables:
+- Content type detection function
+- Type classification utility
+- Integration points for different parsers
+
+steps:
+1. Create `src/utils/rss-content-detector.ts`
+2. Implement content type detection based on HTML tags
+3. Add detection for common HTML entities and tags
+4. Return type enum (HTML, PLAIN_TEXT, UNKNOWN)
+5. Add unit tests for detection accuracy
+
+tests:
+- Unit: Test HTML detection with various HTML snippets
+- Unit: Test plain text detection with text-only content
+- Unit: Test edge cases (mixed content, malformed HTML)
+
+acceptance_criteria:
+- Function correctly identifies HTML vs plain text content
+- Handles common HTML patterns and entities
+- Returns UNKNOWN for unclassifiable content
+
+validation:
+- Test with HTML description from real RSS feeds
+- Test with plain text descriptions
+- Verify UNKNOWN cases are handled gracefully
+
+notes:
+- Look for common HTML tags: <div>, <p>, <br>, <a>, <b>, <i>
+- Check for HTML entities: &lt;, &gt;, &amp;, &quot;, &apos;
+- Consider content length threshold for HTML detection
--- a/tasks/rss-content-parsing/04-html-content-extraction.md
+++ b/tasks/rss-content-parsing/04-html-content-extraction.md
@@ -0,0 +1,47 @@
+# 04. Implement HTML Content Extraction
+
+meta:
+  id: rss-content-parsing-04
+  feature: rss-content-parsing
+  priority: P2
+  depends_on: [rss-content-parsing-03]
+  tags: [rss, parsing, html]
+
+objective:
+- Parse HTML content from RSS feed descriptions
+- Extract and sanitize text content
+- Convert HTML to plain text for display
+
+deliverables:
+- HTML to text conversion utility
+- Sanitization function for XSS prevention
+- Updated RSS parser integration
+
+steps:
+1. Create `src/utils/html-to-text.ts`
+2. Implement HTML-to-text conversion algorithm
+3. Add XSS sanitization for extracted content
+4. Handle common HTML elements (paragraphs, lists, links)
+5. Update `parseRSSFeed()` to use new HTML parser
+
+tests:
+- Unit: Test HTML to text conversion accuracy
+- Integration: Test with HTML-rich RSS feeds
+- Security: Test XSS sanitization with malicious HTML
+
+acceptance_criteria:
+- HTML content is converted to readable plain text
+- No HTML tags remain in output
+- Sanitization prevents XSS attacks
+- Links are properly converted to text format
+
+validation:
+- Test with podcast descriptions containing HTML
+- Verify text is readable and properly formatted
+- Check for any HTML tag remnants
+
+notes:
+- Use existing `decodeEntities()` function from rss-parser.ts
+- Preserve line breaks and paragraph structure
+- Convert URLs to text format (e.g., "Visit example.com")
+- Consider using a lightweight HTML parser like `html-escaper` or `cheerio`
--- a/tasks/rss-content-parsing/05-plain-text-content-handling.md
+++ b/tasks/rss-content-parsing/05-plain-text-content-handling.md
@@ -0,0 +1,45 @@
+# 05. Maintain Plain Text Fallback Handling
+
+meta:
+  id: rss-content-parsing-05
+  feature: rss-content-parsing
+  priority: P2
+  depends_on: [rss-content-parsing-03]
+  tags: [rss, parsing, fallback]
+
+objective:
+- Ensure plain text RSS feeds continue to work correctly
+- Maintain backward compatibility with existing functionality
+- Handle mixed content scenarios
+
+deliverables:
+- Updated parseRSSFeed() for HTML support
+- Plain text handling path remains unchanged
+- Error handling for parsing failures
+
+steps:
+1. Update `parseRSSFeed()` to use content type detection
+2. Route to HTML parser or plain text path based on type
+3. Add error handling for parsing failures
+4. Test with both HTML and plain text feeds
+5. Verify backward compatibility
+
+tests:
+- Integration: Test with plain text RSS feeds
+- Integration: Test with HTML RSS feeds
+- Regression: Verify existing functionality still works
+
+acceptance_criteria:
+- Plain text feeds parse without errors
+- HTML feeds parse correctly with sanitization
+- No regression in existing functionality
+
+validation:
+- Test with various podcast RSS feeds
+- Verify descriptions display correctly
+- Check for any parsing errors
+
+notes:
+- Plain text path uses existing `decodeEntities()` logic
+- Keep existing parseRSSFeed() structure for plain text
+- Add logging for parsing strategy selection
--- a/tasks/rss-content-parsing/README.md
+++ b/tasks/rss-content-parsing/README.md
@@ -0,0 +1,18 @@
+# HTML vs Plain Text RSS Parsing
+
+Objective: Detect and handle both HTML and plain text content in RSS feeds
+
+Status legend: [ ] todo, [~] in-progress, [x] done
+
+Tasks
+- [ ] 03 — Add content type detection utility → `03-rss-content-detection.md`
+- [ ] 04 — Implement HTML content parsing → `04-html-content-extraction.md`
+- [ ] 05 — Maintain plain text fallback handling → `05-plain-text-content-handling.md`
+
+Dependencies
+- 03 -> 04
+- 03 -> 05
+
+Exit criteria
+- RSS feeds with HTML content are properly parsed and sanitized
+- Plain text feeds continue to work as before