final feature set
This commit is contained in:
45
tasks/rss-content-parsing/03-rss-content-detection.md
Normal file
45
tasks/rss-content-parsing/03-rss-content-detection.md
Normal file
@@ -0,0 +1,45 @@
|
||||
# 03. Add RSS Content Type Detection
|
||||
|
||||
meta:
|
||||
id: rss-content-parsing-03
|
||||
feature: rss-content-parsing
|
||||
priority: P2
|
||||
depends_on: []
|
||||
tags: [rss, parsing, utilities]
|
||||
|
||||
objective:
|
||||
- Create utility to detect if RSS feed content is HTML or plain text
|
||||
- Analyze content type in description and other text fields
|
||||
- Return appropriate parsing strategy
|
||||
|
||||
deliverables:
|
||||
- Content type detection function
|
||||
- Type classification utility
|
||||
- Integration points for different parsers
|
||||
|
||||
steps:
|
||||
1. Create `src/utils/rss-content-detector.ts`
|
||||
2. Implement content type detection based on HTML tags
|
||||
3. Add detection for common HTML entities and tags
|
||||
4. Return type enum (HTML, PLAIN_TEXT, UNKNOWN)
|
||||
5. Add unit tests for detection accuracy
|
||||
|
||||
tests:
|
||||
- Unit: Test HTML detection with various HTML snippets
|
||||
- Unit: Test plain text detection with text-only content
|
||||
- Unit: Test edge cases (mixed content, malformed HTML)
|
||||
|
||||
acceptance_criteria:
|
||||
- Function correctly identifies HTML vs plain text content
|
||||
- Handles common HTML patterns and entities
|
||||
- Returns UNKNOWN for unclassifiable content
|
||||
|
||||
validation:
|
||||
- Test with HTML description from real RSS feeds
|
||||
- Test with plain text descriptions
|
||||
- Verify UNKNOWN cases are handled gracefully
|
||||
|
||||
notes:
|
||||
- Look for common HTML tags: <div>, <p>, <br>, <a>, <b>, <i>
|
||||
- Check for HTML entities: <, >, &, ", '
|
||||
- Consider content length threshold for HTML detection
|
||||
47
tasks/rss-content-parsing/04-html-content-extraction.md
Normal file
47
tasks/rss-content-parsing/04-html-content-extraction.md
Normal file
@@ -0,0 +1,47 @@
|
||||
# 04. Implement HTML Content Extraction
|
||||
|
||||
meta:
|
||||
id: rss-content-parsing-04
|
||||
feature: rss-content-parsing
|
||||
priority: P2
|
||||
depends_on: [rss-content-parsing-03]
|
||||
tags: [rss, parsing, html]
|
||||
|
||||
objective:
|
||||
- Parse HTML content from RSS feed descriptions
|
||||
- Extract and sanitize text content
|
||||
- Convert HTML to plain text for display
|
||||
|
||||
deliverables:
|
||||
- HTML to text conversion utility
|
||||
- Sanitization function for XSS prevention
|
||||
- Updated RSS parser integration
|
||||
|
||||
steps:
|
||||
1. Create `src/utils/html-to-text.ts`
|
||||
2. Implement HTML-to-text conversion algorithm
|
||||
3. Add XSS sanitization for extracted content
|
||||
4. Handle common HTML elements (paragraphs, lists, links)
|
||||
5. Update `parseRSSFeed()` to use new HTML parser
|
||||
|
||||
tests:
|
||||
- Unit: Test HTML to text conversion accuracy
|
||||
- Integration: Test with HTML-rich RSS feeds
|
||||
- Security: Test XSS sanitization with malicious HTML
|
||||
|
||||
acceptance_criteria:
|
||||
- HTML content is converted to readable plain text
|
||||
- No HTML tags remain in output
|
||||
- Sanitization prevents XSS attacks
|
||||
- Links are properly converted to text format
|
||||
|
||||
validation:
|
||||
- Test with podcast descriptions containing HTML
|
||||
- Verify text is readable and properly formatted
|
||||
- Check for any HTML tag remnants
|
||||
|
||||
notes:
|
||||
- Use existing `decodeEntities()` function from rss-parser.ts
|
||||
- Preserve line breaks and paragraph structure
|
||||
- Convert URLs to text format (e.g., "Visit example.com")
|
||||
- Consider using a lightweight HTML parser like `html-escaper` or `cheerio`
|
||||
45
tasks/rss-content-parsing/05-plain-text-content-handling.md
Normal file
45
tasks/rss-content-parsing/05-plain-text-content-handling.md
Normal file
@@ -0,0 +1,45 @@
|
||||
# 05. Maintain Plain Text Fallback Handling
|
||||
|
||||
meta:
|
||||
id: rss-content-parsing-05
|
||||
feature: rss-content-parsing
|
||||
priority: P2
|
||||
depends_on: [rss-content-parsing-03]
|
||||
tags: [rss, parsing, fallback]
|
||||
|
||||
objective:
|
||||
- Ensure plain text RSS feeds continue to work correctly
|
||||
- Maintain backward compatibility with existing functionality
|
||||
- Handle mixed content scenarios
|
||||
|
||||
deliverables:
|
||||
- Updated parseRSSFeed() for HTML support
|
||||
- Plain text handling path remains unchanged
|
||||
- Error handling for parsing failures
|
||||
|
||||
steps:
|
||||
1. Update `parseRSSFeed()` to use content type detection
|
||||
2. Route to HTML parser or plain text path based on type
|
||||
3. Add error handling for parsing failures
|
||||
4. Test with both HTML and plain text feeds
|
||||
5. Verify backward compatibility
|
||||
|
||||
tests:
|
||||
- Integration: Test with plain text RSS feeds
|
||||
- Integration: Test with HTML RSS feeds
|
||||
- Regression: Verify existing functionality still works
|
||||
|
||||
acceptance_criteria:
|
||||
- Plain text feeds parse without errors
|
||||
- HTML feeds parse correctly with sanitization
|
||||
- No regression in existing functionality
|
||||
|
||||
validation:
|
||||
- Test with various podcast RSS feeds
|
||||
- Verify descriptions display correctly
|
||||
- Check for any parsing errors
|
||||
|
||||
notes:
|
||||
- Plain text path uses existing `decodeEntities()` logic
|
||||
- Keep existing parseRSSFeed() structure for plain text
|
||||
- Add logging for parsing strategy selection
|
||||
18
tasks/rss-content-parsing/README.md
Normal file
18
tasks/rss-content-parsing/README.md
Normal file
@@ -0,0 +1,18 @@
|
||||
# HTML vs Plain Text RSS Parsing
|
||||
|
||||
Objective: Detect and handle both HTML and plain text content in RSS feeds
|
||||
|
||||
Status legend: [ ] todo, [~] in-progress, [x] done
|
||||
|
||||
Tasks
|
||||
- [ ] 03 — Add content type detection utility → `03-rss-content-detection.md`
|
||||
- [ ] 04 — Implement HTML content parsing → `04-html-content-extraction.md`
|
||||
- [ ] 05 — Maintain plain text fallback handling → `05-plain-text-content-handling.md`
|
||||
|
||||
Dependencies
|
||||
- 03 -> 04
|
||||
- 03 -> 05
|
||||
|
||||
Exit criteria
|
||||
- RSS feeds with HTML content are properly parsed and sanitized
|
||||
- Plain text feeds continue to work as before
|
||||
Reference in New Issue
Block a user