🐛 FIX: filter_pure_conversation lets hook content through

## @FIX_PATHWAY - Hook Content Leaking into Exports

**Current Behavior**: `filter_pure_conversation()` returns user messages containing `<session-start-hook>` and `<user-prompt-submit-hook>` content

**Impact**: Pollutes semantic search with hook instructions instead of pure conversation

**Root Cause**: 
- File: `claude_parser/filtering/filters.py:35`
- Only checks `isVisibleInTranscriptOnly` flag
- Doesn't check for hook content embedded in regular user messages
- Hook tags appear in user messages WITHOUT the flag set

**Evidence**:
```python
# Message gets through filter but contains hooks:
type=user
isVisibleInTranscriptOnly=None  # Not marked!
text="<session-start-hook>FEATURE_IMPLEMENTATION:..."
```

**Test Results**:
```
Total messages: 424
After filter_pure_conversation: 202
❌ First message has hook content: "<session-start-hook>..."
```

## @PATTERN_SMELL_GATE - Why Manual Checks?

**SMELL DETECTED**: Manual string checking when hooks have structured markers

**QUESTION**: Do hook messages have consistent patterns we can detect?
- `<session-start-hook>`
- `<user-prompt-submit-hook>`
- `<post-tool-use-hook>`
- `<system-reminder>`

## @FRAMEWORK_GATE - Required Fix

**Add content-based filtering to `filter_pure_conversation`**:

```python
def filter_pure_conversation(messages: List) -> Iterator:
    """Filter pure conversation - exclude tool operations and system messages"""
    from ..messages.utils import is_hook_message
    
    # Hook content patterns to exclude
    HOOK_PATTERNS = [
        '<session-start-hook>',
        '<user-prompt-submit-hook>', 
        '<post-tool-use-hook>',
        '<system-reminder>',
        '<command-name>',  # CLI commands
        '<local-command-stdout>'  # CLI output
    ]
    
    def is_pure_conversation(msg):
        # Must be user or assistant
        if msg.get('type') not in ['user', 'assistant']:
            return False
        # Skip meta messages
        if msg.get('is_meta', False):
            return False
        # Skip compact summaries
        if msg.get('isCompactSummary', False):
            return False
        # Skip hook messages using util
        if is_hook_message(msg):
            return False
        
        # NEW: Check content for hook patterns
        from ..messages.utils import get_text
        text = get_text(msg)
        if any(pattern in text for pattern in HOOK_PATTERNS):
            return False
            
        return True
    
    return filter(is_pure_conversation, messages)
```

## @TDD_GATE - Test Requirements

**Add test to verify hook content excluded**:
```python
def test_filter_excludes_hook_content():
    """Hook content in regular messages should be filtered"""
    messages = [
        {'type': 'user', 'text': 'normal user message'},
        {'type': 'user', 'text': '<session-start-hook>hook content here'},
        {'type': 'user', 'text': '<user-prompt-submit-hook>more hooks'},
        {'type': 'assistant', 'text': 'normal assistant response'},
    ]
    
    filtered = list(filter_pure_conversation(messages))
    
    assert len(filtered) == 2
    assert all('<hook>' not in get_text(msg) for msg in filtered)
```

## @VERIFICATION_GATE - Acceptance Criteria

- [ ] Hook content patterns detected in message text
- [ ] Messages with hook tags excluded from filter
- [ ] Test verifies no hook content passes through
- [ ] Export test: `export_for_llamaindex` returns zero hook content
- [ ] Regression test: Normal conversation still passes through

## @DECISION_GATE - Pattern Detection vs Flags

**DECISION**: Use content pattern detection + flags
**RATIONALE**:
- Flags alone insufficient (hooks appear in unflagged messages)
- Content patterns are consistent and well-defined
- Small set of patterns (6 tags)
- Fast string contains check

**ALTERNATIVE REJECTED**: Only use flags
- Incomplete: Hook content appears without flags set
- Leaks hook instructions into exports

## Integration Test Command

```bash
# Verify no hooks in export:
python3 -c "
from claude_parser.export import export_for_llamaindex
from claude_parser.discovery import discover_claude_files

docs = export_for_llamaindex(str(list(discover_claude_files())[0]))
hook_found = any('<hook>' in doc.text for doc in docs)
print(f'Hook content found: {hook_found}')  # Should be False
"
```

---

**Priority**: High - Polluting semantic search indexes
**Framework**: LNCA v4.1
**Pathway**: @FIX_PATHWAY
**Estimated LOC**: <15 lines changed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🐛 FIX: filter_pure_conversation lets hook content through #26

@FIX_PATHWAY - Hook Content Leaking into Exports

@PATTERN_SMELL_GATE - Why Manual Checks?

@FRAMEWORK_GATE - Required Fix

@TDD_GATE - Test Requirements

@VERIFICATION_GATE - Acceptance Criteria

@DECISION_GATE - Pattern Detection vs Flags

Integration Test Command

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

🐛 FIX: filter_pure_conversation lets hook content through #26

Description

@FIX_PATHWAY - Hook Content Leaking into Exports

@PATTERN_SMELL_GATE - Why Manual Checks?

@FRAMEWORK_GATE - Required Fix

@TDD_GATE - Test Requirements

@VERIFICATION_GATE - Acceptance Criteria

@DECISION_GATE - Pattern Detection vs Flags

Integration Test Command

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions