Skip to content

🐛 FIX: filter_pure_conversation lets hook content through #26

@alicoding

Description

@alicoding

@FIX_PATHWAY - Hook Content Leaking into Exports

Current Behavior: filter_pure_conversation() returns user messages containing <session-start-hook> and <user-prompt-submit-hook> content

Impact: Pollutes semantic search with hook instructions instead of pure conversation

Root Cause:

  • File: claude_parser/filtering/filters.py:35
  • Only checks isVisibleInTranscriptOnly flag
  • Doesn't check for hook content embedded in regular user messages
  • Hook tags appear in user messages WITHOUT the flag set

Evidence:

# Message gets through filter but contains hooks:
type=user
isVisibleInTranscriptOnly=None  # Not marked!
text="<session-start-hook>FEATURE_IMPLEMENTATION:..."

Test Results:

Total messages: 424
After filter_pure_conversation: 202
❌ First message has hook content: "<session-start-hook>..."

@PATTERN_SMELL_GATE - Why Manual Checks?

SMELL DETECTED: Manual string checking when hooks have structured markers

QUESTION: Do hook messages have consistent patterns we can detect?

  • <session-start-hook>
  • <user-prompt-submit-hook>
  • <post-tool-use-hook>
  • <system-reminder>

@FRAMEWORK_GATE - Required Fix

Add content-based filtering to filter_pure_conversation:

def filter_pure_conversation(messages: List) -> Iterator:
    """Filter pure conversation - exclude tool operations and system messages"""
    from ..messages.utils import is_hook_message
    
    # Hook content patterns to exclude
    HOOK_PATTERNS = [
        '<session-start-hook>',
        '<user-prompt-submit-hook>', 
        '<post-tool-use-hook>',
        '<system-reminder>',
        '<command-name>',  # CLI commands
        '<local-command-stdout>'  # CLI output
    ]
    
    def is_pure_conversation(msg):
        # Must be user or assistant
        if msg.get('type') not in ['user', 'assistant']:
            return False
        # Skip meta messages
        if msg.get('is_meta', False):
            return False
        # Skip compact summaries
        if msg.get('isCompactSummary', False):
            return False
        # Skip hook messages using util
        if is_hook_message(msg):
            return False
        
        # NEW: Check content for hook patterns
        from ..messages.utils import get_text
        text = get_text(msg)
        if any(pattern in text for pattern in HOOK_PATTERNS):
            return False
            
        return True
    
    return filter(is_pure_conversation, messages)

@TDD_GATE - Test Requirements

Add test to verify hook content excluded:

def test_filter_excludes_hook_content():
    """Hook content in regular messages should be filtered"""
    messages = [
        {'type': 'user', 'text': 'normal user message'},
        {'type': 'user', 'text': '<session-start-hook>hook content here'},
        {'type': 'user', 'text': '<user-prompt-submit-hook>more hooks'},
        {'type': 'assistant', 'text': 'normal assistant response'},
    ]
    
    filtered = list(filter_pure_conversation(messages))
    
    assert len(filtered) == 2
    assert all('<hook>' not in get_text(msg) for msg in filtered)

@VERIFICATION_GATE - Acceptance Criteria

  • Hook content patterns detected in message text
  • Messages with hook tags excluded from filter
  • Test verifies no hook content passes through
  • Export test: export_for_llamaindex returns zero hook content
  • Regression test: Normal conversation still passes through

@DECISION_GATE - Pattern Detection vs Flags

DECISION: Use content pattern detection + flags
RATIONALE:

  • Flags alone insufficient (hooks appear in unflagged messages)
  • Content patterns are consistent and well-defined
  • Small set of patterns (6 tags)
  • Fast string contains check

ALTERNATIVE REJECTED: Only use flags

  • Incomplete: Hook content appears without flags set
  • Leaks hook instructions into exports

Integration Test Command

# Verify no hooks in export:
python3 -c "
from claude_parser.export import export_for_llamaindex
from claude_parser.discovery import discover_claude_files

docs = export_for_llamaindex(str(list(discover_claude_files())[0]))
hook_found = any('<hook>' in doc.text for doc in docs)
print(f'Hook content found: {hook_found}')  # Should be False
"

Priority: High - Polluting semantic search indexes
Framework: LNCA v4.1
Pathway: @FIX_PATHWAY
Estimated LOC: <15 lines changed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions