Skip to content

🔧 FIX: export_for_llamaindex returns dicts instead of Document objects #25

@alicoding

Description

@alicoding

@FIX_PATHWAY - API Contract Violation

Current Behavior: export_for_llamaindex() returns Dict[str, Any] but callers expect Document objects

Impact: Breaking integration with semantic-search-service - requires manual conversion step

Root Cause:

  • File: claude_parser/export/llamaindex.py:36
  • Function returns dicts with {'text': str, 'metadata': dict}
  • Callers use VectorStoreIndex.from_documents() which expects Document objects

Evidence from LlamaIndex Docs:

# Correct pattern (from llama_index docs):
from llama_index.core import Document, VectorStoreIndex

documents = [
    Document(text="content", metadata={"key": "value"}),
    Document(text="more", metadata={"key": "value"})
]
index = VectorStoreIndex.from_documents(documents)  # ONE LINE

@FRAMEWORK_GATE Required Fix

Change _extract_document to return Document objects:

def _extract_document(msg: Dict[str, Any]) -> Document:
    """Transform message to LlamaIndex Document object
    
    @FRAMEWORK_FIRST: Return framework types, not dicts
    """
    from llama_index.core import Document  # Optional dependency
    
    return Document(
        text=get_text(msg),
        metadata={
            'speaker': msg.get('type', 'unknown'),
            'uuid': msg.get('uuid', ''),
            'timestamp': msg.get('timestamp', ''),
            'session_id': msg.get('sessionId', '')
        }
    )

Update return type annotation:

def export_for_llamaindex(
    jsonl_path: str, 
    batch_size: int = None
) -> Union[List[Document], Iterator[List[Document]]]:  # Not Dict!

Update docstring with TRUE one-liner example:

"""Export conversation as LlamaIndex Document objects
    
Returns:
    List[Document] or Iterator[List[Document]] - Ready for VectorStoreIndex.from_documents()
    
Example - TRUE ONE-LINER for long-term memory:
    from llama_index.core import VectorStoreIndex
    from claude_parser.export import export_for_llamaindex
    
    # ONE line to create searchable memory:
    index = VectorStoreIndex.from_documents(
        export_for_llamaindex("conversation.jsonl")
    )
"""

@TDD_GATE - Test Requirements

Update test to verify Document objects:

def test_export_returns_document_objects():
    from llama_index.core import Document
    
    result = export_for_llamaindex("test.jsonl")
    
    assert isinstance(result, list)
    assert all(isinstance(doc, Document) for doc in result)
    assert hasattr(result[0], 'text')
    assert hasattr(result[0], 'metadata')

@VERIFICATION_GATE - Acceptance Criteria

  • _extract_document returns Document objects (not dicts)
  • Return type annotation updated to List[Document]
  • Docstring shows one-liner example
  • Tests verify Document objects returned
  • Integration test: VectorStoreIndex.from_documents(result) works
  • No breaking changes to batch mode

@DECISION_GATE - Why Document Objects?

DECISION: Export framework types, not primitives
RATIONALE:

  • Enables true one-liner: VectorStoreIndex.from_documents(export_for_llamaindex(path))
  • Type-safe integration
  • Follows LlamaIndex patterns
  • Optional dependency (import inside function)
  • Zero boilerplate for callers

ALTERNATIVE REJECTED: Keep dicts + document conversion step

  • Requires manual conversion: [Document(**d) for d in result]
  • Not a "one-liner" anymore
  • Violates @FRAMEWORK_FIRST principle

@MEMORY_UPDATE_GATE

After fix, update project navigator:

LlamaIndex | EXISTS | Document-export-API | search:"export_for_llamaindex"
Integration | FIXED | One-liner-memory-creation | search:"VectorStoreIndex.from_documents"

Priority: High - Blocking semantic-search integration
Framework: LNCA v4.1
Pathway: @FIX_PATHWAY
Estimated LOC: <10 lines changed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions