feat(core): implement safety check for embedding dimension mismatch #2473

safishamsi · 2025-12-03T00:27:04Z

Currently, switching embedding models (e.g., from OpenAI to BGE-M3) without clearing the working_dir causes silent failures or dimension mismatch errors deep in the vector store logic.

This PR introduces a _check_embedding_config method that acts as a handshake protocol during initialization:

Persists Metadata: Saves a lightrag_meta.json file upon the first successful initialization.

Validates Configuration: Compares the currently configured embedding_dim against the persisted metadata on subsequent runs.

Prevents Corruption: Raises a clear, actionable ValueError if a mismatch is detected, halting execution before any write operations occur.

Architecture Logic

The following diagram illustrates the new initialization flow introduced in this PR:

graph TD
    A[Start: LightRAG Initialization] --> B{Metadata File Exists?}
    
    %% Path 1: First Run
    B -- No --> C[Create lightrag_meta.json]
    C --> D[Save Current Model Name & Dimension]
    D --> E[Proceed to Initialize Storage]

    %% Path 2: Subsequent Run
    B -- Yes --> F[Load Saved Metadata]
    F --> G{Compare Dimensions}
    
    %% Logic Branch
    G -- Match --> E
    G -- Mismatch --> H[Raise ValueError]
    H --> I[Stop Execution]
    
    style H fill:#f9f,stroke:#333,stroke-width:2px
    style I fill:#bbf,stroke:#333,stroke-width:2px

I created a reproduction script using mock embeddings to simulate a user switching models between runs.

Phase 1 (Initial Setup): System correctly initializes and saves metadata.

Phase 2 (The Crash Test): System correctly BLOCKS an invalid model switch (768 vs 1536) and raises the new error message.

Phase 3 (Regression Test): System correctly loads when the valid model is restored.

Test Logs:

--- PHASE 1: Initial Setup (Model A) ---
Initializing LightRAG with 'OpenAI_Mock' (Dimension: 1536)...
SUCCESS: 'lightrag_meta.json' created successfully.

--- PHASE 2: The Safety Check (Model B) ---
Attempting to re-initialize with 'BGE_Mock' (Dimension: 768)...
SUCCESS: The system caught the error.
Captured Error Message:
--> Embedding dimension mismatch! Existing data uses dimension 1536, but current configuration uses 768. Please clear the directory.

--- PHASE 3: Regression Test (Model A again) ---
Re-initializing with correct model (Dimension: 1536)...
SUCCESS: The system allowed the correct model to proceed.

captainmirk added 2 commits December 3, 2025 00:20

feat(core): implement safety check for embedding dimension mismatch

896e203

style: fix linting and trailing whitespace

c806694

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(core): implement safety check for embedding dimension mismatch #2473

feat(core): implement safety check for embedding dimension mismatch #2473

safishamsi commented Dec 3, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(core): implement safety check for embedding dimension mismatch #2473

Are you sure you want to change the base?

feat(core): implement safety check for embedding dimension mismatch #2473

Conversation

safishamsi commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

safishamsi commented Dec 3, 2025 •

edited

Loading