neo4j-labs
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 1 deletion b/‎.gitignore‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 88 additions & 48 deletions b/‎README.md‎
Lines changed: 88 additions & 48 deletions
diff --git a/‎backend/example.env‎
Lines changed: 10 additions & 12 deletions b/‎backend/example.env‎
Lines changed: 10 additions & 12 deletions
diff --git a/‎backend/requirements.txt‎
Lines changed: 4 additions & 0 deletions b/‎backend/requirements.txt‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎backend/score.py‎
Lines changed: 36 additions & 16 deletions b/‎backend/score.py‎
Lines changed: 36 additions & 16 deletions
@@ -166,4 +166,5 @@ google-cloud-cli-469.0.0-linux-x86_64.tar.gz
 /backend/src/merged_files
 /backend/src/chunks
 /backend/merged_files
-google-cloud-cli-476.0.0-linux-x86_64.tar.gz
+/backend/chunks
+google-cloud-cli-479.0.0-linux-x86_64.tar.gz
@@ -5,41 +5,105 @@ Files can be uploaded from local machine or S3 bucket and then LLM model can be
 
 ### Getting started
 
-:warning: 
-For the backend, if you want to run the LLM KG Builder locally, and don't need the GCP/VertexAI integration, make sure to have the following set in your ENV file :
+:warning: You will need to have a Neo4j Database V5.15 or later with [APOC installed](https://neo4j.com/docs/apoc/current/installation/) to use this Knowledge Graph Builder.
+You can use any [Neo4j Aura database](https://neo4j.com/aura/) (including the free database)
+If you are using Neo4j Desktop, you will not be able to use the docker-compose but will have to follow the [separate deployment of backend and frontend section](#running-backend-and-frontend-separately-dev-environment). :warning:
 
+### Deploy locally
+#### Running through docker-compose
+By default only OpenAI and Diffbot are enabled since Gemini requires extra GCP configurations.
+
+In your root folder, create a .env file with your OPENAI and DIFFBOT keys (if you want to use both):
+```env
+OPENAI_API_KEY="your-openai-key"
+DIFFBOT_API_KEY="your-diffbot-key"
+```
+
+if you only want OpenAI:
+```env
+LLM_MODELS="OpenAI GPT 3.5,OpenAI GPT 4o"
+OPENAI_API_KEY="your-openai-key"
+```
+
+if you only want Diffbot:
+```env
+LLM_MODELS="Diffbot"
+DIFFBOT_API_KEY="your-diffbot-key"
+```
+
+You can then run Docker Compose to build and start all components:
+```bash
+docker-compose up --build
+```
+
+##### Additional configs
+
+By default, the input sources will be: Local files, Youtube, Wikipedia and AWS S3. As this default config is applied:
 ```env
-GEMINI_ENABLED = False
-GCP_LOG_METRICS_ENABLED = False
+REACT_APP_SOURCES="local,youtube,wiki,s3"
 ```
 
-And for the frontend, make sure to export your local backend URL before running docker-compose by having the BACKEND_API_URL set in your ENV file :
+If however you want the Google GCS integration, add `gcs` and your Google client ID:
 ```env
-BACKEND_API_URL="http://localhost:8000"
+REACT_APP_SOURCES="local,youtube,wiki,s3,gcs"
+GOOGLE_CLIENT_ID="xxxx"
 ```
 
-1. Run Docker Compose to build and start all components:
+You can of course combine all (local, youtube, wikipedia, s3 and gcs) or remove any you don't want/need.
+
+
+#### Running Backend and Frontend separately (dev environment)
+Alternatively, you can run the backend and frontend separately:
+
+- For the frontend:
+1. Create the frontend/.env file by copy/pasting the frontend/example.env.
+2. Change values as needed
+3.
     ```bash
-    docker-compose up --build
+    cd frontend
+    yarn
+    yarn run dev
     ```
 
-2. Alternatively, you can run specific directories separately:
-
-    - For the frontend:
-        ```bash
-        cd frontend
-        yarn
-        yarn run dev
-        ```
+- For the backend:
+1. Create the backend/.env file by copy/pasting the backend/example.env.
+2. Change values as needed
+3.
+    ```bash
+    cd backend
+    python -m venv envName
+    source envName/bin/activate 
+    pip install -r requirements.txt
+    uvicorn score:app --reload
+    ```
+### ENV
+| Env Variable Name       | Mandatory/Optional | Default Value | Description                                                                                      |
+|-------------------------|--------------------|---------------|--------------------------------------------------------------------------------------------------|
+| OPENAI_API_KEY          | Mandatory          |               | API key for OpenAI                                                                               |
+| DIFFBOT_API_KEY         | Mandatory          |               | API key for Diffbot                                                                              |
+| EMBEDDING_MODEL         | Optional           | all-MiniLM-L6-v2 | Model for generating the text embedding (all-MiniLM-L6-v2 , openai , vertexai)                |
+| IS_EMBEDDING            | Optional           | true          | Flag to enable text embedding                                                                    |
+| KNN_MIN_SCORE           | Optional           | 0.94          | Minimum score for KNN algorithm                                                                  |
+| GEMINI_ENABLED          | Optional           | False         | Flag to enable Gemini                                                                             |
+| GCP_LOG_METRICS_ENABLED | Optional           | False         | Flag to enable Google Cloud logs                                                                 |
+| NUMBER_OF_CHUNKS_TO_COMBINE | Optional        | 6             | Number of chunks to combine when processing embeddings                                           |
+| UPDATE_GRAPH_CHUNKS_PROCESSED | Optional      | 20            | Number of chunks processed before updating progress                                        |
+| NEO4J_URI               | Optional           | neo4j://database:7687 | URI for Neo4j database                                                                  |
+| NEO4J_USERNAME          | Optional           | neo4j         | Username for Neo4j database                                                                       |
+| NEO4J_PASSWORD          | Optional           | password      | Password for Neo4j database                                                                       |
+| LANGCHAIN_API_KEY       | Optional           |               | API key for Langchain                                                                             |
+| LANGCHAIN_PROJECT       | Optional           |               | Project for Langchain                                                                             |
+| LANGCHAIN_TRACING_V2    | Optional           | true          | Flag to enable Langchain tracing                                                                  |
+| LANGCHAIN_ENDPOINT      | Optional           | https://api.smith.langchain.com | Endpoint for Langchain API                                                            |
+| BACKEND_API_URL         | Optional           | http://localhost:8000 | URL for backend API                                                                       |
+| BLOOM_URL               | Optional           | https://workspace-preview.neo4j.io/workspace/explore?connectURL={CONNECT_URL}&search=Show+me+a+graph&featureGenAISuggestions=true&featureGenAISuggestionsInternal=true | URL for Bloom visualization |
+| REACT_APP_SOURCES       | Optional           | local,youtube,wiki,s3 | List of input sources that will be available                                               |
+| LLM_MODELS              | Optional           | Diffbot,OpenAI GPT 3.5,OpenAI GPT 4o | Models available for selection on the frontend, used for entities extraction and Q&A Chatbot                          |
+| ENV                     | Optional           | DEV           | Environment variable for the app                                                                 |
+| TIME_PER_CHUNK          | Optional           | 4             | Time per chunk for processing                                                                    |
+| CHUNK_SIZE              | Optional           | 5242880       | Size of each chunk for processing                                                                |
+| GOOGLE_CLIENT_ID        | Optional           |               | Client ID for Google authentication                                                              |
 
-    - For the backend:
-        ```bash
-        cd backend
-        python -m venv envName
-        source envName/bin/activate 
-        pip install -r requirements.txt
-        uvicorn score:app --reload
-        ```
 
 ###
 To deploy the app and packages on Google Cloud Platform, run the following command on google cloud run:
@@ -64,29 +128,6 @@ Allow unauthenticated request : Yes
 - **Neo4j Integration**: The extracted nodes and relationships are stored in a Neo4j database for easy visualization and querying.
 - **Grid View of source node files with** : Name,Type,Size,Nodes,Relations,Duration,Status,Source,Model
   
-## Setting up Environment Variables
-Create .env file and update the following env variables.\
-OPENAI_API_KEY = ""\
-DIFFBOT_API_KEY = ""\
-NEO4J_URI = ""\
-NEO4J_USERNAME = ""\
-NEO4J_PASSWORD = ""\
-AWS_ACCESS_KEY_ID =  ""\
-AWS_SECRET_ACCESS_KEY = ""\
-EMBEDDING_MODEL = ""\
-IS_EMBEDDING = "TRUE"\
-KNN_MIN_SCORE = ""
-
-## Setting up Enviournment Variables For Frontend Configuration
-
-Create .env file in the frontend root folder and update the following env variables.\
-BACKEND_API_URL=""\
-BLOOM_URL=""\
-REACT_APP_SOURCES=""\
-LLM_MODELS=""\
-ENV=""\
-TIME_PER_CHUNK=
-
 ## Functions/Modules
 
 #### extract_graph_from_file(uri, userName, password, file_path, model):
@@ -140,4 +181,3 @@ https://github.com/neo4j-labs/llm-graph-builder/assets/121786590/b725a503-6ade-4
  The Public [ Google cloud Run URL](https://devfrontend-dcavk67s4a-uc.a.run.app).
  [Workspace URL](https://workspace-preview.neo4j.io/workspace)
 
-
 
@@ -1,24 +1,22 @@
 OPENAI_API_KEY = ""
 DIFFBOT_API_KEY = ""
+GROQ_API_KEY = ""
+EMBEDDING_MODEL = "all-MiniLM-L6-v2"
+IS_EMBEDDING = "true"
+KNN_MIN_SCORE = "0.94"
+# Enable Gemini (default is False) | Can be False or True
+GEMINI_ENABLED = False
+# Enable Google Cloud logs (default is False) | Can be False or True
+GCP_LOG_METRICS_ENABLED = False
+NUMBER_OF_CHUNKS_TO_COMBINE = 6
+UPDATE_GRAPH_CHUNKS_PROCESSED = 20
 NEO4J_URI = ""
 NEO4J_USERNAME = ""
 NEO4J_PASSWORD = ""
 NEO4J_DATABASE = ""
 AWS_ACCESS_KEY_ID =  ""
 AWS_SECRET_ACCESS_KEY = ""
-EMBEDDING_MODEL = ""
-IS_EMBEDDING = "TRUE"
-KNN_MIN_SCORE = ""
 LANGCHAIN_API_KEY = ""
 LANGCHAIN_PROJECT = ""
 LANGCHAIN_TRACING_V2 = ""
 LANGCHAIN_ENDPOINT = ""
-NUMBER_OF_CHUNKS_TO_COMBINE = ""
-# NUMBER_OF_CHUNKS_ALLOWED = ""
-# Enable Gemini (default is True)
-GEMINI_ENABLED = True|False
-# Enable Google Cloud logs (default is True)
-GCP_LOG_METRICS_ENABLED = True|False
-UPDATE_GRAPH_CHUNKS_PROCESSED = 20
-NEO4J_USER_AGENT = ""
-UPDATE_GRAPH_CHUNKS_PROCESSED = 20
@@ -74,6 +74,7 @@ langchain-community
 langchain-core
 langchain-experimental
 langchain-google-vertexai
+langchain-groq
 langchain-openai
 langchain-text-splitters==0.0.1
 langdetect==1.0.9
@@ -117,6 +118,7 @@ pydantic==2.6.4
 pydantic_core==2.16.3
 pyparsing==3.0.9
 pypdf==4.0.1
+PyPDF2
 pypdfium2==4.27.0
 pytesseract==0.3.10
 python-dateutil==2.8.2
@@ -158,6 +160,7 @@ unstructured
 unstructured-client
 unstructured-inference
 unstructured.pytesseract
+unstructured[all-docs]
 urllib3
 uvicorn
 gunicorn
@@ -168,3 +171,4 @@ youtube-transcript-api==0.6.2
 zipp==3.17.0
 sentence-transformers
 google-cloud-logging==3.10.0
+PyMuPDF==1.24.5
@@ -56,7 +56,7 @@ def sick():
     allow_headers=["*"],
 )
 
-is_gemini_enabled = os.environ.get("GEMINI_ENABLED", "True").lower() in ("true", "1", "yes")
+is_gemini_enabled = os.environ.get("GEMINI_ENABLED", "False").lower() in ("true", "1", "yes")
 if is_gemini_enabled:
     add_routes(app,ChatVertexAI(), path="/vertexai")
 
@@ -138,7 +138,8 @@ async def extract_knowledge_graph_from_file(
     file_name=Form(None),
     allowedNodes=Form(None),
     allowedRelationship=Form(None),
-    language=Form(None)
+    language=Form(None),
+    access_token=Form(None)
 ):
     """
     Calls 'extract_graph_from_file' in a new thread to create Neo4jGraph from a
@@ -177,7 +178,7 @@ async def extract_knowledge_graph_from_file(
 
         elif source_type == 'gcs bucket' and gcs_bucket_name:
             result = await asyncio.to_thread(
-                extract_graph_from_file_gcs, graph, model, gcs_project_id, gcs_bucket_name, gcs_bucket_folder, gcs_blob_filename, allowedNodes, allowedRelationship)
+                extract_graph_from_file_gcs, graph, model, gcs_project_id, gcs_bucket_name, gcs_bucket_folder, gcs_blob_filename, access_token, allowedNodes, allowedRelationship)
         else:
             return create_api_response('Failed',message='source_type is other than accepted source')
         if result is not None:
@@ -201,16 +202,16 @@ async def extract_knowledge_graph_from_file(
 @app.get("/sources_list")
 async def get_source_list(uri:str, userName:str, password:str, database:str=None):
     """
-    Calls 'get_source_list_from_graph' which returns list of sources which alreday exist in databse
+    Calls 'get_source_list_from_graph' which returns list of sources which already exist in databse
     """
     try:
         decoded_password = decode_password(password)
         if " " in uri:
-            uri= uri.replace(" ","+")
-            result = await asyncio.to_thread(get_source_list_from_graph,uri,userName,decoded_password,database)
-            josn_obj = {'api_name':'sources_list','db_url':uri}
-            logger.log_struct(josn_obj)
-            return create_api_response("Success",data=result)
+            uri = uri.replace(" ","+")
+        result = await asyncio.to_thread(get_source_list_from_graph,uri,userName,decoded_password,database)
+        josn_obj = {'api_name':'sources_list','db_url':uri}
+        logger.log_struct(josn_obj)
+        return create_api_response("Success",data=result)
     except Exception as e:
         job_status = "Failed"
         message="Unable to fetch source list"
@@ -225,11 +226,11 @@ async def update_similarity_graph(uri=Form(None), userName=Form(None), password=
     """
     try:
         graph = create_graph_database_connection(uri, userName, password, database)
-        result = await asyncio.to_thread(update_graph, graph)
-        logging.info(f"result : {result}")
+        await asyncio.to_thread(update_graph, graph)
+        
         josn_obj = {'api_name':'update_similarity_graph','db_url':uri}
         logger.log_struct(josn_obj)
-        return create_api_response('Success',message='Updated KNN Graph',data=result)
+        return create_api_response('Success',message='Updated KNN Graph')
     except Exception as e:
         job_status = "Failed"
         message="Unable to update KNN Graph"
@@ -344,9 +345,12 @@ async def upload_large_file_into_chunks(file:UploadFile = File(...), chunkNumber
         result = await asyncio.to_thread(upload_file, graph, model, file, chunkNumber, totalChunks, originalname, CHUNK_DIR, MERGED_DIR)
         josn_obj = {'api_name':'upload','db_url':uri}
         logger.log_struct(josn_obj)
-        return create_api_response('Success', message=result)
+        if int(chunkNumber) == int(totalChunks):
+            return create_api_response('Success',data=result, message='Source Node Created Successfully')
+        else:
+            return create_api_response('Success', message=result)
     except Exception as e:
-        job_status = "Failed"
+        # job_status = "Failed"
         message="Unable to upload large file into chunks or saving the chunks"
         error_message = str(e)
         logging.info(message)
@@ -359,6 +363,7 @@ async def get_structured_schema(uri=Form(None), userName=Form(None), password=Fo
     try:
         graph = create_graph_database_connection(uri, userName, password, database)
         result = await asyncio.to_thread(get_labels_and_relationtypes, graph)
+        logging.info(f'Schema result from DB: {result}')
         josn_obj = {'api_name':'schema','db_url':uri}
         logger.log_struct(josn_obj)
         return create_api_response('Success', data=result)
@@ -381,6 +386,7 @@ async def update_extract_status(request:Request, file_name, url, userName, passw
     async def generate():
         status = ''
         decoded_password = decode_password(password)
+        uri = url
         if " " in url:
             uri= url.replace(" ","+")
         while True:
@@ -401,7 +407,8 @@ async def generate():
                 'total_chunks':result[0]['total_chunks'],
                 'total_pages':result[0]['total_pages'],
                 'fileSize':result[0]['fileSize'],
-                'processed_chunk':result[0]['processed_chunk']
+                'processed_chunk':result[0]['processed_chunk'],
+                'fileSource':result[0]['fileSource']
                 })
             else:
                 status = json.dumps({'fileName':file_name, 'status':'Failed'})
@@ -457,7 +464,8 @@ async def get_document_status(file_name, url, userName, password, database):
                 'total_chunks':result[0]['total_chunks'],
                 'total_pages':result[0]['total_pages'],
                 'fileSize':result[0]['fileSize'],
-                'processed_chunk':result[0]['processed_chunk']
+                'processed_chunk':result[0]['processed_chunk'],
+                'fileSource':result[0]['fileSource']
                 }
         else:
             status = {'fileName':file_name, 'status':'Failed'}
@@ -484,5 +492,17 @@ async def cancelled_job(uri=Form(None), userName=Form(None), password=Form(None)
     finally:
         close_db_connection(graph, 'cancelled_job')
 
+@app.post("/populate_graph_schema")
+async def populate_graph_schema(input_text=Form(None), model=Form(None), is_schema_description_checked=Form(None)):
+    try:
+        result = populate_graph_schema_from_text(input_text, model, is_schema_description_checked)
+        return create_api_response('Success',data=result)
+    except Exception as e:
+        job_status = "Failed"
+        message="Unable to get the schema from text"
+        error_message = str(e)
+        logging.exception(f'Exception in getting the schema from text:{error_message}')
+        return create_api_response(job_status, message=message, error=error_message)
+
 if __name__ == "__main__":
     uvicorn.run(app)