Skip to content

Commit ac6de74

Browse files
committed
2 parents 22b1434 + 1feed33 commit ac6de74

File tree

89 files changed

+4115
-1872
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

89 files changed

+4115
-1872
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -166,4 +166,5 @@ google-cloud-cli-469.0.0-linux-x86_64.tar.gz
166166
/backend/src/merged_files
167167
/backend/src/chunks
168168
/backend/merged_files
169-
google-cloud-cli-476.0.0-linux-x86_64.tar.gz
169+
/backend/chunks
170+
google-cloud-cli-479.0.0-linux-x86_64.tar.gz

README.md

Lines changed: 88 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -5,41 +5,105 @@ Files can be uploaded from local machine or S3 bucket and then LLM model can be
55

66
### Getting started
77

8-
:warning:
9-
For the backend, if you want to run the LLM KG Builder locally, and don't need the GCP/VertexAI integration, make sure to have the following set in your ENV file :
8+
:warning: You will need to have a Neo4j Database V5.15 or later with [APOC installed](https://neo4j.com/docs/apoc/current/installation/) to use this Knowledge Graph Builder.
9+
You can use any [Neo4j Aura database](https://neo4j.com/aura/) (including the free database)
10+
If you are using Neo4j Desktop, you will not be able to use the docker-compose but will have to follow the [separate deployment of backend and frontend section](#running-backend-and-frontend-separately-dev-environment). :warning:
1011

12+
### Deploy locally
13+
#### Running through docker-compose
14+
By default only OpenAI and Diffbot are enabled since Gemini requires extra GCP configurations.
15+
16+
In your root folder, create a .env file with your OPENAI and DIFFBOT keys (if you want to use both):
17+
```env
18+
OPENAI_API_KEY="your-openai-key"
19+
DIFFBOT_API_KEY="your-diffbot-key"
20+
```
21+
22+
if you only want OpenAI:
23+
```env
24+
LLM_MODELS="OpenAI GPT 3.5,OpenAI GPT 4o"
25+
OPENAI_API_KEY="your-openai-key"
26+
```
27+
28+
if you only want Diffbot:
29+
```env
30+
LLM_MODELS="Diffbot"
31+
DIFFBOT_API_KEY="your-diffbot-key"
32+
```
33+
34+
You can then run Docker Compose to build and start all components:
35+
```bash
36+
docker-compose up --build
37+
```
38+
39+
##### Additional configs
40+
41+
By default, the input sources will be: Local files, Youtube, Wikipedia and AWS S3. As this default config is applied:
1142
```env
12-
GEMINI_ENABLED = False
13-
GCP_LOG_METRICS_ENABLED = False
43+
REACT_APP_SOURCES="local,youtube,wiki,s3"
1444
```
1545

16-
And for the frontend, make sure to export your local backend URL before running docker-compose by having the BACKEND_API_URL set in your ENV file :
46+
If however you want the Google GCS integration, add `gcs` and your Google client ID:
1747
```env
18-
BACKEND_API_URL="http://localhost:8000"
48+
REACT_APP_SOURCES="local,youtube,wiki,s3,gcs"
49+
GOOGLE_CLIENT_ID="xxxx"
1950
```
2051

21-
1. Run Docker Compose to build and start all components:
52+
You can of course combine all (local, youtube, wikipedia, s3 and gcs) or remove any you don't want/need.
53+
54+
55+
#### Running Backend and Frontend separately (dev environment)
56+
Alternatively, you can run the backend and frontend separately:
57+
58+
- For the frontend:
59+
1. Create the frontend/.env file by copy/pasting the frontend/example.env.
60+
2. Change values as needed
61+
3.
2262
```bash
23-
docker-compose up --build
63+
cd frontend
64+
yarn
65+
yarn run dev
2466
```
2567

26-
2. Alternatively, you can run specific directories separately:
27-
28-
- For the frontend:
29-
```bash
30-
cd frontend
31-
yarn
32-
yarn run dev
33-
```
68+
- For the backend:
69+
1. Create the backend/.env file by copy/pasting the backend/example.env.
70+
2. Change values as needed
71+
3.
72+
```bash
73+
cd backend
74+
python -m venv envName
75+
source envName/bin/activate
76+
pip install -r requirements.txt
77+
uvicorn score:app --reload
78+
```
79+
### ENV
80+
| Env Variable Name | Mandatory/Optional | Default Value | Description |
81+
|-------------------------|--------------------|---------------|--------------------------------------------------------------------------------------------------|
82+
| OPENAI_API_KEY | Mandatory | | API key for OpenAI |
83+
| DIFFBOT_API_KEY | Mandatory | | API key for Diffbot |
84+
| EMBEDDING_MODEL | Optional | all-MiniLM-L6-v2 | Model for generating the text embedding (all-MiniLM-L6-v2 , openai , vertexai) |
85+
| IS_EMBEDDING | Optional | true | Flag to enable text embedding |
86+
| KNN_MIN_SCORE | Optional | 0.94 | Minimum score for KNN algorithm |
87+
| GEMINI_ENABLED | Optional | False | Flag to enable Gemini |
88+
| GCP_LOG_METRICS_ENABLED | Optional | False | Flag to enable Google Cloud logs |
89+
| NUMBER_OF_CHUNKS_TO_COMBINE | Optional | 6 | Number of chunks to combine when processing embeddings |
90+
| UPDATE_GRAPH_CHUNKS_PROCESSED | Optional | 20 | Number of chunks processed before updating progress |
91+
| NEO4J_URI | Optional | neo4j://database:7687 | URI for Neo4j database |
92+
| NEO4J_USERNAME | Optional | neo4j | Username for Neo4j database |
93+
| NEO4J_PASSWORD | Optional | password | Password for Neo4j database |
94+
| LANGCHAIN_API_KEY | Optional | | API key for Langchain |
95+
| LANGCHAIN_PROJECT | Optional | | Project for Langchain |
96+
| LANGCHAIN_TRACING_V2 | Optional | true | Flag to enable Langchain tracing |
97+
| LANGCHAIN_ENDPOINT | Optional | https://api.smith.langchain.com | Endpoint for Langchain API |
98+
| BACKEND_API_URL | Optional | http://localhost:8000 | URL for backend API |
99+
| BLOOM_URL | Optional | https://workspace-preview.neo4j.io/workspace/explore?connectURL={CONNECT_URL}&search=Show+me+a+graph&featureGenAISuggestions=true&featureGenAISuggestionsInternal=true | URL for Bloom visualization |
100+
| REACT_APP_SOURCES | Optional | local,youtube,wiki,s3 | List of input sources that will be available |
101+
| LLM_MODELS | Optional | Diffbot,OpenAI GPT 3.5,OpenAI GPT 4o | Models available for selection on the frontend, used for entities extraction and Q&A Chatbot |
102+
| ENV | Optional | DEV | Environment variable for the app |
103+
| TIME_PER_CHUNK | Optional | 4 | Time per chunk for processing |
104+
| CHUNK_SIZE | Optional | 5242880 | Size of each chunk for processing |
105+
| GOOGLE_CLIENT_ID | Optional | | Client ID for Google authentication |
34106

35-
- For the backend:
36-
```bash
37-
cd backend
38-
python -m venv envName
39-
source envName/bin/activate
40-
pip install -r requirements.txt
41-
uvicorn score:app --reload
42-
```
43107

44108
###
45109
To deploy the app and packages on Google Cloud Platform, run the following command on google cloud run:
@@ -64,29 +128,6 @@ Allow unauthenticated request : Yes
64128
- **Neo4j Integration**: The extracted nodes and relationships are stored in a Neo4j database for easy visualization and querying.
65129
- **Grid View of source node files with** : Name,Type,Size,Nodes,Relations,Duration,Status,Source,Model
66130
67-
## Setting up Environment Variables
68-
Create .env file and update the following env variables.\
69-
OPENAI_API_KEY = ""\
70-
DIFFBOT_API_KEY = ""\
71-
NEO4J_URI = ""\
72-
NEO4J_USERNAME = ""\
73-
NEO4J_PASSWORD = ""\
74-
AWS_ACCESS_KEY_ID = ""\
75-
AWS_SECRET_ACCESS_KEY = ""\
76-
EMBEDDING_MODEL = ""\
77-
IS_EMBEDDING = "TRUE"\
78-
KNN_MIN_SCORE = ""
79-
80-
## Setting up Enviournment Variables For Frontend Configuration
81-
82-
Create .env file in the frontend root folder and update the following env variables.\
83-
BACKEND_API_URL=""\
84-
BLOOM_URL=""\
85-
REACT_APP_SOURCES=""\
86-
LLM_MODELS=""\
87-
ENV=""\
88-
TIME_PER_CHUNK=
89-
90131
## Functions/Modules
91132
92133
#### extract_graph_from_file(uri, userName, password, file_path, model):
@@ -140,4 +181,3 @@ https://github.com/neo4j-labs/llm-graph-builder/assets/121786590/b725a503-6ade-4
140181
The Public [ Google cloud Run URL](https://devfrontend-dcavk67s4a-uc.a.run.app).
141182
[Workspace URL](https://workspace-preview.neo4j.io/workspace)
142183
143-

backend/example.env

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,22 @@
11
OPENAI_API_KEY = ""
22
DIFFBOT_API_KEY = ""
3+
GROQ_API_KEY = ""
4+
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
5+
IS_EMBEDDING = "true"
6+
KNN_MIN_SCORE = "0.94"
7+
# Enable Gemini (default is False) | Can be False or True
8+
GEMINI_ENABLED = False
9+
# Enable Google Cloud logs (default is False) | Can be False or True
10+
GCP_LOG_METRICS_ENABLED = False
11+
NUMBER_OF_CHUNKS_TO_COMBINE = 6
12+
UPDATE_GRAPH_CHUNKS_PROCESSED = 20
313
NEO4J_URI = ""
414
NEO4J_USERNAME = ""
515
NEO4J_PASSWORD = ""
616
NEO4J_DATABASE = ""
717
AWS_ACCESS_KEY_ID = ""
818
AWS_SECRET_ACCESS_KEY = ""
9-
EMBEDDING_MODEL = ""
10-
IS_EMBEDDING = "TRUE"
11-
KNN_MIN_SCORE = ""
1219
LANGCHAIN_API_KEY = ""
1320
LANGCHAIN_PROJECT = ""
1421
LANGCHAIN_TRACING_V2 = ""
1522
LANGCHAIN_ENDPOINT = ""
16-
NUMBER_OF_CHUNKS_TO_COMBINE = ""
17-
# NUMBER_OF_CHUNKS_ALLOWED = ""
18-
# Enable Gemini (default is True)
19-
GEMINI_ENABLED = True|False
20-
# Enable Google Cloud logs (default is True)
21-
GCP_LOG_METRICS_ENABLED = True|False
22-
UPDATE_GRAPH_CHUNKS_PROCESSED = 20
23-
NEO4J_USER_AGENT = ""
24-
UPDATE_GRAPH_CHUNKS_PROCESSED = 20

backend/requirements.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ langchain-community
7474
langchain-core
7575
langchain-experimental
7676
langchain-google-vertexai
77+
langchain-groq
7778
langchain-openai
7879
langchain-text-splitters==0.0.1
7980
langdetect==1.0.9
@@ -117,6 +118,7 @@ pydantic==2.6.4
117118
pydantic_core==2.16.3
118119
pyparsing==3.0.9
119120
pypdf==4.0.1
121+
PyPDF2
120122
pypdfium2==4.27.0
121123
pytesseract==0.3.10
122124
python-dateutil==2.8.2
@@ -158,6 +160,7 @@ unstructured
158160
unstructured-client
159161
unstructured-inference
160162
unstructured.pytesseract
163+
unstructured[all-docs]
161164
urllib3
162165
uvicorn
163166
gunicorn
@@ -168,3 +171,4 @@ youtube-transcript-api==0.6.2
168171
zipp==3.17.0
169172
sentence-transformers
170173
google-cloud-logging==3.10.0
174+
PyMuPDF==1.24.5

backend/score.py

Lines changed: 36 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ def sick():
5656
allow_headers=["*"],
5757
)
5858

59-
is_gemini_enabled = os.environ.get("GEMINI_ENABLED", "True").lower() in ("true", "1", "yes")
59+
is_gemini_enabled = os.environ.get("GEMINI_ENABLED", "False").lower() in ("true", "1", "yes")
6060
if is_gemini_enabled:
6161
add_routes(app,ChatVertexAI(), path="/vertexai")
6262

@@ -138,7 +138,8 @@ async def extract_knowledge_graph_from_file(
138138
file_name=Form(None),
139139
allowedNodes=Form(None),
140140
allowedRelationship=Form(None),
141-
language=Form(None)
141+
language=Form(None),
142+
access_token=Form(None)
142143
):
143144
"""
144145
Calls 'extract_graph_from_file' in a new thread to create Neo4jGraph from a
@@ -177,7 +178,7 @@ async def extract_knowledge_graph_from_file(
177178

178179
elif source_type == 'gcs bucket' and gcs_bucket_name:
179180
result = await asyncio.to_thread(
180-
extract_graph_from_file_gcs, graph, model, gcs_project_id, gcs_bucket_name, gcs_bucket_folder, gcs_blob_filename, allowedNodes, allowedRelationship)
181+
extract_graph_from_file_gcs, graph, model, gcs_project_id, gcs_bucket_name, gcs_bucket_folder, gcs_blob_filename, access_token, allowedNodes, allowedRelationship)
181182
else:
182183
return create_api_response('Failed',message='source_type is other than accepted source')
183184
if result is not None:
@@ -201,16 +202,16 @@ async def extract_knowledge_graph_from_file(
201202
@app.get("/sources_list")
202203
async def get_source_list(uri:str, userName:str, password:str, database:str=None):
203204
"""
204-
Calls 'get_source_list_from_graph' which returns list of sources which alreday exist in databse
205+
Calls 'get_source_list_from_graph' which returns list of sources which already exist in databse
205206
"""
206207
try:
207208
decoded_password = decode_password(password)
208209
if " " in uri:
209-
uri= uri.replace(" ","+")
210-
result = await asyncio.to_thread(get_source_list_from_graph,uri,userName,decoded_password,database)
211-
josn_obj = {'api_name':'sources_list','db_url':uri}
212-
logger.log_struct(josn_obj)
213-
return create_api_response("Success",data=result)
210+
uri = uri.replace(" ","+")
211+
result = await asyncio.to_thread(get_source_list_from_graph,uri,userName,decoded_password,database)
212+
josn_obj = {'api_name':'sources_list','db_url':uri}
213+
logger.log_struct(josn_obj)
214+
return create_api_response("Success",data=result)
214215
except Exception as e:
215216
job_status = "Failed"
216217
message="Unable to fetch source list"
@@ -225,11 +226,11 @@ async def update_similarity_graph(uri=Form(None), userName=Form(None), password=
225226
"""
226227
try:
227228
graph = create_graph_database_connection(uri, userName, password, database)
228-
result = await asyncio.to_thread(update_graph, graph)
229-
logging.info(f"result : {result}")
229+
await asyncio.to_thread(update_graph, graph)
230+
230231
josn_obj = {'api_name':'update_similarity_graph','db_url':uri}
231232
logger.log_struct(josn_obj)
232-
return create_api_response('Success',message='Updated KNN Graph',data=result)
233+
return create_api_response('Success',message='Updated KNN Graph')
233234
except Exception as e:
234235
job_status = "Failed"
235236
message="Unable to update KNN Graph"
@@ -344,9 +345,12 @@ async def upload_large_file_into_chunks(file:UploadFile = File(...), chunkNumber
344345
result = await asyncio.to_thread(upload_file, graph, model, file, chunkNumber, totalChunks, originalname, CHUNK_DIR, MERGED_DIR)
345346
josn_obj = {'api_name':'upload','db_url':uri}
346347
logger.log_struct(josn_obj)
347-
return create_api_response('Success', message=result)
348+
if int(chunkNumber) == int(totalChunks):
349+
return create_api_response('Success',data=result, message='Source Node Created Successfully')
350+
else:
351+
return create_api_response('Success', message=result)
348352
except Exception as e:
349-
job_status = "Failed"
353+
# job_status = "Failed"
350354
message="Unable to upload large file into chunks or saving the chunks"
351355
error_message = str(e)
352356
logging.info(message)
@@ -359,6 +363,7 @@ async def get_structured_schema(uri=Form(None), userName=Form(None), password=Fo
359363
try:
360364
graph = create_graph_database_connection(uri, userName, password, database)
361365
result = await asyncio.to_thread(get_labels_and_relationtypes, graph)
366+
logging.info(f'Schema result from DB: {result}')
362367
josn_obj = {'api_name':'schema','db_url':uri}
363368
logger.log_struct(josn_obj)
364369
return create_api_response('Success', data=result)
@@ -381,6 +386,7 @@ async def update_extract_status(request:Request, file_name, url, userName, passw
381386
async def generate():
382387
status = ''
383388
decoded_password = decode_password(password)
389+
uri = url
384390
if " " in url:
385391
uri= url.replace(" ","+")
386392
while True:
@@ -401,7 +407,8 @@ async def generate():
401407
'total_chunks':result[0]['total_chunks'],
402408
'total_pages':result[0]['total_pages'],
403409
'fileSize':result[0]['fileSize'],
404-
'processed_chunk':result[0]['processed_chunk']
410+
'processed_chunk':result[0]['processed_chunk'],
411+
'fileSource':result[0]['fileSource']
405412
})
406413
else:
407414
status = json.dumps({'fileName':file_name, 'status':'Failed'})
@@ -457,7 +464,8 @@ async def get_document_status(file_name, url, userName, password, database):
457464
'total_chunks':result[0]['total_chunks'],
458465
'total_pages':result[0]['total_pages'],
459466
'fileSize':result[0]['fileSize'],
460-
'processed_chunk':result[0]['processed_chunk']
467+
'processed_chunk':result[0]['processed_chunk'],
468+
'fileSource':result[0]['fileSource']
461469
}
462470
else:
463471
status = {'fileName':file_name, 'status':'Failed'}
@@ -484,5 +492,17 @@ async def cancelled_job(uri=Form(None), userName=Form(None), password=Form(None)
484492
finally:
485493
close_db_connection(graph, 'cancelled_job')
486494

495+
@app.post("/populate_graph_schema")
496+
async def populate_graph_schema(input_text=Form(None), model=Form(None), is_schema_description_checked=Form(None)):
497+
try:
498+
result = populate_graph_schema_from_text(input_text, model, is_schema_description_checked)
499+
return create_api_response('Success',data=result)
500+
except Exception as e:
501+
job_status = "Failed"
502+
message="Unable to get the schema from text"
503+
error_message = str(e)
504+
logging.exception(f'Exception in getting the schema from text:{error_message}')
505+
return create_api_response(job_status, message=message, error=error_message)
506+
487507
if __name__ == "__main__":
488508
uvicorn.run(app)

0 commit comments

Comments
 (0)