You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**PyMuPDF4LLM** is based on `PyMuPDF <https://pymupdf.readthedocs.io>`_ - the fastest **PDF** extraction tool for **Python**.
8
9
9
-
This documentation explains how to use the **Python PDF4LLM** package as well as providing links to other related **RAG** & **LLM** resources for **PyMuPDF**.
10
+
This documentation explains how to use the **PyMuPDF4LLM** package as well as providing links to other related **RAG** & **LLM** resources for **PyMuPDF**.
10
11
12
+
Features
13
+
-------------------------------
11
14
15
+
- Support for multi-column pages
16
+
- Support for image and vector graphics extraction (and inclusion of references in the MD text)
17
+
- Support for page chunking output.
18
+
- Direct support for output as LlamaIndex Documents.
12
19
13
20
- This package converts the pages of a **PDF** to text in **Markdown** format using **PyMuPDF**.
14
21
@@ -21,10 +28,8 @@ This documentation explains how to use the **Python PDF4LLM** package as well as
21
28
- By default, all document pages are processed. If desired, a subset of pages can be specified by providing a list of `0`-based page numbers.
**PyMuPDF4LLM** is aimed to make it easier to extract **PDF** content in the format you need for **LLM** & **RAG** environments. It supports :ref:`Markdown extraction <extracting_as_md>` as well as :ref:`LlamaIndex document output <extracting_as_llamaindex>`.
**PyMuPDF4LLM** supports direct conversion to a **LLamaIndex** document. A document is first converted into **Markdown** format and then a **LlamaIndex** document is returned as follows:
0 commit comments