Skip to content

Commit b2d5531

Browse files
authored
Merge pull request #24 from pymupdf/more-docs
Adds more info to homepage for latest version and starts API doc.
2 parents 29d11c0 + 898ac4b commit b2d5531

File tree

2 files changed

+82
-5
lines changed

2 files changed

+82
-5
lines changed

docs/src/api.rst

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
.. include:: header.rst
2+
3+
4+
API
5+
===========================================================================
6+
7+
8+
9+
.. property:: version
10+
11+
Prints the version of the library.
12+
13+
.. method:: to_markdown(doc: fitz.Document | str, *, pages: list | range | None = None, hdr_info: IdentifyHeaders | None = None, write_images: bool = False, page_chunks: bool = False) -> str | list[dict]
14+
15+
16+
17+
18+
----
19+
20+
21+
.. class:: LlamaMarkdownReader
22+
23+
.. method:: load_data(file_path: Union[Path, str], extra_info: Optional[Dict] = None, **load_kwargs: Any) -> List[LlamaIndexDocument]
24+
25+
26+
27+
28+
.. include:: footer.rst
29+
30+
31+
32+
33+

docs/src/index.rst

Lines changed: 49 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,21 @@
11
.. include:: header.rst
22

33

4+
45
PyMuPDF4LLM
56
===========================================================================
67

78
**PyMuPDF4LLM** is based on `PyMuPDF <https://pymupdf.readthedocs.io>`_ - the fastest **PDF** extraction tool for **Python**.
89

9-
This documentation explains how to use the **Python PDF4LLM** package as well as providing links to other related **RAG** & **LLM** resources for **PyMuPDF**.
10+
This documentation explains how to use the **PyMuPDF4LLM** package as well as providing links to other related **RAG** & **LLM** resources for **PyMuPDF**.
1011

12+
Features
13+
-------------------------------
1114

15+
- Support for multi-column pages
16+
- Support for image and vector graphics extraction (and inclusion of references in the MD text)
17+
- Support for page chunking output.
18+
- Direct support for output as LlamaIndex Documents.
1219

1320
- This package converts the pages of a **PDF** to text in **Markdown** format using **PyMuPDF**.
1421

@@ -21,10 +28,8 @@ This documentation explains how to use the **Python PDF4LLM** package as well as
2128
- By default, all document pages are processed. If desired, a subset of pages can be specified by providing a list of `0`-based page numbers.
2229

2330

24-
Using in LLM / RAG Applications
25-
--------------------------------------------------------------
26-
27-
To retrieve your document content in **Markdown** simply install the package and then use a couple of lines of **Python** code to get results.
31+
Installation
32+
----------------
2833

2934

3035
Install the package via **pip** with:
@@ -35,6 +40,23 @@ Install the package via **pip** with:
3540
pip install pymupdf4llm
3641
3742
43+
44+
Using in LLM / RAG Applications
45+
--------------------------------------------------------------
46+
47+
**PyMuPDF4LLM** is aimed to make it easier to extract **PDF** content in the format you need for **LLM** & **RAG** environments. It supports :ref:`Markdown extraction <extracting_as_md>` as well as :ref:`LlamaIndex document output <extracting_as_llamaindex>`.
48+
49+
50+
51+
.. _extracting_as_md:
52+
53+
Extracting a file as **Markdown**
54+
--------------------------------------------------------------
55+
56+
To retrieve your document content in **Markdown** simply install the package and then use a couple of lines of **Python** code to get results.
57+
58+
59+
3860
Then in your **Python** script do:
3961

4062

@@ -58,6 +80,28 @@ If you want to store your **Markdown** file, e.g. store as a UTF8-encoded file,
5880
pathlib.Path("output.md").write_bytes(md_text.encode())
5981
6082
83+
84+
.. _extracting_as_llamaindex:
85+
86+
Extracting a file as a **LlamaIndex** document
87+
--------------------------------------------------------------
88+
89+
**PyMuPDF4LLM** supports direct conversion to a **LLamaIndex** document. A document is first converted into **Markdown** format and then a **LlamaIndex** document is returned as follows:
90+
91+
92+
93+
.. code-block:: python
94+
95+
import pymupdf4llm
96+
llama_reader = pymupdf4llm.LlamaMarkdownReader()
97+
llama_docs = llama_reader.load_data("input.pdf")
98+
99+
100+
API
101+
-------
102+
103+
See :doc:`api`.
104+
61105
Further Resources
62106
-------------------
63107

0 commit comments

Comments
 (0)