You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This introduces support for multiple text columns on document pages.
Also supported is now the extraction of images and vector graphics.
Instead of one large markdown string for the whole document, "chunks" per page can be requested.
Copy file name to clipboardExpand all lines: README.md
+9-5Lines changed: 9 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,9 +14,7 @@ This will generally be implemented as one or more Python functions called by any
14
14
15
15
# Installation
16
16
17
-
As a specialty, folder "helpers" contains a script that is capable to convert PDF pages into **_text strings in Markdown format_** (GitHub compatible), which includes **standard text** as well as **table-based text** in a consistent and integrated view. This is particularly important in RAG environments.
18
-
19
-
There is a Python package on PyPI [pymupdf4llm](https://pypi.org/project/pymupdf4llm/) (there also is an alias [pdf4llm](https://pypi.org/project/pdf4llm/)) which provides convenient access to this script:
17
+
The Python package on PyPI [pymupdf4llm](https://pypi.org/project/pymupdf4llm/) (there also is an alias [pdf4llm](https://pypi.org/project/pdf4llm/)) is capable of converting PDF pages into **_text strings in Markdown format_** (GitHub compatible). This includes **standard text** as well as **table-based text** in a consistent and integrated view - a feature particularly important in RAG settings.
Instead of the filename string as above, one can also provide a PyMuPDF `Document`. By default, all pages in the PDF will be processed. If desired, a list of zero-based page numbers to consider can be provided.
37
+
Instead of the filename string as above, one can also provide a PyMuPDF `Document`. By default, all pages in the PDF will be processed. If desired, the parameter `pages=[...]` can be used to provide a list of zero-based page numbers to consider.
38
+
39
+
Markdown text creation now also processes **multi-column pages**.
40
+
41
+
To create small **chunks of text** - as opposed to generating one large string for the whole document - the new (v0.0.2) option `page_chunks=True` can be used. The result of `.to_markdown("input.pdf", page_chunks=True)` will be a list of Python dictionaries, one for each page.
42
+
43
+
Also new in version 0.0.2 is the optional **extraction of images** and vector graphics: use of parameter `write_images=True`. The will store PNG images in the document's folder, and the Markdown text will appropriately refer to them. The images are named like `"input.pdf-page_number-index.png"`.
40
44
41
45
# Document Support
42
46
43
-
While PDF is certainly the most important document format worldwide by far, it is worthwhile mentioning that all examples and helper scripts work in the same way and **_without change_** for [all supported file types](https://pymupdf.readthedocs.io/en/latest/how-to-open-a-file.html#supported-file-types).
47
+
While PDF is by far the most important document format worldwide, it is worthwhile mentioning that all examples and helper scripts work in the same way and **_without change_** for [all supported file types](https://pymupdf.readthedocs.io/en/latest/how-to-open-a-file.html#supported-file-types).
44
48
45
49
So for an XPS document or an eBook, simply provide the filename for instance as `"input.mobi"` and everything else will work as before.
0 commit comments