Skip to content

Commit 3a34b40

Browse files
committed
Multiple page columns support
This introduces support for multiple text columns on document pages. Also supported is now the extraction of images and vector graphics. Instead of one large markdown string for the whole document, "chunks" per page can be requested.
1 parent 6d55dd4 commit 3a34b40

File tree

15 files changed

+1067
-655
lines changed

15 files changed

+1067
-655
lines changed

README.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,7 @@ This will generally be implemented as one or more Python functions called by any
1414

1515
# Installation
1616

17-
As a specialty, folder "helpers" contains a script that is capable to convert PDF pages into **_text strings in Markdown format_** (GitHub compatible), which includes **standard text** as well as **table-based text** in a consistent and integrated view. This is particularly important in RAG environments.
18-
19-
There is a Python package on PyPI [pymupdf4llm](https://pypi.org/project/pymupdf4llm/) (there also is an alias [pdf4llm](https://pypi.org/project/pdf4llm/)) which provides convenient access to this script:
17+
The Python package on PyPI [pymupdf4llm](https://pypi.org/project/pymupdf4llm/) (there also is an alias [pdf4llm](https://pypi.org/project/pdf4llm/)) is capable of converting PDF pages into **_text strings in Markdown format_** (GitHub compatible). This includes **standard text** as well as **table-based text** in a consistent and integrated view - a feature particularly important in RAG settings.
2018

2119
```bash
2220
$ pip install -U pymupdf4llm
@@ -36,11 +34,17 @@ import pathlib
3634
pathlib.Path("output.md").write_bytes(md_text.encode())
3735
```
3836

39-
Instead of the filename string as above, one can also provide a PyMuPDF `Document`. By default, all pages in the PDF will be processed. If desired, a list of zero-based page numbers to consider can be provided.
37+
Instead of the filename string as above, one can also provide a PyMuPDF `Document`. By default, all pages in the PDF will be processed. If desired, the parameter `pages=[...]` can be used to provide a list of zero-based page numbers to consider.
38+
39+
Markdown text creation now also processes **multi-column pages**.
40+
41+
To create small **chunks of text** - as opposed to generating one large string for the whole document - the new (v0.0.2) option `page_chunks=True` can be used. The result of `.to_markdown("input.pdf", page_chunks=True)` will be a list of Python dictionaries, one for each page.
42+
43+
Also new in version 0.0.2 is the optional **extraction of images** and vector graphics: use of parameter `write_images=True`. The will store PNG images in the document's folder, and the Markdown text will appropriately refer to them. The images are named like `"input.pdf-page_number-index.png"`.
4044

4145
# Document Support
4246

43-
While PDF is certainly the most important document format worldwide by far, it is worthwhile mentioning that all examples and helper scripts work in the same way and **_without change_** for [all supported file types](https://pymupdf.readthedocs.io/en/latest/how-to-open-a-file.html#supported-file-types).
47+
While PDF is by far the most important document format worldwide, it is worthwhile mentioning that all examples and helper scripts work in the same way and **_without change_** for [all supported file types](https://pymupdf.readthedocs.io/en/latest/how-to-open-a-file.html#supported-file-types).
4448

4549
So for an XPS document or an eBook, simply provide the filename for instance as `"input.mobi"` and everything else will work as before.
4650

helpers/README.md

Lines changed: 0 additions & 40 deletions
This file was deleted.

helpers/input.pdf

-103 KB
Binary file not shown.

helpers/input2.pdf

-105 KB
Binary file not shown.

0 commit comments

Comments
 (0)