Skip to content

Commit e6fa1c3

Browse files
committed
Changes for version 0.0.6
Documentated in files changes.rst.
1 parent fa4330a commit e6fa1c3

File tree

9 files changed

+343
-108
lines changed

9 files changed

+343
-108
lines changed

docs/src/api.rst

Lines changed: 23 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,25 +10,39 @@ API
1010

1111
Prints the version of the library.
1212

13-
.. method:: to_markdown(doc: pymupdf.Document | str, *, pages: list | range | None = None, hdr_info: Any = None, write_images: bool = False, margins=(0, 50, 0, 50), page_chunks: bool = False) -> str | list[dict]
13+
.. method:: to_markdown(doc: pymupdf.Document | str, *, pages: list | range | None = None, hdr_info: Any = None, write_images: bool = False, dpi: int = 150, margins=(0, 50, 0, 50), page_chunks: bool = False, page_width: float = 612, page_height: float = None, table_strategy="lines_strict", graphics_limit: int = None) -> str | list[dict]
1414

15-
Read the pages of the file and outputs the text of its pages in |Markdown| format. How this should happen in detail can be influenced by a number of parameters. Please note that there exists support for building page chunks from the |Markdown| text.
15+
Read the pages of the file and outputs the text of its pages in |Markdown| format. How this should happen in detail can be influenced by a number of parameters. Please note that there exists **support for building page chunks** from the |Markdown| text.
1616

17-
:arg Document,str doc: the file, to be specified either as a file path string, or as a |PyMuPDF| Document (created via `pymupdf.open`).
17+
:arg Document,str doc: the file, to be specified either as a file path string, or as a |PyMuPDF| Document (created via `pymupdf.open`). In order to use `pathlib.Path` specifications, Python file-like objects, documents in memory etc. you **must** use a |PyMuPDF| Document.
1818

19-
:arg list,range pages: optional, the pages to consider for output. If omitted all pages are processed.
19+
:arg list pages: optional, the pages to consider for output (caution: specify 0-based page numbers). If omitted all pages are processed.
2020

21-
:arg hdr_info: optional, a callable (or an object having a method named `hdr_info`) which accepts a text span and delivers a string of 0 up to 6 "#" characters which should be used to identify headers in the markdown text. If omitted, a full document scan will be performed to find the most popular font sizes and derive header levels based on this. For instance, to avoid generating any lines tagged as headers specify `hdr_info=lambda s: ""`.
21+
:arg hdr_info: optional. Use this if you want to provide your own header detection logic. This may be a callable or an object having a method named `get_header_id`. It must accept a text span (a span dictionary as contained in `extractDict <https://pymupdf.readthedocs.io/en/latest/textpage.html#span-dictionary>`_) and has optional access to the owning `Page <https://pymupdf.readthedocs.io/en/latest/page.html>`_ object. It must return a string "" or up to 6 "#" characters followed by 1 space. If omitted, a full document scan will be performed to find the most popular font sizes and derive header levels based on this. For instance, avoid any headers by specifying `hdr_info=lambda s: ""`.
2222

23-
:arg bool write_images: when encountering images or vector graphics, PNG images will be generated from the respective page area. Markdown references will be generated pointing to these images. Any text contained in these areas will not be included in the output. Therefore, if your document has text written on full page images, make sure to set this parameter to `False`.
23+
:arg bool write_images: when encountering images or vector graphics, PNG images will be created from the respective page area and stored in the folder of the document. Markdown references will be generated pointing to these images. Any text contained in these areas will not be included in the text output (but appear as part of the images). Therefore, if your document has text written on full page images, make sure to set this parameter to `False`.
2424

25-
:arg float,list margins: a float or a list of up to 4 floats specifying page borders. If 4 floats are provided, they are assumed to be the values left, top, right, bottom, in this sequence. Only content below top and above bottom, etc. will be considered for processing. If a single float value is provided, it will be taken as the value for all 4 border values. A pair of numbers is assumed to specify top and bottom.
25+
:arg int dpi: specify the desired image resolution in dots per inch. Relevant only if `write_images=True`. Default value is 150.
26+
27+
:arg float,list margins: a float or a sequence of 2 or 4 floats specifying page borders. Only objects inside the margins will be considered for output.
28+
29+
* `margin=f` yields `(f, f, f, f)` for `(left, top, right, bottom)`.
30+
* `(top, bottom)` yields `(0, top, 0, bottom)`.
31+
* To always read full pages, use `margins=0`.
32+
33+
:arg float page_width: specify a desired page width. This is ignored for documents with a fixed page width like PDF, XPS etc. **Reflowable** documents however, like e-books, office or text files have no fixed page dimensions and by default are assumed to have Letter format width (612) and an **"infinite"** page height. This means that the full document is treated as one large page.
34+
35+
:arg float page_height: specify a desired page height. For relevance see the `page_width` parameter. If using the default `None`, the document will appear as one large page with a width of `page_width`. Consequently in this case, no markdown page separators will occur (except the final one), respectively only one page chunk will be returned.
2636

37+
:arg str table_strategy: table detection strategy. Default is `"lines_strict"` which ignores background colors. In some occasions, other strategies may be more successful, for example `"lines"` which uses all vector graphics objects for detection.
38+
39+
:arg int graphics_limit: use this to limit dealing with excess amounts of vector graphics elements. Typically, scientific documents or pages simulating text using graphics commands may contain tens of thousands of these objects. As vector graphics are used for table detection mainly, analyzing pages of this kind may result in excessive runtimes. You can exclude problematic pages via `graphics_limit=5000`. The respective pages will then be ignored and be represented by one message line in the output text.
40+
2741
:arg bool page_chunks: if `True` the output will be a list of `Document.page_count` dictionaries (one per page). Each dictionary has the following structure:
2842

2943
- **"metadata"** - a dictionary consisting of the document's metadata `Document.metadata <https://pymupdf.readthedocs.io/en/latest/document.html#Document.metadata>`_, enriched with additional keys **"file_path"** (the file name), **"page_count"** (number of pages in document), and **"page_number"** (1-based page number).
3044

31-
- **"toc_items"** - a list of Table of Contents items pointing to this page. Each item of this list has the format `[lvl, title, pagenumber]`, where `lvl` is the hierachy level, `title` a string and `pagenumber` the 12-based page number.
45+
- **"toc_items"** - a list of Table of Contents items pointing to this page. Each item of this list has the format `[lvl, title, pagenumber]`, where `lvl` is the hierachy level, `title` a string and `pagenumber` the 1-based page number.
3246

3347
- **"tables"** - a list of tables on this page. Each item is a dictionary with keys "bbox", "row_count" and "col_count". Key "bbox" is a `pymupdf.Rect` in tuple format of the table's position on the page.
3448

@@ -42,7 +56,7 @@ API
4256

4357
.. method:: LlamaMarkdownReader(*args, **kwargs)
4458

45-
Create a `pdf_markdown_reader.PDFMarkdownReader` using the `LlamaIndex <https://pypi.org/project/llama-index/>`_ package. Please note that this package will **not automatically be installed** when installing **pymupdf4llm**.
59+
Create a `pdf_markdown_reader.PDFMarkdownReader` using the `LlamaIndex <https://pypi.org/project/llama-index/>`_ package. Please note that this package will **not be installed** when installing **pymupdf4llm**.
4660

4761
For details on the possible arguments, please consult the LlamaIndex documentation [#f1]_.
4862

docs/src/changes.rst

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
.. include:: header.rst
2+
3+
4+
Change Log
5+
===========================================================================
6+
7+
Changes in version 0.0.6
8+
--------------------------
9+
10+
Fixes:
11+
~~~~~~~
12+
13+
* `55 <https://github.com/pymupdf/RAG/issues/55>`_ "Bug in helpers/multi_column.py - IndexError: list index out of range"
14+
* `54 <https://github.com/pymupdf/RAG/issues/54>`_ "Mistakes in orchestrating sentences"
15+
* `52 <https://github.com/pymupdf/RAG/issues/52>`_ "Chunking of text files"
16+
* Partial fix for `41 <https://github.com/pymupdf/RAG/issues/41>`_ / `40 <https://github.com/pymupdf/RAG/issues/40>`_. Improved page column detection, but still no silver bullet for overly complex page layouts.
17+
18+
Improvements:
19+
~~~~~~~~~~~~~~~~
20+
21+
* New parameter `dpi` to specify the resolution of images.
22+
* New parameters `page_width` / `page_height` for easily processing reflowable documents (Text, Office, e-books).
23+
* New parameter `graphics_limit` to avoid spending runtimes for value-less content.
24+
* New parameter `table_strategy` to directly control the table detection strategy.
25+
26+
.. include:: footer.rst
27+

docs/src/index.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,13 @@ API
154154

155155
See :doc:`api`.
156156

157+
158+
Change Log
159+
------------
160+
161+
See :doc:`changes`.
162+
163+
157164
Further Resources
158165
-------------------
159166

pymupdf4llm/pymupdf4llm/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
from .helpers.pymupdf_rag import to_markdown, IdentifyHeaders
1+
from .helpers.pymupdf_rag import IdentifyHeaders, to_markdown
22

3-
__version__ = "0.0.5"
3+
__version__ = "0.0.6"
44
version = __version__
55
version_tuple = tuple(map(int, version.split(".")))
66

pymupdf4llm/pymupdf4llm/helpers/get_text_lines.py

Lines changed: 20 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -64,15 +64,20 @@ def get_raw_lines(textpage, clip=None, tolerance=3):
6464
]
6565
spans = [] # all spans in TextPage here
6666
for bno, b in enumerate(blocks):
67-
for lno, l in enumerate(b["lines"]):
68-
for s in l["spans"]:
67+
for lno, line in enumerate(b["lines"]):
68+
lbbox = fitz.Rect(line["bbox"])
69+
for sno, s in enumerate(line["spans"]):
6970
sbbox = fitz.Rect(s["bbox"]) # turn to a Rect
70-
if (
71-
abs(sbbox & clip) < abs(sbbox) * 0.8
72-
): # must be inside parameter rectangle
71+
mpoint = (sbbox.tl + sbbox.br) / 2 # middle point
72+
if mpoint not in clip:
7373
continue
7474
if is_white(s["text"]): # ignore white text
7575
continue
76+
if s["flags"] & 1 == 1: # if a superscript, modify
77+
i = 1 if sno == 0 else sno - 1
78+
neighbor = line["spans"][i]
79+
sbbox.y1 = neighbor["bbox"][3]
80+
s["text"] = f"[{s['text']}]"
7681
s["bbox"] = sbbox # update with the Rect version
7782
# include line identifier to facilitate separator insertion
7883
s["line"] = lno
@@ -82,7 +87,9 @@ def get_raw_lines(textpage, clip=None, tolerance=3):
8287
if not spans: # we may have no text at all
8388
return []
8489

85-
spans.sort(key=lambda s: s["bbox"].y1) # sort spans by assending bottom coord
90+
spans.sort(
91+
key=lambda s: s["bbox"].y1
92+
) # sort spans by assending bottom coord
8693
nlines = [] # final result
8794
line = [spans[0]] # collects spans with fitting vertical coordinate
8895
lrect = spans[0]["bbox"] # rectangle joined from span rectangles
@@ -91,7 +98,10 @@ def get_raw_lines(textpage, clip=None, tolerance=3):
9198
sbbox = s["bbox"]
9299
sbbox0 = line[-1]["bbox"]
93100
# if any of top or bottom coordinates are close enough, join...
94-
if abs(sbbox.y1 - sbbox0.y1) <= y_delta or abs(sbbox.y0 - sbbox0.y0) <= y_delta:
101+
if (
102+
abs(sbbox.y1 - sbbox0.y1) <= y_delta
103+
or abs(sbbox.y0 - sbbox0.y0) <= y_delta
104+
):
95105
line.append(s) # append to this line
96106
lrect |= sbbox # extend line rectangle
97107
continue
@@ -112,7 +122,9 @@ def get_raw_lines(textpage, clip=None, tolerance=3):
112122
return nlines
113123

114124

115-
def get_text_lines(page, *, textpage=None, clip=None, sep="\t", tolerance=3, ocr=False):
125+
def get_text_lines(
126+
page, *, textpage=None, clip=None, sep="\t", tolerance=3, ocr=False
127+
):
116128
"""Extract text by line keeping natural reading sequence.
117129
118130
Notes:

0 commit comments

Comments
 (0)