Skip to content

Commit 42ffb69

Browse files
authored
Merge pull request #25 from pymupdf/api-doc
Update api.rst
2 parents c7dff3c + ca2e01f commit 42ffb69

File tree

3 files changed

+64
-3
lines changed

3 files changed

+64
-3
lines changed

docs/src/_static/custom.css

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,9 @@
11
/* add custom CSS as required */
22
.footer-item .theme-version {
33
display: none;
4+
}
5+
6+
cite {
7+
font-weight: bold;
8+
font-style: normal;
49
}

docs/src/api.rst

Lines changed: 46 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,19 +10,62 @@ API
1010

1111
Prints the version of the library.
1212

13-
.. method:: to_markdown(doc: fitz.Document | str, *, pages: list | range | None = None, hdr_info: IdentifyHeaders | None = None, write_images: bool = False, page_chunks: bool = False) -> str | list[dict]
13+
.. method:: to_markdown(doc: pymupdf.Document | str, *, pages: list | range | None = None, hdr_info: Any = None, write_images: bool = False, margins=(0, 50, 0, 50), page_chunks: bool = False) -> str | list[dict]
1414

15+
Read the pages of the file and outputs the text of its pages in |Markdown| format. How this should happen in detail can be influenced by a number of parameters. Please note that there exists support for building page chunks from the |Markdown| text.
1516

17+
:arg Document,str doc: the file, to be specified either as a file path string, or as a |PyMuPDF| Document (created via `pymupdf.open`).
1618

19+
:arg list,range pages: optional, the pages to consider for output. If omitted all pages are processed.
1720

18-
----
21+
:arg hdr_info: optional, a callable (or an object having a method named `hdr_info`) which accepts a text span and delivers a string of 0 up to 6 "#" characters which should be used to identify headers in the markdown text. If omitted, a full document scan will be performed to find the most popular font sizes and derive header levels based on this. For instance, to avoid generating any lines tagged as headers specify `hdr_info=lambda s: ""`.
22+
23+
:arg bool write_images: when encountering images or vector graphics, PNG images will be generated from the respective page area. Markdown references will be generated pointing to these images. Any text contained in these areas will not be included in the output. Therefore, if your document has text written on full page images, make sure to set this parameter to `False`.
24+
25+
:arg float,list margins: a float or a list of up to 4 floats specifying page borders. If 4 floats are provided, they are assumed to be the values left, top, right, bottom, in this sequence. Only content below top and above bottom, etc. will be considered for processing. If a single float value is provided, it will be taken as the value for all 4 border values. A pair of numbers is assumed to specify top and bottom.
26+
27+
:arg bool page_chunks: if `True` the output will be a list of `Document.page_count` dictionaries (one per page). Each dictionary has the following structure:
28+
29+
- **"metadata"** - a dictionary consisting of the document's metadata `Document.metadata <https://pymupdf.readthedocs.io/en/latest/document.html#Document.metadata>`_, enriched with additional keys **"file_path"** (the file name), **"page_count"** (number of pages in document), and **"page_number"** (1-based page number).
30+
31+
- **"toc_items"** - a list of Table of Contents items pointing to this page. Each item of this list has the format `[lvl, title, pagenumber]`, where `lvl` is the hierachy level, `title` a string and `pagenumber` the 12-based page number.
32+
33+
- **"tables"** - a list of tables on this page. Each item is a dictionary with keys "bbox", "row_count" and "col_count". Key "bbox" is a `pymupdf.Rect` in tuple format of the table's position on the page.
34+
35+
- **"images"** - a list of images on the page. This a copy of page method `get_image_info <https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_image_info>`_. Please see there for a full description of items.
36+
37+
- **"graphics"** - a list of vector graphics rectangles on the page. This is a list of boundary boxes of clustered vector graphics as delivered by method `cluster_drawings <https://pymupdf.readthedocs.io/en/latest/page.html#Page.cluster_drawings>`_.
38+
39+
- **"text"** - page content as Markdown text.
40+
41+
:returns: Either a string of the combined text of all selected document pages or a list of dictionaries.
1942

43+
.. method:: LlamaMarkdownReader(*args, **kwargs)
2044

21-
.. class:: LlamaMarkdownReader
45+
Create a `pdf_markdown_reader.PDFMarkdownReader` using the `LlamaIndex <https://pypi.org/project/llama-index/>`_ package. Please note that this package will **not automatically be installed** when installing **pymupdf4llm**.
2246

47+
For details on the possible arguments, please consult the LlamaIndex documentation [#f1]_.
48+
49+
:raises: NotImplementedError: Please install required 'llama_index'.
50+
:returns: a `pdf_markdown_reader.PDFMarkdownReader` and issues message "Successfully imported LlamaIndex". Please note that this method needs several seconds to execute. For details on using the markdown reader please see below.
51+
52+
----
53+
54+
55+
.. class:: pdf_markdown_reader.PDFMarkdownReader
56+
2357
.. method:: load_data(file_path: Union[Path, str], extra_info: Optional[Dict] = None, **load_kwargs: Any) -> List[LlamaIndexDocument]
2458

59+
This is the only method of the markdown reader you should currently use to extract markdown data. Please in any case ignore methods `aload_data()` and `lazy_load_data()`. Other methods like `use_doc_meta()` may or may not make sense. For more information, please consult the documentation of LlamaIndex [#f1]_.
60+
61+
Under the hood the method will execute `to_markdown()`.
62+
63+
:returns: a list of `LlamaIndexDocument` documents - one for each page.
64+
65+
66+
.. rubric:: Footnotes
2567

68+
.. [#f1] `LlamaIndex documentation <https://docs.llamaindex.ai/en/stable/>`_
2669
2770
2871
.. include:: footer.rst

docs/src/header.rst

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,16 @@
1+
.. |PyMuPDF| raw:: html
2+
3+
<cite>PyMuPDF</cite>
4+
5+
.. |PDF| raw:: html
6+
7+
<cite>PDF</cite>
8+
9+
.. |Markdown| raw:: html
10+
11+
<cite>Markdown</cite>
12+
13+
114
.. raw:: html
215

316
<style>

0 commit comments

Comments
 (0)