|
10 | 10 |
|
11 | 11 | Prints the version of the library. |
12 | 12 |
|
13 | | -.. method:: to_markdown(doc: fitz.Document | str, *, pages: list | range | None = None, hdr_info: Any = None, write_images: bool = False, margins=(0, 50, 0, 50), page_chunks: bool = False) -> str | list[dict] |
| 13 | +.. method:: to_markdown(doc: pymupdf.Document | str, *, pages: list | range | None = None, hdr_info: Any = None, write_images: bool = False, margins=(0, 50, 0, 50), page_chunks: bool = False) -> str | list[dict] |
14 | 14 |
|
15 | 15 | Read the pages of the file and outputs the text of its pages in Markdown format. How this should happen in detail can be influenced by a number of parameters. Please note that there exists support for building page chunks from the Markdown text. |
16 | 16 |
|
17 | | - :arg Document,str doc: the file, to be specified either as a file path string, or as a PyMuPDF Document (created via pymupdf.open). |
| 17 | + :arg Document,str doc: the file, to be specified either as a file path string, or as a PyMuPDF Document (created via `pymupdf.open`). |
18 | 18 |
|
19 | 19 | :arg list,range pages: optional, the pages to consider for output. If omitted all pages are processed. |
20 | 20 |
|
|
26 | 26 |
|
27 | 27 | :arg bool page_chunks: if `True` the output will be a list of `Document.page_count` dictionaries (one per page). Each dictionary has the following structure: |
28 | 28 |
|
29 | | - - **"metadata"** - a dictionary consisting of the document's metadata `Document.metadata <https://pymupdf.readthedocs.io/en/latest/document.html#Document.metadata>'_, enriched with additional keys **"file_path"** (the file name), **"page_count"** (number of pages in document), and **"page_number"** (1-based page number). |
| 29 | + - **"metadata"** - a dictionary consisting of the document's metadata `Document.metadata <https://pymupdf.readthedocs.io/en/latest/document.html#Document.metadata>`_, enriched with additional keys **"file_path"** (the file name), **"page_count"** (number of pages in document), and **"page_number"** (1-based page number). |
30 | 30 |
|
31 | | - - **"toc_items"** - a list of Table of Contents items pointing to this page. Ech item of this list has the format `[lvl, title, pagenumber]`, where "lvl" is the hierachie level, "title" a string and "pagenumber" the 12-based page number. |
| 31 | + - **"toc_items"** - a list of Table of Contents items pointing to this page. Each item of this list has the format `[lvl, title, pagenumber]`, where "lvl" is the hierachy level, "title" a string and "pagenumber" the 12-based page number. |
32 | 32 |
|
33 | | - - **"tables"** - a list of tables on this page. Each item is a dictionary with keys "bbox", "row_count" and "col_count". Key "bbox" is a fitz.Rect in tuple format of the table's position on the page. |
| 33 | + - **"tables"** - a list of tables on this page. Each item is a dictionary with keys "bbox", "row_count" and "col_count". Key "bbox" is a `pymupdf.Rect` in tuple format of the table's position on the page. |
34 | 34 |
|
35 | 35 | - **"images"** - a list of images on the page. This a copy of page method `get_image_info <https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_image_info>`_. Please see there for a full description of items. |
36 | 36 |
|
|
42 | 42 |
|
43 | 43 | .. method:: LlamaMarkdownReader(*args, **kwargs) |
44 | 44 |
|
45 | | - Create a `pdf_markdown_reader.PDFMarkdownReader` using the `LlamaIndex <https://pypi.org/project/llama-index/>`_ package. Please note that this package will **not automatically be installed** when installing pymupdf4llm. |
| 45 | + Create a `pdf_markdown_reader.PDFMarkdownReader` using the `LlamaIndex <https://pypi.org/project/llama-index/>`_ package. Please note that this package will **not automatically be installed** when installing **pymupdf4llm**. |
46 | 46 |
|
47 | | - For details on the possible arguments, please consult LlamaIndex documentation. |
| 47 | + For details on the possible arguments, please consult the LlamaIndex documentation [#f1]_. |
48 | 48 |
|
49 | 49 | :raises: NotImplementedError: Please install required 'llama_index'. |
50 | 50 | :returns: a `pdf_markdown_reader.PDFMarkdownReader` and issues message "Successfully imported LlamaIndex". Please note that this method needs several seconds to execute. For details on using the markdown reader please see below. |
|
56 | 56 |
|
57 | 57 | .. method:: load_data(file_path: Union[Path, str], extra_info: Optional[Dict] = None, **load_kwargs: Any) -> List[LlamaIndexDocument] |
58 | 58 |
|
59 | | - This is the only method of the markdown reader you should currently use to extract markdown data. Please in any case ignore methods `aload_data()` and `lazy_load_data()`. Other methods like `use_doc_meta()` may or may not make sense. For more information, please consult the documentation of LlamaIndex. |
| 59 | + This is the only method of the markdown reader you should currently use to extract markdown data. Please in any case ignore methods `aload_data()` and `lazy_load_data()`. Other methods like `use_doc_meta()` may or may not make sense. For more information, please consult the documentation of LlamaIndex [#f1]_. |
60 | 60 |
|
61 | 61 | Under the hood the method will execute `to_markdown()`. |
62 | 62 |
|
63 | 63 | :returns: a list of `LlamaIndexDocument` documents - one for each page. |
64 | 64 |
|
65 | 65 |
|
| 66 | +.. rubric:: Footnotes |
| 67 | + |
| 68 | +.. [#f1] `LlamaIndex documentation <https://docs.llamaindex.ai/en/stable/>`_ |
| 69 | +
|
| 70 | +
|
66 | 71 | .. include:: footer.rst |
67 | 72 |
|
68 | 73 |
|
|
0 commit comments