|
| 1 | +# Change Log |
| 2 | + |
| 3 | + |
| 4 | +## Changes in version 0.0.15 |
| 5 | + |
| 6 | + |
| 7 | + |
| 8 | +### Fixes: |
| 9 | + |
| 10 | + |
| 11 | +* [138](https://github.com/pymupdf/RAG/issues/138) - Table is not extracted and some text order was wrong. |
| 12 | +* [135](https://github.com/pymupdf/RAG/issues/135) - Problem with multiple columns in simple text. |
| 13 | +* [134](https://github.com/pymupdf/RAG/issues/134) - Exclude images based on size threshold parameter. |
| 14 | +* [132](https://github.com/pymupdf/RAG/issues/132) - Optionally embed images as base64 string. |
| 15 | +* [128](https://github.com/pymupdf/RAG/issues/128) - Enhanced image embedding format. |
| 16 | + |
| 17 | + |
| 18 | +### Improvements |
| 19 | + |
| 20 | +* New parameter `embed_images` (bool) **embeds** images and vector graphics in the markdown text as base64-encoded strings. Ignores `write_images` and `image_path` parameters. |
| 21 | +* New parameter `image_size_limit` which is a float between 0 and 1, default is 0.05 (5%). Causes images to be ignored if their width or height values are smaller than the corresponding fraction of the page's width or height. |
| 22 | +* The algorithm has been improved which determins the sequence of the text rectangles on multi-column pages. |
| 23 | +* Change of the header identification algorithm: If more than six header levels are required for a document, then all text with a font size larger than body text is assumed to be a header of level 6 (i.e. HTML "h6" = "###### "). |
| 24 | + |
| 25 | + |
| 26 | +## Changes in version 0.0.13 |
| 27 | + |
| 28 | + |
| 29 | +### Fixes |
| 30 | + |
| 31 | +* [112](https://github.com/pymupdf/RAG/issues/112) - Invalid bandwriter header dimensions/setup. |
| 32 | + |
| 33 | + |
| 34 | +### Improvements |
| 35 | + |
| 36 | +* New parameter `ignore_code` suppresses special formatting of text in mono-spaced fonts. |
| 37 | +* New parameter `extract_words` enforces `page_chunks=True` and adds a "words" list to each page dictionary. |
| 38 | + |
| 39 | + |
| 40 | +## Changes in version 0.0.11 |
| 41 | + |
| 42 | + |
| 43 | +### Fixes |
| 44 | + |
| 45 | +* [90](https://github.com/pymupdf/RAG/issues/90) - 'Quad' object has no attribute 'tl'. |
| 46 | +* [88](https://github.com/pymupdf/RAG/issues/88) - Bug in `is_significant` function. |
| 47 | + |
| 48 | + |
| 49 | +### Improvements |
| 50 | + |
| 51 | +* Extended the list of known bullet point characters. |
| 52 | + |
| 53 | + |
| 54 | +## Changes in version 0.0.10 |
| 55 | + |
| 56 | + |
| 57 | +### Fixes |
| 58 | + |
| 59 | +* [73](https://github.com/pymupdf/RAG/issues/73) - bug in `to_markdown` internal function. |
| 60 | +* [74](https://github.com/pymupdf/RAG/issues/74) - minimum area for images & vector graphics. |
| 61 | +* [75](https://github.com/pymupdf/RAG/issues/75) - Poor Markdown Generation for Particular PDF. |
| 62 | +* [76](https://github.com/pymupdf/RAG/issues/76) - suggestion on useful api parameters. |
| 63 | + |
| 64 | + |
| 65 | +### Improvements |
| 66 | + |
| 67 | +* Improved recognition of "insignificant" vector graphics. Graphics like text highlights or borders will be ignored. |
| 68 | +* The format of saved images can now be controlled via new parameter `image_format`. |
| 69 | +* Images can be stored in a specific folder via the new parameter `image_path`. |
| 70 | +* Images are **not stored if contained** in another image on same page. |
| 71 | +* Images are **not stored if too small:** if width or height are less than 5% of corresponding page dimension. |
| 72 | +* All text is always written. If `write_images=True`, text on images / graphics can be suppressed by setting `force_text=False`. |
| 73 | + |
| 74 | + |
| 75 | +## Changes in version 0.0.9 |
| 76 | + |
| 77 | + |
| 78 | +### Fixes |
| 79 | + |
| 80 | +* [71](https://github.com/pymupdf/RAG/issues/71) - Unexpected results in pymupdf4llm but pymupdf works. |
| 81 | +* [68](https://github.com/pymupdf/RAG/issues/68) - Issue with text extraction near footer of page. |
| 82 | + |
| 83 | + |
| 84 | +### Improvements |
| 85 | + |
| 86 | +* Improved identification of scattered text span particles. This should address most issues with out-of-sequence situations. |
| 87 | +* We now correctly process rotated pages (see [issue 68](https://github.com/pymupdf/RAG/issues/68)). |
| 88 | + |
| 89 | + |
| 90 | +## Changes in version 0.0.8 |
| 91 | + |
| 92 | + |
| 93 | +### Fixes |
| 94 | + |
| 95 | + |
| 96 | +* [65](https://github.com/pymupdf/RAG/issues/65) - Fix typo in `pymupdf_rag.py`. |
| 97 | + |
| 98 | + |
| 99 | +## Changes in version 0.0.7 |
| 100 | + |
| 101 | + |
| 102 | +### Fixes |
| 103 | + |
| 104 | + |
| 105 | +* [54](https://github.com/pymupdf/RAG/issues/54) - Mistakes in orchestrating sentences. Additional fix: text extraction no longer uses the `TEXT_DEHYPHENATE` flag bit. |
| 106 | + |
| 107 | +### Improvements |
| 108 | + |
| 109 | +* Improved the algorithm dealing with vector graphics. Vector graphics are now more reliably classified as irrelevant: We now detect when "strokes" only exist in the neighborhood of the graphics boundary box border itself. This is quite often the case for code snippets. |
| 110 | + |
| 111 | +## Changes in version 0.0.6 |
| 112 | + |
| 113 | + |
| 114 | +### Fixes |
| 115 | + |
| 116 | + |
| 117 | +* [55](https://github.com/pymupdf/RAG/issues/55) - Bug in helpers/multi_column.py - IndexError: list index out of range. |
| 118 | +* [54](https://github.com/pymupdf/RAG/issues/54) - Mistakes in orchestrating sentences. |
| 119 | +* [52](https://github.com/pymupdf/RAG/issues/52) - Chunking of text files. |
| 120 | +* Partial fix for [41](https://github.com/pymupdf/RAG/issues/41) / [40](https://github.com/pymupdf/RAG/issues/40) - Improved page column detection, but still no silver bullet for overly complex page layouts. |
| 121 | + |
| 122 | +### Improvements |
| 123 | + |
| 124 | +* New parameter `dpi` to specify the resolution of images. |
| 125 | +* New parameters `page_width` / `page_height` for easily processing reflowable documents (Text, Office, e-books). |
| 126 | +* New parameter `graphics_limit` to avoid spending runtimes for value-less content. |
| 127 | +* New parameter `table_strategy` to directly control the table detection strategy. |
0 commit comments