Skip to content

Commit 69a5861

Browse files
authored
Merge pull request #142 from pymupdf/removes-docs
Removes the documentation folder and moves changes to top level MD file.
2 parents b1ae40c + c2bf44f commit 69a5861

31 files changed

+127
-900
lines changed

.readthedocs.yaml

Lines changed: 0 additions & 31 deletions
This file was deleted.

CHANGES.md

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# Change Log
2+
3+
4+
## Changes in version 0.0.15
5+
6+
7+
8+
### Fixes:
9+
10+
11+
* [138](https://github.com/pymupdf/RAG/issues/138) - Table is not extracted and some text order was wrong.
12+
* [135](https://github.com/pymupdf/RAG/issues/135) - Problem with multiple columns in simple text.
13+
* [134](https://github.com/pymupdf/RAG/issues/134) - Exclude images based on size threshold parameter.
14+
* [132](https://github.com/pymupdf/RAG/issues/132) - Optionally embed images as base64 string.
15+
* [128](https://github.com/pymupdf/RAG/issues/128) - Enhanced image embedding format.
16+
17+
18+
### Improvements
19+
20+
* New parameter `embed_images` (bool) **embeds** images and vector graphics in the markdown text as base64-encoded strings. Ignores `write_images` and `image_path` parameters.
21+
* New parameter `image_size_limit` which is a float between 0 and 1, default is 0.05 (5%). Causes images to be ignored if their width or height values are smaller than the corresponding fraction of the page's width or height.
22+
* The algorithm has been improved which determins the sequence of the text rectangles on multi-column pages.
23+
* Change of the header identification algorithm: If more than six header levels are required for a document, then all text with a font size larger than body text is assumed to be a header of level 6 (i.e. HTML "h6" = "###### ").
24+
25+
26+
## Changes in version 0.0.13
27+
28+
29+
### Fixes
30+
31+
* [112](https://github.com/pymupdf/RAG/issues/112) - Invalid bandwriter header dimensions/setup.
32+
33+
34+
### Improvements
35+
36+
* New parameter `ignore_code` suppresses special formatting of text in mono-spaced fonts.
37+
* New parameter `extract_words` enforces `page_chunks=True` and adds a "words" list to each page dictionary.
38+
39+
40+
## Changes in version 0.0.11
41+
42+
43+
### Fixes
44+
45+
* [90](https://github.com/pymupdf/RAG/issues/90) - 'Quad' object has no attribute 'tl'.
46+
* [88](https://github.com/pymupdf/RAG/issues/88) - Bug in `is_significant` function.
47+
48+
49+
### Improvements
50+
51+
* Extended the list of known bullet point characters.
52+
53+
54+
## Changes in version 0.0.10
55+
56+
57+
### Fixes
58+
59+
* [73](https://github.com/pymupdf/RAG/issues/73) - bug in `to_markdown` internal function.
60+
* [74](https://github.com/pymupdf/RAG/issues/74) - minimum area for images & vector graphics.
61+
* [75](https://github.com/pymupdf/RAG/issues/75) - Poor Markdown Generation for Particular PDF.
62+
* [76](https://github.com/pymupdf/RAG/issues/76) - suggestion on useful api parameters.
63+
64+
65+
### Improvements
66+
67+
* Improved recognition of "insignificant" vector graphics. Graphics like text highlights or borders will be ignored.
68+
* The format of saved images can now be controlled via new parameter `image_format`.
69+
* Images can be stored in a specific folder via the new parameter `image_path`.
70+
* Images are **not stored if contained** in another image on same page.
71+
* Images are **not stored if too small:** if width or height are less than 5% of corresponding page dimension.
72+
* All text is always written. If `write_images=True`, text on images / graphics can be suppressed by setting `force_text=False`.
73+
74+
75+
## Changes in version 0.0.9
76+
77+
78+
### Fixes
79+
80+
* [71](https://github.com/pymupdf/RAG/issues/71) - Unexpected results in pymupdf4llm but pymupdf works.
81+
* [68](https://github.com/pymupdf/RAG/issues/68) - Issue with text extraction near footer of page.
82+
83+
84+
### Improvements
85+
86+
* Improved identification of scattered text span particles. This should address most issues with out-of-sequence situations.
87+
* We now correctly process rotated pages (see [issue 68](https://github.com/pymupdf/RAG/issues/68)).
88+
89+
90+
## Changes in version 0.0.8
91+
92+
93+
### Fixes
94+
95+
96+
* [65](https://github.com/pymupdf/RAG/issues/65) - Fix typo in `pymupdf_rag.py`.
97+
98+
99+
## Changes in version 0.0.7
100+
101+
102+
### Fixes
103+
104+
105+
* [54](https://github.com/pymupdf/RAG/issues/54) - Mistakes in orchestrating sentences. Additional fix: text extraction no longer uses the `TEXT_DEHYPHENATE` flag bit.
106+
107+
### Improvements
108+
109+
* Improved the algorithm dealing with vector graphics. Vector graphics are now more reliably classified as irrelevant: We now detect when "strokes" only exist in the neighborhood of the graphics boundary box border itself. This is quite often the case for code snippets.
110+
111+
## Changes in version 0.0.6
112+
113+
114+
### Fixes
115+
116+
117+
* [55](https://github.com/pymupdf/RAG/issues/55) - Bug in helpers/multi_column.py - IndexError: list index out of range.
118+
* [54](https://github.com/pymupdf/RAG/issues/54) - Mistakes in orchestrating sentences.
119+
* [52](https://github.com/pymupdf/RAG/issues/52) - Chunking of text files.
120+
* Partial fix for [41](https://github.com/pymupdf/RAG/issues/41) / [40](https://github.com/pymupdf/RAG/issues/40) - Improved page column detection, but still no silver bullet for overly complex page layouts.
121+
122+
### Improvements
123+
124+
* New parameter `dpi` to specify the resolution of images.
125+
* New parameters `page_width` / `page_height` for easily processing reflowable documents (Text, Office, e-books).
126+
* New parameter `graphics_limit` to avoid spending runtimes for value-less content.
127+
* New parameter `table_strategy` to directly control the table detection strategy.

docs/README.md

Lines changed: 0 additions & 68 deletions
This file was deleted.

docs/src/_static/custom.css

Lines changed: 0 additions & 9 deletions
This file was deleted.

docs/src/_static/favicon.ico

-15 KB
Binary file not shown.
-25.5 KB
Binary file not shown.
-25.5 KB
Binary file not shown.

docs/src/changes.rst

Lines changed: 0 additions & 139 deletions
This file was deleted.

0 commit comments

Comments
 (0)