对于字体密集且小的PDF页最终解析结果会出现文字重复 #4134
Replies: 1 comment
-
|
字体密集且小的PDF页出现文字重复,是MinerU已知的问题,尤其在使用VLM后端时,主要原因是模型本身和分块/去重逻辑不完善导致的。即使牺牲部分准确率,也有几种方式可以有效减少重复:
如果只追求“不要重复”,即使部分内容丢失或识别不准,强制OCR和激进去重是最直接的方案。需要更细致的代码修改或合并流程可以补充说明。 To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
想问下如何避免重复的问题,即使解析不准确也不要重复的效果
Beta Was this translation helpful? Give feedback.
All reactions