-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
extra space in the result pdf when the input pdf is in Chinese #715
Description
Hi.
First, sorry for my poor English.
Description
Recently I upgraded my tesseract engine from v4.0.0.20181030 to v5.0.0-alpha.20201127 and two things happened.
One is there is space between every single words when i OCR a pdf with pure English text which is good and i didn't get those extra space when my engine was v4.0. That means i got text like "thereisnospacebetweenwords " before, and now it becomes "there is no space between words ". However, with the v5.0 engine, it went wrong when my input pdf is in Chinese, as there is extra space between every single letter. The result now is like 每 个 字 之 间 都 有 多 余 的 空 格 。 (FYI, i didn't get those extra space when using ocrmypdf to OCR Chinese pdf with tesseract v4.0)
To Reproduce
my tesseract engines are the following downloaded from https://digi.bib.uni-mannheim.de/tesseract/
tesseract-ocr-w64-setup-v4.0.0.20181030.exe
tesseract-ocr-w64-setup-v5.0.0-alpha.20201127.exe
I just typed ocrmypdf input_name.pdf OCR-output_name.pdf -l chi_sim in CLI.
Expected behavior
I wish to keep the space in the english text, while omit the extra space in the chinese text.
System (please complete the following information):
- OS: windows
- Python version: 3.8.2
- OCRmyPDF version: I upgraded my ocrmypdf from 11.2.1 to 11.5.0 today, and another problem occured:
[WinError 2] The system can not find the file specified.
The warning appeared two times as shown in the picture.
But i still get a result, which is the same as the one before i upgraded ocrmypdf.
Additional context
I tested the tesseract engine v5.0, and the output text is just fine after i used parameter --psm 6. But the parameter seems doesn't work well for ocrmypdf. (The parameter does work a little bit in ocrmypdf, the text layer changed from 每 个 字 都 有 多 余 的 空 格 to 每 个 字 都有 多余 的 空格.
FYI: The solution for the extra space in CJK in tesseract tesseract-ocr/tesseract#991
Please let me know if you need any further information. Thanks!
