Issue with Japanese and numbers (digits)

Hi there,

Following up #948 and #962, I tested a couple of Japanese documents and the whole process went flawlessly.

To my surprise, the only problem I had is with numbers. 
For some reason, roman numbers are converted to circled numbers. 
Besides being incorrect, this messes up the date recognition because 2016年10月25日 is being recognized as ⑳①⑥ 年 ①0  月  ②⑤  日 (weird, right?)

I tried to find a solution and came across this issue in the tessdata repo, which explains the issue in more details and has a potential solution (not sure that they were talking about, exactly). 
https://github.com/tesseract-ocr/tessdata/issues/119

I wanted to find a "good" paper for sharing here for people to test on, but all the "good" documents I've got contain personal information, etc. so I just used the back of a movie ticket that I had lying around. You'll notice all the numbers at the bottom become circled after OCR.

[z-20210731-145217.pdf](https://github.com/eikek/docspell/files/6911129/z-20210731-145217.pdf)

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Japanese and numbers (digits) #973

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue with Japanese and numbers (digits) #973

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions