Skip to content

Issue with Japanese and numbers (digits) #973

@wallace11

Description

@wallace11

Hi there,

Following up #948 and #962, I tested a couple of Japanese documents and the whole process went flawlessly.

To my surprise, the only problem I had is with numbers.
For some reason, roman numbers are converted to circled numbers.
Besides being incorrect, this messes up the date recognition because 2016年10月25日 is being recognized as ⑳①⑥ 年 ①0 月 ②⑤ 日 (weird, right?)

I tried to find a solution and came across this issue in the tessdata repo, which explains the issue in more details and has a potential solution (not sure that they were talking about, exactly).
tesseract-ocr/tessdata#119

I wanted to find a "good" paper for sharing here for people to test on, but all the "good" documents I've got contain personal information, etc. so I just used the back of a movie ticket that I had lying around. You'll notice all the numbers at the bottom become circled after OCR.

z-20210731-145217.pdf

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working or in unexpected waysdockerAll things regarding docker setup

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions