-
Notifications
You must be signed in to change notification settings - Fork 171
Issue with Japanese and numbers (digits) #973
Description
Hi there,
Following up #948 and #962, I tested a couple of Japanese documents and the whole process went flawlessly.
To my surprise, the only problem I had is with numbers.
For some reason, roman numbers are converted to circled numbers.
Besides being incorrect, this messes up the date recognition because 2016年10月25日 is being recognized as ⑳①⑥ 年 ①0 月 ②⑤ 日 (weird, right?)
I tried to find a solution and came across this issue in the tessdata repo, which explains the issue in more details and has a potential solution (not sure that they were talking about, exactly).
tesseract-ocr/tessdata#119
I wanted to find a "good" paper for sharing here for people to test on, but all the "good" documents I've got contain personal information, etc. so I just used the back of a movie ticket that I had lying around. You'll notice all the numbers at the bottom become circled after OCR.
Thanks!