-
Notifications
You must be signed in to change notification settings - Fork 171
Japanese and CJK Testing & Optimizations #2667
Description
Hi @eikek - So I took the dive and overwrote my local joex and restserver configurations with the nightly maintainer versions on a new VM to test everything for Japanese and Japanese Vertical.
Following the work we did in #2445 I think the following scope should be added before full release of the next major version:
☑️ Adding Japanese (Vertical) as a default language.
This has been addressed and solved in #2505
☑️ Having the primary output type for Japanese/Japanese (Vertical) set to output-type=pdf
While pdf/a is in every way a better format, as we saw in #2445 - when pdf/a is the encoded format it prevents ocrmypdf from reading the file. So the design question is - do we want to maintain the original file and increase the ocrmypdf converted file integrity, or make it more readable for optical character recognition? I would say that it's better to have character recognition as I'm sure most users would find more use in that and the read data can stay in the database which is more important than the converted document. However, non-CJK languages may not share this need so I will make a mapping for regular Japanese for it. This should be a simple add I can do this weekend, as it simply requires adding a special mapping.
Edit: All set on #2668
☑️ Add additional documentation for Japanese/vertical languages.
I am planning on making a pull request to the current documents branch to include the note about output-type=pdf set as a default for Japanese above. Because the functionality is similar in Chinese and Japanese (and likely some use for Korean), I intend to add this as Vertical Languages (CJK) under configuration. I will include a simple bash script for converting vertical text to a horizontal output as described in this post.
Edit: All set on #2669
In the future, it would be nice if we could automate Japanese Vertical to automatically run this on the extracted metadata. That way the user receives horizontal output which is much more useful for them in almost all circumstances. Or, we could add a button for the ocrmypdf "sidecar" output which can also generate this data as previously discussed in #2505
This will provide the basis for anyone who wants to add in Chinese, Korean, or other vertical languages in the future. If all goes well with my volunteer time, I aim to resolve these with two commits for the above points in the coming week for release.
Edit: Should be all completed on original scope and good to go for release after a second pair of eyes looks at it.