-
Notifications
You must be signed in to change notification settings - Fork 446
Description
per Eva’s tweet, started looking into whether we had some issues with Arabic script.
Not sure if this was a bug in the older tabula-extractor.
Anyway, given this file (or any other PDF with Arabic script) and just trying to pull out any run of text, you’ll get output from Tabula and tabula-java that’s mirrored:
(Note question mark position in first line.)
After looking into it some, here’s what I’ve dug up:
-
The PDFbox site here mentions at the bottom that
Extracting text in languages whose text goes from right to left (such as Arabic and Hebrew) in PDF files can result in text that is backwards. PDFBox can normalize and reverse the text if the ICU4J jar file has been placed on the classpath (it is an optional dependency). Note that you should also enable sorting with either
org.apache.pdfbox.util.PDFTextStripperororg.apache.pdfbox.ExtractTextto ensure accurate output. -
Our TextElement class rips some bits from that very PDFTextStripper class. Here's ours, noting that it’s "ported from from PDFBox's
PDFTextStripper.writePage, with modifications"(lol "Here be dragons")/** -
So that upstream
writePagefunction has a bunch of extra bits, starting around L629 regarding normalizing RTL scripts. We're missing those bits. Here's the block comment from that part:/* Before we can display the text, we need to do some normalizing. * Arabic and Hebrew text is right to left and is typically stored * in its logical format, which means that the rightmost character is * stored first, followed by the second character from the right etc. * However, PDF stores the text in presentation form, which is left to * right. We need to do some normalization to convert the PDF data to * the proper logical output format. * * Note that if we did not sort the text, then the output of reversing the * text is undefined and can sometimes produce worse output then not trying * to reverse the order. Sorting should be done for these languages. * */

