RTL text is mirrored

per [Eva’s tweet](https://twitter.com/EvaConstantaras/status/715580426182635520), started looking into whether we had some issues with Arabic script.

Not sure if this was a bug in the older `tabula-extractor`.

Anyway, given [this file](http://www.drugs.health.gov.au/internet/drugs/publishing.nsf/Content/languages/$FILE/NESB%20Arabic.pdf) (or any other PDF with Arabic script) and just trying to pull out any run of text, you’ll get output from Tabula and tabula-java that’s mirrored:

![reversed-txt](https://cloud.githubusercontent.com/assets/53129/14190946/437e73ce-f763-11e5-9453-ef67a038f222.png)

reference text:
![refrerence](https://cloud.githubusercontent.com/assets/53129/14190944/411e8bf0-f763-11e5-8560-a03492c2766d.png)

(Note question mark position in first line.)

---

After looking into it some, here’s what I’ve dug up:
- The PDFbox site [here](https://pdfbox.apache.org/1.8/cookbook/textextraction.html) mentions at the bottom that
  
  > Extracting text in languages whose text goes from right to left (such as Arabic and Hebrew) in PDF files can result in text that is backwards. PDFBox can normalize and reverse the text if the ICU4J jar file has been placed on the classpath (it is an optional dependency). Note that you should also enable sorting with either `org.apache.pdfbox.util.PDFTextStripper` or `org.apache.pdfbox.ExtractText` to ensure accurate output.
- Our TextElement class rips some bits from that very PDFTextStripper class. Here's ours, noting that it’s "ported from from PDFBox's `PDFTextStripper.writePage`, with modifications" https://github.com/tabulapdf/tabula-java/blob/7b56c46d3362299430f19c34657a692b6529ed98/src/main/java/technology/tabula/TextElement.java#L108 (lol "Here be dragons")
- So that [upstream `writePage` function](https://github.com/apache/pdfbox/blob/1.8/pdfbox/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java#L594) has a bunch of extra bits, starting around [L629](https://github.com/apache/pdfbox/blob/1.8/pdfbox/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java#L629) regarding normalizing RTL scripts. We're missing those bits. Here's the block comment from that part:
  
  ```
  /* Before we can display the text, we need to do some normalizing.
   * Arabic and Hebrew text is right to left and is typically stored
   * in its logical format, which means that the rightmost character is
   * stored first, followed by the second character from the right etc.
   * However, PDF stores the text in presentation form, which is left to
   * right.  We need to do some normalization to convert the PDF data to
   * the proper logical output format.
   *
   * Note that if we did not sort the text, then the output of reversing the
   * text is undefined and can sometimes produce worse output then not trying
   * to reverse the order.  Sorting should be done for these languages.
   * */
  ```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RTL text is mirrored #66

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RTL text is mirrored #66

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions