-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Arabic text is extracted in the wrong order #1296
Copy link
Copy link
Closed
Labels
Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsA minimal, complete and verifiable example helps a lot to debug / understand feature requestsis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Metadata
Metadata
Assignees
Labels
Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsA minimal, complete and verifiable example helps a lot to debug / understand feature requestsis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
There is a big problem with arabic text extraction.
If we have a string that says (مرحبا هذه تجربة) the PyPDF2 extract_text function returned it like : (ةبرجت هذه ابحرم).
Environment
$ python -m platform Windows-10-10.0.19044-SP0 $ python -c "import PyPDF2;print(PyPDF2.__version__)" 2.10.3Code + PDF
This is a minimal, complete example that shows the issue with file.pdf:
It gives:
but it's partially reversed, e.g. the beginning
should be