-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
The T* operator does not produce a line wrap when extracting text from some documents. Per the PDF 1.7 spec, "This operator has the same effect as the code 0 -TL Td", but it appears the pypdf T* implementation only affects tm_matrix[5], while Td would also affect tm_matrix[4] if either tm_matrix[0] or tm_matrix[2] is nonzero.
I will open a pull request shortly
Environment
$ python -m platform
Linux-5.10.228-219.884.amzn2.x86_64-x86_64-with-glibc2.35
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.3.1, crypt_provider=('cryptography', '3.4.8'), PIL=10.2.0Code + PDF
This is a minimal, complete example that shows the issue:
The content stream shows a simple sequence of text, each on a new line.
reader = pypdf.PdfReader('link16-line-wrap.sanitized.pdf')
print(reader.pages[0].get_contents().operations)[
([], b'BT'),
(['/GS0'], b'gs'),
(['/TT0', 1], b'Tf'),
([0, 10.02, -10.02, 0, 15.72, 72], b'Tm'),
([' '], b'Tj'),
([0, -58.509], b'TD'),
([' '], b'Tj'),
([0.0014], b'Tc'),
([0, 7.98, -7.98, 0, 42.9, 72], b'Tm'),
([' '], b'Tj'),
([0, -1.135], b'TD'),
([' '], b'Tj'),
([], b'T*'),
([' '], b'Tj'),
([], b'T*'),
([' '], b'Tj'),
([], b'T*'),
([' '], b'Tj'),
([], b'T*'),
([' '], b'Tj'),
([], b'T*'),
([' WORD DESCRIPTION '], b'Tj'),
([], b'T*'),
([' ---------------- '], b'Tj'),
([], b'T*'),
# ... irrelevant content trimmed
]
The extracted text, however, does not produce the line wrap.
print(reader.pages[0].extract_text())
WORD DESCRIPTION ----------------
# ... irrelevant content trimmed
A sample PDF that produces the error: link16-line-wrap.sanitized.pdf
The expected output: link16-line-wrap.sanitized.expected.txt
Traceback
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow