Skip to content

T* Operator Handled Incorrectly #3247

@hackowitz

Description

@hackowitz

The T* operator does not produce a line wrap when extracting text from some documents. Per the PDF 1.7 spec, "This operator has the same effect as the code 0 -TL Td", but it appears the pypdf T* implementation only affects tm_matrix[5], while Td would also affect tm_matrix[4] if either tm_matrix[0] or tm_matrix[2] is nonzero.

I will open a pull request shortly

Environment

$ python -m platform
Linux-5.10.228-219.884.amzn2.x86_64-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.3.1, crypt_provider=('cryptography', '3.4.8'), PIL=10.2.0

Code + PDF

This is a minimal, complete example that shows the issue:

The content stream shows a simple sequence of text, each on a new line.

reader = pypdf.PdfReader('link16-line-wrap.sanitized.pdf')
print(reader.pages[0].get_contents().operations)
[
    ([], b'BT'),
    (['/GS0'], b'gs'),
    (['/TT0', 1], b'Tf'),
    ([0, 10.02, -10.02, 0, 15.72, 72], b'Tm'),
    ([' '], b'Tj'),
    ([0, -58.509], b'TD'),
    ([' '], b'Tj'),
    ([0.0014], b'Tc'),
    ([0, 7.98, -7.98, 0, 42.9, 72], b'Tm'),
    (['               '], b'Tj'),
    ([0, -1.135], b'TD'),
    (['               '], b'Tj'),
    ([], b'T*'),
    (['               '], b'Tj'),
    ([], b'T*'),
    (['               '], b'Tj'),
    ([], b'T*'),
    (['               '], b'Tj'),
    ([], b'T*'),
    (['               '], b'Tj'),
    ([], b'T*'),
    (['                                                           WORD DESCRIPTION '], b'Tj'),
    ([], b'T*'),
    (['                                                           ---------------- '], b'Tj'),
    ([], b'T*'),
    # ... irrelevant content trimmed
]

The extracted text, however, does not produce the line wrap.

print(reader.pages[0].extract_text())
 
 
               
                                                                                                                                      WORD DESCRIPTION                                                            ----------------                

    # ... irrelevant content trimmed

A sample PDF that produces the error: link16-line-wrap.sanitized.pdf

The expected output: link16-line-wrap.sanitized.expected.txt

Traceback

Metadata

Metadata

Assignees

No one assigned

    Labels

    workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions