Skip to content

BUG: Incorrect text matrix passed to visitor_text in page.extract_text #2059

@troethe

Description

@troethe

When supplying page.extract_text with a visitor_text function, the callback to the function is made with an incorrect tm_matrix (text matrix) parameter.

I believe this is due to the fact that here, the visitor is called with the new tm_matrix, even though at this point, the text matrix may already be changed by the currently handled OP.

Environment

Running the current main branch of pypdf on Debian.

$ python -m platform
Linux-5.10.0-23-amd64-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf.__version__)"
3.14.0

Code + PDF

Here is an example script that converts a testdoc.pdf to SVG, logs the OP's in the page and the calls to the visitor with x, y coordinates.

from pypdf import PdfReader
import svgwrite

file_name = "./testdoc.pdf"

reader = PdfReader(file_name)
page = reader.pages[0]

print("####\nOps:\n####")
print(page.get_contents().operations)

length, height = page.mediabox[2:]

dwg = svgwrite.Drawing(file_name[:-3] + "svg", size=(length, height), profile="full")


def visitor_svg_text(text, cm, tm, fontDict, fontSize):
    (x, y) = (tm[4], tm[5])
    print(x, y, text)
    dwg.add(dwg.text(text, insert=(x, height-y), fill="blue", style=f"font-size: {fontSize}px"))


print("\n####\nParsed Lines:\n####")
page.extract_text(visitor_text=visitor_svg_text)
dwg.save()

Output:

####
Ops:
####
[([], b'BT'), (['/F29', 14.3462], b'Tf'), ([133.768, 707.125], b'Td'), ([['1', -1125, 'T', 94, 'estsection']], b'TJ'), (['/F19', 9.9626000000000001], b'Tf'), ([0, -21.821000000000002], b'Td'), ([['A', -333, 'B', -334, 'C']], b'TJ'), ([169.36500000000001, -546.04899999999998], b'Td'), ([['1']], b'TJ'), ([], b'ET')]

####
Parsed Lines:
####
0.0 0.0 
133.768 707.125 1 Testsection
133.768 685.304 

303.13300000000004 139.255 A B C

303.13300000000004 139.255 1

As can be seen (also in the resulting SVG), the coordinates of the line "A B C" get affected by the Td following after it, because this is when crlf_space_check detects a new line. It then however supplies the tm_matrix that was already altered to the call to the visitor, instead of the one that was active before the operator was applied.

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-advanced-text-extractionGetting coordinates, font weight, font type, ...

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions