-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
When supplying page.extract_text with a visitor_text function, the callback to the function is made with an incorrect tm_matrix (text matrix) parameter.
I believe this is due to the fact that here, the visitor is called with the new tm_matrix, even though at this point, the text matrix may already be changed by the currently handled OP.
Environment
Running the current main branch of pypdf on Debian.
$ python -m platform
Linux-5.10.0-23-amd64-x86_64-with-glibc2.31
$ python -c "import pypdf;print(pypdf.__version__)"
3.14.0Code + PDF
Here is an example script that converts a testdoc.pdf to SVG, logs the OP's in the page and the calls to the visitor with x, y coordinates.
from pypdf import PdfReader
import svgwrite
file_name = "./testdoc.pdf"
reader = PdfReader(file_name)
page = reader.pages[0]
print("####\nOps:\n####")
print(page.get_contents().operations)
length, height = page.mediabox[2:]
dwg = svgwrite.Drawing(file_name[:-3] + "svg", size=(length, height), profile="full")
def visitor_svg_text(text, cm, tm, fontDict, fontSize):
(x, y) = (tm[4], tm[5])
print(x, y, text)
dwg.add(dwg.text(text, insert=(x, height-y), fill="blue", style=f"font-size: {fontSize}px"))
print("\n####\nParsed Lines:\n####")
page.extract_text(visitor_text=visitor_svg_text)
dwg.save()Output:
####
Ops:
####
[([], b'BT'), (['/F29', 14.3462], b'Tf'), ([133.768, 707.125], b'Td'), ([['1', -1125, 'T', 94, 'estsection']], b'TJ'), (['/F19', 9.9626000000000001], b'Tf'), ([0, -21.821000000000002], b'Td'), ([['A', -333, 'B', -334, 'C']], b'TJ'), ([169.36500000000001, -546.04899999999998], b'Td'), ([['1']], b'TJ'), ([], b'ET')]
####
Parsed Lines:
####
0.0 0.0
133.768 707.125 1 Testsection
133.768 685.304
303.13300000000004 139.255 A B C
303.13300000000004 139.255 1As can be seen (also in the resulting SVG), the coordinates of the line "A B C" get affected by the Td following after it, because this is when crlf_space_check detects a new line. It then however supplies the tm_matrix that was already altered to the call to the visitor, instead of the one that was active before the operator was applied.