Skip to content

x values in the tm_matrix are wrong #2075

@rkiddy

Description

@rkiddy

I am trying to read what seems to be a not very complex pdf. Here is a bit from one page:

Screenshot from 2023-08-09 17-44-08

I am pulling out the y and then x value from the tm_matrix and the text from the visitor_text. I am getting this:

[...]
{'text_matrix': [1, 0, 0, 1, 26, 541], 'text': 'ADOPTION \n'}
{'text_matrix': [1, 0, 0, 1, 123, 530], 'text': 'adults, adoption of,  AB 1756 '}
{'text_matrix': [1, 0, 0, 1, 273, 519], 'text': 'agencies, organizations, etc.: requirements, prohibitions, etc.,  SB 807 '}
{'text_matrix': [1, 0, 0, 1, 245, 508], 'text': 'assistance programs, adoption: nonminor dependents,  SB 9 '}
{'text_matrix': [1, 0, 0, 1, 114, 497], 'text': 'birth certificates,  AB 1302 '}
{'text_matrix': [1, 0, 0, 1, 35, 486], 'text': 'contact agreements, postadoption— \n'}
{'text_matrix': [1, 0, 0, 1, 110, 474], 'text': 'birth parents,  AB 1650 '}
{'text_matrix': [1, 0, 0, 1, 93, 463], 'text': 'siblings,  AB 20 '}
{'text_matrix': [1, 0, 0, 1, 130, 452], 'text': 'facilitators, adoption,  AB 120'}
{'text_matrix': [1, 0, 0, 1, 164, 452], 'text': ';  SB 120'}
{'text_matrix': [1, 0, 0, 1, 184, 452], 'text': ',  807 '}
{'text_matrix': [1, 0, 0, 1, 199, 441], 'text': 'failed adoptions: reproductive loss leave,  SB 848 '}
{'text_matrix': [1, 0, 0, 1, 300, 430], 'text': 'hearings, adoption finalization: remote proceedings, technology, etc.,  SB 21 '}
{'text_matrix': [1, 0, 0, 1, 135, 419], 'text': 'native american tribes,  AB 120'}
{'text_matrix': [1, 0, 0, 1, 168, 419], 'text': ';  SB 120 '}
{'text_matrix': [1, 0, 0, 1, 170, 408], 'text': 'parental rights, reinstatement of,  AB 20 '}
{'text_matrix': [1, 0, 0, 1, 265, 397], 'text': 'parents, prospective adoptive: criminal background checks,  SB 824 '}
{'text_matrix': [1, 0, 0, 1, 26, 386], 'text': 'ADULT EDUCATION \n'}
{'text_matrix': [1, 0, 0, 1, 150, 375], 'text': 'services, adult educational,  SB 877 '}
{'text_matrix': [1, 0, 0, 1, 140, 364], 'text': 'week, adult education,  ACR 31 '}
{'text_matrix': [1, 0, 0, 1, 26, 353], 'text': 'ADVERTISING. See also MARKETING; and particular subject matter (e.g., \n'}
{'text_matrix': [1, 0, 0, 1, 68, 342], 'text': 'ELECTIONS). \n'}
{'text_matrix': [1, 0, 0, 1, 211, 331], 'text': 'alcoholic beverages: tied-house restrictions,  AB 546'}
{'text_matrix': [1, 0, 0, 1, 231, 331], 'text': ',  840'}
{'text_matrix': [1, 0, 0, 1, 251, 331], 'text': ',  1294'}
{'text_matrix': [1, 0, 0, 1, 290, 331], 'text': ' ;  SB 392'}
{'text_matrix': [1, 0, 0, 1, 310, 331], 'text': ',  430 '}
{'text_matrix': [1, 0, 0, 1, 206, 320], 'text': 'campaign re social equity, civil rights, etc.,  SB 447 '}
{'text_matrix': [1, 0, 0, 1, 87, 309], 'text': 'cannabis,  AB 794'}
{'text_matrix': [1, 0, 0, 1, 107, 309], 'text': ',  1207 '}
{'text_matrix': [1, 0, 0, 1, 35, 298], 'text': 'elections. See ELECTIONS. \n'}
{'text_matrix': [1, 0, 0, 1, 35, 287], 'text': 'false, misleading, etc., advertising— \n'}
{'text_matrix': [1, 0, 0, 1, 155, 276], 'text': 'disgorgement, remedy of,  AB 1366 '}
{'text_matrix': [1, 0, 0, 1, 218, 265], 'text': 'master of divinity: prohibited title displays,  AB 1564 '}
{'text_matrix': [1, 0, 0, 1, 232, 254], 'text': 'pregnancy-related services: civil penalties, etc.,  AB 315'}
{'text_matrix': [1, 0, 0, 1, 253, 254], 'text': ',  602 '}
{'text_matrix': [1, 0, 0, 1, 172, 243], 'text': 'pricing for goods and services,  SB 478 '}
{'text_matrix': [1, 0, 0, 1, 321, 232], 'text': 'hotels, short-term rentals, etc., advertised rates: mandatory fee disclosures,  SB 683 '}
{'text_matrix': [1, 0, 0, 1, 247, 221], 'text': 'housing rental properties advertised rates: disclosures,  SB 611 '}
{'text_matrix': [1, 0, 0, 1, 25, 190], 'text': '*2023–24 First Extraordinary Session bills are designated (1X). '}

There are only 2 levels of indentation in the text, as you can see from the screenshot. And the x values are all over the place,

The amount of error in the x value seems to be somewhat proportional to the number of spaces that have been lost. I wonder if this is significant.

Environment

OS: Ubuntu 22.04.2 LTS

 % pip freeze | grep pdf
 pypdf==3.15.0

 $ python -m platform
 Linux-6.2.0-26-generic-x86_64-with-glibc2.35

 $ python -c "import pypdf;print(pypdf.__version__)"
 3.15.0

Code + PDF?

LegIndex-page6.pdf

from pypdf import PdfReader


def text_details(text, curr_trans_matrix, text_matrix, font_dict, font_size):
    info = {
        "text": text,
        "curr_trans_matrix": curr_trans_matrix,
        "text_matrix": text_matrix,
        "font_dict": font_dict,
        "font_size": font_size,
    }

    # put into a dictionary keyed by y value to enable sort.

    if info["text"] != "" and info["text"] != "\u200b" and info["text"] != "\n":
        global strings
        y_val = info["text_matrix"][5]
        if y_val not in strings:
            strings[y_val] = list()
        strings[y_val].append({"text_matrix": [int(el) for el in text_matrix], "text": text})


if __name__ == "__main__":
    path = "LegIndex-page6.pdf"

    strings = {}

    pdf = PdfReader(path)
    text_list = pdf.pages[0].extract_text().split("\n")
    pdf.pages[0].extract_text(visitor_text=text_details)
    y_vals = reversed(sorted(list(strings.keys())))
    for y_val in y_vals:
        for string in strings[y_val]:
            print(string)

Metadata

Metadata

Assignees

Labels

is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-advanced-text-extractionGetting coordinates, font weight, font type, ...

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions