-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-advanced-text-extractionGetting coordinates, font weight, font type, ...Getting coordinates, font weight, font type, ...
Description
I am trying to read what seems to be a not very complex pdf. Here is a bit from one page:
I am pulling out the y and then x value from the tm_matrix and the text from the visitor_text. I am getting this:
[...]
{'text_matrix': [1, 0, 0, 1, 26, 541], 'text': 'ADOPTION \n'}
{'text_matrix': [1, 0, 0, 1, 123, 530], 'text': 'adults, adoption of, AB 1756 '}
{'text_matrix': [1, 0, 0, 1, 273, 519], 'text': 'agencies, organizations, etc.: requirements, prohibitions, etc., SB 807 '}
{'text_matrix': [1, 0, 0, 1, 245, 508], 'text': 'assistance programs, adoption: nonminor dependents, SB 9 '}
{'text_matrix': [1, 0, 0, 1, 114, 497], 'text': 'birth certificates, AB 1302 '}
{'text_matrix': [1, 0, 0, 1, 35, 486], 'text': 'contact agreements, postadoption— \n'}
{'text_matrix': [1, 0, 0, 1, 110, 474], 'text': 'birth parents, AB 1650 '}
{'text_matrix': [1, 0, 0, 1, 93, 463], 'text': 'siblings, AB 20 '}
{'text_matrix': [1, 0, 0, 1, 130, 452], 'text': 'facilitators, adoption, AB 120'}
{'text_matrix': [1, 0, 0, 1, 164, 452], 'text': '; SB 120'}
{'text_matrix': [1, 0, 0, 1, 184, 452], 'text': ', 807 '}
{'text_matrix': [1, 0, 0, 1, 199, 441], 'text': 'failed adoptions: reproductive loss leave, SB 848 '}
{'text_matrix': [1, 0, 0, 1, 300, 430], 'text': 'hearings, adoption finalization: remote proceedings, technology, etc., SB 21 '}
{'text_matrix': [1, 0, 0, 1, 135, 419], 'text': 'native american tribes, AB 120'}
{'text_matrix': [1, 0, 0, 1, 168, 419], 'text': '; SB 120 '}
{'text_matrix': [1, 0, 0, 1, 170, 408], 'text': 'parental rights, reinstatement of, AB 20 '}
{'text_matrix': [1, 0, 0, 1, 265, 397], 'text': 'parents, prospective adoptive: criminal background checks, SB 824 '}
{'text_matrix': [1, 0, 0, 1, 26, 386], 'text': 'ADULT EDUCATION \n'}
{'text_matrix': [1, 0, 0, 1, 150, 375], 'text': 'services, adult educational, SB 877 '}
{'text_matrix': [1, 0, 0, 1, 140, 364], 'text': 'week, adult education, ACR 31 '}
{'text_matrix': [1, 0, 0, 1, 26, 353], 'text': 'ADVERTISING. See also MARKETING; and particular subject matter (e.g., \n'}
{'text_matrix': [1, 0, 0, 1, 68, 342], 'text': 'ELECTIONS). \n'}
{'text_matrix': [1, 0, 0, 1, 211, 331], 'text': 'alcoholic beverages: tied-house restrictions, AB 546'}
{'text_matrix': [1, 0, 0, 1, 231, 331], 'text': ', 840'}
{'text_matrix': [1, 0, 0, 1, 251, 331], 'text': ', 1294'}
{'text_matrix': [1, 0, 0, 1, 290, 331], 'text': ' ; SB 392'}
{'text_matrix': [1, 0, 0, 1, 310, 331], 'text': ', 430 '}
{'text_matrix': [1, 0, 0, 1, 206, 320], 'text': 'campaign re social equity, civil rights, etc., SB 447 '}
{'text_matrix': [1, 0, 0, 1, 87, 309], 'text': 'cannabis, AB 794'}
{'text_matrix': [1, 0, 0, 1, 107, 309], 'text': ', 1207 '}
{'text_matrix': [1, 0, 0, 1, 35, 298], 'text': 'elections. See ELECTIONS. \n'}
{'text_matrix': [1, 0, 0, 1, 35, 287], 'text': 'false, misleading, etc., advertising— \n'}
{'text_matrix': [1, 0, 0, 1, 155, 276], 'text': 'disgorgement, remedy of, AB 1366 '}
{'text_matrix': [1, 0, 0, 1, 218, 265], 'text': 'master of divinity: prohibited title displays, AB 1564 '}
{'text_matrix': [1, 0, 0, 1, 232, 254], 'text': 'pregnancy-related services: civil penalties, etc., AB 315'}
{'text_matrix': [1, 0, 0, 1, 253, 254], 'text': ', 602 '}
{'text_matrix': [1, 0, 0, 1, 172, 243], 'text': 'pricing for goods and services, SB 478 '}
{'text_matrix': [1, 0, 0, 1, 321, 232], 'text': 'hotels, short-term rentals, etc., advertised rates: mandatory fee disclosures, SB 683 '}
{'text_matrix': [1, 0, 0, 1, 247, 221], 'text': 'housing rental properties advertised rates: disclosures, SB 611 '}
{'text_matrix': [1, 0, 0, 1, 25, 190], 'text': '*2023–24 First Extraordinary Session bills are designated (1X). '}
There are only 2 levels of indentation in the text, as you can see from the screenshot. And the x values are all over the place,
The amount of error in the x value seems to be somewhat proportional to the number of spaces that have been lost. I wonder if this is significant.
Environment
OS: Ubuntu 22.04.2 LTS
% pip freeze | grep pdf
pypdf==3.15.0
$ python -m platform
Linux-6.2.0-26-generic-x86_64-with-glibc2.35
$ python -c "import pypdf;print(pypdf.__version__)"
3.15.0
Code + PDF?
from pypdf import PdfReader
def text_details(text, curr_trans_matrix, text_matrix, font_dict, font_size):
info = {
"text": text,
"curr_trans_matrix": curr_trans_matrix,
"text_matrix": text_matrix,
"font_dict": font_dict,
"font_size": font_size,
}
# put into a dictionary keyed by y value to enable sort.
if info["text"] != "" and info["text"] != "\u200b" and info["text"] != "\n":
global strings
y_val = info["text_matrix"][5]
if y_val not in strings:
strings[y_val] = list()
strings[y_val].append({"text_matrix": [int(el) for el in text_matrix], "text": text})
if __name__ == "__main__":
path = "LegIndex-page6.pdf"
strings = {}
pdf = PdfReader(path)
text_list = pdf.pages[0].extract_text().split("\n")
pdf.pages[0].extract_text(visitor_text=text_details)
y_vals = reversed(sorted(list(strings.keys())))
for y_val in y_vals:
for string in strings[y_val]:
print(string)Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-advanced-text-extractionGetting coordinates, font weight, font type, ...Getting coordinates, font weight, font type, ...
