-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsA minimal, complete and verifiable example helps a lot to debug / understand feature requestsworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
Trying to extract text from one pdf page. Parsing crashes.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Darwin-22.6.0-x86_64-i386-64bit
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.16.2, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=noneCode + PDF
This is a minimal, complete example that shows the issue:
https://github.com/vors/pypdf-text-parsing-repro (has pdf)
from pypdf import PdfReader
reader = PdfReader("input.pdf")
page = reader.pages[0]
page.extract_text()Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!
You can use them in your tests.
Traceback
This is the complete Traceback I see:
Traceback (most recent call last):
File "repro.py", line 5, in <module>
page.extract_text()
File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_page.py", line 2266, in extract_text
visitor_text,
File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_page.py", line 1901, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_cmap.py", line 30, in build_char_map
space_width, ft
File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_cmap.py", line 54, in build_char_map_from_dict
map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_cmap.py", line 240, in parse_to_unicode
int_entry,
File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_cmap.py", line 310, in process_cm_line
multiline_rg = parse_bfrange(line, map_dict, int_entry, multiline_rg)
File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_cmap.py", line 369, in parse_bfrange
] = unhexlify(fmt2 % c).decode("utf-16-be", "surrogatepass")
binascii.Error: Odd-length string
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsA minimal, complete and verifiable example helps a lot to debug / understand feature requestsworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow