-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Description
I am trying to extract the text from a set of pdf. pypdf fails on some of them.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
python -m platform
Linux-5.15.0-82-generic-x86_64-with-glibc2.35
$ python -c "import pypdf;print(pypdf._debug_versions)"
# TODO: Your output goes herepypdf==3.15.5, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none
from pypdf import PdfReader
if __name__ == '__main__':
pdf = PdfReader("bug.pdf")
for page_number, page in enumerate(pdf.pages, start=1):
print( f" {page_number}", end="" )
text = page.extract_text()The page is the first page of this PDF from archive.org: https://archive.org/download/1979-Fall-compute-magazine/Compute_Issue_001_1979_Fall.pdf
Let us know if we may add them to our tests!
Traceback
This is the complete Traceback I see:
1Traceback (most recent call last):
File "/home/fred/Development/extractpages/bug.py", line 7, in <module>
text = page.extract_text()
File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_page.py", line 2263, in extract_text
return self._extract_text(
File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_page.py", line 1908, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 29, in build_char_map
font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 54, in build_char_map_from_dict
map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 234, in parse_to_unicode
process_rg, process_char, multiline_rg = process_cm_line(
File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 309, in process_cm_line
multiline_rg = parse_bfrange(line, map_dict, int_entry, multiline_rg)
File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 340, in parse_bfrange
a = int(lst[0], 16)
ValueError: invalid literal for int() with base 16: b'\t\t'
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels