-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Description
Recently I ran into a particular kind of pdf file from which I cannot extract text because the library throws an exception.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Windows-10-10.0.22621-SP0
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=noneCode + PDF
This is a minimal, complete example that shows the issue:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()
print(text)Sample PDF file can be found here:
example.pdf
Traceback
This is the complete Traceback I see:
Traceback (most recent call last):
File "...\prueba_pdf\test.py", line 6, in <module>
text = page.extract_text()
File "...\prueba_pdf\venv\lib\site-packages\pypdf\_page.py", line 2284, in extract_text
return self._extract_text(
File "...\prueba_pdf\venv\lib\site-packages\pypdf\_page.py", line 1903, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 29, in build_char_map
font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 54, in build_char_map_from_dict
map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 224, in parse_to_unicode
return type1_alternative(ft, map_dict, space_code, int_entry)
File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 481, in type1_alternative
if words[3] != b"put":
IndexError: list index out of rangeReactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels