-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
is-uncaught-exceptionUse this label only for issues caused by broken PDF documents that cannot be recovered.Use this label only for issues caused by broken PDF documents that cannot be recovered.
Description
Error extracting text from document
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Windows-11-10.0.22631-SP0
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=11.0.0Code + PDF
This is a minimal, complete example that shows the issue:
for page_num in range_of_pages:
page = pdf_reader.pages[page_num]
page_text = page.extract_text()
page_text = page_text.strip()
if not page_text:
page_num_without_text.append(page_num + 1)
page_texts.append(page_text)Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!
1af7d56a-5c8c-4914-85b3-b2536a5525cd.pdf
Traceback
This is the complete traceback I see:
File "common\fast_pdf_util.py", line 138, in get_pdf_info
page_text = page.extract_text()
^^^^^^^^^^^^^^^^^^^
File "venv\Lib\site-packages\pypdf\_page.py", line 2398, in extract_text
return self._extract_text(
^^^^^^^^^^^^^^^^^^^
File "venv\Lib\site-packages\pypdf\_page.py", line 1868, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\git\pi-embedding\venv\Lib\site-packages\pypdf\_cmap.py", line 33, in build_char_map
font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "venv\Lib\site-packages\pypdf\_cmap.py", line 56, in build_char_map_from_dict
encoding, map_dict = get_encoding(ft)
^^^^^^^^^^^^^^^^
File "venv\Lib\site-packages\pypdf\_cmap.py", line 129, in get_encoding
map_dict, int_entry = _parse_to_unicode(ft)
^^^^^^^^^^^^^^^^^^^^^
File "venv\Lib\site-packages\pypdf\_cmap.py", line 220, in _parse_to_unicode
cm = prepare_cm(ft)
^^^^^^^^^^^^^^
File "venv\Lib\site-packages\pypdf\_cmap.py", line 250, in prepare_cm
cm = cast(DecodedStreamObject, ft["/ToUnicode"]).get_data()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\git\pi-embedding\venv\Lib\site-packages\pypdf\generic\_data_structures.py", line 1113, in get_data
decoded.set_data(decode_stream_data(self))
^^^^^^^^^^^^^^^^^^^^^^^^
File "venv\Lib\site-packages\pypdf\filters.py", line 638, in decode_stream_data
data = ASCII85Decode.decode(data)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "venv\Lib\site-packages\pypdf\filters.py", line 449, in decode
return a85decode(data, adobe=True, ignorechars=WHITESPACES_AS_BYTES)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "Python312\Lib\base64.py", line 388, in a85decode
raise ValueError(
ValueError: Ascii85 encoded byte sequences must end with b'~>'
Workaround to get past error
diff --git a/pypdf/filters.py b/pypdf/filters.py
index 517d6aa..5ea158d 100644
--- a/pypdf/filters.py
+++ b/pypdf/filters.py
@@ -635,7 +635,11 @@ def decode_stream_data(stream: Any) -> bytes: # utils.StreamObject
elif filter_type in (FT.LZW_DECODE, FTA.LZW):
data = LZWDecode._decodeb(data, params)
elif filter_type in (FT.ASCII_85_DECODE, FTA.A85):
- data = ASCII85Decode.decode(data)
+ try:
+ data = ASCII85Decode.decode(data)
+ except ValueError:
+ # ignore the error for now as workaround
+ pass
elif filter_type == FT.DCT_DECODE:
data = DCTDecode.decode(data)
elif filter_type == FT.JPX_DECODE:Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
is-uncaught-exceptionUse this label only for issues caused by broken PDF documents that cannot be recovered.Use this label only for issues caused by broken PDF documents that cannot be recovered.