Skip to content

ValueError: Ascii85 encoded byte sequences must end with b'~>' #2996

@neeraj9

Description

@neeraj9

Error extracting text from document

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-11-10.0.22631-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=11.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

for page_num in range_of_pages:
        page = pdf_reader.pages[page_num]
        page_text = page.extract_text()
        page_text = page_text.strip()
        if not page_text:
            page_num_without_text.append(page_num + 1)
        page_texts.append(page_text)

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

1af7d56a-5c8c-4914-85b3-b2536a5525cd.pdf

Traceback

This is the complete traceback I see:

File "common\fast_pdf_util.py", line 138, in get_pdf_info
    page_text = page.extract_text()
                ^^^^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_page.py", line 2398, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_page.py", line 1868, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\git\pi-embedding\venv\Lib\site-packages\pypdf\_cmap.py", line 33, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_cmap.py", line 56, in build_char_map_from_dict
    encoding, map_dict = get_encoding(ft)
                         ^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_cmap.py", line 129, in get_encoding
    map_dict, int_entry = _parse_to_unicode(ft)
                          ^^^^^^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_cmap.py", line 220, in _parse_to_unicode
    cm = prepare_cm(ft)
         ^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_cmap.py", line 250, in prepare_cm
    cm = cast(DecodedStreamObject, ft["/ToUnicode"]).get_data()
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\git\pi-embedding\venv\Lib\site-packages\pypdf\generic\_data_structures.py", line 1113, in get_data
    decoded.set_data(decode_stream_data(self))
                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\filters.py", line 638, in decode_stream_data
    data = ASCII85Decode.decode(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\filters.py", line 449, in decode
    return a85decode(data, adobe=True, ignorechars=WHITESPACES_AS_BYTES)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "Python312\Lib\base64.py", line 388, in a85decode
    raise ValueError(
ValueError: Ascii85 encoded byte sequences must end with b'~>'

Workaround to get past error

diff --git a/pypdf/filters.py b/pypdf/filters.py
index 517d6aa..5ea158d 100644
--- a/pypdf/filters.py
+++ b/pypdf/filters.py
@@ -635,7 +635,11 @@ def decode_stream_data(stream: Any) -> bytes:  # utils.StreamObject
             elif filter_type in (FT.LZW_DECODE, FTA.LZW):
                 data = LZWDecode._decodeb(data, params)
             elif filter_type in (FT.ASCII_85_DECODE, FTA.A85):
-                data = ASCII85Decode.decode(data)
+                try:
+                    data = ASCII85Decode.decode(data)
+                except ValueError:
+                    # ignore the error for now as workaround
+                    pass
             elif filter_type == FT.DCT_DECODE:
                 data = DCTDecode.decode(data)
             elif filter_type == FT.JPX_DECODE:

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-uncaught-exceptionUse this label only for issues caused by broken PDF documents that cannot be recovered.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions