Skip to content

binascii.Error: Odd-length string when parsing pdf #2216

@vors

Description

@vors

Trying to extract text from one pdf page. Parsing crashes.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Darwin-22.6.0-x86_64-i386-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.16.2, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

https://github.com/vors/pypdf-text-parsing-repro (has pdf)

from pypdf import PdfReader

reader = PdfReader("input.pdf")
page = reader.pages[0]
page.extract_text()

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

You can use them in your tests.

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "repro.py", line 5, in <module>
    page.extract_text()
  File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_page.py", line 2266, in extract_text
    visitor_text,
  File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_page.py", line 1901, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_cmap.py", line 30, in build_char_map
    space_width, ft
  File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_cmap.py", line 54, in build_char_map_from_dict
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_cmap.py", line 240, in parse_to_unicode
    int_entry,
  File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_cmap.py", line 310, in process_cm_line
    multiline_rg = parse_bfrange(line, map_dict, int_entry, multiline_rg)
  File "/Users/sergei.vorobev/src/pypdf-text-parsing-repro/venv/lib/python3.7/site-packages/pypdf/_cmap.py", line 369, in parse_bfrange
    ] = unhexlify(fmt2 % c).decode("utf-16-be", "surrogatepass")
binascii.Error: Odd-length string

Metadata

Metadata

Assignees

No one assigned

    Labels

    Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions