Skip to content

pypdf crashes when extracting text from pdf #2173

@fstark

Description

@fstark

I am trying to extract the text from a set of pdf. pypdf fails on some of them.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
python -m platform
Linux-5.15.0-82-generic-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
# TODO: Your output goes here

pypdf==3.15.5, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

from pypdf import PdfReader

if __name__ == '__main__':
    pdf = PdfReader("bug.pdf")
    for page_number, page in enumerate(pdf.pages, start=1):
        print( f" {page_number}", end="" )
        text = page.extract_text()

bug.pdf

The page is the first page of this PDF from archive.org: https://archive.org/download/1979-Fall-compute-magazine/Compute_Issue_001_1979_Fall.pdf

Let us know if we may add them to our tests!

Traceback

This is the complete Traceback I see:

 1Traceback (most recent call last):
  File "/home/fred/Development/extractpages/bug.py", line 7, in <module>
    text = page.extract_text()
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_page.py", line 2263, in extract_text
    return self._extract_text(
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_page.py", line 1908, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 29, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 54, in build_char_map_from_dict
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 234, in parse_to_unicode
    process_rg, process_char, multiline_rg = process_cm_line(
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 309, in process_cm_line
    multiline_rg = parse_bfrange(line, map_dict, int_entry, multiline_rg)
  File "/home/fred/Development/extractpages/venv/lib/python3.10/site-packages/pypdf/_cmap.py", line 340, in parse_bfrange
    a = int(lst[0], 16)
ValueError: invalid literal for int() with base 16: b'\t\t'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions