Skip to content

possible bug with error TypeError: 'IndirectObject' object cannot be interpreted as an integer #2137

@rchen19

Description

@rchen19

See description below. Seems like a bug to me. This is solved by make the following edits in function compute_space_width in _cmap.py, line 19 in the code below st = w[0] -> st = w[0] if isinstance(w[0], int) else w[0].get_object(), this is in line 412 from the original file, since I am not familiar at all with the lower level implementation of pdf format, I am not sure if this is a bug at all, or if my fix makes sense:

def compute_space_width(
    ft: DictionaryObject, space_code: int, space_width: float
) -> float:
    sp_width: float = space_width * 2.0  # default value
    w = []
    w1 = {}
    st: int = 0
    if "/DescendantFonts" in ft:  # ft["/Subtype"].startswith("/CIDFontType"):
        ft1 = ft["/DescendantFonts"][0].get_object()  # type: ignore
        try:
            w1[-1] = cast(float, ft1["/DW"])
        except Exception:
            w1[-1] = 1000.0
        if "/W" in ft1:
            w = list(ft1["/W"])
        else:
            w = []
        while len(w) > 0:
            # st = w[0]
            # above commented out line is the original, below is my edit:
            st = w[0] if isinstance(w[0], int) else w[0].get_object()
            second = w[1].get_object()
            if isinstance(second, int):
                for x in range(st, second):
                    w1[x] = w[2]
                w = w[3:]
            elif isinstance(second, list):
                for y in second:
                    w1[st] = y
                    st += 1
                w = w[2:]
            else:
                logger_warning(
                    "unknown widths : \n" + (ft1["/W"]).__repr__(),
                    __name__,
                )
                break
        try:
            sp_width = w1[space_code]
        except Exception:
            sp_width = (
                w1[-1] / 2.0
            )  # if using default we consider space will be only half size
    elif "/Widths" in ft:
        w = list(ft["/Widths"])  # type: ignore
        try:
            st = cast(int, ft["/FirstChar"])
            en: int = cast(int, ft["/LastChar"])
            if st > space_code or en < space_code:
                raise Exception("Not in range")
            if w[space_code - st] == 0:
                raise Exception("null width")
            sp_width = w[space_code - st]
        except Exception:
            if "/FontDescriptor" in ft and "/MissingWidth" in cast(
                DictionaryObject, ft["/FontDescriptor"]
            ):
                sp_width = ft["/FontDescriptor"]["/MissingWidth"]  # type: ignore
            else:
                # will consider width of char as avg(width)/2
                m = 0
                cpt = 0
                for x in w:
                    if x > 0:
                        m += x
                        cpt += 1
                sp_width = m / max(1, cpt) / 2
    return sp_width

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-148-generic-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.15.4, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

import pypdf

f_path = "data/Morris et al. - 2020 - TextAttack A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP.pdf"
with open(f_path, "rb") as pdf_file_obj:
    p = pypdf.PdfReader(pdf_file_obj).pages[0].extract_text()
    print(p)

The pdf file:
Morris et al. - 2020 - TextAttack A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP.pdf

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "/export/home/***/try_parse_pdf.py", line 12, in <module>
    p = pypdf.PdfReader(pdf_file_obj).pages[0].extract_text()
  File "/export/home/***/***/mambaforge/envs/pypdf/lib/python3.9/site-packages/pypdf/_page.py", line 2263, in extract_text
    return self._extract_text(
  File "/export/home/***/***/mambaforge/envs/pypdf/lib/python3.9/site-packages/pypdf/_page.py", line 1908, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/export/home/***/***/mambaforge/envs/pypdf/lib/python3.9/site-packages/pypdf/_cmap.py", line 29, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
  File "/export/home/***/***/mambaforge/envs/pypdf/lib/python3.9/site-packages/pypdf/_cmap.py", line 89, in build_char_map_from_dict
    sp_width = compute_space_width(ft, sp, space_width)
  File "/export/home/cuda00042/***/***/envs/pypdf/lib/python3.9/site-packages/pypdf/_cmap.py", line 415, in compute_space_width
    for x in range(st, second):
TypeError: 'IndirectObject' object cannot be interpreted as an integer

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions