Skip to content

Exception on indirect object during text extraction #2966

@nsw42

Description

@nsw42

The Python library pdfrw (which is what underpins the pdf_redactor library) sometimes generates files that result in the pypdf font size calculations encountering an indirect object. This results in an exception (either trying to add an int to an IndirectObject, or to divide an IndirectObject by an int).

Environment

$ python -m platform
Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.39

(but I also reproduced it on macOS)

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

import random
import re
import string

import pdf_redactor
import pypdf


def test_repro(match_pattern, replace):
    options = pdf_redactor.RedactorOptions()
    options.content_filters = [
        (re.compile(match_pattern), lambda m: replace),
    ]
    with open('repro_in.pdf', 'rb') as inhandle:
        with open('repro_out.pdf', 'wb') as outhandle:
            options.input_stream = inhandle
            options.output_stream = outhandle
            pdf_redactor.redactor(options)

    # This will result in an unhandled exception if the file
    # successfully reproduces the error
    with open('repro_out.pdf', 'rb') as inhandle:
        r = pypdf.PdfReader(inhandle)
        p = r.pages[0]
        _ = p.extract_text()


test_repro('TEXT', '-')

It took a while to provoke pdf_redactor / pdfrw into using an indirect object with a PDF that I could share, and I'm still not 100% certain what causes it to do so, so this PDF (which is a Word document that has been exported to PDF) maybe includes superfluous content.

repro_in.pdf

To save you installing pdf_redactor, here's the PDF that gets generated:

repro_out.pdf

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "/Users/neil/3rdpartysw/pypdf/repro.py", line 28, in <module>
    test_repro('TEXT', '-')
  File "/Users/neil/3rdpartysw/pypdf/repro.py", line 25, in test_repro
    _ = p.extract_text()
        ^^^^^^^^^^^^^^^^
  File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 2393, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 1868, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/neil/3rdpartysw/pypdf/pypdf/_cmap.py", line 33, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/neil/3rdpartysw/pypdf/pypdf/_cmap.py", line 60, in build_char_map_from_dict
    half_space_width = compute_space_width(font_width_map, space_key_char) / 2.0
                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~
TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'float'

After fixing the issue in build_char_map_from_dict (which is really in compute_space_width), there's a similar exception thrown in _get_acutual_font_widths (sic), which stems from compute_font_width:

Traceback (most recent call last):
  File "/Users/neil/3rdpartysw/pypdf/repro.py", line 28, in <module>
    test_repro('TEXT', '-')
  File "/Users/neil/3rdpartysw/pypdf/repro.py", line 25, in test_repro
    _ = p.extract_text()
        ^^^^^^^^^^^^^^^^
  File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 2393, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 2109, in _extract_text
    process_operation(b"Tj", [op])
  File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 2052, in process_operation
    text, rtl_dir, _actual_str_size = self._handle_tj(
                                      ^^^^^^^^^^^^^^^^
  File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 1820, in _handle_tj
    self._get_acutual_font_widths(cmap, text_operands, font_size, space_width))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 1786, in _get_acutual_font_widths
    font_widths += compute_font_width(font_width_map, char)
TypeError: unsupported operand type(s) for +=: 'int' and 'IndirectObject'

Patch

I've forked the repo, and have created a patch that solves this problem for this particular input file. I'll raise a PR for that in a minute. There may, of course, be more complicated scenarios that the patch doesn't handle correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions