Exception on indirect object during text extraction

The Python library pdfrw (which is what underpins the pdf_redactor library) sometimes generates files that result in the pypdf font size calculations encountering an indirect object.  This results in an exception (either trying to add an int to an IndirectObject, or to divide an IndirectObject by an int).

## Environment

```bash
$ python -m platform
Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.39

(but I also reproduced it on macOS)

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none
```

## Code + PDF

This is a minimal, complete example that shows the issue:

```python
import random
import re
import string

import pdf_redactor
import pypdf


def test_repro(match_pattern, replace):
    options = pdf_redactor.RedactorOptions()
    options.content_filters = [
        (re.compile(match_pattern), lambda m: replace),
    ]
    with open('repro_in.pdf', 'rb') as inhandle:
        with open('repro_out.pdf', 'wb') as outhandle:
            options.input_stream = inhandle
            options.output_stream = outhandle
            pdf_redactor.redactor(options)

    # This will result in an unhandled exception if the file
    # successfully reproduces the error
    with open('repro_out.pdf', 'rb') as inhandle:
        r = pypdf.PdfReader(inhandle)
        p = r.pages[0]
        _ = p.extract_text()


test_repro('TEXT', '-')
```

It took a while to provoke pdf_redactor / pdfrw into using an indirect object with a PDF that I could share, and I'm still not 100% certain what causes it to do so, so this PDF (which is a Word document that has been exported to PDF) maybe includes superfluous content. 

[repro_in.pdf](https://github.com/user-attachments/files/17903952/repro_in.pdf)

To save you installing pdf_redactor, here's the PDF that gets generated:

[repro_out.pdf](https://github.com/user-attachments/files/17904233/repro_out.pdf)

## Traceback

This is the complete traceback I see:

```
Traceback (most recent call last):
  File "/Users/neil/3rdpartysw/pypdf/repro.py", line 28, in <module>
    test_repro('TEXT', '-')
  File "/Users/neil/3rdpartysw/pypdf/repro.py", line 25, in test_repro
    _ = p.extract_text()
        ^^^^^^^^^^^^^^^^
  File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 2393, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 1868, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/neil/3rdpartysw/pypdf/pypdf/_cmap.py", line 33, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/neil/3rdpartysw/pypdf/pypdf/_cmap.py", line 60, in build_char_map_from_dict
    half_space_width = compute_space_width(font_width_map, space_key_char) / 2.0
                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~
TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'float'
```

After fixing the issue in `build_char_map_from_dict` (which is really in `compute_space_width`), there's a similar exception thrown in `_get_acutual_font_widths` (sic), which stems from `compute_font_width`:

```
Traceback (most recent call last):
  File "/Users/neil/3rdpartysw/pypdf/repro.py", line 28, in <module>
    test_repro('TEXT', '-')
  File "/Users/neil/3rdpartysw/pypdf/repro.py", line 25, in test_repro
    _ = p.extract_text()
        ^^^^^^^^^^^^^^^^
  File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 2393, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 2109, in _extract_text
    process_operation(b"Tj", [op])
  File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 2052, in process_operation
    text, rtl_dir, _actual_str_size = self._handle_tj(
                                      ^^^^^^^^^^^^^^^^
  File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 1820, in _handle_tj
    self._get_acutual_font_widths(cmap, text_operands, font_size, space_width))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 1786, in _get_acutual_font_widths
    font_widths += compute_font_width(font_width_map, char)
TypeError: unsupported operand type(s) for +=: 'int' and 'IndirectObject'
```

## Patch

I've forked the repo, and have created a patch that solves this problem for this particular input file. I'll raise a PR for that in a minute. There may, of course, be more complicated scenarios that the patch doesn't handle correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception on indirect object during text extraction #2966

Environment

Code + PDF

Traceback

Patch

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Exception on indirect object during text extraction #2966

Description

Environment

Code + PDF

Traceback

Patch

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions