-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
The Python library pdfrw (which is what underpins the pdf_redactor library) sometimes generates files that result in the pypdf font size calculations encountering an indirect object. This results in an exception (either trying to add an int to an IndirectObject, or to divide an IndirectObject by an int).
Environment
$ python -m platform
Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.39
(but I also reproduced it on macOS)
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=noneCode + PDF
This is a minimal, complete example that shows the issue:
import random
import re
import string
import pdf_redactor
import pypdf
def test_repro(match_pattern, replace):
options = pdf_redactor.RedactorOptions()
options.content_filters = [
(re.compile(match_pattern), lambda m: replace),
]
with open('repro_in.pdf', 'rb') as inhandle:
with open('repro_out.pdf', 'wb') as outhandle:
options.input_stream = inhandle
options.output_stream = outhandle
pdf_redactor.redactor(options)
# This will result in an unhandled exception if the file
# successfully reproduces the error
with open('repro_out.pdf', 'rb') as inhandle:
r = pypdf.PdfReader(inhandle)
p = r.pages[0]
_ = p.extract_text()
test_repro('TEXT', '-')It took a while to provoke pdf_redactor / pdfrw into using an indirect object with a PDF that I could share, and I'm still not 100% certain what causes it to do so, so this PDF (which is a Word document that has been exported to PDF) maybe includes superfluous content.
To save you installing pdf_redactor, here's the PDF that gets generated:
Traceback
This is the complete traceback I see:
Traceback (most recent call last):
File "/Users/neil/3rdpartysw/pypdf/repro.py", line 28, in <module>
test_repro('TEXT', '-')
File "/Users/neil/3rdpartysw/pypdf/repro.py", line 25, in test_repro
_ = p.extract_text()
^^^^^^^^^^^^^^^^
File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 2393, in extract_text
return self._extract_text(
^^^^^^^^^^^^^^^^^^^
File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 1868, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/neil/3rdpartysw/pypdf/pypdf/_cmap.py", line 33, in build_char_map
font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/neil/3rdpartysw/pypdf/pypdf/_cmap.py", line 60, in build_char_map_from_dict
half_space_width = compute_space_width(font_width_map, space_key_char) / 2.0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~
TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'float'
After fixing the issue in build_char_map_from_dict (which is really in compute_space_width), there's a similar exception thrown in _get_acutual_font_widths (sic), which stems from compute_font_width:
Traceback (most recent call last):
File "/Users/neil/3rdpartysw/pypdf/repro.py", line 28, in <module>
test_repro('TEXT', '-')
File "/Users/neil/3rdpartysw/pypdf/repro.py", line 25, in test_repro
_ = p.extract_text()
^^^^^^^^^^^^^^^^
File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 2393, in extract_text
return self._extract_text(
^^^^^^^^^^^^^^^^^^^
File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 2109, in _extract_text
process_operation(b"Tj", [op])
File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 2052, in process_operation
text, rtl_dir, _actual_str_size = self._handle_tj(
^^^^^^^^^^^^^^^^
File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 1820, in _handle_tj
self._get_acutual_font_widths(cmap, text_operands, font_size, space_width))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/neil/3rdpartysw/pypdf/pypdf/_page.py", line 1786, in _get_acutual_font_widths
font_widths += compute_font_width(font_width_map, char)
TypeError: unsupported operand type(s) for +=: 'int' and 'IndirectObject'
Patch
I've forked the repo, and have created a patch that solves this problem for this particular input file. I'll raise a PR for that in a minute. There may, of course, be more complicated scenarios that the patch doesn't handle correctly.