Skip to content

TypeError when extracting text from PDF: Unsupported operand type(s) for '+=' (int and 'DictionaryObject') #3153

@IvanOvchynnikov

Description

@IvanOvchynnikov

I have tried to use function extract_text():

self.reader.pages[page_num].extract_text()

Environment

I have used the 5.3.0 version and got the following error , python version 3.11.
When using version pypdf==4.3.1 everything goes well.

  File "/Users/ivanovcinnikov/PycharmProjects/teams-rag/jobs/readers/pypdf_section_reader.py", line 39, in extract_text
    text=self.reader.pages[page_num].extract_text(),
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ivanovcinnikov/PycharmProjects/teams-rag/.venv/lib/python3.11/site-packages/pypdf/_page.py", line 2378, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/ivanovcinnikov/PycharmProjects/teams-rag/.venv/lib/python3.11/site-packages/pypdf/_page.py", line 2091, in _extract_text
    process_operation(b"Tj", [op])
  File "/Users/ivanovcinnikov/PycharmProjects/teams-rag/.venv/lib/python3.11/site-packages/pypdf/_page.py", line 2035, in process_operation
    text, rtl_dir, _actual_str_size = self._handle_tj(
                                      ^^^^^^^^^^^^^^^^
  File "/Users/ivanovcinnikov/PycharmProjects/teams-rag/.venv/lib/python3.11/site-packages/pypdf/_page.py", line 1809, in _handle_tj
    self._get_actual_font_widths(cmap, text_operands, font_size, space_width))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ivanovcinnikov/PycharmProjects/teams-rag/.venv/lib/python3.11/site-packages/pypdf/_page.py", line 1775, in _get_actual_font_widths
    font_widths += compute_font_width(font_width_map, char)
TypeError: unsupported operand type(s) for +=: 'int' and 'DictionaryObject'

Marseille_pypdf_level_0 (2)_compressed.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-robustness-issueFrom a users perspective, this is about robustnessworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions