Loading large content streams consumes lots of memory

I'm actually reading lot of PDF, but this morning one of nodes was OOMKilled in loop. ANd that's was because of PDF reading.
The causing trouble PDF have a big schema on his page 10 and it's where memory skyrocket.

## Environment

Which environment were you using when you encountered the problem?

Python version : 3.12
PyPDF version : 4.1.0
OS : Ubuntu

## Code + PDF

This is a minimal, complete example that shows the issue:

```python
def get_body_info(reader: PdfReader) -> BodyInfo:
    heights: Dict[float, int] = {}
    x_pos: Dict[float, int] = {}
    height_y_min_max_per_page: List[Dict[float, Tuple[float, float]]] = []

    number_of_pages = len(reader.pages)

    for i in range(number_of_pages):
        page_y_min_max_heights: Dict[float, Tuple[float, float]] = {}
        height_y_min_max_per_page.append({})

        def get_characters_heights(text, cm, tm, font_dict, font_size):
            height = round(font_size, 0)
            x = round(max([cm[4], tm[4]]), 0)
            y = round(max([cm[5], tm[5]]), 0)
            if text not in ["", " ", "\n"]:
                if height not in heights:
                    heights[height] = 0
                if x not in x_pos:
                    x_pos[x] = 0
                if height not in page_y_min_max_heights:
                    page_y_min_max_heights[height] = (y, y)

                heights[height] += 1
                x_pos[x] += 1
                if y < page_y_min_max_heights[height][0]:
                    page_y_min_max_heights[height] = (
                        y,
                        page_y_min_max_heights[height][1],
                    )
                if y > page_y_min_max_heights[height][1]:
                    page_y_min_max_heights[height] = (
                        page_y_min_max_heights[height][0],
                        y,
                    )
            height_y_min_max_per_page[i].update(page_y_min_max_heights)

        page = reader.pages[i]
        page.extract_text(
            orientations=0,
            visitor_text=get_characters_heights,
        )

    ret = BodyInfo(heights, x_pos, height_y_min_max_per_page)
    return ret
```

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

https://www.nature.com/articles/s41467-024-46625-w.pdf
Page 10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading large content streams consumes lots of memory #3167

Environment

Code + PDF

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Loading large content streams consumes lots of memory #3167

Description

Environment

Code + PDF

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions