Skip to content

Loading large content streams consumes lots of memory #3167

@lpi-tn

Description

@lpi-tn

I'm actually reading lot of PDF, but this morning one of nodes was OOMKilled in loop. ANd that's was because of PDF reading.
The causing trouble PDF have a big schema on his page 10 and it's where memory skyrocket.

Environment

Which environment were you using when you encountered the problem?

Python version : 3.12
PyPDF version : 4.1.0
OS : Ubuntu

Code + PDF

This is a minimal, complete example that shows the issue:

def get_body_info(reader: PdfReader) -> BodyInfo:
    heights: Dict[float, int] = {}
    x_pos: Dict[float, int] = {}
    height_y_min_max_per_page: List[Dict[float, Tuple[float, float]]] = []

    number_of_pages = len(reader.pages)

    for i in range(number_of_pages):
        page_y_min_max_heights: Dict[float, Tuple[float, float]] = {}
        height_y_min_max_per_page.append({})

        def get_characters_heights(text, cm, tm, font_dict, font_size):
            height = round(font_size, 0)
            x = round(max([cm[4], tm[4]]), 0)
            y = round(max([cm[5], tm[5]]), 0)
            if text not in ["", " ", "\n"]:
                if height not in heights:
                    heights[height] = 0
                if x not in x_pos:
                    x_pos[x] = 0
                if height not in page_y_min_max_heights:
                    page_y_min_max_heights[height] = (y, y)

                heights[height] += 1
                x_pos[x] += 1
                if y < page_y_min_max_heights[height][0]:
                    page_y_min_max_heights[height] = (
                        y,
                        page_y_min_max_heights[height][1],
                    )
                if y > page_y_min_max_heights[height][1]:
                    page_y_min_max_heights[height] = (
                        page_y_min_max_heights[height][0],
                        y,
                    )
            height_y_min_max_per_page[i].update(page_y_min_max_heights)

        page = reader.pages[i]
        page.extract_text(
            orientations=0,
            visitor_text=get_characters_heights,
        )

    ret = BodyInfo(heights, x_pos, height_y_min_max_per_page)
    return ret

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

https://www.nature.com/articles/s41467-024-46625-w.pdf
Page 10

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-discussionThe PR/issue needs more discussion before we can continueworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions