-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
needs-discussionThe PR/issue needs more discussion before we can continueThe PR/issue needs more discussion before we can continueworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
I'm actually reading lot of PDF, but this morning one of nodes was OOMKilled in loop. ANd that's was because of PDF reading.
The causing trouble PDF have a big schema on his page 10 and it's where memory skyrocket.
Environment
Which environment were you using when you encountered the problem?
Python version : 3.12
PyPDF version : 4.1.0
OS : Ubuntu
Code + PDF
This is a minimal, complete example that shows the issue:
def get_body_info(reader: PdfReader) -> BodyInfo:
heights: Dict[float, int] = {}
x_pos: Dict[float, int] = {}
height_y_min_max_per_page: List[Dict[float, Tuple[float, float]]] = []
number_of_pages = len(reader.pages)
for i in range(number_of_pages):
page_y_min_max_heights: Dict[float, Tuple[float, float]] = {}
height_y_min_max_per_page.append({})
def get_characters_heights(text, cm, tm, font_dict, font_size):
height = round(font_size, 0)
x = round(max([cm[4], tm[4]]), 0)
y = round(max([cm[5], tm[5]]), 0)
if text not in ["", " ", "\n"]:
if height not in heights:
heights[height] = 0
if x not in x_pos:
x_pos[x] = 0
if height not in page_y_min_max_heights:
page_y_min_max_heights[height] = (y, y)
heights[height] += 1
x_pos[x] += 1
if y < page_y_min_max_heights[height][0]:
page_y_min_max_heights[height] = (
y,
page_y_min_max_heights[height][1],
)
if y > page_y_min_max_heights[height][1]:
page_y_min_max_heights[height] = (
page_y_min_max_heights[height][0],
y,
)
height_y_min_max_per_page[i].update(page_y_min_max_heights)
page = reader.pages[i]
page.extract_text(
orientations=0,
visitor_text=get_characters_heights,
)
ret = BodyInfo(heights, x_pos, height_y_min_max_per_page)
return retShare here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!
https://www.nature.com/articles/s41467-024-46625-w.pdf
Page 10
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
needs-discussionThe PR/issue needs more discussion before we can continueThe PR/issue needs more discussion before we can continueworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow