Skip to content

Memory leak in image_backend (forgotten Image.open()) #3133

@asfoorial

Description

@asfoorial

Bug

In image_backend.py line 150 loads a PIL image (img = Image.open(self.path_or_stream) ) but never releases it which causes memory leak when processing a png file (memory increase by >50MB on every converter.convert() call)

Steps to reproduce

...

Docling version

2.80.0

Python version

3.12.12

I hope someone will add the below fix to the docling source code.

The fix is done by modifying the ImageDocumentBackend.init function with a "del img". In addition, I also unload the backend and call gc after each convert.convert()

import gc
for i in range(100):
res = converter.convert('/data/sample.png')
res.input._backend.unload()
del res
gc.collect()
get_current_process_memory_usage()

Below is the full init after modification.

def init(
self,
in_doc: InputDocument,
path_or_stream: Union[BytesIO, Path],
options: PdfBackendOptions = PdfBackendOptions(),
):
# Bypass PdfDocumentBackend.init to avoid image→PDF conversion
AbstractDocumentBackend.init(self, in_doc, path_or_stream, options)
self.options: PdfBackendOptions = options

    if self.input_format not in {InputFormat.IMAGE}:
        raise RuntimeError(
            f"Incompatible file format {self.input_format} was passed to ImageDocumentBackend."
        )

    # Load frames eagerly for thread-safety across pages
    self._frames: List[Image.Image] = []
    try:
        img = Image.open(self.path_or_stream)  # type: ignore[arg-type]

        # Handle multi-frame and single-frame images
        # - multiframe formats: TIFF, GIF, ICO
        # - singleframe formats: JPEG (.jpg, .jpeg), PNG (.png), BMP, WEBP (unless animated), HEIC
        frame_count = getattr(img, "n_frames", 1)

        if frame_count > 1:
            for i in range(frame_count):
                img.seek(i)
                self._frames.append(img.copy().convert("RGB"))
        else:
            self._frames.append(img.convert("RGB"))
        img.close()
        del img
    except Exception as e:
        raise RuntimeError(f"Could not load image for document {self.file}") from e

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions