Bug
In image_backend.py line 150 loads a PIL image (img = Image.open(self.path_or_stream) ) but never releases it which causes memory leak when processing a png file (memory increase by >50MB on every converter.convert() call)
Steps to reproduce
...
Docling version
2.80.0
Python version
3.12.12
I hope someone will add the below fix to the docling source code.
The fix is done by modifying the ImageDocumentBackend.init function with a "del img". In addition, I also unload the backend and call gc after each convert.convert()
import gc
for i in range(100):
res = converter.convert('/data/sample.png')
res.input._backend.unload()
del res
gc.collect()
get_current_process_memory_usage()
Below is the full init after modification.
def init(
self,
in_doc: InputDocument,
path_or_stream: Union[BytesIO, Path],
options: PdfBackendOptions = PdfBackendOptions(),
):
# Bypass PdfDocumentBackend.init to avoid image→PDF conversion
AbstractDocumentBackend.init(self, in_doc, path_or_stream, options)
self.options: PdfBackendOptions = options
if self.input_format not in {InputFormat.IMAGE}:
raise RuntimeError(
f"Incompatible file format {self.input_format} was passed to ImageDocumentBackend."
)
# Load frames eagerly for thread-safety across pages
self._frames: List[Image.Image] = []
try:
img = Image.open(self.path_or_stream) # type: ignore[arg-type]
# Handle multi-frame and single-frame images
# - multiframe formats: TIFF, GIF, ICO
# - singleframe formats: JPEG (.jpg, .jpeg), PNG (.png), BMP, WEBP (unless animated), HEIC
frame_count = getattr(img, "n_frames", 1)
if frame_count > 1:
for i in range(frame_count):
img.seek(i)
self._frames.append(img.copy().convert("RGB"))
else:
self._frames.append(img.convert("RGB"))
img.close()
del img
except Exception as e:
raise RuntimeError(f"Could not load image for document {self.file}") from e
Bug
In image_backend.py line 150 loads a PIL image (img = Image.open(self.path_or_stream) ) but never releases it which causes memory leak when processing a png file (memory increase by >50MB on every converter.convert() call)
Steps to reproduce
...
Docling version
2.80.0
Python version
3.12.12
I hope someone will add the below fix to the docling source code.
The fix is done by modifying the ImageDocumentBackend.init function with a "del img". In addition, I also unload the backend and call gc after each convert.convert()
import gc
for i in range(100):
res = converter.convert('/data/sample.png')
res.input._backend.unload()
del res
gc.collect()
get_current_process_memory_usage()
Below is the full init after modification.
def init(
self,
in_doc: InputDocument,
path_or_stream: Union[BytesIO, Path],
options: PdfBackendOptions = PdfBackendOptions(),
):
# Bypass PdfDocumentBackend.init to avoid image→PDF conversion
AbstractDocumentBackend.init(self, in_doc, path_or_stream, options)
self.options: PdfBackendOptions = options