-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
The documentation contains two examples regarding image extraction:
This works as expected for pages containing embedded raster graphics. For example:
from pypdf import PdfReader, PdfWriter
# PDF: https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf
reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[0]
count = 0
for image_file_object in page.images:
with open(str(count) + image_file_object.name, "wb") as fp:
fp.write(image_file_object.data)
count += 1
writer = PdfWriter()
writer.add_page(page)
writer.remove_images()
with open("out.pdf", "wb") as f:
writer.write(f)As a result we get two images and a new PDF document without these images. When we change the page from 0 to the last one:
page = reader.pages[18]We get only reduced PDF document:
Some background images have disappeared (they look like vector graphics) and they are not store in separate files.
This might look like a bug but I suppose that you are aware of that (that's the reason to request a new feature instead of a bug). Nevertheless, it could be great to have a feature allowing to extract only raster graphics to separate files and don't touch vector graphics (or extract them but also with text which might be very hard as I guess).
