Skip to content

ENH: Remove only raster graphics #2208

@pbrus

Description

@pbrus

The documentation contains two examples regarding image extraction:

This works as expected for pages containing embedded raster graphics. For example:

from pypdf import PdfReader, PdfWriter


# PDF: https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf
reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")

page = reader.pages[0]
count = 0

for image_file_object in page.images:
	with open(str(count) + image_file_object.name, "wb") as fp:
		fp.write(image_file_object.data)
		count += 1

writer = PdfWriter()
writer.add_page(page)
writer.remove_images()

with open("out.pdf", "wb") as f:
	writer.write(f)

As a result we get two images and a new PDF document without these images. When we change the page from 0 to the last one:

page = reader.pages[18]

We get only reduced PDF document:

image

Some background images have disappeared (they look like vector graphics) and they are not store in separate files.

This might look like a bug but I suppose that you are aware of that (that's the reason to request a new feature instead of a bug). Nevertheless, it could be great to have a feature allowing to extract only raster graphics to separate files and don't touch vector graphics (or extract them but also with text which might be very hard as I guess).

Metadata

Metadata

Assignees

Labels

Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsis-featureA feature request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions