ENH: Remove only raster graphics

The documentation contains two examples regarding image extraction:

- [images extraction](https://pypdf.readthedocs.io/en/stable/user/extract-images.html)
- [removing images](https://pypdf.readthedocs.io/en/stable/user/file-size.html?#removing-images)

This works as expected for pages containing embedded raster graphics. For example:

```python
from pypdf import PdfReader, PdfWriter


# PDF: https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf
reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")

page = reader.pages[0]
count = 0

for image_file_object in page.images:
	with open(str(count) + image_file_object.name, "wb") as fp:
		fp.write(image_file_object.data)
		count += 1

writer = PdfWriter()
writer.add_page(page)
writer.remove_images()

with open("out.pdf", "wb") as f:
	writer.write(f)
```
As a result we get two images and a new PDF document without these images. When we change the page from 0 to the last one:
```python
page = reader.pages[18]
```
We get only reduced PDF document:

![image](https://github.com/py-pdf/pypdf/assets/19304263/e3fd66fe-b790-4939-9c3e-3867f2e6cc8f)

Some background images have disappeared (they look like vector graphics) and they are not store in separate files.

This might look like a bug but I suppose that you are aware of that (that's the reason to request a new feature instead of a bug). Nevertheless, it could be great to have a feature allowing to extract only raster graphics to separate files and don't touch vector graphics (or extract them but also with text which might be very hard as I guess).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Remove only raster graphics #2208

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ENH: Remove only raster graphics #2208

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions