Skip to content

ENH: Accelerate image list keys generation#2014

Merged
MartinThoma merged 3 commits intopy-pdf:mainfrom
pubpub-zz:iss1987
Jul 28, 2023
Merged

ENH: Accelerate image list keys generation#2014
MartinThoma merged 3 commits intopy-pdf:mainfrom
pubpub-zz:iss1987

Conversation

@pubpub-zz
Copy link
Copy Markdown
Collaborator

closes #1987

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@MartinThoma
I've got 2 mypy errors I do not understand Can you have a look please 😘

@codecov
Copy link
Copy Markdown

codecov bot commented Jul 26, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.02% ⚠️

Comparison is base (890c93a) 94.03% compared to head (88c8bb2) 94.01%.
Report is 6 commits behind head on main.

❗ Current head 88c8bb2 differs from pull request most recent head c756267. Consider uploading reports for the commit c756267 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2014      +/-   ##
==========================================
- Coverage   94.03%   94.01%   -0.02%     
==========================================
  Files          33       33              
  Lines        7076     7090      +14     
  Branches     1413     1418       +5     
==========================================
+ Hits         6654     6666      +12     
- Misses        263      264       +1     
- Partials      159      160       +1     
Files Changed Coverage Δ
pypdf/_page.py 93.61% <100.00%> (-0.15%) ⬇️
pypdf/_utils.py 99.17% <100.00%> (+<0.01%) ⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@MartinThoma
I've found a nice fix. Now it's all your 😀

@MartinThoma
Copy link
Copy Markdown
Member

I've tested it with https://github.com/py-pdf/pypdf/files/12160419/table_redacted.pdf and now I get:

Traceback (most recent call last):
  File "/home/moose/Github/py-pdf/pypdf/sample-files/foo.py", line 20, in <module>
    run("table_redacted.pdf")
  File "/home/moose/Github/py-pdf/pypdf/sample-files/foo.py", line 13, in run
    for image in page.images:
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2633, in __iter__
    yield self[i]
          ~~~~^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2629, in __getitem__
    return self.get_function(lst[index])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 532, in _get_image
    return self.inline_images[id]
           ~~~~~~~~~~~~~~~~~~^^^^
KeyError: '~0~'

@MartinThoma MartinThoma changed the title ENH : accelerate image list keys generation ENH: Accelerate image list keys generation Jul 28, 2023
@stefan6419846
Copy link
Copy Markdown
Collaborator

Which code did you use for testing? Did you remove page.inline_images = dict()?

@MartinThoma
Copy link
Copy Markdown
Member

Thank you - I forgot that 🙈

@MartinThoma
Copy link
Copy Markdown
Member

Before (current main):

  • 4.24s: 009-pdflatex-geotopo/GeoTopo.pdf
  • 2.88s: 009-pdflatex-geotopo/GeoTopo-komprimiert.pdf

With this PR:

  • 2.01s: 009-pdflatex-geotopo/GeoTopo.pdf
  • 0.44s: 009-pdflatex-geotopo/GeoTopo-komprimiert.pdf

Good work 🎉

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

I've tested it with https://github.com/py-pdf/pypdf/files/12160419/table_redacted.pdf and now I get:

Traceback (most recent call last):
  File "/home/moose/Github/py-pdf/pypdf/sample-files/foo.py", line 20, in <module>
    run("table_redacted.pdf")
  File "/home/moose/Github/py-pdf/pypdf/sample-files/foo.py", line 13, in run
    for image in page.images:
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2633, in __iter__
    yield self[i]
          ~~~~^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2629, in __getitem__
    return self.get_function(lst[index])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 532, in _get_image
    return self.inline_images[id]
           ~~~~~~~~~~~~~~~~~~^^^^
KeyError: '~0~'

can you clarify the code you are using ? page.inline_images = dict() is normally not required

@MartinThoma
Copy link
Copy Markdown
Member

Yes, I was adding page.inline_images = dict(). That was leading to the error. However, I would not consider this a blocker as this is modifying pypdf behavior in an unexpected way.

I want to have a final look after work, but so far it seems like a great improvement. I'll likely merge it as it is :-)

@MartinThoma MartinThoma merged commit 94f23f9 into py-pdf:main Jul 28, 2023
MartinThoma added a commit that referenced this pull request Jul 29, 2023
## What's new

### New Features (ENH)
-  Accelerate image list keys generation (#2014)
-  Use `cryptography` for encryption/decryption as a fallback for PyCryptodome (#2000)
-  Extract LaTeX characters (#2016)
-  ASCIIHexDecode.decode now returns bytes instead of str (#1994)

### Bug Fixes (BUG)
-  Add RunLengthDecode filter (#2012)
-  Process /Separation ColorSpace (#2007)
-  Handle single element ColorSpace list (#2026)
-  Process lookup decoded as TextStringObjects (#2008)

### Robustness (ROB)
-  Cope with garbage collector during cloning (#1841)

### Maintenance (MAINT)
-  Cleanup of annotations (#1745)

[Full Changelog](3.13.0...3.14.0)
@pubpub-zz pubpub-zz deleted the iss1987 branch September 2, 2023 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provide public interface for skipping inline page images

3 participants