-
-
Notifications
You must be signed in to change notification settings - Fork 590
Avoid calling "DecodeName" when parsing dictionaries. #776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If no name with a "#" has been added before, it's not necessary to go through the expensive "Insert" call that will call "DecodeName" for every previously added key.
|
Timings from my test program: |
|
Thanks for putting in the work! |
Well, the fix was pretty straightforward so I didn't expect much discussion. But I'm happy if you come up with a better solution. |
|
If this one works, wow, that's a lot faster. What are the side effects on optimization if any? |
8382e49 to
7ecda67
Compare
|
Just updated the test to better match the contents of my problematic file. In the file there are two dictionaries with about 200.000 entries each. The test generates a dictionary with 50.000 entries and now takes (on my machine) ~29 seconds on master and ~0.05 seconds with this PR. While this PR fixes this particular problem, you could still construct a file that takes very long to parse. |
|
Excellent patch! |
|
Thanks for pdfcpu, glad to be able to give something back! |
|
Correcting myself. If there are dict keys using hex codes, then we really only need to support locating them by the original key or a normalized version that contains 2 bytes for each # sequence. So as per your idea we should be fine! 👍🏻 Also, I checked my local test corpus and there is only a very small number of files that actually using # within names. Still, I refactored the code around that location plus I fixed the case so that DecodeName gets also called on the very first occurence of # in a dict key. Hopefully all in your spirit. So thanks again 🙏🏻 Please get the latest commit. |

If no name with a "#" has been added before, it's not necessary to go through the expensive "Insert" call that will call "DecodeName" for every previously added key.
Fixes #775
Even better would be to keep the
hasNamesflag in theDictobject so other calls toFindcould also benefit, but this is not possible with the aliased type.