Skip to content

Conversation

@fancycode
Copy link
Contributor

If no name with a "#" has been added before, it's not necessary to go through the expensive "Insert" call that will call "DecodeName" for every previously added key.

Fixes #775

Even better would be to keep the hasNames flag in the Dict object so other calls to Find could also benefit, but this is not possible with the aliased type.

If no name with a "#" has been added before, it's not necessary to go
through the expensive "Insert" call that will call "DecodeName" for
every previously added key.
@fancycode
Copy link
Contributor Author

Timings from my test program:

$ go run test.go 
2024/01/15 10:34:07.566998 Parsing ...
2024/01/15 10:34:10.899487 Done
2024/01/15 10:34:10.899501 Parsed 9 pages

@CLAassistant
Copy link

CLAassistant commented Jan 15, 2024

CLA assistant check
All committers have signed the CLA.

@hhrutter
Copy link
Collaborator

hhrutter commented Jan 17, 2024

Thanks for putting in the work!
I am evaluating this.

@fancycode
Copy link
Contributor Author

Thanks for putting in the work, although I'd rather discuss issue resolutions before pull requests get filed.

Well, the fix was pretty straightforward so I didn't expect much discussion. But I'm happy if you come up with a better solution.

@joel-rieke
Copy link

If this one works, wow, that's a lot faster. What are the side effects on optimization if any?

@fancycode
Copy link
Contributor Author

Just updated the test to better match the contents of my problematic file. In the file there are two dictionaries with about 200.000 entries each. The test generates a dictionary with 50.000 entries and now takes (on my machine) ~29 seconds on master and ~0.05 seconds with this PR.

While this PR fixes this particular problem, you could still construct a file that takes very long to parse.

@hhrutter hhrutter merged commit 04634d3 into pdfcpu:master Jan 25, 2024
@hhrutter
Copy link
Collaborator

Excellent patch!
Thanks for your contribution 💚

@fancycode fancycode deleted the speedup-parse-dict branch January 25, 2024 20:50
@fancycode
Copy link
Contributor Author

Thanks for pdfcpu, glad to be able to give something back!

@hhrutter
Copy link
Collaborator

I believe we need to walk back some of these changes:

parse

First of all in line 594 we need to check the return code.

This is all about recognizing existing dict keys and also processing any embedded 2 digit hexcodes in Name objects like in:
/A#42 which (bytewise) is the same as /AB.

Parsing both << /A#42 (A) /AB (B) >> and << /AB (B) /A#42 (A) />> should produce a Duplicate Key error which is not the case.

There is also a bug in the existing dict.Find(key) method where the key should also be decoded before comparing
since it's a name. That's on me :)

I have to take a step back and think about how to best fit in these pieces - processDictKeys is already way too complex.
I need to think more about this but this defnitely can't stay like this.
So consider this a headsup.

PS: I still think we can make some shortcuts here like per your proposal but it's gotta be different.

@hhrutter
Copy link
Collaborator

Correcting myself.

If there are dict keys using hex codes, then we really only need to support locating them by the original key or a normalized version that contains 2 bytes for each # sequence.

So as per your idea we should be fine! 👍🏻

Also, I checked my local test corpus and there is only a very small number of files that actually using # within names.
So the excellent news is, parsing gets a major performance boost. 🚀

Still, I refactored the code around that location plus I fixed the case so that DecodeName gets also called on the very first occurence of # in a dict key.

Hopefully all in your spirit.

So thanks again 🙏🏻

Please get the latest commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parsing file with lots of dictionaries is extremely slow

4 participants