Avoid calling "DecodeName" when parsing dictionaries. #776

fancycode · 2024-01-15T09:38:03Z

If no name with a "#" has been added before, it's not necessary to go through the expensive "Insert" call that will call "DecodeName" for every previously added key.

Fixes #775

Even better would be to keep the hasNames flag in the Dict object so other calls to Find could also benefit, but this is not possible with the aliased type.

If no name with a "#" has been added before, it's not necessary to go through the expensive "Insert" call that will call "DecodeName" for every previously added key.

fancycode · 2024-01-15T09:39:09Z

Timings from my test program:

$ go run test.go 
2024/01/15 10:34:07.566998 Parsing ...
2024/01/15 10:34:10.899487 Done
2024/01/15 10:34:10.899501 Parsed 9 pages

CLAassistant · 2024-01-15T12:11:56Z

All committers have signed the CLA.

hhrutter · 2024-01-17T08:45:31Z

Thanks for putting in the work!
I am evaluating this.

fancycode · 2024-01-17T08:54:32Z

Thanks for putting in the work, although I'd rather discuss issue resolutions before pull requests get filed.

Well, the fix was pretty straightforward so I didn't expect much discussion. But I'm happy if you come up with a better solution.

joel-rieke · 2024-01-18T23:01:17Z

If this one works, wow, that's a lot faster. What are the side effects on optimization if any?

fancycode · 2024-01-23T07:46:19Z

Just updated the test to better match the contents of my problematic file. In the file there are two dictionaries with about 200.000 entries each. The test generates a dictionary with 50.000 entries and now takes (on my machine) ~29 seconds on master and ~0.05 seconds with this PR.

While this PR fixes this particular problem, you could still construct a file that takes very long to parse.

hhrutter · 2024-01-25T20:47:25Z

Excellent patch!
Thanks for your contribution 💚

fancycode · 2024-01-25T20:50:34Z

Thanks for pdfcpu, glad to be able to give something back!

hhrutter · 2024-01-26T00:20:09Z

I believe we need to walk back some of these changes:

First of all in line 594 we need to check the return code.

This is all about recognizing existing dict keys and also processing any embedded 2 digit hexcodes in Name objects like in:
/A#42 which (bytewise) is the same as /AB.

Parsing both << /A#42 (A) /AB (B) >> and << /AB (B) /A#42 (A) />> should produce a Duplicate Key error which is not the case.

There is also a bug in the existing dict.Find(key) method where the key should also be decoded before comparing
since it's a name. That's on me :)

I have to take a step back and think about how to best fit in these pieces - processDictKeys is already way too complex.
I need to think more about this but this defnitely can't stay like this.
So consider this a headsup.

PS: I still think we can make some shortcuts here like per your proposal but it's gotta be different.

hhrutter · 2024-01-27T18:29:45Z

Correcting myself.

If there are dict keys using hex codes, then we really only need to support locating them by the original key or a normalized version that contains 2 bytes for each # sequence.

So as per your idea we should be fine! 👍🏻

Also, I checked my local test corpus and there is only a very small number of files that actually using # within names.
So the excellent news is, parsing gets a major performance boost. 🚀

Still, I refactored the code around that location plus I fixed the case so that DecodeName gets also called on the very first occurence of # in a dict key.

Hopefully all in your spirit.

So thanks again 🙏🏻

Please get the latest commit.

Avoid calling "DecodeName" when parsing dictionaries.

dd5b470

If no name with a "#" has been added before, it's not necessary to go through the expensive "Insert" call that will call "DecodeName" for every previously added key.

fancycode mentioned this pull request Jan 15, 2024

Parsing file with lots of dictionaries is extremely slow #775

Closed

Add testcase that parses a large dictionary.

7ecda67

fancycode force-pushed the speedup-parse-dict branch from 8382e49 to 7ecda67 Compare January 23, 2024 07:41

hhrutter merged commit 04634d3 into pdfcpu:master Jan 25, 2024

fancycode deleted the speedup-parse-dict branch January 25, 2024 20:50

fancycode mentioned this pull request Jan 29, 2024

Further improve parsing of dictionaries / names #795

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Avoid calling "DecodeName" when parsing dictionaries. #776

Avoid calling "DecodeName" when parsing dictionaries. #776

Uh oh!

fancycode commented Jan 15, 2024

Uh oh!

fancycode commented Jan 15, 2024

Uh oh!

CLAassistant commented Jan 15, 2024 •

edited

Loading

Uh oh!

hhrutter commented Jan 17, 2024 •

edited

Loading

Uh oh!

fancycode commented Jan 17, 2024

Uh oh!

joel-rieke commented Jan 18, 2024

Uh oh!

fancycode commented Jan 23, 2024

Uh oh!

hhrutter commented Jan 25, 2024

Uh oh!

fancycode commented Jan 25, 2024

Uh oh!

hhrutter commented Jan 26, 2024

Uh oh!

hhrutter commented Jan 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Avoid calling "DecodeName" when parsing dictionaries. #776

Avoid calling "DecodeName" when parsing dictionaries. #776

Uh oh!

Conversation

fancycode commented Jan 15, 2024

Uh oh!

fancycode commented Jan 15, 2024

Uh oh!

CLAassistant commented Jan 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hhrutter commented Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fancycode commented Jan 17, 2024

Uh oh!

joel-rieke commented Jan 18, 2024

Uh oh!

fancycode commented Jan 23, 2024

Uh oh!

hhrutter commented Jan 25, 2024

Uh oh!

fancycode commented Jan 25, 2024

Uh oh!

hhrutter commented Jan 26, 2024

Uh oh!

hhrutter commented Jan 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Jan 15, 2024 •

edited

Loading

hhrutter commented Jan 17, 2024 •

edited

Loading