Improve performance, especially in data with many CR-LF#137
Merged
Kludex merged 4 commits intoKludex:masterfrom Sep 28, 2024
Merged
Improve performance, especially in data with many CR-LF#137Kludex merged 4 commits intoKludex:masterfrom
Kludex merged 4 commits intoKludex:masterfrom
Conversation
Drops the look-behind buffer since the content is always the boundary.
The Boyer-Moore-Horspool algorithm was removed and replaced with Python's built-in `find` method. This appears to be faster, sometimes by an order of magnitude.
Kludex
approved these changes
Sep 28, 2024
Owner
Kludex
left a comment
There was a problem hiding this comment.
I can make a pull request with just this change if you are interested.
All good.
Thanks for the great description, helped a lot on the review.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The code changes here started as an attempt to improve performance with data loaded with many CRLF pairs (#67). In the current code these cause performance issues because CRLF matches the start of a boundary pattern. This makes advancing through the data very slow due to the number of call-backs.
By far the greatest optmization was to remove the partial Boyer-Moore-Horspool (BMH) implementation and replace it with
bytes.findand a bit of logic to ensure partial matches at the end weren't missed.bytes.findappears to be significantly faster (around 3 times) on representative data. I can make a pull request with just this change if you are interested.In the update I also attempted to reduce the number of times the on_part_data callback was called to a minimum. Whereas the old code would call it every time a partial boundary match was found (i.e. CRLF), now it only calls it when necessary. The conditions for calling the
on_part_dataare now:The significant complication is what happens when a partial boundary match overlaps the end of the loaded data. This was addressed with a look-behind buffer before, but the buffer is mostly unnecessary: since we are always matching boundary bytes, the look-behind buffer is always just a copy of the boundary. Only the last few bytes may vary (CRLF vs -- depending on whether it is a part or end boundary). However hitting this condition should be very very rare, and is addressed in the code.