Speed improvement for gp_levenshtein().#1408
Merged
Merged
Conversation
…improvements On longer strings (~500char) splitting these into an array before processing it character-by-character results in a 2x speed increase. Takes the average run in my testing from an average of 705ms to 341ms on my real-world random data set.
…at's not required. Might be slightly faster.
… loop to a foreach, which again is slightly faster. 341ms => 310ms
Contributor
Author
|
Just noting that there's limited unit tests for this code branch currently: https://github.com/GlotPress/GlotPress/blob/develop/tests/phpunit/testcases/test_strings.php No logical changes were made here though, and before/after return values appear to be the same still, so I haven't dug into adding proper unit tests, mostly due to personal time constraints. |
…'s function declaration.
ocean90
reviewed
Apr 14, 2022
dd32
commented
Apr 19, 2022
Co-authored-by: Dion Hulse <dion@wordpress.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
gp_levenshtein():
mb_substr()for significant speed improvementsOn longer strings (~500char) splitting these into an array before processing it character-by-character results in a 2x speed increase.
On my random WordPress.org page dataset, this results in a speed improvement of ~700ms ~= 310ms per call of
gp_string_similarity(), which is basically justgp_levenshtein().Non-scientific measurement though, re-running a script with ~200 calls to gp_string_similarity() multiple times with changing the code back and forth.
Some of these changes are not strictly likely to be faster in all scenario's, but at least here seems that
foreachis more efficient thanforwith an additional assignment here.This was found by a
.poimport taking 100% CPU for several minutes, caused by a project having a significant number of obsoleted originals, most of which were long strings, so the C implementation wasn't being used.Note: I also went looking for more optimized variants of this algorithm, there's probably something more that can be done by removing
gp_string_similarity()'s percentage calculation and passing through the raw distance to future calls to abort early in the string if it's a long-way-away from the source string and a closer match has already been found.The algorithm in use here is this: https://en.wikipedia.org/wiki/Levenshtein_distance#Iterative_with_two_matrix_rows