add Unicode.isequal_normalized function#42493
Merged
Conversation
JeffBezanson
approved these changes
Oct 4, 2021
Member
|
It's not really clear to me what |
Member
|
Nitpick: indentation should be 4 spaces, I guess. |
Member
Author
|
Indentation should be fixed now. (For some reason, vscode was detecting Unicode.jl as using 2-space indentation and adjusted my code accordingly.) |
Member
Author
|
Maybe Alternatively, we could just call it |
Member
That sounds good to me. I don't want to hold this up any further though if others don't mind the name. |
Member
Author
|
renamed to |
LilithHafner
pushed a commit
to LilithHafner/julia
that referenced
this pull request
Feb 22, 2022
This adds a function `isequal_normalized` to the Unicode stdlib to check whether two strings are canonically equivalent (optionally casefolding and/or stripping combining marks). Previously, the only way to do this was to call `Unicode.normalize` on the two strings, to construct normalized versions, but this seemed a bit wasteful — the new `isequal_normalized` function calls lower-level functions in utf8proc to accomplish the same task while only allocating 4-codepoint (16-byte) temporary arrays. It seems to be about 2x faster than calling `normalize` in the expensive case where the strings are equivalent, and is potentially much faster for inequivalent strings for which the loop can break early. (If we could stack-allocate small arrays it might get faster.) (In the future, we might also want to add `Unicode.isless_normalized` and `Unicode.cmp_normalized` functions for comparing Unicode strings, but `isequal_normalized` seemed like a good start.)
LilithHafner
pushed a commit
to LilithHafner/julia
that referenced
this pull request
Mar 8, 2022
This adds a function `isequal_normalized` to the Unicode stdlib to check whether two strings are canonically equivalent (optionally casefolding and/or stripping combining marks). Previously, the only way to do this was to call `Unicode.normalize` on the two strings, to construct normalized versions, but this seemed a bit wasteful — the new `isequal_normalized` function calls lower-level functions in utf8proc to accomplish the same task while only allocating 4-codepoint (16-byte) temporary arrays. It seems to be about 2x faster than calling `normalize` in the expensive case where the strings are equivalent, and is potentially much faster for inequivalent strings for which the loop can break early. (If we could stack-allocate small arrays it might get faster.) (In the future, we might also want to add `Unicode.isless_normalized` and `Unicode.cmp_normalized` functions for comparing Unicode strings, but `isequal_normalized` seemed like a good start.)
StefanKarpinski
pushed a commit
that referenced
this pull request
Dec 19, 2023
Fixes #52408. (Note that this function was added in Julia 1.8, in #42493.) In the future it would be good to further optimize this function by adding a fast path for the common case of strings that are mostly ASCII characters. Perhaps simply skip ahead to the first byte that doesn't match before we begin doing decomposition etcetera.
KristofferC
pushed a commit
that referenced
this pull request
Dec 23, 2023
Fixes #52408. (Note that this function was added in Julia 1.8, in #42493.) In the future it would be good to further optimize this function by adding a fast path for the common case of strings that are mostly ASCII characters. Perhaps simply skip ahead to the first byte that doesn't match before we begin doing decomposition etcetera. (cherry picked from commit 3b250c7)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a function
isequivalentisequal_normalizedto the Unicode stdlib to check whether two strings are canonically equivalent (optionally casefolding and/or stripping combining marks).Previously, the only way to do this was to call
Unicode.normalizeon the two strings, to construct normalized versions, but this seemed a bit wasteful — the newisequal_normalizedfunction calls lower-level functions in utf8proc to accomplish the same task while only allocating 4-codepoint (16-byte) temporary arrays. It seems to be about 2x faster than callingnormalizein the expensive case where the strings are equivalent, and is potentially much faster for inequivalent strings for which the loop can break early. (If we could stack-allocate small arrays it might get faster.)(In the future, we might also want to add
Unicode.isless_normalizedandUnicode.cmp_normalizedfunctions for comparing Unicode strings, butisequal_normalizedseemed like a good start.)