Improve case insensitivity consistency#10884
Conversation
unicase is already a dependency via mime_guess in nu-command. This replaces as much ad-hoc lowercase/uppercase comparison as possible with case folding provided by unicase. In actuality a lot of code still does to_lowercase, because unicase only provides immediate comparison, and doesn't expose a to_folded_case yet. And since we do a lot of contains/starts_with/ends_with, it's not enough to just have eq_ignore_case. But if we get access, this makes us ready to use it with a change in one place. Plus, it's clearer what the purpose is at the call site to call to_folded_case instead of to_lowercase if it's just for the purpose of case insensitive comparison, even if it just does to_lowercase still.
| let val = o.parse::<f64>(); | ||
| match val { | ||
| Ok(f) => Ok(f.abs() >= f64::EPSILON), | ||
| Ok(f) => Ok(f != 0.0), |
There was a problem hiding this comment.
Comparing to f64::EPSILON is almost never what you actually want to do. Yes, when comparing floats, you want to do so with an epsilon because of floating point imprecision, but the machine epsilon is an astoundingly bad choice for a comparison epsilon. There is no "perfect" comparison epsilon, and the comparison epsilon really wants to be relative to the scale of the compared values, due to the floating nature of floating point arithmetic.
A comparison to machine epsilon is in most cases no different to a direct comparison with 0.0, except for a false appearance of satisfying a rule of "no float ==" just for the purpose of satisfying the rule. Equality with 0 is both the value it makes most sense to directly compare to ("did they write a literal zero") and more honest than comparing with machine epsilon.
Yes, the clippy lint suggested comparing with f64::EPSILON for a long time. This was unfortunately bad advice that took an unfortunately long period to replace, due to there not being any straightforward advice to replace it with. You essentially just can't check floats for "effective equality" without domain specific knowledge.
There was a problem hiding this comment.
Some recent rustlang discussion: rust-lang/rust#116916
There was a problem hiding this comment.
And on clippy: rust-lang/rust-clippy#6816, rust-lang/rust-clippy#7725
There was a problem hiding this comment.
Would you mind moving the float comparison into a separate PR? It is an issue on its own, it would be better addressed separately, IMO.
Accidentally used an exclusive range instead of inclusive, whoops
|
Some of the feedback I'm getting from the nushell core-team is that lowercasing only ascii characters may be a mistake. Thoughts? |
|
It seems strange to me that some |
|
For comparisons against known ASCII literals like in the config that is fine and gives a known capacity and is cheaper. |
|
That's accurate. Every case I switched to It's useful to use
I can add comments to each instance of Side note: unicase merged the PR for exposing its Side note: I'm considering the value of a follow-up PR using icu_collator for proper collation (sort) order (e.g. Footnotes
|
|
Thanks for the explanation @CAD97. We appreciate your help and support.
Thanks for offering. I think this would be a good idea and helpful because I, for one, wouldn't have understood what you explained above. |
|
@CAD97 Thanks for the explanation, it makes sense to use ASCII lowercasing in cases where one side is known ASCII. Adding comments before each The Unicode string comparisons are tricky, so I think moving it into one place (the trait you introduced) is a good move IMO. Remaining questions, like “maß” vs. “mass”, would also be easier addressed within the trait. |
sholderbach
left a comment
There was a problem hiding this comment.
Gave it a top down reading, the places where you narrowed it to ascii casing look good to me.
I tried to note what breaking changes to the behavior may result, so far I can only note sort and sort-by that may sort differently when operating in case-insensitive mode.
| fn eq_ignore_case(&self, other: &str) -> bool { | ||
| UniCase::new(self) == UniCase::new(other) | ||
| } |
There was a problem hiding this comment.
Would it make sense to have a newtype for which we implement IgnoreCaseExt so we can convert/fold our needles once before going through the haystacks? Else we can always go through the to_folded_case and the standard comparisons but that allocates for both the needle and each straw in the haystack.
There was a problem hiding this comment.
The main reason I didn't do so is that most of those cases are potentially using contains or starts_with or otherwise using operations which expect to have &str access (such as for natural sort). Plus, as discussed on the trait docs, for repeated comparisons it can often be preferable to fold once up front, since the folding process isn't exactly free either. ICU4X doesn't even offer case-folded equality comparison. (It does offer one-off collation, but that might still allocate internally to create sort keys.)
In short, the extra ceremony required to thread through a newtype isn't really worth it. A Cow<str> version which doesn't allocate if the string is already case folded might be beneficial, but no case folding provider exposes that possibility currently.
There was a problem hiding this comment.
OK makes sense, thanks for the detailed explanation.
Let's go with the definite correctness improvements this all already provides. (If profiling tells us anything egregious we can still ponder particular implementations in the future anyways)
| /// Returns a [case folded] equivalent of this string, as a new String. | ||
| /// | ||
| /// Case folding is primarily based on lowercase mapping, but includes | ||
| /// additional changes to the source text to help make case folding | ||
| /// language-invariant and consistent. Case folded text should be used | ||
| /// solely for processing and generally should not be stored or displayed. | ||
| /// | ||
| /// Note: this method might only do [`str::to_lowercase`] instead of a | ||
| /// full case fold, depending on how Nu is compiled. You should still | ||
| /// prefer using this method for generating case-insensitive strings, | ||
| /// though, as it expresses intent much better than `to_lowercase`. | ||
| /// | ||
| /// [case folded]: <https://unicode.org/faq/casemap_charprop.html#2> |
There was a problem hiding this comment.
Love the professional doccomments!
| // Fold case if case-insensitive | ||
| let left = if insensitive { | ||
| left_res.to_ascii_lowercase() | ||
| left_res.to_folded_case() |
There was a problem hiding this comment.
So this will be a breaking change for the sort command
| let span_b = b.span(); | ||
| let lowercase_left = match a { | ||
| Value::String { val, .. } => { | ||
| Value::string(val.to_ascii_lowercase(), span_a) |
There was a problem hiding this comment.
Also anything depending on crates/nu-command/src/sort_utils.rs may incur a breaking change.
Co-authored-by: Christopher Durham <cad97@cad97.com>
sholderbach
left a comment
There was a problem hiding this comment.
I think we should give this one a go! Thanks for the meticulous work @CAD97
# Description Add an extension trait `IgnoreCaseExt` to nu_utils which adds some case insensitivity helpers, and use them throughout nu to improve the handling of case insensitivity. Proper case folding is done via unicase, which is already a dependency via mime_guess from nu-command. In actuality a lot of code still does `to_lowercase`, because unicase only provides immediate comparison and doesn't expose a `to_folded_case` yet. And since we do a lot of `contains`/`starts_with`/`ends_with`, it's not sufficient to just have `eq_ignore_case`. But if we get access in the future, this makes us ready to use it with a change in one place. Plus, it's clearer what the purpose is at the call site to call `to_folded_case` instead of `to_lowercase` if it's exclusively for the purpose of case insensitive comparison, even if it just does `to_lowercase` still. # User-Facing Changes - Some commands that were supposed to be case insensitive remained only insensitive to ASCII case (a-z), and now are case insensitive w.r.t. non-ASCII characters as well. # Tests + Formatting - 🟢 `toolkit fmt` - 🟢 `toolkit clippy` - 🟢 `toolkit test` - 🟢 `toolkit test stdlib` --------- Co-authored-by: Stefan Holderbach <sholderbach@users.noreply.github.com>
# Description Add an extension trait `IgnoreCaseExt` to nu_utils which adds some case insensitivity helpers, and use them throughout nu to improve the handling of case insensitivity. Proper case folding is done via unicase, which is already a dependency via mime_guess from nu-command. In actuality a lot of code still does `to_lowercase`, because unicase only provides immediate comparison and doesn't expose a `to_folded_case` yet. And since we do a lot of `contains`/`starts_with`/`ends_with`, it's not sufficient to just have `eq_ignore_case`. But if we get access in the future, this makes us ready to use it with a change in one place. Plus, it's clearer what the purpose is at the call site to call `to_folded_case` instead of `to_lowercase` if it's exclusively for the purpose of case insensitive comparison, even if it just does `to_lowercase` still. # User-Facing Changes - Some commands that were supposed to be case insensitive remained only insensitive to ASCII case (a-z), and now are case insensitive w.r.t. non-ASCII characters as well. # Tests + Formatting - 🟢 `toolkit fmt` - 🟢 `toolkit clippy` - 🟢 `toolkit test` - 🟢 `toolkit test stdlib` --------- Co-authored-by: Stefan Holderbach <sholderbach@users.noreply.github.com>
# Description Switch `to_folded_case` to a proper case fold instead of `str::to_lowercase` now that unicase exposes its `to_folded_case` method. Rel: #10884, seanmonstar/unicase#61 # User-Facing Changes Case insensitive sorts now do proper case folding. Old behavior: ```nushell [dreißig DREISSIG] | sort -i # => ╭───┬──────────╮ # => │ 0 │ DREISSIG │ # => │ 1 │ dreißig │ # => ╰───┴──────────╯ ``` New behavior: ```nushell [dreißig DREISSIG] | sort -i # => ╭───┬──────────╮ # => │ 0 │ dreißig │ # => │ 1 │ DREISSIG │ # => ╰───┴──────────╯ ```
# Description Switch `to_folded_case` to a proper case fold instead of `str::to_lowercase` now that unicase exposes its `to_folded_case` method. Rel: nushell#10884, seanmonstar/unicase#61 # User-Facing Changes Case insensitive sorts now do proper case folding. Old behavior: ```nushell [dreißig DREISSIG] | sort -i # => ╭───┬──────────╮ # => │ 0 │ DREISSIG │ # => │ 1 │ dreißig │ # => ╰───┴──────────╯ ``` New behavior: ```nushell [dreißig DREISSIG] | sort -i # => ╭───┬──────────╮ # => │ 0 │ dreißig │ # => │ 1 │ DREISSIG │ # => ╰───┴──────────╯ ```
Description
Add an extension trait
IgnoreCaseExtto nu_utils which adds some case insensitivity helpers, and use them throughout nu to improve the handling of case insensitivity. Proper case folding is done via unicase, which is already a dependency via mime_guess from nu-command.In actuality a lot of code still does
to_lowercase, because unicase only provides immediate comparison and doesn't expose ato_folded_caseyet. And since we do a lot ofcontains/starts_with/ends_with, it's not sufficient to just haveeq_ignore_case. But if we get access in the future, this makes us ready to use it with a change in one place.Plus, it's clearer what the purpose is at the call site to call
to_folded_caseinstead ofto_lowercaseif it's exclusively for the purpose of case insensitive comparison, even if it just doesto_lowercasestill.User-Facing Changes
Tests + Formatting
toolkit fmttoolkit clippytoolkit testtoolkit test stdlib