Improve case insensitivity consistency by CAD97 · Pull Request #10884 · nushell/nushell

CAD97 · 2023-10-30T01:20:45Z

Description

Add an extension trait IgnoreCaseExt to nu_utils which adds some case insensitivity helpers, and use them throughout nu to improve the handling of case insensitivity. Proper case folding is done via unicase, which is already a dependency via mime_guess from nu-command.

In actuality a lot of code still does to_lowercase, because unicase only provides immediate comparison and doesn't expose a to_folded_case yet. And since we do a lot of contains/starts_with/ends_with, it's not sufficient to just have eq_ignore_case. But if we get access in the future, this makes us ready to use it with a change in one place.

Plus, it's clearer what the purpose is at the call site to call to_folded_case instead of to_lowercase if it's exclusively for the purpose of case insensitive comparison, even if it just does to_lowercase still.

User-Facing Changes

Some commands that were supposed to be case insensitive remained only insensitive to ASCII case (a-z), and now are case insensitive w.r.t. non-ASCII characters as well.

Tests + Formatting

🟢 toolkit fmt
🟢 toolkit clippy
🟢 toolkit test
🟢 toolkit test stdlib

unicase is already a dependency via mime_guess in nu-command. This replaces as much ad-hoc lowercase/uppercase comparison as possible with case folding provided by unicase. In actuality a lot of code still does to_lowercase, because unicase only provides immediate comparison, and doesn't expose a to_folded_case yet. And since we do a lot of contains/starts_with/ends_with, it's not enough to just have eq_ignore_case. But if we get access, this makes us ready to use it with a change in one place. Plus, it's clearer what the purpose is at the call site to call to_folded_case instead of to_lowercase if it's just for the purpose of case insensitive comparison, even if it just does to_lowercase still.

CAD97 · 2023-10-30T01:32:54Z

crates/nu-command/src/conversions/into/bool.rs

            let val = o.parse::<f64>();
            match val {
-                Ok(f) => Ok(f.abs() >= f64::EPSILON),
+                Ok(f) => Ok(f != 0.0),


Comparing to f64::EPSILON is almost never what you actually want to do. Yes, when comparing floats, you want to do so with an epsilon because of floating point imprecision, but the machine epsilon is an astoundingly bad choice for a comparison epsilon. There is no "perfect" comparison epsilon, and the comparison epsilon really wants to be relative to the scale of the compared values, due to the floating nature of floating point arithmetic.

A comparison to machine epsilon is in most cases no different to a direct comparison with 0.0, except for a false appearance of satisfying a rule of "no float ==" just for the purpose of satisfying the rule. Equality with 0 is both the value it makes most sense to directly compare to ("did they write a literal zero") and more honest than comparing with machine epsilon.

Yes, the clippy lint suggested comparing with f64::EPSILON for a long time. This was unfortunately bad advice that took an unfortunately long period to replace, due to there not being any straightforward advice to replace it with. You essentially just can't check floats for "effective equality" without domain specific knowledge.

Some recent rustlang discussion: rust-lang/rust#116916

And on clippy: rust-lang/rust-clippy#6816, rust-lang/rust-clippy#7725

Would you mind moving the float comparison into a separate PR? It is an issue on its own, it would be better addressed separately, IMO.

Accidentally used an exclusive range instead of inclusive, whoops

fdncred · 2023-11-03T17:08:29Z

Some of the feedback I'm getting from the nushell core-team is that lowercasing only ascii characters may be a mistake. Thoughts?

kubouch · 2023-11-03T18:19:43Z

It seems strange to me that some to_lowercase() calls are replaced with to_ascii_lowercase(). Shouldn't these be kept as to_lowercase(), or replaced with the new to_folded_case()?

sholderbach · 2023-11-03T19:56:59Z

For comparisons against known ASCII literals like in the config that is fine and gives a known capacity and is cheaper.

CAD97 · 2023-11-03T23:10:03Z

That's accurate. Every case I switched to to_ascii_lowercase should have one side of the comparison known to be an ascii-only constant. If it isn't, that's an accidentally introduced bug. Someone scanning through to validate this assertion would be a useful review.

It's useful to use to_ascii_lowercase instead of to_lowercase or to_folded_case for multiple reasons:

Performance: ASCII lowercase is better than lowercase is better than case folding.
Intent: When comparing against a case constant, only normalizing for ASCII case more directly corresponds to the actual comparison being done. (I would have used eq_ignore_ascii_case more if not for usually matching or passing to another function.)
Correctness: While case folding is an aggressive lower casing, it shouldn't be necessary to know that to see that the code is doing the right thing. (Also, should maß compare equal to mass? It might, given case folding).

I can add comments to each instance of to_ascii_Xcase categorizing why that's sufficient if that's considered beneficial. Ask and it'll get done (sooner; I'll likely eventually end up doing it anyway to help make the PR easier to accept if it doesn't get merged first).

Side note: unicase merged the PR for exposing its to_folded_case, so when/if that gets published we can change ours to do a case fold instead of just lower case.

Side note: I'm considering the value of a follow-up PR using icu_collator for proper collation (sort) order (e.g. á sorting next to a), but that's a much bigger question to ask since 1. it's a new dependency, unlike std or unicase, 2. it requires deciding how to handle ICU4X data¹, and 3. collation is locale sensitive (but you can use und for the most locale neutral option) (but it does natively provide numeric/natural sort order).

The "best" approach is probably some combination of minimally sufficient always available compiled in data (the und locale) and then configuring paths to check for locale (or updated) data before using the fallback. Interesting would be communicating when the configured data is used and when built-in data is used, since there's essentially no way we get the entire Rust strings software stack to be generic over an ICU4X provider, so there'll always be something in the tree that uses baked Unicode data; it's just massively more convenient, to the point that ICU4X has it on by default. ↩

fdncred · 2023-11-04T00:49:34Z

Thanks for the explanation @CAD97. We appreciate your help and support.

I can add comments to each instance of to_ascii_Xcase categorizing why that's sufficient if that's considered beneficial.

Thanks for offering. I think this would be a good idea and helpful because I, for one, wouldn't have understood what you explained above.

kubouch · 2023-11-04T09:37:41Z

@CAD97 Thanks for the explanation, it makes sense to use ASCII lowercasing in cases where one side is known ASCII. Adding comments before each to_ascii_lowercase() would be helpful for the case when someone like me comes along and tries to “fix” it 😆 . Alternatively,i nstead of adding comments, you could consider adding a to_kown_ascii_lowercase() (which would just call to_ascii_lowercase()) method to your trait and add a doc comment there explaining when it is supposed to be used. The latter would be more future-proof.

The Unicode string comparisons are tricky, so I think moving it into one place (the trait you introduced) is a good move IMO. Remaining questions, like “maß” vs. “mass”, would also be easier addressed within the trait.

sholderbach

Gave it a top down reading, the places where you narrowed it to ascii casing look good to me.

I tried to note what breaking changes to the behavior may result, so far I can only note sort and sort-by that may sort differently when operating in case-insensitive mode.

sholderbach · 2023-11-07T22:03:32Z

crates/nu-utils/src/casing.rs

+    fn eq_ignore_case(&self, other: &str) -> bool {
+        UniCase::new(self) == UniCase::new(other)
+    }


Would it make sense to have a newtype for which we implement IgnoreCaseExt so we can convert/fold our needles once before going through the haystacks? Else we can always go through the to_folded_case and the standard comparisons but that allocates for both the needle and each straw in the haystack.

The main reason I didn't do so is that most of those cases are potentially using contains or starts_with or otherwise using operations which expect to have &str access (such as for natural sort). Plus, as discussed on the trait docs, for repeated comparisons it can often be preferable to fold once up front, since the folding process isn't exactly free either. ICU4X doesn't even offer case-folded equality comparison. (It does offer one-off collation, but that might still allocate internally to create sort keys.)

In short, the extra ceremony required to thread through a newtype isn't really worth it. A Cow<str> version which doesn't allocate if the string is already case folded might be beneficial, but no case folding provider exposes that possibility currently.

OK makes sense, thanks for the detailed explanation.

Let's go with the definite correctness improvements this all already provides. (If profiling tells us anything egregious we can still ponder particular implementations in the future anyways)

crates/nu-cli/src/completions/variable_completions.rs

crates/nu-command/src/filters/sort.rs

sholderbach · 2023-11-07T22:12:48Z

crates/nu-utils/src/casing.rs

+    /// Returns a [case folded] equivalent of this string, as a new String.
+    ///
+    /// Case folding is primarily based on lowercase mapping, but includes
+    /// additional changes to the source text to help make case folding
+    /// language-invariant and consistent. Case folded text should be used
+    /// solely for processing and generally should not be stored or displayed.
+    ///
+    /// Note: this method might only do [`str::to_lowercase`] instead of a
+    /// full case fold, depending on how Nu is compiled. You should still
+    /// prefer using this method for generating case-insensitive strings,
+    /// though, as it expresses intent much better than `to_lowercase`.
+    ///
+    /// [case folded]: <https://unicode.org/faq/casemap_charprop.html#2>


Love the professional doccomments!

sholderbach · 2023-11-07T22:14:21Z

crates/nu-command/src/filters/sort.rs

+        // Fold case if case-insensitive
        let left = if insensitive {
-            left_res.to_ascii_lowercase()
+            left_res.to_folded_case()


So this will be a breaking change for the sort command

sholderbach · 2023-11-07T22:15:58Z

crates/nu-command/src/sort_utils.rs

                    let span_b = b.span();
-                    let lowercase_left = match a {
-                        Value::String { val, .. } => {
-                            Value::string(val.to_ascii_lowercase(), span_a)


Also anything depending on crates/nu-command/src/sort_utils.rs may incur a breaking change.

crates/nu-glob/src/lib.rs

Co-authored-by: Christopher Durham <cad97@cad97.com>

sholderbach

I think we should give this one a go! Thanks for the meticulous work @CAD97

# Description Add an extension trait `IgnoreCaseExt` to nu_utils which adds some case insensitivity helpers, and use them throughout nu to improve the handling of case insensitivity. Proper case folding is done via unicase, which is already a dependency via mime_guess from nu-command. In actuality a lot of code still does `to_lowercase`, because unicase only provides immediate comparison and doesn't expose a `to_folded_case` yet. And since we do a lot of `contains`/`starts_with`/`ends_with`, it's not sufficient to just have `eq_ignore_case`. But if we get access in the future, this makes us ready to use it with a change in one place. Plus, it's clearer what the purpose is at the call site to call `to_folded_case` instead of `to_lowercase` if it's exclusively for the purpose of case insensitive comparison, even if it just does `to_lowercase` still. # User-Facing Changes - Some commands that were supposed to be case insensitive remained only insensitive to ASCII case (a-z), and now are case insensitive w.r.t. non-ASCII characters as well. # Tests + Formatting - 🟢 `toolkit fmt` - 🟢 `toolkit clippy` - 🟢 `toolkit test` - 🟢 `toolkit test stdlib` --------- Co-authored-by: Stefan Holderbach <sholderbach@users.noreply.github.com>

# Description Switch `to_folded_case` to a proper case fold instead of `str::to_lowercase` now that unicase exposes its `to_folded_case` method. Rel: #10884, seanmonstar/unicase#61 # User-Facing Changes Case insensitive sorts now do proper case folding. Old behavior: ```nushell [dreißig DREISSIG] | sort -i # => ╭───┬──────────╮ # => │ 0 │ DREISSIG │ # => │ 1 │ dreißig │ # => ╰───┴──────────╯ ``` New behavior: ```nushell [dreißig DREISSIG] | sort -i # => ╭───┬──────────╮ # => │ 0 │ dreißig │ # => │ 1 │ DREISSIG │ # => ╰───┴──────────╯ ```

# Description Switch `to_folded_case` to a proper case fold instead of `str::to_lowercase` now that unicase exposes its `to_folded_case` method. Rel: nushell#10884, seanmonstar/unicase#61 # User-Facing Changes Case insensitive sorts now do proper case folding. Old behavior: ```nushell [dreißig DREISSIG] | sort -i # => ╭───┬──────────╮ # => │ 0 │ DREISSIG │ # => │ 1 │ dreißig │ # => ╰───┴──────────╯ ``` New behavior: ```nushell [dreißig DREISSIG] | sort -i # => ╭───┬──────────╮ # => │ 0 │ dreißig │ # => │ 1 │ DREISSIG │ # => ╰───┴──────────╯ ```

CAD97 commented Oct 30, 2023

View reviewed changes

Unbreak range patterns

a9d11ce

Accidentally used an exclusive range instead of inclusive, whoops

sholderbach reviewed Nov 7, 2023

View reviewed changes

sholderbach added the notes:breaking-changes This PR implies a change affecting users and has to be noted in the release notes label Nov 7, 2023

CAD97 commented Nov 7, 2023

View reviewed changes

crates/nu-glob/src/lib.rs Outdated Show resolved Hide resolved

sholderbach and others added 3 commits November 8, 2023 18:51

Update crates/nu-glob/src/lib.rs

6eeca97

Co-authored-by: Christopher Durham <cad97@cad97.com>

Merge branch 'main' into ignore-case

4d1d23f

Merge branch 'main' into ignore-case

4a2063d

sholderbach approved these changes Nov 8, 2023

View reviewed changes

sholderbach merged commit 0f600bc into nushell:main Nov 8, 2023

CAD97 deleted the ignore-case branch November 9, 2023 00:40

132ikl mentioned this pull request Nov 5, 2024

Switch to unicase's to_folded_case #14255

Merged

Conversation

CAD97 commented Oct 30, 2023

Description

User-Facing Changes

Tests + Formatting

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CAD97 Nov 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fdncred commented Nov 3, 2023

Uh oh!

kubouch commented Nov 3, 2023

Uh oh!

sholderbach commented Nov 3, 2023

Uh oh!

CAD97 commented Nov 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

fdncred commented Nov 4, 2023

Uh oh!

kubouch commented Nov 4, 2023

Uh oh!

sholderbach left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sholderbach left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CAD97 Nov 1, 2023 •

edited

Loading

CAD97 commented Nov 3, 2023 •

edited

Loading