Skip to content

Improve case insensitivity consistency#10884

Merged
sholderbach merged 5 commits intonushell:mainfrom
CAD97:ignore-case
Nov 8, 2023
Merged

Improve case insensitivity consistency#10884
sholderbach merged 5 commits intonushell:mainfrom
CAD97:ignore-case

Conversation

@CAD97
Copy link
Copy Markdown
Contributor

@CAD97 CAD97 commented Oct 30, 2023

Description

Add an extension trait IgnoreCaseExt to nu_utils which adds some case insensitivity helpers, and use them throughout nu to improve the handling of case insensitivity. Proper case folding is done via unicase, which is already a dependency via mime_guess from nu-command.

In actuality a lot of code still does to_lowercase, because unicase only provides immediate comparison and doesn't expose a to_folded_case yet. And since we do a lot of contains/starts_with/ends_with, it's not sufficient to just have eq_ignore_case. But if we get access in the future, this makes us ready to use it with a change in one place.

Plus, it's clearer what the purpose is at the call site to call to_folded_case instead of to_lowercase if it's exclusively for the purpose of case insensitive comparison, even if it just does to_lowercase still.

User-Facing Changes

  • Some commands that were supposed to be case insensitive remained only insensitive to ASCII case (a-z), and now are case insensitive w.r.t. non-ASCII characters as well.

Tests + Formatting

  • 🟢 toolkit fmt
  • 🟢 toolkit clippy
  • 🟢 toolkit test
  • 🟢 toolkit test stdlib

unicase is already a dependency via mime_guess in nu-command. This
replaces as much ad-hoc lowercase/uppercase comparison as possible with
case folding provided by unicase.

In actuality a lot of code still does to_lowercase, because unicase only
provides immediate comparison, and doesn't expose a to_folded_case yet.
And since we do a lot of contains/starts_with/ends_with, it's not enough
to just have eq_ignore_case. But if we get access, this makes us ready
to use it with a change in one place.

Plus, it's clearer what the purpose is at the call site to call
to_folded_case instead of to_lowercase if it's just for the purpose of
case insensitive comparison, even if it just does to_lowercase still.
let val = o.parse::<f64>();
match val {
Ok(f) => Ok(f.abs() >= f64::EPSILON),
Ok(f) => Ok(f != 0.0),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing to f64::EPSILON is almost never what you actually want to do. Yes, when comparing floats, you want to do so with an epsilon because of floating point imprecision, but the machine epsilon is an astoundingly bad choice for a comparison epsilon. There is no "perfect" comparison epsilon, and the comparison epsilon really wants to be relative to the scale of the compared values, due to the floating nature of floating point arithmetic.

A comparison to machine epsilon is in most cases no different to a direct comparison with 0.0, except for a false appearance of satisfying a rule of "no float ==" just for the purpose of satisfying the rule. Equality with 0 is both the value it makes most sense to directly compare to ("did they write a literal zero") and more honest than comparing with machine epsilon.

Yes, the clippy lint suggested comparing with f64::EPSILON for a long time. This was unfortunately bad advice that took an unfortunately long period to replace, due to there not being any straightforward advice to replace it with. You essentially just can't check floats for "effective equality" without domain specific knowledge.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some recent rustlang discussion: rust-lang/rust#116916

Copy link
Copy Markdown
Contributor Author

@CAD97 CAD97 Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind moving the float comparison into a separate PR? It is an issue on its own, it would be better addressed separately, IMO.

Accidentally used an exclusive range instead of inclusive, whoops
@fdncred
Copy link
Copy Markdown
Contributor

fdncred commented Nov 3, 2023

Some of the feedback I'm getting from the nushell core-team is that lowercasing only ascii characters may be a mistake. Thoughts?

@kubouch
Copy link
Copy Markdown
Contributor

kubouch commented Nov 3, 2023

It seems strange to me that some to_lowercase() calls are replaced with to_ascii_lowercase(). Shouldn't these be kept as to_lowercase(), or replaced with the new to_folded_case()?

@sholderbach
Copy link
Copy Markdown
Member

For comparisons against known ASCII literals like in the config that is fine and gives a known capacity and is cheaper.

@CAD97
Copy link
Copy Markdown
Contributor Author

CAD97 commented Nov 3, 2023

That's accurate. Every case I switched to to_ascii_lowercase should have one side of the comparison known to be an ascii-only constant. If it isn't, that's an accidentally introduced bug. Someone scanning through to validate this assertion would be a useful review.

It's useful to use to_ascii_lowercase instead of to_lowercase or to_folded_case for multiple reasons:

  • Performance: ASCII lowercase is better than lowercase is better than case folding.
  • Intent: When comparing against a case constant, only normalizing for ASCII case more directly corresponds to the actual comparison being done. (I would have used eq_ignore_ascii_case more if not for usually matching or passing to another function.)
  • Correctness: While case folding is an aggressive lower casing, it shouldn't be necessary to know that to see that the code is doing the right thing. (Also, should maß compare equal to mass? It might, given case folding).

I can add comments to each instance of to_ascii_Xcase categorizing why that's sufficient if that's considered beneficial. Ask and it'll get done (sooner; I'll likely eventually end up doing it anyway to help make the PR easier to accept if it doesn't get merged first).

Side note: unicase merged the PR for exposing its to_folded_case, so when/if that gets published we can change ours to do a case fold instead of just lower case.

Side note: I'm considering the value of a follow-up PR using icu_collator for proper collation (sort) order (e.g. á sorting next to a), but that's a much bigger question to ask since 1. it's a new dependency, unlike std or unicase, 2. it requires deciding how to handle ICU4X data1, and 3. collation is locale sensitive (but you can use und for the most locale neutral option) (but it does natively provide numeric/natural sort order).

Footnotes

  1. The "best" approach is probably some combination of minimally sufficient always available compiled in data (the und locale) and then configuring paths to check for locale (or updated) data before using the fallback. Interesting would be communicating when the configured data is used and when built-in data is used, since there's essentially no way we get the entire Rust strings software stack to be generic over an ICU4X provider, so there'll always be something in the tree that uses baked Unicode data; it's just massively more convenient, to the point that ICU4X has it on by default.

@fdncred
Copy link
Copy Markdown
Contributor

fdncred commented Nov 4, 2023

Thanks for the explanation @CAD97. We appreciate your help and support.

I can add comments to each instance of to_ascii_Xcase categorizing why that's sufficient if that's considered beneficial.

Thanks for offering. I think this would be a good idea and helpful because I, for one, wouldn't have understood what you explained above.

@kubouch
Copy link
Copy Markdown
Contributor

kubouch commented Nov 4, 2023

@CAD97 Thanks for the explanation, it makes sense to use ASCII lowercasing in cases where one side is known ASCII. Adding comments before each to_ascii_lowercase() would be helpful for the case when someone like me comes along and tries to “fix” it 😆 . Alternatively,i nstead of adding comments, you could consider adding a to_kown_ascii_lowercase() (which would just call to_ascii_lowercase()) method to your trait and add a doc comment there explaining when it is supposed to be used. The latter would be more future-proof.

The Unicode string comparisons are tricky, so I think moving it into one place (the trait you introduced) is a good move IMO. Remaining questions, like “maß” vs. “mass”, would also be easier addressed within the trait.

Copy link
Copy Markdown
Member

@sholderbach sholderbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave it a top down reading, the places where you narrowed it to ascii casing look good to me.

I tried to note what breaking changes to the behavior may result, so far I can only note sort and sort-by that may sort differently when operating in case-insensitive mode.

Comment on lines +48 to +50
fn eq_ignore_case(&self, other: &str) -> bool {
UniCase::new(self) == UniCase::new(other)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have a newtype for which we implement IgnoreCaseExt so we can convert/fold our needles once before going through the haystacks? Else we can always go through the to_folded_case and the standard comparisons but that allocates for both the needle and each straw in the haystack.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I didn't do so is that most of those cases are potentially using contains or starts_with or otherwise using operations which expect to have &str access (such as for natural sort). Plus, as discussed on the trait docs, for repeated comparisons it can often be preferable to fold once up front, since the folding process isn't exactly free either. ICU4X doesn't even offer case-folded equality comparison. (It does offer one-off collation, but that might still allocate internally to create sort keys.)

In short, the extra ceremony required to thread through a newtype isn't really worth it. A Cow<str> version which doesn't allocate if the string is already case folded might be beneficial, but no case folding provider exposes that possibility currently.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK makes sense, thanks for the detailed explanation.

Let's go with the definite correctness improvements this all already provides. (If profiling tells us anything egregious we can still ponder particular implementations in the future anyways)

Comment on lines +5 to +17
/// Returns a [case folded] equivalent of this string, as a new String.
///
/// Case folding is primarily based on lowercase mapping, but includes
/// additional changes to the source text to help make case folding
/// language-invariant and consistent. Case folded text should be used
/// solely for processing and generally should not be stored or displayed.
///
/// Note: this method might only do [`str::to_lowercase`] instead of a
/// full case fold, depending on how Nu is compiled. You should still
/// prefer using this method for generating case-insensitive strings,
/// though, as it expresses intent much better than `to_lowercase`.
///
/// [case folded]: <https://unicode.org/faq/casemap_charprop.html#2>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the professional doccomments!

// Fold case if case-insensitive
let left = if insensitive {
left_res.to_ascii_lowercase()
left_res.to_folded_case()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this will be a breaking change for the sort command

let span_b = b.span();
let lowercase_left = match a {
Value::String { val, .. } => {
Value::string(val.to_ascii_lowercase(), span_a)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also anything depending on crates/nu-command/src/sort_utils.rs may incur a breaking change.

@sholderbach sholderbach added the notes:breaking-changes This PR implies a change affecting users and has to be noted in the release notes label Nov 7, 2023
Copy link
Copy Markdown
Member

@sholderbach sholderbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should give this one a go! Thanks for the meticulous work @CAD97

@sholderbach sholderbach merged commit 0f600bc into nushell:main Nov 8, 2023
@CAD97 CAD97 deleted the ignore-case branch November 9, 2023 00:40
hardfau1t pushed a commit to hardfau1t/nushell that referenced this pull request Dec 14, 2023
# Description

Add an extension trait `IgnoreCaseExt` to nu_utils which adds some case
insensitivity helpers, and use them throughout nu to improve the
handling of case insensitivity. Proper case folding is done via unicase,
which is already a dependency via mime_guess from nu-command.

In actuality a lot of code still does `to_lowercase`, because unicase
only provides immediate comparison and doesn't expose a `to_folded_case`
yet. And since we do a lot of `contains`/`starts_with`/`ends_with`, it's
not sufficient to just have `eq_ignore_case`. But if we get access in
the future, this makes us ready to use it with a change in one place.

Plus, it's clearer what the purpose is at the call site to call
`to_folded_case` instead of `to_lowercase` if it's exclusively for the
purpose of case insensitive comparison, even if it just does
`to_lowercase` still.

# User-Facing Changes

- Some commands that were supposed to be case insensitive remained only
insensitive to ASCII case (a-z), and now are case insensitive w.r.t.
non-ASCII characters as well.

# Tests + Formatting

- 🟢 `toolkit fmt`
- 🟢 `toolkit clippy`
- 🟢 `toolkit test`
- 🟢 `toolkit test stdlib`

---------

Co-authored-by: Stefan Holderbach <sholderbach@users.noreply.github.com>
dmatos2012 pushed a commit to dmatos2012/nushell that referenced this pull request Feb 20, 2024
# Description

Add an extension trait `IgnoreCaseExt` to nu_utils which adds some case
insensitivity helpers, and use them throughout nu to improve the
handling of case insensitivity. Proper case folding is done via unicase,
which is already a dependency via mime_guess from nu-command.

In actuality a lot of code still does `to_lowercase`, because unicase
only provides immediate comparison and doesn't expose a `to_folded_case`
yet. And since we do a lot of `contains`/`starts_with`/`ends_with`, it's
not sufficient to just have `eq_ignore_case`. But if we get access in
the future, this makes us ready to use it with a change in one place.

Plus, it's clearer what the purpose is at the call site to call
`to_folded_case` instead of `to_lowercase` if it's exclusively for the
purpose of case insensitive comparison, even if it just does
`to_lowercase` still.

# User-Facing Changes

- Some commands that were supposed to be case insensitive remained only
insensitive to ASCII case (a-z), and now are case insensitive w.r.t.
non-ASCII characters as well.

# Tests + Formatting

- 🟢 `toolkit fmt`
- 🟢 `toolkit clippy`
- 🟢 `toolkit test`
- 🟢 `toolkit test stdlib`

---------

Co-authored-by: Stefan Holderbach <sholderbach@users.noreply.github.com>
sholderbach pushed a commit that referenced this pull request Nov 5, 2024
# Description
Switch `to_folded_case` to a proper case fold instead of
`str::to_lowercase` now that unicase exposes its `to_folded_case`
method.

Rel: #10884, seanmonstar/unicase#61

# User-Facing Changes

Case insensitive sorts now do proper case folding.

Old behavior:

```nushell
[dreißig DREISSIG] | sort -i
# => ╭───┬──────────╮
# => │ 0 │ DREISSIG │
# => │ 1 │ dreißig  │
# => ╰───┴──────────╯
```

New behavior:

```nushell
[dreißig DREISSIG] | sort -i
# => ╭───┬──────────╮
# => │ 0 │ dreißig  │
# => │ 1 │ DREISSIG │
# => ╰───┴──────────╯
```
NotTheDr01ds pushed a commit to NotTheDr01ds/nushell that referenced this pull request Nov 8, 2024
# Description
Switch `to_folded_case` to a proper case fold instead of
`str::to_lowercase` now that unicase exposes its `to_folded_case`
method.

Rel: nushell#10884, seanmonstar/unicase#61

# User-Facing Changes

Case insensitive sorts now do proper case folding.

Old behavior:

```nushell
[dreißig DREISSIG] | sort -i
# => ╭───┬──────────╮
# => │ 0 │ DREISSIG │
# => │ 1 │ dreißig  │
# => ╰───┴──────────╯
```

New behavior:

```nushell
[dreißig DREISSIG] | sort -i
# => ╭───┬──────────╮
# => │ 0 │ dreißig  │
# => │ 1 │ DREISSIG │
# => ╰───┴──────────╯
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

notes:breaking-changes This PR implies a change affecting users and has to be noted in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants