Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Remove String IsASCII flag from the object header#24400

Closed
EgorBo wants to merge 1 commit intodotnet:masterfrom
EgorBo:remove-string-state
Closed

Remove String IsASCII flag from the object header#24400
EgorBo wants to merge 1 commit intodotnet:masterfrom
EgorBo:remove-string-state

Conversation

@EgorBo
Copy link
Member

@EgorBo EgorBo commented May 5, 2019

Not sure if this PR should be merged but at least it's a nice diff to understand how this feature works.
I even made a schema (click to zoom in):

image

This control bit in the object header signals if the string is checked that it doesn't contain any high char (>0x80), see this comment from syncblk.cpp:

    //the highest 2 bits have reloaded meaning
    //for string objects:	
    //         BIT_SBLK_STRING_HAS_NO_HIGH_CHARS   0x80000000	
    //         BIT_SBLK_STRING_HIGH_CHARS_KNOWN    0x40000000	
    //         BIT_SBLK_STRING_HAS_SPECIAL_SORT    0xC0000000	
    //for other objects:	
    //         BIT_SBLK_FINALIZER_RUN              0x40000000

Currently this flag works only on .Unix systems and can be checked via String.IsFastSort internal call used by internal IndexOf for the invariant cultures and for example String.Replace(string,string,Invariant) uses it.

so when we do:

string str2 = str.Replace("a", "b", StringComparison.InvariantCulture);

if we call Replace for the first time it scans str for the "high chars" and sets the bit to 0x80 (no high chars) or to 0x40 (has high chars) - see IsFastSort (IsFastSort is called from here)
so next time we call Replace for this string the call will be a little bit faster (since IsFastSort will be O(1) instead of O(N)).

BTW, if we want to keep this flag why string literals don't have it set by default?
BTW2, if a string is checked that it doesn't contain any high char and has this flag set to 0x80 and we then modify the string via e.g.:

fixed (char* c = str) *c = 'Й';

The IsAscii flag will remain and the comparison logic will be broken (I know, modifying strings this way is not a good idea but anyway).

cc: @jkotas @filipnavara

@benaadams
Copy link
Member

/cc @GrabYourPitchforks

@danmoseley
Copy link
Member

@EgorBo what is the motivation for this change?

@EgorBo
Copy link
Member Author

EgorBo commented May 5, 2019

@danmosemsft dotnet/corefx#37358 (comment) and #24373 so while this change can make some benchmarks a bit slower (e.g. if they repeatedly call Replace for the same string over and over - in the real world strings tend to be more unique) it also makes code more portable (as you can see I removed #if CORECLR from the code that is "shared" with other runtimes including Mono).

@EgorBo
Copy link
Member Author

EgorBo commented May 5, 2019

Also as I said I am not sure it should be removed at all, I did it mostly as an investigation 🙂 and if we want it to stay I wonder why we don't use it on Windows - on Windows it uses pinvoke into kernel32 FastStringOrdinal (https://docs.microsoft.com/en-us/windows/desktop/api/libloaderapi/nf-libloaderapi-findstringordinal) so currently this logic is useless on Windows if I am not mistaken


#if CORECLR
if (_isAsciiEqualityOrdinal && CanUseAsciiOrdinalForOptions(options) && source.IsFastSort() && target.IsFastSort())
if (_isAsciiEqualityOrdinal && CanUseAsciiOrdinalForOptions(options) && source.IsAscii() && target.IsAscii())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this code can be optimized - we can omit IsAscii and just run IndexOf and figure out if it's not "IsAscii" on the fly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Windows implementation don't have this optimization. we can try just remove the whole optimization and compare the perf with Windows.

checking IsAscii only here is incorrect in Linux as you can run into issues when having characters like hyphens.


In reply to: 281011011 [](ancestors = 281011011)

@filipnavara
Copy link
Member

filipnavara commented May 5, 2019

I'm mostly worried about removing the IndexOf optimization resulting in performance regressions.

If the flag is to be kept then there's a lot of places where it can be propagated in various operations (eg. Substring of FastSort string is guaranteed to be FastSort).

@jkotas
Copy link
Member

jkotas commented May 5, 2019

this change can make some benchmarks a bit slower

And make some benchmarks faster. This change needs perf numbers.

I guess this code can be optimized - we can omit IsAscii and just run IndexOf and figure out if it's not "IsAscii" on the fly.

Correct. Whether or not this is worth doing should be supported by perf measurements. This optimization was used when removing use of this flag in other places, but not all of them.

there's a lot of places where it can be propagated in various operations

The culture specific operations are not used that frequently, and they are a lot slower than a ordinal operations by design. I do not think it is a good idea to spread peanut butter overhead and complexity everywhere for something that is slow already.

@tarekgh
Copy link
Member

tarekgh commented Aug 30, 2019

We should try to remove the ASCII case optimization and have the perf numbers. Using IsAscii only is wrong especially when string has some characters like hyphen or single quotes.

@jkotas
Copy link
Member

jkotas commented Sep 18, 2019

Superseded by #26759

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants