-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Update PlatformAgnostic case conversion functions to allow expanding-length strings #4832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| while (i < inStrLim) | ||
| // Fast path for one character strings | ||
| char16 inChar = pThis->GetSz()[0]; | ||
| char16 oneCharAttempt[2] = { inChar, 0 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of this buffer? Does the TryChangeStringLinguisticCaseInPlace function expect a null terminated string rather than a length-specified string?
| *pErrorOut = TranslateUErrorCode(errorCode); | ||
| return -1; | ||
| } | ||
| AssertMsg(resultStringLength > 0, "u_strToCase must return required destString length"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this assert hit if you try to change the case of the empty string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nevermind, I see there's a check earlier on that the source length is greater than 0; I'm not sure I see where the caller enforces that though
| char16 *outStr = builder.DangerousGetWritableBuffer(); | ||
| ApiError error = ApiError::NoError; | ||
| // pre-flight to get the length required, as it may be longer than the original string | ||
| charcount_t requiredStringLength = ChangeStringLinguisticCase<toUpper, useInvariant>(pThis->GetSz(), pThis->GetLength(), nullptr, 0, &error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So some API implementation details, here... ICU's version of this API allows you to pass in a non-null buffer for pre-flighting, in case you happen to provide one long enough. Intl takes advantage of this, where we will basically try a function once with a buffer of a guessed length, and if that fails, allocate a new buffer using the actual length returned from the first attempt. We could do that here, where our initial guess could be to assume that the cased string will be as long as the non-cased string, which will be true in lots of cases. However, the Windows/no-ICU version of this function doesn't support that functionality, because if the buffer isn't long enough, it returns a required length of 0 to indicate failure. So, to be safe, we always pass nullptr here. Its one thing to implement this functionality here separately for ICU and no-ICU, but does that make the ChangeStringLinguisticCase API too complicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tempted to say open another issue to track this as a possible optimization but for now it's fine.
|
|
||
| return ToLocaleCaseHelper(args[0], false, scriptContext); | ||
| JavascriptString * pThis = nullptr; | ||
| GetThisStringArgument(args, scriptContext, _u("String.prototype.toLocaleUpperCase"), &pThis); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toLocaleUpperCase [](start = 72, length = 17)
These labels are mixed up between the two functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
| return ToCaseCore<toUpper, false>(pThis); | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra blank line below this function
|
|
||
| int32 ChangeStringLinguisticCase(CaseFlags caseFlags, const char16* sourceString, uint32 sourceLength, char16* destString, uint32 destLength, ApiError* pErrorOut) | ||
| template<bool toUpper, bool useInvariant> | ||
| charcount_t ChangeStringLinguisticCase(const char16* sourceString, charcount_t sourceLength, char16* destString, charcount_t destLength, ApiError* pErrorOut) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linguistic [](start = 32, length = 10)
how do we handle the langtag for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
POSIX doesn't have any idea of locale for this implementation, but this is only used in the --no-icu case, which as far as I know is anything goes in terms of spec compliance. This didn't handle system vs root locale before, either.
| } | ||
| } | ||
|
|
||
| uint32 ChangeStringCaseInPlace(CaseFlags caseFlags, char16* stringToChange, uint32 bufferLength) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ChangeStringCaseInPlace [](start = 15, length = 23)
is this not needed anymore? I think this was an optimization for ASCII strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it because we couldn't guarantee in-place conversion for any arbitrary string (any ICU implementation takes a src and dest buffer, and if characters can expand from source to dest, doing conversions in place using the same buffer for both args would produce incorrect results). I suppose I can add this back and fail for non-ASCII?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am still against this function because in reality, this is only called when converting JavascriptStrings, which are immutable and can't be changed in place regardless. So, unless we can come up with a case where we are casing a string from native code that isn't a JavascriptString, I think this function just adds more complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. If it regresses perf in a scenario we are tracking we can optimize later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth submitting an IB run with this change?
| assert.areEqual("\u03B1\u0345\u03C2", "\u0391\u0345\u03A3".toLowerCase(), "Expecting sigma preceded by Greek upper-case alpha (0391), combining Greek ypogegrammeni (0345)"); | ||
| assert.areEqual("a\u03C2\u0345", "A\u03A3\u0345".toLowerCase(), "Expecting sigma followed by combining Greek ypogegrammeni (0345)"); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} [](start = 8, length = 1)
Should add a test for the proper error thrown for the String.prototype.toLocale{Upper|Lower}Case.call(null) case
|
|
||
| if (WScript.Platform.INTL_LIBRARY === "icu") { | ||
| assert.areEqual("\u0069\u0307", "\u0130".toLowerCase(), "Expecting Latin upper-case i with dot above"); | ||
| assert.areEqual("a\u03C2", "A\u03A3".toLowerCase(), "Expecting sigma preceded by Latin upper-case a"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upper [](start = 103, length = 5)
lower-case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were added previously. I could try to go through and add them all again, but I imagine they were auto-generated?
| if (WScript.Platform.INTL_LIBRARY === "icu") { | ||
| assert.areEqual("\u0069\u0307", "\u0130".toLowerCase(), "Expecting Latin upper-case i with dot above"); | ||
| assert.areEqual("a\u03C2", "A\u03A3".toLowerCase(), "Expecting sigma preceded by Latin upper-case a"); | ||
| assert.areEqual("\uD835\uDCA2\u03C2", "\uD835\uDCA2\u03A3".toLowerCase(), "Expecting sigma preceded by mathematical script capital g (d835 dca2 = 1d4a2)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
capital [](start = 139, length = 7)
again lower case because you're using .toLowerCase?
Probably just check over these comments again.
| // Win32's version of toUpperCase does not support growing strings in the uppercase operation (see CharUpperBuff) | ||
| return; | ||
| } | ||
| assert.areEqual("\u0053\u0053", "\u00DF".toUpperCase(), "Expecting Latin lower-case sharp s"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lower-case [](start = 85, length = 10)
you used toUpperCase so this comment should reflect that
| assert.areEqual("\u1F0D\u0399", "\u1F85".toUpperCase(), "Expecting Greek lower-case alpha with dasia and oxia and ypogegrammeni"); | ||
| assert.areEqual("\u1F0E\u0399", "\u1F86".toUpperCase(), "Expecting Greek lower-case alpha with psili and perispomeni and ypogegrammeni"); | ||
| assert.areEqual("\u1F0F\u0399", "\u1F87".toUpperCase(), "Expecting Greek lower-case alpha with dasia and perispomeni and ypogegrammeni"); | ||
| assert.areEqual("\u1F08\u0399", "\u1F88".toUpperCase(), "Expecting Greek upper-case alpha with psili and prosgegrammeni"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some mixing of lowercase and uppercase but all operations are toUpperCase
| { | ||
| name: "Edge cases", | ||
| body() { | ||
| assert.areEqual("", "".toUpperCase(), "Empty string should return itself (toUpperCase)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toUpperCase [](start = 35, length = 11)
Did you add any tests for toLocale{Upper|Lower}Case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that should probably go in a different file.
dilijev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits and test comment consistency issues. Otherwise LGTM
sethbrenith
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
![]()
|
@dilijev Unless you feel particularly strongly about it, I feel like I should leave the message issues in the Unicode test as is and open a bug to regenerate that test for Unicode 10 soon. |
|
@jackhorton that's fine if there's a task to track. Keeping in mind the test is a bit misleading at the moment (especially if something fails could cause confusion). We both have Unicode 10 update work to do :P |
Fixes #421
Fixes #4526
Also fixes a previously untracked? bug where String.prototype.toLocale{Upper|Lower}Case.call(null) would print "null" rather than throwing a TypeError.