Update PlatformAgnostic case conversion functions to allow expanding-length strings #4832

jackhorton · 2018-03-16T06:44:58Z

Also fixes a previously untracked? bug where String.prototype.toLocale{Upper|Lower}Case.call(null) would print "null" rather than throwing a TypeError.

… length in ICU

MSLaguana · 2018-03-16T15:49:44Z

lib/Runtime/Library/JavascriptString.cpp

-            while (i < inStrLim)
+            // Fast path for one character strings
+            char16 inChar = pThis->GetSz()[0];
+            char16 oneCharAttempt[2] = { inChar, 0 };


What is the point of this buffer? Does the TryChangeStringLinguisticCaseInPlace function expect a null terminated string rather than a length-specified string?

MSLaguana · 2018-03-16T16:04:26Z

lib/Runtime/PlatformAgnostic/Platform/Common/UnicodeText.ICU.cpp

-                *pErrorOut = TranslateUErrorCode(errorCode);
-                return -1;
-            }
+            AssertMsg(resultStringLength > 0, "u_strToCase must return required destString length");


Is this assert hit if you try to change the case of the empty string?

Oh nevermind, I see there's a check earlier on that the source length is greater than 0; I'm not sure I see where the caller enforces that though

jackhorton · 2018-03-16T23:43:48Z

lib/Runtime/Library/JavascriptString.cpp

-        char16 *outStr = builder.DangerousGetWritableBuffer();
+        ApiError error = ApiError::NoError;
+        // pre-flight to get the length required, as it may be longer than the original string
+        charcount_t requiredStringLength = ChangeStringLinguisticCase<toUpper, useInvariant>(pThis->GetSz(), pThis->GetLength(), nullptr, 0, &error);


So some API implementation details, here... ICU's version of this API allows you to pass in a non-null buffer for pre-flighting, in case you happen to provide one long enough. Intl takes advantage of this, where we will basically try a function once with a buffer of a guessed length, and if that fails, allocate a new buffer using the actual length returned from the first attempt. We could do that here, where our initial guess could be to assume that the cased string will be as long as the non-cased string, which will be true in lots of cases. However, the Windows/no-ICU version of this function doesn't support that functionality, because if the buffer isn't long enough, it returns a required length of 0 to indicate failure. So, to be safe, we always pass nullptr here. Its one thing to implement this functionality here separately for ICU and no-ICU, but does that make the ChangeStringLinguisticCase API too complicated?

I'm tempted to say open another issue to track this as a possible optimization but for now it's fine.

dilijev · 2018-03-17T01:17:49Z

lib/Runtime/Library/JavascriptString.cpp


-        return ToLocaleCaseHelper(args[0], false, scriptContext);
+        JavascriptString * pThis = nullptr;
+        GetThisStringArgument(args, scriptContext, _u("String.prototype.toLocaleUpperCase"), &pThis);


toLocaleUpperCase [](start = 72, length = 17)

These labels are mixed up between the two functions.

dilijev · 2018-03-17T01:18:26Z

lib/Runtime/Library/JavascriptString.cpp

+        return ToCaseCore<toUpper, false>(pThis);
+    }
+
+


nit: extra blank line below this function

dilijev · 2018-03-17T01:24:53Z

lib/Runtime/PlatformAgnostic/Platform/POSIX/UnicodeText.cpp


-        int32 ChangeStringLinguisticCase(CaseFlags caseFlags, const char16* sourceString, uint32 sourceLength, char16* destString, uint32 destLength, ApiError* pErrorOut)
+        template<bool toUpper, bool useInvariant>
+        charcount_t ChangeStringLinguisticCase(const char16* sourceString, charcount_t sourceLength, char16* destString, charcount_t destLength, ApiError* pErrorOut)


Linguistic [](start = 32, length = 10)

how do we handle the langtag for this?

POSIX doesn't have any idea of locale for this implementation, but this is only used in the --no-icu case, which as far as I know is anything goes in terms of spec compliance. This didn't handle system vs root locale before, either.

dilijev · 2018-03-17T01:25:24Z

lib/Runtime/PlatformAgnostic/Platform/POSIX/UnicodeText.cpp

            }
        }

-        uint32 ChangeStringCaseInPlace(CaseFlags caseFlags, char16* stringToChange, uint32 bufferLength)


ChangeStringCaseInPlace [](start = 15, length = 23)

is this not needed anymore? I think this was an optimization for ASCII strings.

I removed it because we couldn't guarantee in-place conversion for any arbitrary string (any ICU implementation takes a src and dest buffer, and if characters can expand from source to dest, doing conversions in place using the same buffer for both args would produce incorrect results). I suppose I can add this back and fail for non-ASCII?

I think I am still against this function because in reality, this is only called when converting JavascriptStrings, which are immutable and can't be changed in place regardless. So, unless we can come up with a case where we are casing a string from native code that isn't a JavascriptString, I think this function just adds more complexity.

Sounds good to me. If it regresses perf in a scenario we are tracking we can optimize later.

Is it worth submitting an IB run with this change?

dilijev · 2018-03-17T01:26:44Z