-
-
Notifications
You must be signed in to change notification settings - Fork 200
Description
The Native SDK uses narrow UTF-8 encoding throughout the code base as the canonical string encoding.
The only exception to this rule is the strings representing paths on Windows, which use wide characters that depend on the system or application code page settings when they reach our output boundary (i.e., when writing to files or the console). The reason for handling Windows paths separately was that Win32 APIs exclusively use this encoding.
By making wide chars the canonical path encoding on Windows, we can prevent conversions back and forth. However, to make this work, we have to rely on using %ls or %S format specifiers wherever we reach an output boundary, and those require CRT locale configuration or console code-page configuration on the application side (to match the system settings) in order not to get encoding errors when using non-ASCII characters in paths.
This became evident in a recent issue (#1388), which, at its core, was about a different topic related to path encoding on Windows, but also showed that our logging fails to render paths correctly if those paths contain Cyrillic characters, if applications do not maintain correct locale and console code-page settings. The problem, however, is worse than just logging, because we use the same mechanism for serialization.
Let's make narrow UTF-8 the canonical encoding on Windows too, to eliminate any platform-specific issues in the output. This would mean to
- eliminate all uses of
%S/%lsformat specifiers in the code base - using a UTF-8
char*as the internal path representation, like we do on all other platforms - introduce wide-char conversion where it is necessary at the boundary
- provide a cached accessor for the Windows code base so that we do not have to convert back to wide char on every Win32 or public interface boundary