Fix handling of Windows (WTF-16) and WASI (UTF-8) paths, etc #19005
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
On Windows, paths/environment variables/command line arguments are arbitrary sequences of
u16(known as WTF-16), which means that unpaired surrogate codepoints (U+D800 to U+DFFF) are allowed. Unpaired surrogate codepoints cannot be encoded as valid UTF-8/UTF-16, meaning that UTF-8/UTF-16 cannot represent all possible paths/environment variables/command line arguments on Windows.On other platforms (but not WASI), paths/environment variables/command line arguments are arbitrary sequences of
u8with no particular encoding. Therefore, invalid UTF-8 sequences are allowed, which in turn means that valid UTF-8 cannot represent all possible paths/environment variables/command line arguments.On WASI, paths/environment variables/command line arguments are specified to be sequences of Unicode scalar values, meaning that they must be encodable as valid UTF-8/UTF-16. This means that WASI cannot handle all paths/environment variables/command line arguments regardless of the host platform.
Because Zig has cross-platform APIs that deal with slices of
u8, some normalization/conversion has to be done for certain platforms. Up to this point, the status quo of Zig has been:error.Unexpectedif invalid UTF-8 was attempted to be used (the underlying error isILSEQor invalid byte sequence)Possible solutions
[]u8and force APIs to always deal with WTF-16 directly on WindowsWhat is WTF-8?
WTF-8 is a superset of UTF-8 that allows the codepoints
U+D800toU+DFFF(surrogate codepoints) to be encoded using the normal UTF-8 encoding algorithm. SinceU+D800toU+DFFFare the only WTF-16 code units that are normally unrepresentable in UTF-8, this alone is sufficient to be able to losslessly roundtrip from WTF-8 to WTF-16.Some notes:
U+D83D U+DCA9(a high surrogate followed by a low surrogate) was encoded as WTF-8, then when converted to WTF-16 and back to WTF-8 it'd be interpreted as a surrogate pair that enocdes the codepointU+1F4A9, so the final WTF-8 would have the byte sequence forU+1F4A9rather thanU+D83D U+DCA9. As long as all surrogate codepoints in WTF-8 are unpaired, though,WTF-8<->WTF-16roundtripping is guaranteed.The changes
This PR was initially focused solely on handling WTF-16 via WTF-8, but now has a few interconnected changes:
std.unicodewas refactored a bit and function names were made more consistent (e.g. lowercaselechanged to the more commonLe)std.unicodeILSEQerrors and returnserror.InvalidUtf8(now a WASI-only error) in that caseerror.InvalidWtf8(a Windows-only error) if any user-supplied inputs are invalid WTF-8NativeUtf8ComponentIteratorwas previously incorrectly named [by me])The
std.unicodechanges in detailThis same information is in one of the commit messages, but:
std.unicode changes
Renamed functions for consistent
Lecapitalization and conventions:New UTF related functions:
(the ArrayList functions are mostly to allow the Alloc and AllocZ functions to share an implementation)
New WTF related functions/structs:
Notes/concerns
InvalidUtf8has gone from a Windows-only error to a WASI-only error in many places. This may lead to bugs at existing callsites since it won't appear as a breaking change.std.unicodeimplementation. This means it is up to the user to be aware of WTF-8 well-formedness and maintain that property themselves (see the spec section on concatenation for what this means in practice) if they care about the roundtripping property. Note, however, that when converting to WTF-16, paired surrogates in WTF-8 are interpreted as a surrogate pair, so non-well-formed WTF-8 will get interpreted as if it were concatenated according to the spec in the process of being converted to WTF-16.u8encoding of WTF-16, well-formedness of the WTF-8 doesn't matter too much since it has to be mapped to WTF-16 before it can be used in syscalls.std.fs.path.fmtAsUtf8Lossyandstd.fs.path.fmtWtf16LeAsUtf8Lossyfor any use cases where the paths being printed should definitely be represented as valid UTF-8, with unrepresentable sequences replaced by �.Closes #18694
Closes #1774
Closes #2565