-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
tl;dr
The Encoding.UTF8 singleton currently says "please emit a BOM when writing." This is an anachronism. Nowadays, it should say "please do not emit a BOM when writing."
The Encoding.UTF8 singleton should continue to perform U+FFFD substitution on invalid subsequences, just as it does today.
Discussion
More information: dotnet/standard#260, #7779, with further discussion at #28218
Historically, the Encoding.UTF8 singleton has been equivalent to new UTF8Encoding(encoderShouldEmitUTF8Identifier: true, throwOnInvalidBytes: false). This is largely for historical reasons, as these types were introduced during a period when multiple different encodings were commonplace, and the world hadn't yet settled on UTF-8 as the de facto standard. Now, 20 years later, UTF-8 has cemented its place as the true winner, and many tools across Unix and Windows natively operate on UTF-8. But as mentioned in the above linked issues, these tools can fail if they encounter a BOM at the start of the data.
The Unicode maintainers have also discussed recommending against the use of BOMs by default unless explicitly required by the protocol or file format.
- https://www.unicode.org/L2/L2021/21038-bom-guidance.pdf (guidance being drafted, but not yet adopted)
- https://corp.unicode.org/pipermail/unicode/2020-October/009070.html (previous discussion on this issue which led to above draft guidance)
- https://corp.unicode.org/pipermail/unicode/2020-June/008713.html (earlier discussion on this issue)
This would be a breaking change. However, this breaking change should be an overall net positive for the ecosystem because it would prevent our writers from emitting bytes which many tools do not properly discard upon read. We have a history of making breaking changes in this area for .NET Core to assist with interoperability. For example, we changed Encoding.Default to be UTF-8 w/o BOM across all OSes. We also changed UTF8Encoding to be more standards-compliant when it comes to replacing ill-formed input sequences with U+FFFD chars.
Parsers can still opt to honor BOMs at the beginning of files opened for read. Nothing in this proposal discourages readers from parsing the first few bytes and selecting an appropriate Encoding based on that data.
This proposal does not suggest changing the BOM behavior for Encoding.UTF32, Encoding.Unicode, or other built-in singletons. For writers which query the preamble before writing text, it is useful for these writers to continue to emit a "this data is not UTF-8!" marker before the bytestream. This should help preserve compatibility in the less-common scenarios where people want to continue writing XML files as UTF-16.