Breaking change proposal: Encoding.UTF8 singleton should not have a BOM

## tl;dr

The `Encoding.UTF8` singleton currently says "please emit a BOM when writing." This is an anachronism. Nowadays, it should say "please _do not_ emit a BOM when writing."

The `Encoding.UTF8` singleton should continue to perform U+FFFD substitution on invalid subsequences, just as it does today.

## Discussion

More information: https://github.com/dotnet/standard/issues/260, https://github.com/dotnet/runtime/issues/7779, with further discussion at https://github.com/dotnet/runtime/issues/28218

Historically, the `Encoding.UTF8` singleton has been equivalent to `new UTF8Encoding(encoderShouldEmitUTF8Identifier: true, throwOnInvalidBytes: false)`. This is largely for historical reasons, as these types were introduced during a period when multiple different encodings were commonplace, and the world hadn't yet settled on UTF-8 as the de facto standard. Now, 20 years later, UTF-8 has cemented its place as the true winner, and many tools across Unix and Windows natively operate on UTF-8. But as mentioned in the above linked issues, these tools can fail if they encounter a BOM at the start of the data.

The Unicode maintainers have also discussed recommending _against_ the use of BOMs by default unless explicitly required by the protocol or file format.

* https://www.unicode.org/L2/L2021/21038-bom-guidance.pdf (guidance being drafted, but not yet adopted)
* https://corp.unicode.org/pipermail/unicode/2020-October/009070.html (previous discussion on this issue which led to above draft guidance)
* https://corp.unicode.org/pipermail/unicode/2020-June/008713.html (earlier discussion on this issue)

This would be a breaking change. However, this breaking change should be an overall net positive for the ecosystem because it would prevent our writers from emitting bytes which many tools do not properly discard upon read. We have a history of making breaking changes in this area for .NET Core to assist with interoperability. For example, we changed [`Encoding.Default` to be UTF-8 w/o BOM](https://github.com/dotnet/runtime/issues/7779) across all OSes. We also changed `UTF8Encoding` [to be more standards-compliant](https://github.com/dotnet/docs/issues/13547) when it comes to replacing ill-formed input sequences with U+FFFD chars.

Parsers can still opt to honor BOMs at the beginning of files opened for read. Nothing in this proposal discourages readers from parsing the first few bytes and selecting an appropriate `Encoding` based on that data.

This proposal _does not_ suggest changing the BOM behavior for `Encoding.UTF32`, `Encoding.Unicode`, or other built-in singletons. For writers which query the preamble before writing text, it is useful for these writers to continue to emit a "this data is not UTF-8!" marker before the bytestream. This should help preserve compatibility in the less-common scenarios where people want to continue writing XML files as UTF-16.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Breaking change proposal: Encoding.UTF8 singleton should not have a BOM #51353

tl;dr

Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Breaking change proposal: Encoding.UTF8 singleton should not have a BOM #51353

Description

tl;dr

Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions