Skip to content

.NET Core 3.0 follows Unicode best practices when replacing ill-formed UTF-8 byte sequences #13547

@GrabYourPitchforks

Description

@GrabYourPitchforks

.NET Core 3.0 follows Unicode best practices when replacing ill-formed UTF-8 byte sequences

See .NET Core 3.0 follows Unicode best practices when replacing ill-formed UTF-8 byte sequences for updated documentation for this change.

When the UTF8Encoding class encounters an ill-formed UTF-8 byte sequence during a bytes-to-chars transcoding operation, it will replace that sequence with a '�' (U+FFFD REPLACEMENT CHARACTER) character in the output string. .NET Core 3.0 differs from previous versions of .NET Core and the .NET Framework in that .NET Core 3.0 follows the Unicode best practice for performing this replacement during the transcoding operation.

Version introduced

3.0

Change description

When transcoding bytes to chars, the UTF8Encoding class now performs character substitution based on Unicode best practices. The substitution mechanism used is described by The Unicode Standard, Version 12.0, Sec. 3.9 (PDF) in the heading titled U+FFFD Substitution of Maximal Subparts.

This behavior only applies when the input byte sequence contains ill-formed UTF-8 data. Additionally, if the UTF8Encoding instance has been constructed with throwOnInvalidBytes: true (see the ctor documentation), the UTF8Encoding instance will continue to throw on invalid input rather than perform U+FFFD replacement.

Old behavior

Input: The 3-byte input: [ ED A0 90 ] (ill-formed input)
Output: The 2-char output: [ FFFD FFFD ]

New behavior

Input: The 3-byte input: [ ED A0 90 ] (ill-formed input)
Output: The 3-char output: [ FFFD FFFD FFFD ]

(This 3-char output is the preferred output per Table 3-9 of the previously linked Unicode Standard PDF.)

Reason for change

This is part of a larger effort to improve UTF-8 handling throughout .NET, including by the new System.Text.Unicode.Utf8 and System.Text.Rune types. The UTF8Encoding type was given improved error handling mechanics so that it produces output consistent with the newly introduced types.

Recommended action

No action is required on the part of the developer.

Category

Core

Affected APIs


Issue metadata

  • Issue type: breaking-change

Metadata

Metadata

Assignees

No one assigned

    Labels

    breaking-changeIndicates a .NET Core breaking change

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions