-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Description
TL;DR
In .NET Core 3.0.100-preview5-011568, during UTF-8 incremental decoding, when an invalid byte is located in one increment but can only be detected during decoding of the subsequent increment, the reported index position of the invalid byte refers to a different byte.
Full story: consider the following input to be decoded:
var bytes = new byte[] { (byte)'x', 0xED, (byte)'y', (byte)'z'};It contains one invalid byte in position 2 (zero-based-index value of 1) surrouned by ASCII-range bytes. When this input is decoded in a single pass by the UTF8Encoding, the encoding provides the correct value to its decoder fallback:
var encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true);
try
{
encoding.GetChars(bytes);
}
catch (DecoderFallbackException ex)
{
Console.WriteLine($"ex.Index: {ex.Index} (should be 1)");
Console.WriteLine($"first byte in ex.BytesUnknown: {ex.BytesUnknown[0]:X2} (should be ED)");
}Output:
ex.Index: 1 (should be 1)
first byte in ex.BytesUnknown: ED (should be ED)
Now the same input is being decoded in two increments, each two bytes long:
var chars = new char[4];
var decoder = encoding.GetDecoder();
int count;
// Increment 1: bytes { 'x', 0xED }
count = decoder.GetCharCount(bytes, index: 0, count: 2, flush: false);
Console.WriteLine($"char count: {count}");
count = decoder.GetChars(bytes, byteIndex: 0, byteCount: 2, chars, charIndex: 0, flush: false);
Console.WriteLine($"num decoded chars: {count}");The output is:
char count: 1
num decoded chars: 1
which is correct: although the invalid byte is within the first increment, no fallback can be triggered yet because it may have been a start of a valid UTF-8 sequence (byte ED can be a start of an encoded surrogate point). Therefore, only character 'x' is decoded in this step.
The issue is with Increment 2:
// Increment 2: bytes { 'y', 'z' }
try
{
count = decoder.GetCharCount(bytes, index: 2, count: 2, flush: true);
}
catch (DecoderFallbackException ex)
{
Console.WriteLine($"ex.Index: {ex.Index} (should be -1)");
Console.WriteLine($"first byte in ex.BytesUnknown: {ex.BytesUnknown[0]:X2} (should be ED)");
}The output is:
ex.Index: 0 (should be -1)
first byte in ex.BytesUnknown: ED (should be ED)
BytesUnknown correctly carries the value of the invalid byte form the previous increment, however the index value, which is an offset relative to the beginning of the decoding buffer, incorrectly points to byte 'y'. A correct value of -1 indicates that the error position lies within the previous increment, one byte back from the end (i.e. the last byte of the previous increment). This was the behaviour of .NET Core 3.0.0-preview4-27615-11. Also all other .NET frameworks report a negative index in such case.
Consistent index reporting is essential for statefull decoder fallback implementations to be able to distinguish cases like { 'x', 0xED } + {'y', 'z' } and {'x', 'y'} + { 0xED, 'z'}