UTF8Encoding reports incorrect error position during incremental decoding

TL;DR
In .NET Core 3.0.100-preview5-011568, during UTF-8 incremental decoding, when an invalid byte is located in one increment but can only be detected during decoding of the subsequent increment, the reported index position of the invalid byte refers to a different byte.

Full story: consider the following input to be decoded:

```csharp
var bytes = new byte[] { (byte)'x', 0xED, (byte)'y', (byte)'z'};
```

It contains one invalid byte in position 2 (zero-based-index value of 1) surrouned by ASCII-range bytes. When this input is decoded in a single pass by the `UTF8Encoding`, the encoding provides the correct value to its decoder fallback:

```csharp
var encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true);

try
{
    encoding.GetChars(bytes);
}
catch (DecoderFallbackException ex)
{
    Console.WriteLine($"ex.Index: {ex.Index} (should be 1)");
    Console.WriteLine($"first byte in ex.BytesUnknown: {ex.BytesUnknown[0]:X2} (should be ED)");
}

```
Output:
```
ex.Index: 1 (should be 1)
first byte in ex.BytesUnknown: ED (should be ED)
```

Now the same input is being decoded in two increments, each two bytes long:

```csharp
var chars = new char[4];
var decoder = encoding.GetDecoder();
int count;

// Increment 1: bytes { 'x', 0xED }
count = decoder.GetCharCount(bytes, index: 0, count: 2, flush: false);
Console.WriteLine($"char count: {count}");
count = decoder.GetChars(bytes, byteIndex: 0, byteCount: 2, chars, charIndex: 0, flush: false);
Console.WriteLine($"num decoded chars: {count}");
```

The output is:

```
char count: 1
num decoded chars: 1
```

which is correct: although the invalid byte is within the first increment, no fallback can be triggered yet because it may have been a start of a valid UTF-8 sequence (byte ED can be a start of an encoded surrogate point). Therefore, only character 'x' is decoded in this step.

The issue is with Increment 2:

```csharp
// Increment 2: bytes { 'y', 'z' }
try
{
    count = decoder.GetCharCount(bytes, index: 2, count: 2, flush: true);
}
catch (DecoderFallbackException ex)
{
    Console.WriteLine($"ex.Index: {ex.Index} (should be -1)");
    Console.WriteLine($"first byte in ex.BytesUnknown: {ex.BytesUnknown[0]:X2} (should be ED)");
}
```

The output is:

```
ex.Index: 0 (should be -1)
first byte in ex.BytesUnknown: ED (should be ED)
```
`BytesUnknown` correctly carries the value of the invalid byte form the previous increment, however the index value, which is an offset relative to the beginning of the decoding buffer, incorrectly points to byte 'y'. A correct value of -1 indicates that the error position lies within the previous increment, one byte back from the end (i.e. the last byte of the previous increment). This was the behaviour of .NET Core 3.0.0-preview4-27615-11. Also all other .NET frameworks report a negative index in such case.

Consistent index reporting is essential for statefull decoder fallback implementations to be able to distinguish cases like `{ 'x', 0xED } + {'y', 'z' }` and `{'x', 'y'} + { 0xED, 'z'}`

cc @GrabYourPitchforks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8Encoding reports incorrect error position during incremental decoding #29674

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

UTF8Encoding reports incorrect error position during incremental decoding #29674

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions