Data corruption in BinaryReader.ReadChars

This issue was first reported internally as a bug in .NET Framework, but it is also reproducible in .NET Core.

* .NET Framework code: https://referencesource.microsoft.com/#mscorlib/system/io/binaryreader.cs,344
* .NET Core 2.2 code: [dotnet/coreclr/blob/v2.2.6/src/mscorlib/src/System/IO/BinaryReader.cs#L358](https://github.com/dotnet/coreclr/blob/v2.2.6/src/mscorlib/src/System/IO/BinaryReader.cs#L358)
* .NET Core 3.0 code: [dotnet/coreclr/blob/v3.0.0-preview8.19379.2/src/System.Private.CoreLib/shared/System/IO/BinaryReader.cs#L388](https://github.com/dotnet/coreclr/blob/v3.0.0-preview8.19379.2/src/System.Private.CoreLib/shared/System/IO/BinaryReader.cs#L368)
* .NET Core 5.0 code: [dotnet/corefx/blob/2fac89f63c644ab81d936d62c91e6003302e718d/src/Common/src/CoreLib/System/IO/BinaryReader.cs#L362](https://github.com/dotnet/corefx/blob/2fac89f63c644ab81d936d62c91e6003302e718d/src/Common/src/CoreLib/System/IO/BinaryReader.cs#L362)

### UTF-16

When using a 2-byte encoding (i.e. UTF-16), and `_stream.Read()` returns an uneven number of bytes (e.g. a network stream), then in the next iteration of the loop, `numBytes` can be off by one, and the next read can go past the end of the encoded data in the stream. This can either cause an unexpected end of stream, and/or corruption of any subsequent data read from it.
     
The repro below is supposed to print all "True"s, as it does when a `MemoryStream` is used. However, when using a mock `NetStream` that simulates short reads, data corruption occurs.

Update: It seems that this issue has always been there, so it's not a new regression.

```cs
class Program
{
    class NetStream : MemoryStream
    {
        public override int Read(byte[] buffer, int offset, int count)
        {
            return base.Read(buffer, offset, count > 10 ? count - 3 : count);
        }

        public override int Read(Span<byte> destination)
        {
            return base.Read(destination.Length > 10 ? destination.Slice(0, destination.Length - 3) : destination);
        }
    }

    static void Main(string[] args)
    {
        char[] data1 = "hello world".ToCharArray();
        uint data2 = 0xABCDEF01;
        uint data3 = 0;

        using (Stream stream = new NetStream())
        {
            using (BinaryWriter writer = new BinaryWriter(stream, Encoding.Unicode, leaveOpen: true))
            {
                writer.Write(data1);
                writer.Write(data2);
                writer.Write(data3);
            }

            Console.WriteLine(stream.Length == data1.Length * sizeof(char) + 2 * sizeof(uint));

            stream.Seek(0, SeekOrigin.Begin);
            using (BinaryReader reader = new BinaryReader(stream, encoding: Encoding.Unicode, leaveOpen: true))
            {
                char[] data1b = reader.ReadChars(data1.Length);
                Console.WriteLine(data1.AsSpan().SequenceEqual(data1b));
                Console.WriteLine(stream.Position == data1.Length * sizeof(char));

                uint data2b = reader.ReadUInt32();
                Console.WriteLine(data2 == data2b);
                Console.WriteLine(stream.Position == data1.Length * sizeof(char) + sizeof(uint));
            }
        }
    }
}
```

### UTF-8

The code below passes on .NET Framework and .NET Core 2.2, but fails on 3.0, so it's a new regression:

```cs
class Program
{
    // version for UTF-8: over-reads if 2 bytes, 3 chars left
    // Fails on .NET Core 3.0, passes on 2.2.
    static void Main(string[] args)
    {
        char[] data1 = "hello world 😃x".ToCharArray(); // 14 code points, 15 chars in UTF-16, 17 bytes in UTF-8
        uint data2 = 0xABCDEF01;
        uint data3 = 0;

        using (Stream stream = new MemoryStream())
        {
            using (BinaryWriter writer = new BinaryWriter(stream, Encoding.UTF8, leaveOpen: true))
            {
                writer.Write(data1);
                writer.Write(data2);
                writer.Write(data3);
            }

            Console.WriteLine(stream.Length == (data1.Length - 1) + 3 + 2 * sizeof(uint));

            stream.Seek(0, SeekOrigin.Begin);
            using (BinaryReader reader = new BinaryReader(stream, encoding: Encoding.UTF8, leaveOpen: true))
            {
                char[] data1b = reader.ReadChars(data1.Length);
                Console.WriteLine(data1.AsSpan().SequenceEqual(data1b));
                Console.WriteLine(stream.Position == (data1.Length - 1) + 3);

                uint data2b = reader.ReadUInt32();
                Console.WriteLine(data2 == data2b);
                Console.WriteLine(stream.Position == (data1.Length - 1) + 3 + sizeof(uint));
            }
        }
    }
}
```


### Remarks

* Both versions of `BinaryReader` fail the UTF-16 repro.
* The UTF-8 case does not need a short read (i.e. network stream), it fails deterministically on any kind of underlying stream.
* It seems that the reason why the UTF-8 code fails in 3.0 but passes in 2.2, is because the 2.2 runtime uses a version of `BinaryReader` that has a special case for handling t`his edge case for single-byte encodings:

The special case in 2.2: [dotnet/coreclr/blob/ce1d090d33b400a25620c0145046471495067cc7/src/mscorlib/src/System/IO/BinaryReader.cs#L377-L386](https://github.com/dotnet/coreclr/blob/ce1d090d33b400a25620c0145046471495067cc7/src/mscorlib/src/System/IO/BinaryReader.cs#L377-L386)

```cs
...
numBytes = charsRemaining;

// special case for DecoderNLS subclasses when there is a hanging byte from the previous loop
DecoderNLS decoder = _decoder as DecoderNLS;
if (decoder != null && decoder.HasState && numBytes > 1)
{
    numBytes -= 1;
}

if (_2BytesPerChar)
...
```
No special case in 3.0 and 5.0: [dotnet/coreclr/blob/9642de76d4f5e563150a90f5923b304d87d7a8b1/src/System.Private.CoreLib/shared/System/IO/BinaryReader.cs#L387-L389](https://github.com/dotnet/coreclr/blob/9642de76d4f5e563150a90f5923b304d87d7a8b1/src/System.Private.CoreLib/shared/System/IO/BinaryReader.cs#L387-L389)
```cs
...
numBytes = charsRemaining;

if (_2BytesPerChar)
...
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data corruption in BinaryReader.ReadChars #30637

UTF-16

UTF-8

Remarks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Data corruption in BinaryReader.ReadChars #30637

Description

UTF-16

UTF-8

Remarks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions