Skip to content

Data corruption in BinaryReader.ReadChars #30637

@carlossanlop

Description

@carlossanlop

This issue was first reported internally as a bug in .NET Framework, but it is also reproducible in .NET Core.

UTF-16

When using a 2-byte encoding (i.e. UTF-16), and _stream.Read() returns an uneven number of bytes (e.g. a network stream), then in the next iteration of the loop, numBytes can be off by one, and the next read can go past the end of the encoded data in the stream. This can either cause an unexpected end of stream, and/or corruption of any subsequent data read from it.

The repro below is supposed to print all "True"s, as it does when a MemoryStream is used. However, when using a mock NetStream that simulates short reads, data corruption occurs.

Update: It seems that this issue has always been there, so it's not a new regression.

class Program
{
    class NetStream : MemoryStream
    {
        public override int Read(byte[] buffer, int offset, int count)
        {
            return base.Read(buffer, offset, count > 10 ? count - 3 : count);
        }

        public override int Read(Span<byte> destination)
        {
            return base.Read(destination.Length > 10 ? destination.Slice(0, destination.Length - 3) : destination);
        }
    }

    static void Main(string[] args)
    {
        char[] data1 = "hello world".ToCharArray();
        uint data2 = 0xABCDEF01;
        uint data3 = 0;

        using (Stream stream = new NetStream())
        {
            using (BinaryWriter writer = new BinaryWriter(stream, Encoding.Unicode, leaveOpen: true))
            {
                writer.Write(data1);
                writer.Write(data2);
                writer.Write(data3);
            }

            Console.WriteLine(stream.Length == data1.Length * sizeof(char) + 2 * sizeof(uint));

            stream.Seek(0, SeekOrigin.Begin);
            using (BinaryReader reader = new BinaryReader(stream, encoding: Encoding.Unicode, leaveOpen: true))
            {
                char[] data1b = reader.ReadChars(data1.Length);
                Console.WriteLine(data1.AsSpan().SequenceEqual(data1b));
                Console.WriteLine(stream.Position == data1.Length * sizeof(char));

                uint data2b = reader.ReadUInt32();
                Console.WriteLine(data2 == data2b);
                Console.WriteLine(stream.Position == data1.Length * sizeof(char) + sizeof(uint));
            }
        }
    }
}

UTF-8

The code below passes on .NET Framework and .NET Core 2.2, but fails on 3.0, so it's a new regression:

class Program
{
    // version for UTF-8: over-reads if 2 bytes, 3 chars left
    // Fails on .NET Core 3.0, passes on 2.2.
    static void Main(string[] args)
    {
        char[] data1 = "hello world 😃x".ToCharArray(); // 14 code points, 15 chars in UTF-16, 17 bytes in UTF-8
        uint data2 = 0xABCDEF01;
        uint data3 = 0;

        using (Stream stream = new MemoryStream())
        {
            using (BinaryWriter writer = new BinaryWriter(stream, Encoding.UTF8, leaveOpen: true))
            {
                writer.Write(data1);
                writer.Write(data2);
                writer.Write(data3);
            }

            Console.WriteLine(stream.Length == (data1.Length - 1) + 3 + 2 * sizeof(uint));

            stream.Seek(0, SeekOrigin.Begin);
            using (BinaryReader reader = new BinaryReader(stream, encoding: Encoding.UTF8, leaveOpen: true))
            {
                char[] data1b = reader.ReadChars(data1.Length);
                Console.WriteLine(data1.AsSpan().SequenceEqual(data1b));
                Console.WriteLine(stream.Position == (data1.Length - 1) + 3);

                uint data2b = reader.ReadUInt32();
                Console.WriteLine(data2 == data2b);
                Console.WriteLine(stream.Position == (data1.Length - 1) + 3 + sizeof(uint));
            }
        }
    }
}

Remarks

  • Both versions of BinaryReader fail the UTF-16 repro.
  • The UTF-8 case does not need a short read (i.e. network stream), it fails deterministically on any kind of underlying stream.
  • It seems that the reason why the UTF-8 code fails in 3.0 but passes in 2.2, is because the 2.2 runtime uses a version of BinaryReader that has a special case for handling t`his edge case for single-byte encodings:

The special case in 2.2: dotnet/coreclr/blob/ce1d090d33b400a25620c0145046471495067cc7/src/mscorlib/src/System/IO/BinaryReader.cs#L377-L386

...
numBytes = charsRemaining;

// special case for DecoderNLS subclasses when there is a hanging byte from the previous loop
DecoderNLS decoder = _decoder as DecoderNLS;
if (decoder != null && decoder.HasState && numBytes > 1)
{
    numBytes -= 1;
}

if (_2BytesPerChar)
...

No special case in 3.0 and 5.0: dotnet/coreclr/blob/9642de76d4f5e563150a90f5923b304d87d7a8b1/src/System.Private.CoreLib/shared/System/IO/BinaryReader.cs#L387-L389

...
numBytes = charsRemaining;

if (_2BytesPerChar)
...

Metadata

Metadata

Assignees

Labels

area-System.IOtenet-reliabilityReliability/stability related issue (stress, load problems, etc.)

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions