Fix XmlReader ArgumentOutOfRangeException for malformed UTF-8 in XML … by lufen · Pull Request #124428 · dotnet/runtime

lufen · 2026-02-14T19:38:01Z

Fix XmlReader ArgumentOutOfRangeException for malformed UTF-8 in XML declaration
When SafeAsciiDecoder is used during initial XML declaration parsing, it maps
each byte to a char 1:1, including bytes >0x7F. UnDecodeChars() previously
used Encoding.GetByteCount() on these chars to compute bytePos, but non-ASCII
chars like U+00BF (from byte 0xBF) produce 2 bytes in UTF-8, inflating bytePos
beyond bytesUsed and causing ArgumentOutOfRangeException in BlockCopy.

Fix: detect SafeAsciiDecoder and use charPos directly (correct 1:1 mapping),
then validate that the target encoding byte count matches. If mismatched,
throw XmlException with Xml_InvalidCharInThisEncoding, detecting invalid bytes
before they corrupt ParsingState.

Fixes #113061

…declaration When SafeAsciiDecoder is used during initial XML declaration parsing, it maps each byte to a char 1:1, including bytes >0x7F. UnDecodeChars() previously used Encoding.GetByteCount() on these chars to compute bytePos, but non-ASCII chars like U+00BF (from byte 0xBF) produce 2 bytes in UTF-8, inflating bytePos beyond bytesUsed and causing ArgumentOutOfRangeException in BlockCopy. Fix: detect SafeAsciiDecoder and use charPos directly (correct 1:1 mapping), then validate that the target encoding byte count matches. If mismatched, throw XmlException with Xml_InvalidCharInThisEncoding, detecting invalid bytes before they corrupt ParsingState. Fixes dotnet#113061 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

dotnet-policy-service · 2026-02-14T19:39:10Z

Tagging subscribers to this area: @dotnet/area-system-xml
See info in area-owners.md if you want to be subscribed.

Copilot

Pull request overview

This PR fixes a bug where malformed UTF-8 bytes in XML declarations caused ArgumentOutOfRangeException instead of the expected XmlException. The issue occurred because SafeAsciiDecoder (used during initial XML parsing) maps each byte to a char 1:1, including bytes >0x7F. When UnDecodeChars() used Encoding.GetByteCount() to compute byte positions, non-ASCII chars like U+00BF (from byte 0xBF) would produce 2 bytes in UTF-8, inflating the byte position beyond valid bounds.

Changes:

Added detection for SafeAsciiDecoder in UnDecodeChars() to use the correct 1:1 byte-to-char mapping
Added validation that target encoding byte count matches character count, throwing XmlException for invalid bytes
Added regression test to verify malformed UTF-8 in XML declarations throws XmlException instead of ArgumentOutOfRangeException

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
src/libraries/System.Private.Xml/src/System/Xml/Core/XmlTextReaderImpl.cs	Added SafeAsciiDecoder-specific handling in UnDecodeChars() to correctly compute byte positions and validate encoding compatibility
src/libraries/System.Private.Xml/tests/XmlReader/Tests/ReaderEncodingTests.cs	Added regression test for #113061 verifying malformed UTF-8 byte 0xBF in XML declaration throws XmlException

stephentoub · 2026-03-04T14:36:18Z

🤖 Copilot Code Review — PR #124428

Holistic Assessment

Motivation: Legitimate bug fix for a confirmed regression (#113061) found by fuzzing. When XmlReader.Create(stream) encounters non-ASCII bytes (0x80–0xFF) inside an XML declaration on a UTF-8 stream, UnDecodeChars() computes an inflated bytePos that exceeds bytesUsed, causing ArgumentOutOfRangeException in BlockCopy instead of the expected XmlException. The issue is real and the root cause analysis is correct.

Approach: The fix correctly identifies the mismatch between SafeAsciiDecoder's 1:1 byte-to-char mapping and Encoding.UTF8.GetByteCount()'s multi-byte count for chars > U+007F. It uses charPos directly for the byte offset (correct for SafeAsciiDecoder) and then validates that the target encoding agrees. This is the right layer and the right approach — minimal, targeted, and addresses root cause.

Summary: ⚠️ Needs Human Review. The core fix is correct and well-motivated. I have two observations worth human attention: (1) the bytePos mutation happens before the validation throw, which is safe because Throw is [DoesNotReturn] but is worth a human confirming is intentional; (2) a minor question about whether the issue's original repro (valid multi-byte UTF-8 0xC2 0xBF, not bare 0xBF) should also have a test case. Neither observation is blocking.

Detailed Findings

✅ Core Fix — Correct byte position computation

The central fix is sound:

SafeAsciiDecoder always maps 1 byte → 1 char (confirmed in XmlEncoding.cs:175-183), so _ps.bytePos += _ps.charPos is the correct byte offset when this decoder is active.
_ps.encoding is always UTF-8 when SafeAsciiDecoder is used (confirmed in SetupEncoding — _ps.encoding = Encoding.UTF8 paired with new SafeAsciiDecoder()), so the GetByteCount validation targets the right encoding.
The check encodingByteCount != _ps.charPos correctly detects any non-ASCII byte: UTF-8 encodes U+0080–U+00FF as 2 bytes, so GetByteCount > charPos whenever a byte > 0x7F was decoded by SafeAsciiDecoder.
False positives are impossible: if all bytes are 0x00–0x7F, UTF-8 encodes each in 1 byte, so encodingByteCount == charPos. Only truly invalid-for-UTF-8 inputs trigger the error.

✅ Async path coverage

The async code (XmlTextReaderImplAsync.cs:1004) calls the same UnDecodeChars() method. No separate async version exists. Both sync and async paths are covered by this single fix.

✅ Exception handling — `Throw` is `[DoesNotReturn]`

Throw(string res) at line 2674 is annotated [DoesNotReturn] and unconditionally throws XmlException. The fact that _ps.bytePos is mutated before the throw is harmless — the exception prevents any subsequent use of the corrupted state, and XmlReader does not support recovery after XmlException.

✅ Error message — Appropriate and consistent

SR.Xml_InvalidCharInThisEncoding ("Invalid character in the given encoding.") is the same error used throughout the codebase for encoding-related invalid input (e.g., lines 3661, 3879, and multiple test assertions in ReaderEncodingTests.cs). This is the correct exception type (XmlException) and message for this scenario.

✅ Test — Correctly verifies the fix

The test constructs a byte stream with 0xBF (an invalid standalone UTF-8 continuation byte) embedded in the version attribute value of an XML declaration. Key observations:

Assert.Throws<XmlException> confirms the correct exception type (was ArgumentOutOfRangeException before the fix).
Assert.Contains(_invalidCharInThisEncoding, ex.Message) confirms the correct error message.
The while (reader.Read()) loop ensures the exception is caught regardless of which Read() call triggers it.
The byte 0xBF cannot produce a passing test accidentally — it's the only non-ASCII byte and is what triggers the mismatch.

💡 Test coverage — Consider adding the original issue's repro scenario

The original issue's repro uses System.Text.Encoding.UTF8.GetBytes("<?xml version=\"1.0\xbf\"?>") which produces valid multi-byte UTF-8 0xC2 0xBF (not bare 0xBF). SafeAsciiDecoder would decode these as two separate chars (U+00C2, U+00BF), both > 0x7F, inflating GetByteCount by 2 extra bytes. This is a slightly different input than the test's single 0xBF. Consider adding an [InlineData]-style test (or [Theory]) covering both scenarios to match the original report.

💡 Test style — Regression comment citing issue number

The test comment references dotnet/runtime#113061. The repo conventions recommend avoiding regression comments citing GitHub issue numbers unless explicitly asked. This matches the existing pattern in the file (line 10 references an issue URL), so it's not inconsistent — but worth noting for the area maintainer's preference.

Add ReadWithMultiByteUtf8CharInXmlDeclaration test covering the original issue repro where Encoding.UTF8 produces 0xC2 0xBF for \xbf. SafeAsciiDecoder maps these two bytes to two >0x7F chars, inflating GetByteCount by 2 extra bytes. Also tighten the existing test's comment. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

You can also share your feedback on Copilot code review. Take the survey.

src/libraries/System.Private.Xml/src/System/Xml/Core/XmlTextReaderImpl.cs

src/libraries/System.Private.Xml/tests/XmlReader/Tests/ReaderEncodingTests.cs

danmoseley

LGTM thanks

Copilot AI review requested due to automatic review settings February 14, 2026 19:38

github-actions bot added the area-System.Xml label Feb 14, 2026

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Feb 14, 2026

Copilot started reviewing on behalf of lufen February 14, 2026 19:38 View session

Copilot AI reviewed Feb 14, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 8, 2026 08:30

lufen force-pushed the fix/xmlreader-malformed-utf8 branch from 3ce4a70 to d12bd8c Compare March 8, 2026 08:30

Copilot started reviewing on behalf of lufen March 8, 2026 08:31 View session

Copilot AI reviewed Mar 8, 2026

View reviewed changes

src/libraries/System.Private.Xml/src/System/Xml/Core/XmlTextReaderImpl.cs Show resolved Hide resolved

danmoseley reviewed Mar 16, 2026

View reviewed changes

src/libraries/System.Private.Xml/tests/XmlReader/Tests/ReaderEncodingTests.cs Show resolved Hide resolved

danmoseley approved these changes Mar 16, 2026

View reviewed changes

danmoseley merged commit da46c68 into dotnet:main Mar 16, 2026
93 of 94 checks passed

dotnet-maestro bot mentioned this pull request Mar 16, 2026

[main] Source code updates from dotnet/runtime dotnet/dotnet#5506

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix XmlReader ArgumentOutOfRangeException for malformed UTF-8 in XML …#124428

Fix XmlReader ArgumentOutOfRangeException for malformed UTF-8 in XML …#124428
danmoseley merged 2 commits intodotnet:mainfrom
lufen:fix/xmlreader-malformed-utf8

lufen commented Feb 14, 2026 •

edited

Loading

Uh oh!

dotnet-policy-service bot commented Feb 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

stephentoub commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

danmoseley left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lufen commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dotnet-policy-service bot commented Feb 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

stephentoub commented Mar 4, 2026

🤖 Copilot Code Review — PR #124428

Holistic Assessment

Detailed Findings

✅ Core Fix — Correct byte position computation

✅ Async path coverage

✅ Exception handling — Throw is [DoesNotReturn]

✅ Error message — Appropriate and consistent

✅ Test — Correctly verifies the fix

💡 Test coverage — Consider adding the original issue's repro scenario

💡 Test style — Regression comment citing issue number

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danmoseley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lufen commented Feb 14, 2026 •

edited

Loading

✅ Exception handling — `Throw` is `[DoesNotReturn]`