Skip to content

Fix XmlReader ArgumentOutOfRangeException for malformed UTF-8 in XML …#124428

Merged
danmoseley merged 2 commits intodotnet:mainfrom
lufen:fix/xmlreader-malformed-utf8
Mar 16, 2026
Merged

Fix XmlReader ArgumentOutOfRangeException for malformed UTF-8 in XML …#124428
danmoseley merged 2 commits intodotnet:mainfrom
lufen:fix/xmlreader-malformed-utf8

Conversation

@lufen
Copy link
Copy Markdown
Contributor

@lufen lufen commented Feb 14, 2026

Fix XmlReader ArgumentOutOfRangeException for malformed UTF-8 in XML declaration
When SafeAsciiDecoder is used during initial XML declaration parsing, it maps
each byte to a char 1:1, including bytes >0x7F. UnDecodeChars() previously
used Encoding.GetByteCount() on these chars to compute bytePos, but non-ASCII
chars like U+00BF (from byte 0xBF) produce 2 bytes in UTF-8, inflating bytePos
beyond bytesUsed and causing ArgumentOutOfRangeException in BlockCopy.

Fix: detect SafeAsciiDecoder and use charPos directly (correct 1:1 mapping),
then validate that the target encoding byte count matches. If mismatched,
throw XmlException with Xml_InvalidCharInThisEncoding, detecting invalid bytes
before they corrupt ParsingState.

Fixes #113061

…declaration

When SafeAsciiDecoder is used during initial XML declaration parsing, it maps
each byte to a char 1:1, including bytes >0x7F. UnDecodeChars() previously
used Encoding.GetByteCount() on these chars to compute bytePos, but non-ASCII
chars like U+00BF (from byte 0xBF) produce 2 bytes in UTF-8, inflating bytePos
beyond bytesUsed and causing ArgumentOutOfRangeException in BlockCopy.

Fix: detect SafeAsciiDecoder and use charPos directly (correct 1:1 mapping),
then validate that the target encoding byte count matches. If mismatched,
throw XmlException with Xml_InvalidCharInThisEncoding, detecting invalid bytes
before they corrupt ParsingState.

Fixes dotnet#113061

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 14, 2026 19:38
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Feb 14, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @dotnet/area-system-xml
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where malformed UTF-8 bytes in XML declarations caused ArgumentOutOfRangeException instead of the expected XmlException. The issue occurred because SafeAsciiDecoder (used during initial XML parsing) maps each byte to a char 1:1, including bytes >0x7F. When UnDecodeChars() used Encoding.GetByteCount() to compute byte positions, non-ASCII chars like U+00BF (from byte 0xBF) would produce 2 bytes in UTF-8, inflating the byte position beyond valid bounds.

Changes:

  • Added detection for SafeAsciiDecoder in UnDecodeChars() to use the correct 1:1 byte-to-char mapping
  • Added validation that target encoding byte count matches character count, throwing XmlException for invalid bytes
  • Added regression test to verify malformed UTF-8 in XML declarations throws XmlException instead of ArgumentOutOfRangeException

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/libraries/System.Private.Xml/src/System/Xml/Core/XmlTextReaderImpl.cs Added SafeAsciiDecoder-specific handling in UnDecodeChars() to correctly compute byte positions and validate encoding compatibility
src/libraries/System.Private.Xml/tests/XmlReader/Tests/ReaderEncodingTests.cs Added regression test for #113061 verifying malformed UTF-8 byte 0xBF in XML declaration throws XmlException

@stephentoub
Copy link
Copy Markdown
Member

🤖 Copilot Code Review — PR #124428

Holistic Assessment

Motivation: Legitimate bug fix for a confirmed regression (#113061) found by fuzzing. When XmlReader.Create(stream) encounters non-ASCII bytes (0x80–0xFF) inside an XML declaration on a UTF-8 stream, UnDecodeChars() computes an inflated bytePos that exceeds bytesUsed, causing ArgumentOutOfRangeException in BlockCopy instead of the expected XmlException. The issue is real and the root cause analysis is correct.

Approach: The fix correctly identifies the mismatch between SafeAsciiDecoder's 1:1 byte-to-char mapping and Encoding.UTF8.GetByteCount()'s multi-byte count for chars > U+007F. It uses charPos directly for the byte offset (correct for SafeAsciiDecoder) and then validates that the target encoding agrees. This is the right layer and the right approach — minimal, targeted, and addresses root cause.

Summary: ⚠️ Needs Human Review. The core fix is correct and well-motivated. I have two observations worth human attention: (1) the bytePos mutation happens before the validation throw, which is safe because Throw is [DoesNotReturn] but is worth a human confirming is intentional; (2) a minor question about whether the issue's original repro (valid multi-byte UTF-8 0xC2 0xBF, not bare 0xBF) should also have a test case. Neither observation is blocking.


Detailed Findings

✅ Core Fix — Correct byte position computation

The central fix is sound:

  • SafeAsciiDecoder always maps 1 byte → 1 char (confirmed in XmlEncoding.cs:175-183), so _ps.bytePos += _ps.charPos is the correct byte offset when this decoder is active.
  • _ps.encoding is always UTF-8 when SafeAsciiDecoder is used (confirmed in SetupEncoding_ps.encoding = Encoding.UTF8 paired with new SafeAsciiDecoder()), so the GetByteCount validation targets the right encoding.
  • The check encodingByteCount != _ps.charPos correctly detects any non-ASCII byte: UTF-8 encodes U+0080–U+00FF as 2 bytes, so GetByteCount > charPos whenever a byte > 0x7F was decoded by SafeAsciiDecoder.
  • False positives are impossible: if all bytes are 0x00–0x7F, UTF-8 encodes each in 1 byte, so encodingByteCount == charPos. Only truly invalid-for-UTF-8 inputs trigger the error.

✅ Async path coverage

The async code (XmlTextReaderImplAsync.cs:1004) calls the same UnDecodeChars() method. No separate async version exists. Both sync and async paths are covered by this single fix.

✅ Exception handling — Throw is [DoesNotReturn]

Throw(string res) at line 2674 is annotated [DoesNotReturn] and unconditionally throws XmlException. The fact that _ps.bytePos is mutated before the throw is harmless — the exception prevents any subsequent use of the corrupted state, and XmlReader does not support recovery after XmlException.

✅ Error message — Appropriate and consistent

SR.Xml_InvalidCharInThisEncoding ("Invalid character in the given encoding.") is the same error used throughout the codebase for encoding-related invalid input (e.g., lines 3661, 3879, and multiple test assertions in ReaderEncodingTests.cs). This is the correct exception type (XmlException) and message for this scenario.

✅ Test — Correctly verifies the fix

The test constructs a byte stream with 0xBF (an invalid standalone UTF-8 continuation byte) embedded in the version attribute value of an XML declaration. Key observations:

  • Assert.Throws<XmlException> confirms the correct exception type (was ArgumentOutOfRangeException before the fix).
  • Assert.Contains(_invalidCharInThisEncoding, ex.Message) confirms the correct error message.
  • The while (reader.Read()) loop ensures the exception is caught regardless of which Read() call triggers it.
  • The byte 0xBF cannot produce a passing test accidentally — it's the only non-ASCII byte and is what triggers the mismatch.

💡 Test coverage — Consider adding the original issue's repro scenario

The original issue's repro uses System.Text.Encoding.UTF8.GetBytes("<?xml version=\"1.0\xbf\"?>") which produces valid multi-byte UTF-8 0xC2 0xBF (not bare 0xBF). SafeAsciiDecoder would decode these as two separate chars (U+00C2, U+00BF), both > 0x7F, inflating GetByteCount by 2 extra bytes. This is a slightly different input than the test's single 0xBF. Consider adding an [InlineData]-style test (or [Theory]) covering both scenarios to match the original report.

💡 Test style — Regression comment citing issue number

The test comment references dotnet/runtime#113061. The repo conventions recommend avoiding regression comments citing GitHub issue numbers unless explicitly asked. This matches the existing pattern in the file (line 10 references an issue URL), so it's not inconsistent — but worth noting for the area maintainer's preference.

Add ReadWithMultiByteUtf8CharInXmlDeclaration test covering the original
issue repro where Encoding.UTF8 produces 0xC2 0xBF for \xbf. SafeAsciiDecoder
maps these two bytes to two >0x7F chars, inflating GetByteCount by 2 extra
bytes. Also tighten the existing test's comment.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 8, 2026 08:30
@lufen lufen force-pushed the fix/xmlreader-malformed-utf8 branch from 3ce4a70 to d12bd8c Compare March 8, 2026 08:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


You can also share your feedback on Copilot code review. Take the survey.

Copy link
Copy Markdown
Member

@danmoseley danmoseley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks

@danmoseley danmoseley merged commit da46c68 into dotnet:main Mar 16, 2026
93 of 94 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-System.Xml community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Undocumented ArgumentOutOfRangeException in XmlReader created from MemoryStream

4 participants