Skip to content

Utf16.IsValid and Utf8/16.IndexOfInvalidSubsequence#120326

Open
lilinus wants to merge 20 commits intodotnet:mainfrom
lilinus:utf8-utf16-validation
Open

Utf16.IsValid and Utf8/16.IndexOfInvalidSubsequence#120326
lilinus wants to merge 20 commits intodotnet:mainfrom
lilinus:utf8-utf16-validation

Conversation

@lilinus
Copy link
Contributor

@lilinus lilinus commented Oct 2, 2025

Fixes #118018
Fixes #118113

Copilot AI review requested due to automatic review settings October 2, 2025 14:24
@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 2, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Oct 2, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds new UTF-16 validation APIs to System.Text.Unicode, introducing Utf16.IsValid and Utf16.IndexOfInvalidSubsequence methods alongside similar functionality for UTF-8. It also refactors existing code to use these new standardized validation APIs instead of custom validation logic.

  • Adds new Utf16 class with validation methods IsValid and IndexOfInvalidSubsequence
  • Adds Utf8.IndexOfInvalidSubsequence method to complement existing Utf8.IsValid
  • Refactors multiple codebases to use the new standardized validation APIs

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
System.Private.CoreLib.Shared.projitems Adds the new Utf16.cs file to the build
Utf16.cs New class implementing UTF-16 validation methods
Utf8.cs Adds IndexOfInvalidSubsequence method and updates class documentation
System.Runtime.cs Adds public API surface for new UTF-16 and UTF-8 validation methods
HttpEncoder.cs Refactors surrogate validation to use new Utf16.IndexOfInvalidSubsequence
Normalization.Icu.cs Replaces custom validation logic with Utf16.IsValid
StringSearchValues.cs Simplifies surrogate validation using Utf16.IsValid
Utf8UtilityTests.ValidateBytes.cs Adds test coverage for new Utf8.IndexOfInvalidSubsequence method
Utf16UtilityTests.ValidateChars.cs Adds test coverage for new Utf16 validation methods

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

@jeffhandley jeffhandley removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jan 14, 2026
Copilot AI review requested due to automatic review settings February 1, 2026 18:58
@jeffhandley
Copy link
Member

@lilinus Thanks for your patience on this; we are planning to review it this month. I've updated the branch and CI will re-run, but there was an ApiCompat error in the build previously indicating Utf8.IndexOfInvalidSubsequence wasn't available in the net8.0 target and it wasn't obvious to me why that was the case.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Update XML doc comments and remove unneeded using statements.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 1, 2026 19:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Copy link
Member

@jozkee jozkee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the unsafe comment, this LGTM.

Is worth mentioning that

private static bool HasInvalidUnicodeSequence(ReadOnlySpan<char> s)
from #118018 can't adopt the new APIs because it would make it more expensive by requiring two passes instead of one.

Copilot AI review requested due to automatic review settings February 27, 2026 09:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 27, 2026 09:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

@lilinus
Copy link
Contributor Author

lilinus commented Feb 28, 2026

I'm getting some compat errors in Microsoft.Bcl.Memory.

Not sure how to approach those unfortunately. Should the new methods be excluded or included in that package?

Copilot AI review requested due to automatic review settings March 2, 2026 13:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8.cs:830

  • The XML docs for Utf8.IsValid immediately below this new API still describe the input as a "ReadOnlySpan string". Now that IndexOfInvalidSubsequence is added, please update Utf8.IsValid's <param name="value"> documentation to describe validating a UTF-8 byte sequence / input text, for consistency and correctness.
            Utf8Utility.GetIndexOfFirstInvalidUtf8Sequence(value, out _);
#endif

        /// <summary>
        /// Validates that the value is well-formed UTF-8.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 2, 2026 14:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8.cs:834

  • The XML docs for Utf8.IsValid still describe the input parameter as a generic "ReadOnlySpan string", but this API validates a UTF-8 byte sequence (ReadOnlySpan). Please update the text so public docs accurately describe the expected input, ideally matching the wording used for IndexOfInvalidSubsequence just above.
        /// <summary>
        /// Validates that the value is well-formed UTF-8.
        /// </summary>
        /// <param name="value">The <see cref="ReadOnlySpan{T}"/> string.</param>
        /// <returns><c>true</c> if value is well-formed UTF-8, <c>false</c> otherwise.</returns>
        public static bool IsValid(ReadOnlySpan<byte> value) =>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-System.Runtime community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[API Proposal]: Indexed version of Utf8.IsValid [API Proposal]: UTF-16 version of Utf8.IsValid

6 participants