Skip to content

API proposal: Complex Rune enumeration over spans of UTF-16 and UTF-8 text #28507

@GrabYourPitchforks

Description

@GrabYourPitchforks

(This is the follow-up to dotnet/apireviews#76.)

These Rune enumeration APIs are similar to existing APIs like String.EnumerateRunes(), but instead of returning a simple enumerator of Rune they return an enumerator of a data structure which allows deeper inspection of the underlying data.

API proposal

namespace System.Text
{
   // Proposed NEW type
   public readonly struct RunePosition : IEquatable<RunePosition>
   {
      private readonly int _dummy;

      // NOTE: No public ctor other than the default paramerless ctor.
      // Should we wait to add this later when we see demand?

      public static bool operator==(RunePosition left, RunePosition right);
      public static bool operator!=(RunePosition left, RunePosition right);

      public Rune Rune { get; }
      public int StartIndex { get; }
      public int Length { get; }
      public bool WasReplaced { get; }

      [EditorBrowsable(EditorBrowsableState.Never)]
      public void Deconstruct(out Rune rune, out int startIndex);

      [EditorBrowsable(EditorBrowsableState.Never)]
      public void Deconstruct(out Rune rune, out int startIndex, out int length);

      public bool Equals(RunePosition other);
      public override bool Equals(object other);
      public override int GetHashCode();
   }
}

And the factories that give you enumerators over these things:

namespace System.Text
{
   // EXISTING type
   public readonly struct Rune
   {
      // Proposed NEW methods

      public static Utf8RunePositionEnumerator EnumerateRunePositions(ReadOnlySpan<char> span);
      public static Utf16RunePositionEnumerator EnumerateRunePositions(ReadOnlySpan<Utf8Char> span);

      // Proposed NEW nested types under Rune

      public ref struct Utf8RunePositionEnumerator
      {
         private readonly object _dummy;
         private readonly int _dummyPrimitive;

         public RunePosition Current { get; }
         public Utf8RunePositionEnumerator GetEnumerator();
         public bool MoveNext();
      }

      public ref struct Utf16RunePositionEnumerator
      {
         private readonly object _dummy;
         private readonly int _dummyPrimitive;

         public RunePosition Current { get; }
         public Utf16RunePositionEnumerator GetEnumerator();
         public bool MoveNext();
      }
   }
}

Behaviors

If an invalid or incomplete sequence is encountered, the enumerator silently replaces the bad subsequence with U+FFFD and sets WasReplaced = true for this particular element. This also means that the RunePosition.Length property value can differ from the Rune.Utf{8|16}SequenceLength property value if an invalid subsequence is encountered. RunePosition.Length will always return the number of code units consumed as part of discovering that this was an invalid subsequence, and Rune.Utf{8|16}SequenceLength will always return the code unit count of U+FFFD.

Discussion

I don't believe we have any code in corefx which would benefit directly from being able to call into this. Instead, it's meant for developers migrating from Go who are looking for similar capabilities in our own systems.

https://blog.golang.org/strings gives the following sample in its tutorial.

const nihongo = "日本語"
for index, runeValue := range nihongo {
    fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
}

With this API proposal, the C# equivalent (using System.String) would look like the below.

const string nihongo = "日本語";
foreach (var (runeValue, startIndex) in Rune.EnumerateRunePositions(nihongo))
{
   Console.WriteLine($"U+{runeValue.Value:X4} '{runeValue}' starts at char position {startIndex}");
}

Open issues and questions

  • Is it ok to put the enumeration APIs on and nested types under the existing Rune class?

This isn't the standard way of doing things. Normally we nest enumerators under the collection type, but in this case our collection type is ReadOnlySpan<T>, so that won't help much. We can put them as top-level types under the System.Text namespace, but I don't want to pollute default Intellisense.

I had shied away from creating an instance method String.EnumerateRunePositions or an extension method ROS<char>.EnumerateRunePositions because I didn't want this to conflict with the existing methods named simply EnumerateRunes(). I'd prefer to drive most developers to that simple API over this one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    api-approvedAPI was approved in API review, it can be implementedarea-System.Runtimein-prThere is an active PR which will close this issue when it is merged

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions