Skip to content

Remove redundant check of 128-bit bitmap of zeros in regex #72312

@scrappyCoco

Description

@scrappyCoco

I'm very glad for RegexGeneratorAttribute.

For an example, I have to find all consonant in russian language. I write:

[RegexGenerator(@"[а-я-[аeиоуыэюя]]", RegexOptions.None, -1)]
private static partial Regex ConsonantRegex();

This Regex emits the following generated code:

/// ...
/// I display only interesting part. 

/// <summary>Search <paramref name="inputSpan"/> starting from base.runtextpos for the next location a match could possibly start.</summary>
/// <param name="inputSpan">The text being scanned by the regular expression.</param>
/// <returns>true if a possible match was found; false if no more matches are possible.</returns>
private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
{
    int pos = base.runtextpos;
    char ch;
    
    // Empty matches aren't possible.
    if ((uint)pos < (uint)inputSpan.Length)
    {
        // The pattern begins with a character in the set [\u0430-\u044F-[e\u0430\u0438\u043E\u0443\u044B\u044D-\u044F]].
        // Find the next occurrence. If it can't be found, there's no match.
        ReadOnlySpan<char> span = inputSpan.Slice(pos);
        for (int i = 0; i < span.Length; i++)
        {
            if (((ch = span[i]) < 128 ? ("\0\0\0\0\0\0\0\0"[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass((char)ch, "\0\u0002\0аѐ\0\u000e\0efабийопуфыьэѐ")))
            {
                base.runtextpos = pos + i;
                return true;
            }
        }
    }
}

There we can see the condition that will be always false: ("\0\0\0\0\0\0\0\0"[ch >> 4] & (1 << (ch & 0xF))) != 0

Is it possible to replace this line with: if (((ch = span[i]) >= 128 && RegexRunner.CharInClass((char)ch, "\0\u0002\0аѐ\0\u000e\0efабийопуфыьэѐ")))?

I've tried to compare generated regex with suggested change:

public class DotnetRegexBenchmark
{
    // This is the pure copy from generated regex. 
    private readonly Regex _originalRegex = OriginalRegex.Instance;
    
    // This is the copy from generated regex with suggested change. 
    private readonly Regex _suggestedRegex = SuggestedRegex.Instance;
    
    // Text with Latin symbols.
    private readonly string? _text;

    public DotnetRegexBenchmark()
    {
        _text = new string(Enumerable
            .Range(1, 5000)
            .Select(i => (char)('a' + i % ('z' - 'a')))
            .ToArray());
    }

    [Benchmark(Baseline = true)]
    public int Original() => CountMatches(_originalRegex);

    [Benchmark]
    public int Suggested() => CountMatches(_suggestedRegex);
    
    private int CountMatches(Regex regex)
    {
        int matchesCount = 0;
        Match match = regex.Match(_text!);
        while (match.Success)
        {
            ++matchesCount;
            match = match.NextMatch();
        }
        return matchesCount;
    }
}

With configuration:

BenchmarkDotNet=v0.13.1.20220709-develop, OS=Windows 10 (10.0.19044.1766/21H2/November2021Update)  
Intel Xeon CPU E3-1225 V2 3.20GHz, 1 CPU, 4 logical and 4 physical cores  
.NET SDK=7.0.100-preview.5.22307.18  
[Host]     : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT  
DefaultJob : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT

I've got:

Method Mean Error StdDev Ratio
Original 5.711 us 0.0049 us 0.0038 us 1.00
Suggested 2.879 us 0.0034 us 0.0029 us 0.50

As we can notice, it could be faster for a regex, that is looking for non ASCII symbols and for a text, that contains at most ASCII symbols.

Thanks!

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions