Remove redundant check of 128-bit bitmap of zeros in regex

I'm very glad for `RegexGeneratorAttribute`.

For an example, I have to find all consonant in russian language. I write:
```cs
[RegexGenerator(@"[а-я-[аeиоуыэюя]]", RegexOptions.None, -1)]
private static partial Regex ConsonantRegex();
```

This Regex emits the following generated code:
```cs
/// ...
/// I display only interesting part. 

/// <summary>Search <paramref name="inputSpan"/> starting from base.runtextpos for the next location a match could possibly start.</summary>
/// <param name="inputSpan">The text being scanned by the regular expression.</param>
/// <returns>true if a possible match was found; false if no more matches are possible.</returns>
private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
{
    int pos = base.runtextpos;
    char ch;
    
    // Empty matches aren't possible.
    if ((uint)pos < (uint)inputSpan.Length)
    {
        // The pattern begins with a character in the set [\u0430-\u044F-[e\u0430\u0438\u043E\u0443\u044B\u044D-\u044F]].
        // Find the next occurrence. If it can't be found, there's no match.
        ReadOnlySpan<char> span = inputSpan.Slice(pos);
        for (int i = 0; i < span.Length; i++)
        {
            if (((ch = span[i]) < 128 ? ("\0\0\0\0\0\0\0\0"[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass((char)ch, "\0\u0002\0аѐ\0\u000e\0efабийопуфыьэѐ")))
            {
                base.runtextpos = pos + i;
                return true;
            }
        }
    }
}
```

There we can see the condition that will be always `false`: `("\0\0\0\0\0\0\0\0"[ch >> 4] & (1 << (ch & 0xF))) != 0`

Is it possible to replace this line with: `if (((ch = span[i]) >= 128 && RegexRunner.CharInClass((char)ch, "\0\u0002\0аѐ\0\u000e\0efабийопуфыьэѐ")))`?

I've tried to compare generated regex with suggested change:

```cs
public class DotnetRegexBenchmark
{
    // This is the pure copy from generated regex. 
    private readonly Regex _originalRegex = OriginalRegex.Instance;
    
    // This is the copy from generated regex with suggested change. 
    private readonly Regex _suggestedRegex = SuggestedRegex.Instance;
    
    // Text with Latin symbols.
    private readonly string? _text;

    public DotnetRegexBenchmark()
    {
        _text = new string(Enumerable
            .Range(1, 5000)
            .Select(i => (char)('a' + i % ('z' - 'a')))
            .ToArray());
    }

    [Benchmark(Baseline = true)]
    public int Original() => CountMatches(_originalRegex);

    [Benchmark]
    public int Suggested() => CountMatches(_suggestedRegex);
    
    private int CountMatches(Regex regex)
    {
        int matchesCount = 0;
        Match match = regex.Match(_text!);
        while (match.Success)
        {
            ++matchesCount;
            match = match.NextMatch();
        }
        return matchesCount;
    }
}
```

With configuration:
```
BenchmarkDotNet=v0.13.1.20220709-develop, OS=Windows 10 (10.0.19044.1766/21H2/November2021Update)  
Intel Xeon CPU E3-1225 V2 3.20GHz, 1 CPU, 4 logical and 4 physical cores  
.NET SDK=7.0.100-preview.5.22307.18  
[Host]     : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT  
DefaultJob : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT
``` 

I've  got:

|    Method |     Mean |     Error |    StdDev | Ratio |
|---------- |---------:|----------:|----------:|------:|
|  Original | 5.711 us | 0.0049 us | 0.0038 us |  1.00 |
| Suggested | 2.879 us | 0.0034 us | 0.0029 us |  0.50 |


As we can notice, it could be faster for a regex, that is looking for non ASCII symbols and for a text, that contains at most ASCII symbols.

Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove redundant check of 128-bit bitmap of zeros in regex #72312

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Method	Mean	Error	StdDev	Ratio
Original	5.711 us	0.0049 us	0.0038 us	1.00
Suggested	2.879 us	0.0034 us	0.0029 us	0.50

Remove redundant check of 128-bit bitmap of zeros in regex #72312

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions