-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Closed
Labels
area-System.Text.RegularExpressionstenet-performancePerformance related issuePerformance related issue
Description
I'm very glad for RegexGeneratorAttribute.
For an example, I have to find all consonant in russian language. I write:
[RegexGenerator(@"[а-я-[аeиоуыэюя]]", RegexOptions.None, -1)]
private static partial Regex ConsonantRegex();This Regex emits the following generated code:
/// ...
/// I display only interesting part.
/// <summary>Search <paramref name="inputSpan"/> starting from base.runtextpos for the next location a match could possibly start.</summary>
/// <param name="inputSpan">The text being scanned by the regular expression.</param>
/// <returns>true if a possible match was found; false if no more matches are possible.</returns>
private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
{
int pos = base.runtextpos;
char ch;
// Empty matches aren't possible.
if ((uint)pos < (uint)inputSpan.Length)
{
// The pattern begins with a character in the set [\u0430-\u044F-[e\u0430\u0438\u043E\u0443\u044B\u044D-\u044F]].
// Find the next occurrence. If it can't be found, there's no match.
ReadOnlySpan<char> span = inputSpan.Slice(pos);
for (int i = 0; i < span.Length; i++)
{
if (((ch = span[i]) < 128 ? ("\0\0\0\0\0\0\0\0"[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass((char)ch, "\0\u0002\0аѐ\0\u000e\0efабийопуфыьэѐ")))
{
base.runtextpos = pos + i;
return true;
}
}
}
}There we can see the condition that will be always false: ("\0\0\0\0\0\0\0\0"[ch >> 4] & (1 << (ch & 0xF))) != 0
Is it possible to replace this line with: if (((ch = span[i]) >= 128 && RegexRunner.CharInClass((char)ch, "\0\u0002\0аѐ\0\u000e\0efабийопуфыьэѐ")))?
I've tried to compare generated regex with suggested change:
public class DotnetRegexBenchmark
{
// This is the pure copy from generated regex.
private readonly Regex _originalRegex = OriginalRegex.Instance;
// This is the copy from generated regex with suggested change.
private readonly Regex _suggestedRegex = SuggestedRegex.Instance;
// Text with Latin symbols.
private readonly string? _text;
public DotnetRegexBenchmark()
{
_text = new string(Enumerable
.Range(1, 5000)
.Select(i => (char)('a' + i % ('z' - 'a')))
.ToArray());
}
[Benchmark(Baseline = true)]
public int Original() => CountMatches(_originalRegex);
[Benchmark]
public int Suggested() => CountMatches(_suggestedRegex);
private int CountMatches(Regex regex)
{
int matchesCount = 0;
Match match = regex.Match(_text!);
while (match.Success)
{
++matchesCount;
match = match.NextMatch();
}
return matchesCount;
}
}With configuration:
BenchmarkDotNet=v0.13.1.20220709-develop, OS=Windows 10 (10.0.19044.1766/21H2/November2021Update)
Intel Xeon CPU E3-1225 V2 3.20GHz, 1 CPU, 4 logical and 4 physical cores
.NET SDK=7.0.100-preview.5.22307.18
[Host] : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT
DefaultJob : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT
I've got:
| Method | Mean | Error | StdDev | Ratio |
|---|---|---|---|---|
| Original | 5.711 us | 0.0049 us | 0.0038 us | 1.00 |
| Suggested | 2.879 us | 0.0034 us | 0.0029 us | 0.50 |
As we can notice, it could be faster for a regex, that is looking for non ASCII symbols and for a text, that contains at most ASCII symbols.
Thanks!
Metadata
Metadata
Assignees
Labels
area-System.Text.RegularExpressionstenet-performancePerformance related issuePerformance related issue