Skip to content

Avoid unnecessary ToLower calls in RegexCompiler generated code#35185

Merged
stephentoub merged 3 commits intodotnet:masterfrom
stephentoub:regexcasing
Apr 21, 2020
Merged

Avoid unnecessary ToLower calls in RegexCompiler generated code#35185
stephentoub merged 3 commits intodotnet:masterfrom
stephentoub:regexcasing

Conversation

@stephentoub
Copy link
Member

@stephentoub stephentoub commented Apr 20, 2020

When RegexOptions.IgnoreCase is used, we can skip calling ToLower{Invariant} if the only character that could possibly lower-case to the character we're comparing against is that character itself. This then also lets us employ optimizations like using IndexOf when searching for \n as part of expressions like .*.

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;

public class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssemblies(new[] { typeof(Program).Assembly }).Run(args);

    private readonly Regex _many = new Regex("hello.*world", RegexOptions.Compiled | RegexOptions.IgnoreCase);
    private readonly Regex _single = new Regex("h.l", RegexOptions.Compiled | RegexOptions.IgnoreCase);
    private readonly string _input = "abcdHELLO" + new string('a', 128) + "WORLD123";

    [Benchmark] public bool Many() => _many.IsMatch(_input);
    [Benchmark] public bool Single() => _single.IsMatch(_input);
}
Method Toolchain Mean Error StdDev Ratio
Many \master\corerun.exe 817.32 ns 2.488 ns 2.327 ns 1.00
Many \pr\corerun.exe 170.75 ns 0.399 ns 0.354 ns 0.21
Single \master\corerun.exe 84.47 ns 0.278 ns 0.232 ns 1.00
Single \pr\corerun.exe 81.94 ns 0.220 ns 0.195 ns 0.97

cc: @GrabYourPitchforks, @tarekgh, @eerhardt, @pgovind

@stephentoub stephentoub added this to the 5.0 milestone Apr 20, 2020
@ghost
Copy link

ghost commented Apr 20, 2020

Tagging subscribers to this area: @eerhardt
Notify danmosemsft if you want to be subscribed.

@GrabYourPitchforks
Copy link
Member

The logic in the PR seems sound, but it's not immediately obvious to me how it improved this benchmark. In the "hello.*world" test, it seems that you'd still need to query each of the 128 'a' characters, asking "do you lowercase to 'w'?" What am I missing?

@stephentoub
Copy link
Member Author

stephentoub commented Apr 20, 2020

What am I missing?

That * is greedy.

With the expression hello.*world and the input abc hello world world world 123, it will match hello world world world, not hello world. So, .* doesn't mean "skip over everything until you see the first "world'", it actually means "consume as many .s as possible, and then if the rest of the expression can't be matched at that point, backtrack until it can".

By default, . means "anything other than \n" (it's the same as the character class [^\n]), so .* means "skip over everything that's not a \n"... and rather than implementing that as "are you \n? are you \n? are you \n?" as we walk along the characters (which benefits from this change because we don't need to ToLower each such char), more importantly it let's us use the existing (new to .NET 5) optimization that employs IndexOf('\n').

This is why I was asking offline (thanks, btw) about what could lowercase to \n. It shows up a lot because of ..

Copy link
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the current change. will be fine too if we do the check on the control characters only as this was the case we were trying to fix in the first place.

Copy link

@pgovind pgovind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice find

We can skip calling ToLower{Invariant} if the only character that could possibly lower-case to the character we're comparing against is that character itself.  This then also lets us employ optimizations like using IndexOf when searching for `\n` as part of expressions like `.*`.
@stephentoub stephentoub merged commit 6bb37e1 into dotnet:master Apr 21, 2020
@stephentoub stephentoub deleted the regexcasing branch April 21, 2020 13:37
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants