Skip to content

Regex pattern '^' is slower than manually lowered equivalent #125277

@danmoseley

Description

@danmoseley

(I thought this was only a curiosity but it turns out patterns like ^ are common in the real world corpus!)

When ^ is the only interesting construct (e.g., bare ^ with Multiline), the engine checks every position for "preceded by \n or start of string". The semantically equivalent hand-lowered lookbehind (?<=\n|\A) is ~40% faster on this case, likely because FindFirstChar can use vectorized IndexOf('\n') to jump to candidate positions.

In practice, most patterns have content after ^ (e.g., ^\w+, ^Line\d+) that gives FindFirstChar a better skip target, which largely masks the Bol overhead. The realistic patterns below show only a ~10% gap or none at all. However, the real-world regex patterns dataset has several multiline ^ patterns that could be affected:

Packages Pattern Notes
5,871 ^ Bare ^ with Multiline -- fully affected
823 ^ +$ ^ + spaces -- weak skip
823 ^ {4} ^ + spaces -- weak skip
823 ^ *> ? ^ + optional spaces -- weak skip
599 ^(\d+)\.(\d+)\.(\d+) ^ + digit -- depends on whether FindFirstChar can use \d
460 ^\s*GO\s*$ ^ + whitespace -- weak skip
279 ^(?!$) ^ + lookahead only -- no skip target

Patterns with a literal after ^ (e.g., ^#include at 2,606 packages, ^-+BEGIN at 1,964) are largely unaffected since FindFirstChar uses the literal to skip.

Results (.NET 11.0.0-preview.1, BenchmarkDotNet, 30 iterations, InProcess, stock runtime, \n text with 1000 lines of ~40 chars each):

Pattern Native ^ Lowered (?<=\n|\A) Delta Notes
^ (bare) 197 us 119 us -40% Lowered wins big
^\w+ 59 us 257 us +330% Native wins -- FindFirstChar uses \w to skip
^.+$ 51 us 45 us -12% Small gap, content dominates
^Line\d+ 52 us 47 us -10% Small gap, literal prefix dominates

For comparison, $ (Eol) does not show this gap -- native $ is already as fast as (?=\n\|\z):

Pattern Native $ Lowered (?=\n|\z) Delta
$ on \n text 125 us 122 us -2%
$ on \r\n text 122 us 127 us +4%

This suggests $ already has an efficient scan optimization that ^ lacks.

Discovered while investigating AnyNewLine anchor lowering performance (#124701).

Repro code (BenchmarkDotNet, .NET 11)
using System;
using System.Linq;
using System.Text.RegularExpressions;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Columns;
using BenchmarkDotNet.Reports;
using BenchmarkDotNet.Toolchains.InProcess.Emit;

var config = DefaultConfig.Instance
    .WithSummaryStyle(SummaryStyle.Default.WithRatioStyle(RatioStyle.Percentage))
    .AddJob(Job.Default
        .WithToolchain(new InProcessEmitToolchain(TimeSpan.FromMinutes(10), true))
        .WithIterationCount(30)
        .WithWarmupCount(5));

BenchmarkSwitcher.FromAssembly(typeof(BolBench).Assembly).Run(args, config);

[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "RatioSD", "Alloc Ratio")]
public class BolBench
{
    // 1000 lines, ~40 chars each. ~41K chars. Newlines are ~2.4% of input.
    private static readonly string LfText = string.Join("\n",
        Enumerable.Range(0, 1000).Select(i => $"Line{i} with some words and content here"));

    private static readonly Regex NativeBol = new(@"^", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex LoweredBol = new(@"(?<=\n|\A)", RegexOptions.Compiled);
    private static readonly Regex NativeBolWord = new(@"^\w+", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex LoweredBolWord = new(@"(?<=\n|\A)\w+", RegexOptions.Compiled);
    private static readonly Regex NativeBolDot = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex LoweredBolDot = new(@"(?<=\n|\A).+(?=\n|\z)", RegexOptions.Compiled);
    private static readonly Regex NativeBolLiteral = new(@"^Line\d+", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex LoweredBolLiteral = new(@"(?<=\n|\A)Line\d+", RegexOptions.Compiled);

    [Benchmark(Baseline = true)]
    public int Bare_Native() => NativeBol.Matches(LfText).Count;

    [Benchmark]
    public int Bare_Lowered() => LoweredBol.Matches(LfText).Count;

    [Benchmark]
    public int Word_Native() => NativeBolWord.Matches(LfText).Count;

    [Benchmark]
    public int Word_Lowered() => LoweredBolWord.Matches(LfText).Count;

    [Benchmark]
    public int DotLine_Native() => NativeBolDot.Matches(LfText).Count;

    [Benchmark]
    public int DotLine_Lowered() => LoweredBolDot.Matches(LfText).Count;

    [Benchmark]
    public int Literal_Native() => NativeBolLiteral.Matches(LfText).Count;

    [Benchmark]
    public int Literal_Lowered() => LoweredBolLiteral.Matches(LfText).Count;
}

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions