-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Description
(I thought this was only a curiosity but it turns out patterns like ^ are common in the real world corpus!)
When ^ is the only interesting construct (e.g., bare ^ with Multiline), the engine checks every position for "preceded by \n or start of string". The semantically equivalent hand-lowered lookbehind (?<=\n|\A) is ~40% faster on this case, likely because FindFirstChar can use vectorized IndexOf('\n') to jump to candidate positions.
In practice, most patterns have content after ^ (e.g., ^\w+, ^Line\d+) that gives FindFirstChar a better skip target, which largely masks the Bol overhead. The realistic patterns below show only a ~10% gap or none at all. However, the real-world regex patterns dataset has several multiline ^ patterns that could be affected:
| Packages | Pattern | Notes |
|---|---|---|
| 5,871 | ^ |
Bare ^ with Multiline -- fully affected |
| 823 | ^ +$ |
^ + spaces -- weak skip |
| 823 | ^ {4} |
^ + spaces -- weak skip |
| 823 | ^ *> ? |
^ + optional spaces -- weak skip |
| 599 | ^(\d+)\.(\d+)\.(\d+) |
^ + digit -- depends on whether FindFirstChar can use \d |
| 460 | ^\s*GO\s*$ |
^ + whitespace -- weak skip |
| 279 | ^(?!$) |
^ + lookahead only -- no skip target |
Patterns with a literal after ^ (e.g., ^#include at 2,606 packages, ^-+BEGIN at 1,964) are largely unaffected since FindFirstChar uses the literal to skip.
Results (.NET 11.0.0-preview.1, BenchmarkDotNet, 30 iterations, InProcess, stock runtime, \n text with 1000 lines of ~40 chars each):
| Pattern | Native ^ |
Lowered (?<=\n|\A) |
Delta | Notes |
|---|---|---|---|---|
^ (bare) |
197 us | 119 us | -40% | Lowered wins big |
^\w+ |
59 us | 257 us | +330% | Native wins -- FindFirstChar uses \w to skip |
^.+$ |
51 us | 45 us | -12% | Small gap, content dominates |
^Line\d+ |
52 us | 47 us | -10% | Small gap, literal prefix dominates |
For comparison, $ (Eol) does not show this gap -- native $ is already as fast as (?=\n\|\z):
| Pattern | Native $ |
Lowered (?=\n|\z) |
Delta |
|---|---|---|---|
$ on \n text |
125 us | 122 us | -2% |
$ on \r\n text |
122 us | 127 us | +4% |
This suggests $ already has an efficient scan optimization that ^ lacks.
Discovered while investigating AnyNewLine anchor lowering performance (#124701).
Repro code (BenchmarkDotNet, .NET 11)
using System;
using System.Linq;
using System.Text.RegularExpressions;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Columns;
using BenchmarkDotNet.Reports;
using BenchmarkDotNet.Toolchains.InProcess.Emit;
var config = DefaultConfig.Instance
.WithSummaryStyle(SummaryStyle.Default.WithRatioStyle(RatioStyle.Percentage))
.AddJob(Job.Default
.WithToolchain(new InProcessEmitToolchain(TimeSpan.FromMinutes(10), true))
.WithIterationCount(30)
.WithWarmupCount(5));
BenchmarkSwitcher.FromAssembly(typeof(BolBench).Assembly).Run(args, config);
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "RatioSD", "Alloc Ratio")]
public class BolBench
{
// 1000 lines, ~40 chars each. ~41K chars. Newlines are ~2.4% of input.
private static readonly string LfText = string.Join("\n",
Enumerable.Range(0, 1000).Select(i => $"Line{i} with some words and content here"));
private static readonly Regex NativeBol = new(@"^", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex LoweredBol = new(@"(?<=\n|\A)", RegexOptions.Compiled);
private static readonly Regex NativeBolWord = new(@"^\w+", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex LoweredBolWord = new(@"(?<=\n|\A)\w+", RegexOptions.Compiled);
private static readonly Regex NativeBolDot = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex LoweredBolDot = new(@"(?<=\n|\A).+(?=\n|\z)", RegexOptions.Compiled);
private static readonly Regex NativeBolLiteral = new(@"^Line\d+", RegexOptions.Compiled | RegexOptions.Multiline);
private static readonly Regex LoweredBolLiteral = new(@"(?<=\n|\A)Line\d+", RegexOptions.Compiled);
[Benchmark(Baseline = true)]
public int Bare_Native() => NativeBol.Matches(LfText).Count;
[Benchmark]
public int Bare_Lowered() => LoweredBol.Matches(LfText).Count;
[Benchmark]
public int Word_Native() => NativeBolWord.Matches(LfText).Count;
[Benchmark]
public int Word_Lowered() => LoweredBolWord.Matches(LfText).Count;
[Benchmark]
public int DotLine_Native() => NativeBolDot.Matches(LfText).Count;
[Benchmark]
public int DotLine_Lowered() => LoweredBolDot.Matches(LfText).Count;
[Benchmark]
public int Literal_Native() => NativeBolLiteral.Matches(LfText).Count;
[Benchmark]
public int Literal_Lowered() => LoweredBolLiteral.Matches(LfText).Count;
}