Skip to content

Add RegexOptions.AnyNewLine via parser lowering#124701

Open
danmoseley wants to merge 26 commits intodotnet:mainfrom
danmoseley:anynewline-lower-v2
Open

Add RegexOptions.AnyNewLine via parser lowering#124701
danmoseley wants to merge 26 commits intodotnet:mainfrom
danmoseley:anynewline-lower-v2

Conversation

@danmoseley
Copy link
Member

@danmoseley danmoseley commented Feb 21, 2026

Motivation

.NET's Regex class hardcodes \n as the only newline character. With RegexOptions.Multiline, $ matches before \n but not before \r, \r\n, or Unicode line breaks. This is "by far one of the biggest gotchas" with System.Text.RegularExpressions:

// BUG: on a file with Windows \r\n line endings, .+$ captures trailing \r
var match = Regex.Match("foo\r\nbar", ".*$", RegexOptions.Multiline);
// match.Value == "foo\r" -- not "foo"!

Users are forced into fragile workarounds like \r?$ or (\r\n|\n) to handle mixed line endings. Real-world NuGet packages show how common this is -- from the real-world regex patterns dataset:

  • (\r\n|\n) (18,474 packages) -- CSV parser manually matching both line endings
  • \r?\n in PEM key parsing (1,964 packages) -- \r?\n sprinkled throughout with Multiline
  • $(\r?\n)? in assembly attribute matching (2,108 packages) -- using Multiline with manual newline handling
  • [\r\n]+ (2,422 packages) -- matching any newline character

These workarounds are error-prone, don't compose well with ^ and $ anchors, and miss Unicode newlines (\u0085, \u2028, \u2029).

Summary

Implements RegexOptions.AnyNewLine (api-approved) which makes $, ^, \Z, and . recognize all Unicode line boundaries: \r, \r\n, \n, \u0085 (NEL), \u2028 (LS), \u2029 (PS) -- consistent with Unicode TR18 RL1.6 and PCRE2's (*ANY) behavior.

With AnyNewLine, the example above just works:

var match = Regex.Match("foo\r\nbar", ".*$", RegexOptions.Multiline | RegexOptions.AnyNewLine);
// match.Value == "foo"

Approach: Parser Lowering

All logic lives in RegexParser.cs -- no changes to the interpreter, compiler, or source generator engines. Each affected construct is lowered into an equivalent RegexNode sub-tree:

Construct Lowered to
$ (no Multiline) / \Z (?=\r\n\z|\r?\z)|(?<!\r)(?=\n\z)|(?=[\u0085\u2028\u2029]\z)
$ (Multiline) (?=\r\n|\r|[\u0085\u2028\u2029]|\z)|(?<!\r)(?=\n)
^ (Multiline) (?<=\A|\r\n|\n|[\u0085\u2028\u2029])|(?<=\r)(?!\n)
. [^\r\n\u0085\u2028\u2029] (but Singleline takes precedence)

Key design choices:

  • \r\n is atomic: $ never matches between \r and \n. This is enforced with lookbehind/lookahead guards.
  • Singleline takes precedence: . with Singleline | AnyNewLine matches everything (including newlines), consistent with Singleline's documented behavior.
  • \A and \z are unaffected: absolute start/end anchors don't change.
  • Incompatible with NonBacktracking and ECMAScript: throws ArgumentOutOfRangeException (lowered patterns use lookaround).
  • Zero perf impact on existing patterns: the lowering is gated on the AnyNewLine flag, so patterns that don't use it take the same code paths as before. The only new cost is a flag check ((_options & RegexOptions.AnyNewLine) != 0) in the parser for $, ^, \Z, and ., which is negligible.

Out of scope: \R

Unicode TR18 RL1.6 also recommends a meta-character \R for matching any newline sequence (consuming the characters), equivalent to (?:\r\n|[\n\v\f\r\u0085\u2028\u2029]). This is distinct from what AnyNewLine does: AnyNewLine modifies the behavior of existing zero-width anchors (^, $, \Z) and the character class ., while \R would be a new consuming pattern element. Adding \R could be done independently as a separate feature.

Changes

Production code

  • RegexOptions.cs -- add AnyNewLine = 0x0800
  • RegexParser.cs -- lowering methods AnyNewLineEndZNode(), AnyNewLineEolNode(), AnyNewLineBolNode(), plus . handling
  • RegexCharClass.cs -- add NotNewLineOrCarriageReturnClass constant
  • Regex.cs / RegexCompilationInfo.cs -- validation

Tests

  • ~120 new test cases covering dot, anchors ($, ^, \Z), RightToLeft, Singleline, Multiline, Replace, Split, Count, EnumerateMatches, NonBacktracking rejection, edge cases (adjacent newlines, empty lines, all-newline strings), and PCRE2-inspired scenarios

Fixes #25598

danmoseley and others added 14 commits February 20, 2026 21:59
Add AnyNewLine = 0x0800 to RegexOptions enum. Update ValidateOptions to
bump MaxOptionShift to 12 and reject AnyNewLine | NonBacktracking.
ECMAScript already rejects unknown options via allowlist.

Update source generator to include AnyNewLine in SupportedOptions mask.
Update tests that used 0x800 as an invalid option value to use 0x1000.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When AnyNewLine is set without Multiline, lower $ from EndZ into an
equivalent sub-tree: (?=\r\n\z|\r?\z)|(?<!\r)(?=\n\z)

This matches at end of string, or before \r\n, \r, or \n at end of
string, but not between \r and \n. Works across all engines
(interpreter, compiled, source generator) since it's pure parser
lowering.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When AnyNewLine is set, lower \Z using the same sub-tree as $ without
Multiline. \Z is not affected by Multiline, so the same lowering
applies regardless.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When both Multiline and AnyNewLine are set, lower $ to:
  (?=\r\n|\r|\z)|(?<!\r)(?=\n)

This matches at \r\n, \r, \n boundaries and end-of-string,
without matching between \r and \n of a \r\n sequence.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When both Multiline and AnyNewLine are set, lower ^ to:
  (?<=\A|\r\n|\n)|(?<=\r)(?!\n)

This matches after \r\n, \n, bare \r (not followed by \n), and
at start of string. Without Multiline, ^ remains \A unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When AnyNewLine is set (without Singleline), lower . to [^\n\r]
instead of [^\n], so dot does not match \r or \n.

Add NotNewLineOrCarriageReturnClass constant to RegexCharClass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Combined ^/$/. tests, Replace/Split, RightToLeft, mixed newlines,
empty lines, \Z with trailing newlines, and edge cases.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Integration tests using a ~50 char string with all newline types
(\r\n, \r, \n, \u0085, \u2028, \u2029) exercising ^, $, \Z, and .
together. Replace/Split tests with MatchEvaluator line numbering.
Deduplicated cases moved into per-feature tests (RightToLeft,
empty lines).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Expand test coverage across all AnyNewLine-affected constructs:
- Dollar, EndZ, DollarMultiline, CaretMultiline, Dot test data
  with adjacent newlines, newlines at string boundaries,
  empty segments, RightToLeft, and all Unicode newline types
- Advanced tests: inline options, backreferences, conditionals,
  alternation with anchors, lookahead/lookbehind, quantified dot,
  lazy quantifiers, named/atomic groups, word boundaries near
  newlines, explicit char classes unaffected
- Methods test: IsMatch, Count, EnumerateMatches, Match with
  startat, Replace with group ref, Split
- Unicode expansion: \s/\S behavior, \w behavior, \p{Zl}/\p{Zp}
  categories, adjacent Unicode+ASCII newlines, baselines without
  AnyNewLine

No bugs found — all initial test failures were wrong expectations.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Verify the fixer correctly emits RegexOptions.Multiline |
RegexOptions.AnyNewLine in enum value order when upgrading
to GeneratedRegex.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Test cases derived from cross-validation with PCRE2 NEWLINE_ANY
behavior (BSD-licensed) and analysis of real-world patterns from
dotnet/runtime-assets:
- (.+)# greedy where .+ cannot cross newlines (PCRE2 JIT 472)
- (.)(.) requiring consecutive non-newlines (PCRE2 JIT 471)
- (.). with mixed newline types (PCRE2 JIT 469)
- Blank line detection (^ +$) with \n, \r\n, \u0085 separators

All 31,528 tests pass. No bugs found — our implementation is
fully consistent with PCRE2 NEWLINE_ANY behavior and handles
real-world patterns correctly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add more RightToLeft + AnyNewLine tests (various newline types, dot,
  anchors, \Z)
- Add more Singleline | AnyNewLine tests (all newline types, combined
  with Multiline)
- Replace RegexOptions.AnyNewLine with RegexHelpers.RegexOptionAnyNewLine
  throughout tests for net481 compilation compatibility
- Wrap Count/EnumerateMatches in #if NET for net481 compat
- Add clarifying comments on Split behavior with/without AnyNewLine

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danmoseley
Copy link
Member Author

(Finally got around to having AI finish my lowering branch..)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements RegexOptions.AnyNewLine (value 0x0800 = 2048), a new regex option that makes ^, $, \Z, and . recognize all Unicode line boundaries (\r, \r\n, \n, \u0085 NEL, \u2028 LS, \u2029 PS) instead of only \n. This addresses a major usability issue where users had to manually work around .NET's hardcoded \n-only line ending behavior.

Changes:

  • Added RegexOptions.AnyNewLine = 0x0800 enum value with incompatibility checks for NonBacktracking and ECMAScript modes
  • Implemented parser-level lowering of ^, $, \Z, and . into equivalent lookaround-based RegexNode trees when AnyNewLine is enabled
  • Added comprehensive test coverage (~800 new test lines) covering all anchor types, newline combinations, RightToLeft mode, inline options, and edge cases

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexOptions.cs Added AnyNewLine = 0x0800 enum value with XML documentation
src/libraries/System.Text.RegularExpressions/ref/System.Text.RegularExpressions.cs Updated ref assembly with AnyNewLine = 2048
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs Updated MaxOptionShift to 12 and added AnyNewLine to NonBacktracking incompatibility check
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexParser.cs Implemented lowering methods (AnyNewLineEndZNode, AnyNewLineEolNode, AnyNewLineBolNode) and integrated into ^, $, \Z, . parsing
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs Added NotNewLineOrCarriageReturnClass constant for . with AnyNewLine
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Parser.cs Added AnyNewLine to source generator's supported options
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Match.Tests.cs Added ~800 lines of comprehensive tests for all anchor types, newline combinations, and edge cases
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Tests.Common.cs Added RegexOptionAnyNewLine constant for test compatibility
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Ctor.Tests.cs Updated invalid option test from 0x800 to 0x1000; added NonBacktracking+AnyNewLine incompatibility test
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.MultipleMatches.Tests.cs Updated invalid option comments and tests from 0x800 to 0x1000
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.EnumerateMatches.Tests.cs Updated invalid option tests from 0x800 to 0x1000
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/RegexGeneratorParserTests.cs Updated invalid option tests from 0x800 to 0x1000
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/UpgradeToGeneratedRegexAnalyzerTests.cs Updated tests for 0x1000 as invalid option; added AnyNewLine test case for code fixer

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.

@danmoseley
Copy link
Member Author

@MihuBot benchmark Regex

@MihuBot
Copy link

MihuBot commented Feb 21, 2026

@danmoseley
Copy link
Member Author

Mihubot confirms zero perf impact on existing patterns/options,

@danmoseley
Copy link
Member Author

AnyNewLine Performance Analysis (Release, Compiled, .NET 11.0, BenchmarkDotNet)

Measured impact of converting existing newline-workaround patterns to simplified AnyNewLine equivalents. All scenarios use RegexOptions.Compiled (representative of source-generated too). Measured with BenchmarkDotNet (InProcess, ShortRun). All match counts verified identical between old and new patterns.

Section 1: Real-World Patterns on Windows \r\n Text

Old Pattern New Pattern (+ AnyNewLine) Old (us) New (us) Ratio
^.+\r?$ (1K lines) ^.+$ 46.7 48.8 1.05x
^.+\r?$ (10K lines) ^.+$ 1,694 1,760 1.04x
\[assembly:...\]\s*$(\r?\n)? \[assembly:...\]\s*$ 38.3 32.4 0.85x
^([^\s:]+):\s*(.+?)\r?$ ^([^\s:]+):\s*(.+?)$ 105.9 105.9 1.00x
^# .+\r?$ ^# .+$ 11.1 9.1 0.83x
^.+\r?$ (CSV, 1K rows) ^.+$ 44.4 49.2 1.11x
[^\r\n]+ .+ 44.2 43.8 0.99x
\w+\r?$ \w+$ 90.8 128.7 1.42x
(?:^&#124;\r\n)\w+ ^\w+ 208.7 214.5 1.03x

Section 2: Unix \n Text (overhead of just enabling the flag)

Old Pattern New Pattern (+ AnyNewLine) Old (us) New (us) Ratio
^.+$ ^.+$ 43.5 48.9 1.12x
[^\n]+ .+ 39.0 44.9 1.15x

Section 3: Mixed \n/\r\n Text

Old Pattern New Pattern (+ AnyNewLine) Old (us) New (us) Ratio
[^\r\n\u0085\u2028\u2029]+ .+ 45.4 44.2 0.97x
^.+\r?$ (1K lines) ^.+$ 44.1 50.1 1.14x

Section 4: Non-anchor/dot Patterns (zero impact expected)

Old Pattern New Pattern (+ AnyNewLine) Old (us) New (us) Ratio
\r\n&#124;\r&#124;\n \r\n&#124;\r&#124;\n 20.0 21.7 1.08x
\w+ \w+ 322.4 336.4 1.04x

Section 5: Pathological Cases (unlikely in practice)

Old Pattern New Pattern (+ AnyNewLine) Old (us) New (us) Ratio
$ $ 98.2 134.1 1.37x
^ ^ 145.6 131.6 0.90x
\w+\r?\Z (329K chars) \w+\Z 494.2 1,039.3 2.10x

Summary

  1. Real-world patterns in Compiled mode show 0.83x--1.14x -- essentially zero cost, and sometimes faster because the AnyNewLine pattern is simpler (e.g., ^# .+$ vs ^# .+\r?$ -- removing the \r? node saves more than the lowered $ costs).

  2. Where small regressions occur (1.1x--1.4x), the cause is the lowered anchor tree: a native $ (Eol) is a single "is next char \n?" check, but AnyNewLine lowers it to a lookahead alternation like (?=\r\n|\r|\n|\u0085|\u2028|\u2029|\z). Even when the input only contains \r\n, the engine must evaluate the alternation branches. This overhead is proportionally more visible when the anchor dominates the work (e.g., \w+$ where the \w+ match is short), and nearly invisible when .+ dominates each line's work (e.g., ^.+$ at 1.04x).

  3. Patterns without anchors or dot are completely unaffected (1.04--1.08x, within noise) -- the flag only changes behavior of ., ^, $, \Z.

  4. Only pathological case: \w+\Z on very large input (329K chars) at 2.1x -- the lowered \Z alternation tree is evaluated during backtracking at many positions. Unlikely in practice.

  5. In Compiled/source-generated mode, the JIT compiles the lowered alternation branches into efficient single-char comparisons, keeping overhead minimal. Interpreted mode shows larger gaps (2--3x for typical patterns) but AnyNewLine + interpreted + perf-sensitive is an unlikely combination.

Benchmark source code (BenchmarkDotNet)
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Columns;
using BenchmarkDotNet.Reports;
using BenchmarkDotNet.Toolchains.InProcess.Emit;

BenchmarkRunner.Run<AnyNewLineBenchmarks>(
    DefaultConfig.Instance
        .WithSummaryStyle(SummaryStyle.Default.WithRatioStyle(RatioStyle.Percentage))
        .AddJob(Job.ShortRun.WithToolchain(InProcessEmitToolchain.Instance)));

[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "RatioSD", "Alloc Ratio")]
public class AnyNewLineBenchmarks
{
    private const RegexOptions AnyNewLine = (RegexOptions)0x0800;

    private static string GenerateText(int lineCount, string[] newlines)
    {
        var sb = new StringBuilder();
        for (int i = 0; i < lineCount; i++)
        {
            sb.Append("Lorem ipsum dolor sit amet ");
            sb.Append(i);
            sb.Append(newlines[i % newlines.Length]);
        }
        return sb.ToString();
    }

    private static readonly string WinText1K = GenerateText(1000, ["\r\n"]);
    private static readonly string WinText10K = GenerateText(10000, ["\r\n"]);
    private static readonly string UnixText1K = GenerateText(1000, ["\n"]);
    private static readonly string MixedNR1K = GenerateText(1000, ["\n", "\r\n"]);
    private static readonly string MixedAll1K = GenerateText(1000,
        ["\n", "\r\n", "\r", "\u0085", "\u2028", "\u2029"]);

    private static readonly string AssemblyInfo;
    private static readonly string KvConfig;
    private static readonly string Markdown;
    private static readonly string CsvData;

    static AnyNewLineBenchmarks()
    {
        var sb = new StringBuilder();
        string[] attrs = {
            "[assembly: AssemblyTitle(\"MyApp\")]",
            "[assembly: AssemblyDescription(\"A sample app\")]",
            "[assembly: AssemblyConfiguration(\"\")]",
            "[assembly: AssemblyCompany(\"Contoso\")]",
            "[assembly: AssemblyProduct(\"MyApp\")]",
            "[assembly: AssemblyCopyright(\"Copyright 2024\")]",
            "[assembly: AssemblyTrademark(\"\")]",
            "[assembly: AssemblyCulture(\"\")]",
            "[assembly: AssemblyVersion(\"1.0.0.0\")]",
            "[assembly: AssemblyFileVersion(\"1.0.0.0\")]"
        };
        foreach (var attr in attrs) { sb.Append(attr); sb.Append("\r\n"); }
        AssemblyInfo = string.Concat(Enumerable.Repeat(sb.ToString(), 50));

        sb.Clear();
        string[] keys = { "Server", "Database", "User", "Password", "Timeout",
                          "MaxPool", "MinPool", "Encrypt", "TrustCert", "AppName" };
        for (int i = 0; i < 50; i++)
        {
            sb.Append(keys[i % keys.Length]); sb.Append(": value_"); sb.Append(i); sb.Append("\r\n");
        }
        KvConfig = string.Concat(Enumerable.Repeat(sb.ToString(), 20));

        sb.Clear();
        for (int i = 0; i < 200; i++)
        {
            sb.Append($"# Heading {i}\r\n");
            sb.Append($"Some paragraph text about topic {i}.\r\n");
            sb.Append($"Another line of content here.\r\n\r\n");
        }
        Markdown = sb.ToString();

        sb.Clear();
        sb.Append("Name,Age,City,Email\r\n");
        for (int i = 0; i < 1000; i++)
            sb.Append($"User{i},{20 + i % 50},City{i % 100},user{i}@example.com\r\n");
        CsvData = sb.ToString();
    }

    // Section 1: Real-world on Windows \r\n text
    private static readonly Regex Old_1a = new(@"^.+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_1a = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
    [Benchmark(Baseline = true, Description = "1a_Lines1K_Old")]
    public int Lines1K_Old() => Old_1a.Matches(WinText1K).Count;
    [Benchmark(Description = "1a_Lines1K_New")]
    public int Lines1K_New() => New_1a.Matches(WinText1K).Count;

    private static readonly Regex Old_1b = new(@"^.+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_1b = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
    [Benchmark(Description = "1b_Lines10K_Old")]
    public int Lines10K_Old() => Old_1b.Matches(WinText10K).Count;
    [Benchmark(Description = "1b_Lines10K_New")]
    public int Lines10K_New() => New_1b.Matches(WinText10K).Count;

    private static readonly Regex Old_2 = new(@"\[assembly:\s*\w+\(.*?\)\]\s*$(\r?\n)?",
        RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_2 = new(@"\[assembly:\s*\w+\(.*?\)\]\s*$",
        RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
    [Benchmark(Description = "2_Assembly_Old")]
    public int Assembly_Old() => Old_2.Matches(AssemblyInfo).Count;
    [Benchmark(Description = "2_Assembly_New")]
    public int Assembly_New() => New_2.Matches(AssemblyInfo).Count;

    private static readonly Regex Old_3 = new(@"^([^\s:]+):\s*(.+?)\r?$",
        RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_3 = new(@"^([^\s:]+):\s*(.+?)$",
        RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
    [Benchmark(Description = "3_KeyVal_Old")]
    public int KeyVal_Old() => Old_3.Matches(KvConfig).Count;
    [Benchmark(Description = "3_KeyVal_New")]
    public int KeyVal_New() => New_3.Matches(KvConfig).Count;

    private static readonly Regex Old_4 = new(@"^# .+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_4 = new(@"^# .+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
    [Benchmark(Description = "4_Markdown_Old")]
    public int Markdown_Old() => Old_4.Matches(Markdown).Count;
    [Benchmark(Description = "4_Markdown_New")]
    public int Markdown_New() => New_4.Matches(Markdown).Count;

    private static readonly Regex Old_5 = new(@"^.+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_5 = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
    [Benchmark(Description = "5_CSV_Old")]
    public int CSV_Old() => Old_5.Matches(CsvData).Count;
    [Benchmark(Description = "5_CSV_New")]
    public int CSV_New() => New_5.Matches(CsvData).Count;

    private static readonly Regex Old_6 = new(@"[^\r\n]+", RegexOptions.Compiled);
    private static readonly Regex New_6 = new(@".+", RegexOptions.Compiled | AnyNewLine);
    [Benchmark(Description = "6_DotExcl_Old")]
    public int DotExcl_Old() => Old_6.Matches(WinText1K).Count;
    [Benchmark(Description = "6_DotExcl_New")]
    public int DotExcl_New() => New_6.Matches(WinText1K).Count;

    private static readonly Regex Old_7 = new(@"\w+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_7 = new(@"\w+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
    [Benchmark(Description = "7_WordEOL_Old")]
    public int WordEOL_Old() => Old_7.Matches(WinText1K).Count;
    [Benchmark(Description = "7_WordEOL_New")]
    public int WordEOL_New() => New_7.Matches(WinText1K).Count;

    private static readonly Regex Old_8 = new(@"(?:^|\r\n)\w+", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_8 = new(@"^\w+", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
    [Benchmark(Description = "8_LineSt_Old")]
    public int LineStart_Old() => Old_8.Matches(WinText1K).Count;
    [Benchmark(Description = "8_LineSt_New")]
    public int LineStart_New() => New_8.Matches(WinText1K).Count;

    // Section 2: Unix \n text (control)
    private static readonly Regex Old_9 = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_9 = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
    [Benchmark(Description = "9_UnixLines_Old")]
    public int UnixLines_Old() => Old_9.Matches(UnixText1K).Count;
    [Benchmark(Description = "9_UnixLines_New")]
    public int UnixLines_New() => New_9.Matches(UnixText1K).Count;

    private static readonly Regex Old_10 = new(@"[^\n]+", RegexOptions.Compiled);
    private static readonly Regex New_10 = new(@".+", RegexOptions.Compiled | AnyNewLine);
    [Benchmark(Description = "10_UnixDot_Old")]
    public int UnixDot_Old() => Old_10.Matches(UnixText1K).Count;
    [Benchmark(Description = "10_UnixDot_New")]
    public int UnixDot_New() => New_10.Matches(UnixText1K).Count;

    // Section 3: Mixed newline text
    private static readonly Regex Old_11 = new(@"[^\r\n\u0085\u2028\u2029]+", RegexOptions.Compiled);
    private static readonly Regex New_11 = new(@".+", RegexOptions.Compiled | AnyNewLine);
    [Benchmark(Description = "11_MixedDot_Old")]
    public int MixedDot_Old() => Old_11.Matches(MixedAll1K).Count;
    [Benchmark(Description = "11_MixedDot_New")]
    public int MixedDot_New() => New_11.Matches(MixedAll1K).Count;

    private static readonly Regex Old_12 = new(@"^.+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_12 = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
    [Benchmark(Description = "12_MixedLines_Old")]
    public int MixedLines_Old() => Old_12.Matches(MixedNR1K).Count;
    [Benchmark(Description = "12_MixedLines_New")]
    public int MixedLines_New() => New_12.Matches(MixedNR1K).Count;

    // Section 4: Non-anchor patterns (zero impact)
    private static readonly Regex Old_14 = new(@"\r\n|\r|\n", RegexOptions.Compiled);
    private static readonly Regex New_14 = new(@"\r\n|\r|\n", RegexOptions.Compiled | AnyNewLine);
    [Benchmark(Description = "14_Literal_Old")]
    public int Literal_Old() => Old_14.Matches(MixedAll1K).Count;
    [Benchmark(Description = "14_Literal_New")]
    public int Literal_New() => New_14.Matches(MixedAll1K).Count;

    private static readonly Regex Old_15 = new(@"\w+", RegexOptions.Compiled);
    private static readonly Regex New_15 = new(@"\w+", RegexOptions.Compiled | AnyNewLine);
    [Benchmark(Description = "15_Words_Old")]
    public int Words_Old() => Old_15.Matches(WinText1K).Count;
    [Benchmark(Description = "15_Words_New")]
    public int Words_New() => New_15.Matches(WinText1K).Count;

    // Section 5: Pathological
    private static readonly Regex Old_P1 = new(@"$", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_P1 = new(@"$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
    [Benchmark(Description = "P1_BareEOL_Old")]
    public int BareEOL_Old() => Old_P1.Matches(WinText1K).Count;
    [Benchmark(Description = "P1_BareEOL_New")]
    public int BareEOL_New() => New_P1.Matches(WinText1K).Count;

    private static readonly Regex Old_P2 = new(@"^", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_P2 = new(@"^", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);
    [Benchmark(Description = "P2_BareBOL_Old")]
    public int BareBOL_Old() => Old_P2.Matches(WinText1K).Count;
    [Benchmark(Description = "P2_BareBOL_New")]
    public int BareBOL_New() => New_P2.Matches(WinText1K).Count;

    private static readonly Regex Old_P3 = new(@"\w+\r?\Z", RegexOptions.Compiled);
    private static readonly Regex New_P3 = new(@"\w+\Z", RegexOptions.Compiled | AnyNewLine);
    [Benchmark(Description = "P3_EndZ_Old")]
    public bool EndZ_Old() => Old_P3.IsMatch(WinText10K);
    [Benchmark(Description = "P3_EndZ_New")]
    public bool EndZ_New() => New_P3.IsMatch(WinText10K);
}

@danmoseley
Copy link
Member Author

Verified that adding VT/FF made no material perf difference

@danmoseley
Copy link
Member Author

danmoseley commented Mar 3, 2026

Below is analysis of benchmarks that generally compare AnyNewLine to previous workarounds, and adding AnyNewLine to existing (perhaps pathological) patterns. TLDR: sometimes ANL is faster than the workaround, sometimes not (just more functional -- more newline types, eg). On some pathological cases, ANL is slower. One principal example is a file that only has \n's in (where an option is to just not use ANL). No surprises here.


AnyNewLine Performance Benchmark Results

Setup

  • Branch: anynewline-lower-v2 (PR #124701)
  • Baseline: PR base commit 213a41d3d95b (main)
  • HEAD: f08383ab77d (with VT/FF support)
  • Tool: BenchmarkDotNet 0.14.0, InProcessEmitToolchain, MediumRun
  • Runtime: locally-built .NET 11 via testhost
  • Machine: Developer workstation (Windows), not a controlled benchmark environment

HEAD: AnyNewLine ("New") vs Workaround Patterns ("Old")

Both columns run on the same HEAD binary. "Old" uses manual workaround patterns (no AnyNewLine). "New" uses simplified AnyNewLine patterns. All patterns use Compiled | Multiline unless noted.

Benchmark Old Pattern New Pattern Old (μs) New (μs) Ratio Notes
1a_Lines1K ^.+\r?$ ^.+$ 48.1 48.7 1.01x
1b_Lines10K ^.+\r?$ ^.+$ 1,791 1,738 0.97x
2_Assembly \[assembly:...\]\s*$(\r?\n)? \[assembly:...\]\s*$ 39.4 32.8 0.83x AnyNewLine faster
3_KeyVal ^([^\s:]+):\s*(.+?)\r?$ ^([^\s:]+):\s*(.+?)$ 109.2 106.3 0.97x
4_Markdown ^# .+\r?$ ^# .+$ 11.2 9.6 0.86x AnyNewLine faster
5_CSV ^.+\r?$ ^.+$ 45.9 48.2 1.05x
6_DotExcl [^\r\n]+ .+ 42.9 46.1 1.07x No Multiline
7_WordEOL \w+\r?$ \w+$ 89.6 127.6 1.42x Lookaround cost
8_LineSt (?:^|\r\n)\w+ ^\w+ 201.6 203.9 1.01x
9_UnixLines ^.+$ ^.+$ 42.5 47.8 1.13x \n-only input
10_UnixDot [^\n]+ .+ 38.7 47.5 1.23x No Multiline, \n-only
11_MixedDot [^\r\n\u0085\u2028\u2029]+ .+ 45.4 48.9 1.08x No Multiline, all newlines
12_MixedLines ^.+\r?$ ^.+$ 44.0 48.6 1.10x \n/\r\n input
14_Literal \r\n|\r|\n \r\n|\r|\n 21.7 20.6 0.95x No anchors
15_Words \w+ \w+ 310.5 305.7 0.98x No anchors
P1_BareEOL $ $ 93.3 124.3 1.33x $ lowering cost
P2_BareBOL ^ ^ 139.7 116.1 0.83x AnyNewLine faster
P3_EndZ \w+\r?\Z \w+\Z 489.1 895.6 1.83x \Z lowering cost, no Multiline

Baseline (main) vs HEAD: "Old" Patterns Only

These patterns don't use AnyNewLine, so the regex code path is identical on both builds. Differences are machine noise.

Benchmark Pattern Baseline (μs) HEAD (μs) Δ
1a_Lines1K ^.+\r?$ 45.0 48.1 +7%
1b_Lines10K ^.+\r?$ 1,311 1,791 +37%
2_Assembly \[assembly:...\]\s*$(\r?\n)? 37.1 39.4 +6%
3_KeyVal ^([^\s:]+):\s*(.+?)\r?$ 108.6 109.2 +1%
4_Markdown ^# .+\r?$ 11.1 11.2 +1%
5_CSV ^.+\r?$ 45.8 45.9 +0%
6_DotExcl [^\r\n]+ 43.2 42.9 −1%
7_WordEOL \w+\r?$ 89.2 89.6 +0%
8_LineSt (?:^|\r\n)\w+ 203.4 201.6 −1%
9_UnixLines ^.+$ 44.7 42.5 −5%
10_UnixDot [^\n]+ 37.2 38.7 +4%
11_MixedDot [^\r\n\u0085\u2028\u2029]+ 43.7 45.4 +4%
12_MixedLines ^.+\r?$ 42.6 44.0 +3%
14_Literal \r\n|\r|\n 19.0 21.7 +14%
15_Words \w+ 293.6 310.5 +6%
P1_BareEOL $ 95.9 93.3 −3%
P2_BareBOL ^ 138.2 139.7 +1%
P3_EndZ \w+\r?\Z 481.9 489.1 +1%

Conclusions

  1. No regression to existing patterns. The baseline-vs-HEAD comparison for "Old" patterns shows only machine noise (no consistent direction, outliers explained by system load).

  2. AnyNewLine perf profile unchanged after VT/FF. Adding VT (U+000B) and FF (U+000C) merged adjacent ranges in character classes, producing no measurable performance difference from the previous run.

  3. Expected costs are in line with design. The $, ^, and \Z lowerings trade a lookaround cost for correct Unicode newline handling. Bare $ (P1, +33%) and \Z (P3, +83%) show the highest overhead. Real-world patterns that combine anchors with other matching (benchmarks 1–12) show minimal impact.

  4. Some AnyNewLine patterns are faster (2_Assembly, 4_Markdown, P2_BareBOL) because the simplified patterns allow better optimization by the regex engine.

Benchmark Code

PerfTest.csproj
<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net11.0</TargetFramework>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="BenchmarkDotNet" Version="0.14.0" />
  </ItemGroup>
</Project>
Program.cs
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Columns;
using BenchmarkDotNet.Reports;
using BenchmarkDotNet.Toolchains.InProcess.Emit;

BenchmarkRunner.Run<AnyNewLineBenchmarks>(
    DefaultConfig.Instance
        .WithSummaryStyle(SummaryStyle.Default.WithRatioStyle(BenchmarkDotNet.Columns.RatioStyle.Percentage))
        .AddJob(Job.MediumRun.WithToolchain(InProcessEmitToolchain.Instance)));

[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "RatioSD", "Alloc Ratio")]
public class AnyNewLineBenchmarks
{
    private const RegexOptions AnyNewLine = (RegexOptions)0x0800;

    // ── Inputs ──────────────────────────────────────────────────────
    private static string GenerateText(int lineCount, string[] newlines)
    {
        var sb = new StringBuilder();
        for (int i = 0; i < lineCount; i++)
        {
            sb.Append("Lorem ipsum dolor sit amet ");
            sb.Append(i);
            sb.Append(newlines[i % newlines.Length]);
        }
        return sb.ToString();
    }

    private static readonly string WinText1K = GenerateText(1000, ["\r\n"]);
    private static readonly string WinText10K = GenerateText(10000, ["\r\n"]);
    private static readonly string UnixText1K = GenerateText(1000, ["\n"]);
    private static readonly string MixedNR1K = GenerateText(1000, ["\n", "\r\n"]);
    private static readonly string MixedAll1K = GenerateText(1000, ["\n", "\r\n", "\r", "\u0085", "\u2028", "\u2029"]);

    private static readonly string AssemblyInfo;
    private static readonly string KvConfig;
    private static readonly string Markdown;
    private static readonly string CsvData;

    static AnyNewLineBenchmarks()
    {
        var sb = new StringBuilder();
        string[] attrs = {
            "[assembly: AssemblyTitle(\"MyApp\")]",
            "[assembly: AssemblyDescription(\"A sample app\")]",
            "[assembly: AssemblyConfiguration(\"\")]",
            "[assembly: AssemblyCompany(\"Contoso\")]",
            "[assembly: AssemblyProduct(\"MyApp\")]",
            "[assembly: AssemblyCopyright(\"Copyright 2024\")]",
            "[assembly: AssemblyTrademark(\"\")]",
            "[assembly: AssemblyCulture(\"\")]",
            "[assembly: AssemblyVersion(\"1.0.0.0\")]",
            "[assembly: AssemblyFileVersion(\"1.0.0.0\")]"
        };
        foreach (var attr in attrs) { sb.Append(attr); sb.Append("\r\n"); }
        AssemblyInfo = string.Concat(Enumerable.Repeat(sb.ToString(), 50));

        sb.Clear();
        string[] keys = { "Server", "Database", "User", "Password", "Timeout",
                          "MaxPool", "MinPool", "Encrypt", "TrustCert", "AppName" };
        for (int i = 0; i < 50; i++)
        {
            sb.Append(keys[i % keys.Length]); sb.Append(": value_"); sb.Append(i); sb.Append("\r\n");
        }
        KvConfig = string.Concat(Enumerable.Repeat(sb.ToString(), 20));

        sb.Clear();
        for (int i = 0; i < 200; i++)
        {
            sb.Append($"# Heading {i}\r\n");
            sb.Append($"Some paragraph text about topic {i}.\r\n");
            sb.Append($"Another line of content here.\r\n\r\n");
        }
        Markdown = sb.ToString();

        sb.Clear();
        sb.Append("Name,Age,City,Email\r\n");
        for (int i = 0; i < 1000; i++)
            sb.Append($"User{i},{20 + i % 50},City{i % 100},user{i}@example.com\r\n");
        CsvData = sb.ToString();
    }

    // ── Section 1: Real-world patterns on Windows \r\n text ─────────

    // 1a. Line matching 1K
    private static readonly Regex Old_1a = new(@"^.+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_1a = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);

    [Benchmark(Baseline = true, Description = "1a_Lines1K_Old")]
    public int Lines1K_Old() => Old_1a.Matches(WinText1K).Count;
    [Benchmark(Description = "1a_Lines1K_New")]
    public int Lines1K_New() => New_1a.Matches(WinText1K).Count;

    // 1b. Line matching 10K
    private static readonly Regex Old_1b = new(@"^.+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_1b = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);

    [Benchmark(Description = "1b_Lines10K_Old")]
    public int Lines10K_Old() => Old_1b.Matches(WinText10K).Count;
    [Benchmark(Description = "1b_Lines10K_New")]
    public int Lines10K_New() => New_1b.Matches(WinText10K).Count;

    // 2. Assembly attributes
    private static readonly Regex Old_2 = new(@"\[assembly:\s*\w+\(.*?\)\]\s*$(\r?\n)?",
        RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_2 = new(@"\[assembly:\s*\w+\(.*?\)\]\s*$",
        RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);

    [Benchmark(Description = "2_Assembly_Old")]
    public int Assembly_Old() => Old_2.Matches(AssemblyInfo).Count;
    [Benchmark(Description = "2_Assembly_New")]
    public int Assembly_New() => New_2.Matches(AssemblyInfo).Count;

    // 3. Key-value config
    private static readonly Regex Old_3 = new(@"^([^\s:]+):\s*(.+?)\r?$",
        RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_3 = new(@"^([^\s:]+):\s*(.+?)$",
        RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);

    [Benchmark(Description = "3_KeyVal_Old")]
    public int KeyVal_Old() => Old_3.Matches(KvConfig).Count;
    [Benchmark(Description = "3_KeyVal_New")]
    public int KeyVal_New() => New_3.Matches(KvConfig).Count;

    // 4. Markdown headings
    private static readonly Regex Old_4 = new(@"^# .+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_4 = new(@"^# .+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);

    [Benchmark(Description = "4_Markdown_Old")]
    public int Markdown_Old() => Old_4.Matches(Markdown).Count;
    [Benchmark(Description = "4_Markdown_New")]
    public int Markdown_New() => New_4.Matches(Markdown).Count;

    // 5. CSV line parsing
    private static readonly Regex Old_5 = new(@"^.+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_5 = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);

    [Benchmark(Description = "5_CSV_Old")]
    public int CSV_Old() => Old_5.Matches(CsvData).Count;
    [Benchmark(Description = "5_CSV_New")]
    public int CSV_New() => New_5.Matches(CsvData).Count;

    // 6. [^\r\n]+ vs .+
    private static readonly Regex Old_6 = new(@"[^\r\n]+", RegexOptions.Compiled);
    private static readonly Regex New_6 = new(@".+", RegexOptions.Compiled | AnyNewLine);

    [Benchmark(Description = "6_DotExcl_Old")]
    public int DotExcl_Old() => Old_6.Matches(WinText1K).Count;
    [Benchmark(Description = "6_DotExcl_New")]
    public int DotExcl_New() => New_6.Matches(WinText1K).Count;

    // 7. \w+\r?$ vs \w+$
    private static readonly Regex Old_7 = new(@"\w+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_7 = new(@"\w+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);

    [Benchmark(Description = "7_WordEOL_Old")]
    public int WordEOL_Old() => Old_7.Matches(WinText1K).Count;
    [Benchmark(Description = "7_WordEOL_New")]
    public int WordEOL_New() => New_7.Matches(WinText1K).Count;

    // 8. (?:^|\r\n)\w+ vs ^\w+
    private static readonly Regex Old_8 = new(@"(?:^|\r\n)\w+", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_8 = new(@"^\w+", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);

    [Benchmark(Description = "8_LineSt_Old")]
    public int LineStart_Old() => Old_8.Matches(WinText1K).Count;
    [Benchmark(Description = "8_LineSt_New")]
    public int LineStart_New() => New_8.Matches(WinText1K).Count;

    // ── Section 2: Unix \n text (control) ───────────────────────────

    // 9. Same pattern, flag only
    private static readonly Regex Old_9 = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_9 = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);

    [Benchmark(Description = "9_UnixLines_Old")]
    public int UnixLines_Old() => Old_9.Matches(UnixText1K).Count;
    [Benchmark(Description = "9_UnixLines_New")]
    public int UnixLines_New() => New_9.Matches(UnixText1K).Count;

    // 10. [^\n]+ vs .+
    private static readonly Regex Old_10 = new(@"[^\n]+", RegexOptions.Compiled);
    private static readonly Regex New_10 = new(@".+", RegexOptions.Compiled | AnyNewLine);

    [Benchmark(Description = "10_UnixDot_Old")]
    public int UnixDot_Old() => Old_10.Matches(UnixText1K).Count;
    [Benchmark(Description = "10_UnixDot_New")]
    public int UnixDot_New() => New_10.Matches(UnixText1K).Count;

    // ── Section 3: Mixed newline text ───────────────────────────────

    // 11. Full char class workaround vs .+
    private static readonly Regex Old_11 = new(@"[^\r\n\u0085\u2028\u2029]+", RegexOptions.Compiled);
    private static readonly Regex New_11 = new(@".+", RegexOptions.Compiled | AnyNewLine);

    [Benchmark(Description = "11_MixedDot_Old")]
    public int MixedDot_Old() => Old_11.Matches(MixedAll1K).Count;
    [Benchmark(Description = "11_MixedDot_New")]
    public int MixedDot_New() => New_11.Matches(MixedAll1K).Count;

    // 12. Mixed \n/\r\n line matching
    private static readonly Regex Old_12 = new(@"^.+\r?$", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_12 = new(@"^.+$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);

    [Benchmark(Description = "12_MixedLines_Old")]
    public int MixedLines_Old() => Old_12.Matches(MixedNR1K).Count;
    [Benchmark(Description = "12_MixedLines_New")]
    public int MixedLines_New() => New_12.Matches(MixedNR1K).Count;

    // ── Section 4: Non-anchor patterns (zero impact) ────────────────

    // 14. Literal newlines
    private static readonly Regex Old_14 = new(@"\r\n|\r|\n", RegexOptions.Compiled);
    private static readonly Regex New_14 = new(@"\r\n|\r|\n", RegexOptions.Compiled | AnyNewLine);

    [Benchmark(Description = "14_Literal_Old")]
    public int Literal_Old() => Old_14.Matches(MixedAll1K).Count;
    [Benchmark(Description = "14_Literal_New")]
    public int Literal_New() => New_14.Matches(MixedAll1K).Count;

    // 15. Word matching
    private static readonly Regex Old_15 = new(@"\w+", RegexOptions.Compiled);
    private static readonly Regex New_15 = new(@"\w+", RegexOptions.Compiled | AnyNewLine);

    [Benchmark(Description = "15_Words_Old")]
    public int Words_Old() => Old_15.Matches(WinText1K).Count;
    [Benchmark(Description = "15_Words_New")]
    public int Words_New() => New_15.Matches(WinText1K).Count;

    // ── Section 5: Pathological ─────────────────────────────────────

    // P1. Bare $
    private static readonly Regex Old_P1 = new(@"$", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_P1 = new(@"$", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);

    [Benchmark(Description = "P1_BareEOL_Old")]
    public int BareEOL_Old() => Old_P1.Matches(WinText1K).Count;
    [Benchmark(Description = "P1_BareEOL_New")]
    public int BareEOL_New() => New_P1.Matches(WinText1K).Count;

    // P2. Bare ^
    private static readonly Regex Old_P2 = new(@"^", RegexOptions.Compiled | RegexOptions.Multiline);
    private static readonly Regex New_P2 = new(@"^", RegexOptions.Compiled | RegexOptions.Multiline | AnyNewLine);

    [Benchmark(Description = "P2_BareBOL_Old")]
    public int BareBOL_Old() => Old_P2.Matches(WinText1K).Count;
    [Benchmark(Description = "P2_BareBOL_New")]
    public int BareBOL_New() => New_P2.Matches(WinText1K).Count;

    // P3. \w+\Z on large input
    private static readonly Regex Old_P3 = new(@"\w+\r?\Z", RegexOptions.Compiled);
    private static readonly Regex New_P3 = new(@"\w+\Z", RegexOptions.Compiled | AnyNewLine);

    [Benchmark(Description = "P3_EndZ_Old")]
    public bool EndZ_Old() => Old_P3.IsMatch(WinText10K);
    [Benchmark(Description = "P3_EndZ_New")]
    public bool EndZ_New() => New_P3.IsMatch(WinText10K);
}

@danmoseley
Copy link
Member Author

@MihuBot benchmark Regex

@danmoseley
Copy link
Member Author

@MihuBot regexdiff

@MihuBot
Copy link

MihuBot commented Mar 3, 2026

0 out of 18857 patterns have generated source code changes.

JIT assembly changes
Total bytes of base: 55915821
Total bytes of diff: 55915821
Total bytes of delta: 0 (0.00 % of base)
Sample source code for further analysis
const string JsonPath = "RegexResults-1803.json";
if (!File.Exists(JsonPath))
{
    await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/FIQBKc7A");
    using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
    archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}

using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");



record KnownPattern(string Pattern, RegexOptions Options, int Count);

sealed class RegexEntry
{
    public required KnownPattern Regex { get; set; }
    public required string MainSource { get; set; }
    public required string PrSource { get; set; }
    public string? FullDiff { get; set; }
    public string? ShortDiff { get; set; }
    public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
    public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
}

@MihuBot
Copy link

MihuBot commented Mar 3, 2026

See benchmark results at https://gist.github.com/MihuBot/d906406489feb8a231966adc754246f4

@danmoseley
Copy link
Member Author

MihuBot Benchmark Analysis (Post VT/FF)

Zero regressions. The PR has no impact on existing patterns.

Confirmed by two independent signals:

  1. Regexdiff: 0 out of 18,857 patterns changed. JIT assembly is byte-for-byte identical (Total bytes of delta: 0 (0.00% of base)). Since no code paths are altered for patterns without AnyNewLine, any benchmark differences are definitionally noise.

  2. Benchmark ratios are centered on 1.00x across hundreds of benchmarks. The few apparent outliers are explained by noise:

Benchmark Ratio Why noise
SliceSlice IgnoreCase, None 1.07x Interpreted mode; IgnoreCase path untouched by PR
BoostDocs Id 10 Compiled 1.41x 22→32 ns absolute; sub-10ns jitter at nanosecond scale
BoostDocs Id 13 Compiled 1.34x Same — 26→35 ns, nanosecond-scale noise
Common ReplaceWords IgnoreCase,Compiled 1.25x Same config shows SplitWords at 0.77x and MatchesWords at 0.82x — contradictory = noise
Common ReplaceWords Compiled 1.12x 1149→1290 ns; within typical BDN jitter
Leipzig (?i)Tom|...|Finn Compiled 1.13x Error bars ~865 μs on ~3000 μs mean — meaningless

Consistent with the first MihuBot run (pre-VT/FF), which also showed zero impact with an identical regexdiff.

The PR adds parser-only lowering gated behind RegexOptions.AnyNewLine (0x0800). When the flag isn't set, no code path is touched. MihuBot confirms this with zero assembly diff and noise-level benchmark variation across all suites (Sherlock, Leipzig, BoostDocs, Common, Cache, RegexRedux, Mariomkas, SliceSlice, Russian, Chinese).

@danmoseley danmoseley closed this Mar 3, 2026
@danmoseley danmoseley reopened this Mar 3, 2026
@danmoseley
Copy link
Member Author

nuts, didn't mean to close/reopen.

@danmoseley
Copy link
Member Author

For interest, once we've taken this we can consider \R. We'd need to decide we actually want it as a feature first (there are good reasons, including parity with other major engines). But here's what the code looks like -- it's a small change, non breaking and pay for play: danmoseley#35

@jzabroski
Copy link
Contributor

I'm excited to use it.

One interesting use case for more powerful Regex functionality is AI models with large context windows. There's been some interesting studies that suggest agents are more effective using grep than RAG pipelines using vector databases, and the inflection point is largely due to large context windows. It seems the main advantage to using a vector database is GDPR compliance and other privacy laws compliance, as you can mask with embeddings the data using GUIDs, and havestrong data governance controls over what parts of an ontology graph a given user has rights to. For anything not sensitive, grep with regex wins.

Restructure all three anchor lowering methods (Eol, Bol, EndZ) to
replace the 2-branch outer Alternate node with a sequential
Concatenate of: primary lookaround + shared CRLF guard.

Key idea: include ALL newline chars (including \n for $, \r for ^) in
the primary lookaround's character class, then append (?!(?<=\r)\n) as
a guard to block matching at the \r\n boundary.

Before ($ example): (?=[\r\v\f\u0085\u2028\u2029]|\z)|(?<!\r)(?=\n)
After:              (?=[\n\r\v\f\u0085\u2028\u2029]|\z)(?!(?<=\r)\n)

At non-newline positions (the vast majority during backtracking), the
primary lookaround fails immediately and the Concatenate short-circuits
— the CRLF guard is never evaluated. The old structure evaluated both
branches of the outer Alternate at every position.

Extract shared AnyNewLineCrLfGuardNode() helper used by all three
methods. Replace AnyNewLineExceptLfClass / AnyNewLineExceptCrClass
with unified AnyNewLineClass constant.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 6, 2026 18:20
@danmoseley
Copy link
Member Author

danmoseley commented Mar 6, 2026

Optimized lowering given the observation that newlines are less common than non newlines. . can't be improved without engine changes, which we're avoiding. -- Dan

====

Optimize anchor lowering: eliminate outer alternation

The previous lowerings used a two-branch Alternate at the top level. The problem: at non-newline positions, both branches must be evaluated and fail. Since most characters in typical input are non-newline, this doubles the per-character rejection cost for anchors inside loops like \w+$.

The new structure replaces the outer Alternate with a sequential Concatenate: a single primary lookaround that matches all newline characters (including \n and \r), followed by a shared CRLF guard (?!(?<=\r)\n) that blocks the \r-side of a \r\n pair. At non-newline positions the primary lookaround fails immediately and the guard is never evaluated.

Before/after lowerings:

Construct Before After
$ (multiline) (?=[\r\u0085\u2028\u2029]|\z)|(?<!\r)(?=\n) (?=[\n\r\v\f\u0085\u2028\u2029]|\z)(?!(?<=\r)\n)
$ (non-multiline) / \Z (?=\r\n\z|[\r\u0085\u2028\u2029]?\z)|(?<!\r)(?=\n\z) (?=\r\n\z|[\n\r\v\f\u0085\u2028\u2029]?\z)(?!(?<=\r)\n)
^ (multiline) (?<=[\n\u0085\u2028\u2029]|\A)|(?<=\r)(?!\n) (?<=[\n\r\v\f\u0085\u2028\u2029]|\A)(?!(?<=\r)\n)

The key structural change in each case: branch1 \| branch2 becomes unified_lookaround + guard. The AnyNewLineExceptLfClass / AnyNewLineExceptCrClass constants are replaced by a single AnyNewLineClass constant since the CRLF guard handles the split.

This change is entirely within the AnyNewLine lowering code path -- it has no effect on patterns that don't use RegexOptions.AnyNewLine.

@danmoseley
Copy link
Member Author

danmoseley commented Mar 6, 2026

Perf results after anchor lowering optimization

Measured locally with BenchmarkDotNet MediumRun (15 iterations), RegexOptions.Compiled, Release build, .NET 11.0. Same methodology as the PR's "AnyNewLine vs Workaround Patterns" table -- ratio is New(AnyNewLine) / Old(manual workaround). All match counts verified identical.

Section 1: Real-world patterns on Windows \r\n text

# Previous Workaround AnyNewLine Previous Workaround AnyNewLine Ratio Notes
1a ^.+\r?$ (1K lines) ^.+$ 47.5 us 50.8 us 1.07x . overhead (Set vs Notone)
1b ^.+\r?$ (10K lines) ^.+$ 1717 us 1766 us 1.03x Same, amortized over longer input
2 \[assembly:...\]$(\r?\n)? \[assembly:...\]$ 39.4 us 35.4 us 0.90x Simpler pattern wins
3 ^([^\s:]+):\s*(.+?)\r?$ ^([^\s:]+):\s*(.+?)$ 111.2 us 105.5 us 0.95x Simpler pattern wins
4 ^# .+\r?$ ^# .+$ 11.8 us 11.0 us 0.93x Faster: literal # prefix optimizes well
5 ^.+\r?$ (CSV) ^.+$ 48.4 us 52.3 us 1.08x . overhead
6 [^\r\n]+ .+ 46.3 us 47.3 us 1.02x . overhead, minimal
7 \w+\r?$ \w+$ 91.6 us 92.0 us 1.00x Was 1.33x before this optimization
8 (?:^|\r\n)\w+ ^\w+ 200.8 us 189.3 us 0.94x Simpler pattern wins

Section 2: Unix \n text (overhead of just enabling the flag)

# Previous Workaround AnyNewLine Previous Workaround AnyNewLine Ratio Notes
9 ^.+$ ^.+$ 48.0 us 51.0 us 1.06x . overhead
10 [^\n]+ .+ 41.2 us 47.9 us 1.16x . overhead (Notone vs Set)

Section 3: Mixed \n/\r\n text

# Previous Workaround AnyNewLine Previous Workaround AnyNewLine Ratio Notes
11 [^\r\n\u0085\u2028\u2029]+ .+ 47.1 us 52.4 us 1.11x . overhead
12 ^.+\r?$ (mixed 1K) ^.+$ 46.6 us 51.7 us 1.11x . + anchor overhead

Section 4: Non-anchor/dot patterns (zero impact expected)

# Previous Workaround AnyNewLine Previous Workaround AnyNewLine Ratio Notes
14 \r\n|\r|\n \r\n|\r|\n 41.1 us 42.8 us 1.04x No lowering, within noise
15 \w+ \w+ 314.3 us 310.7 us 0.99x No lowering, within noise

Section 5: Bare anchors (no simple workaround exists for these)

# Pattern (no workaround) Pattern (+ AnyNewLine) Without AnyNewLine AnyNewLine Ratio Notes
P1 $ (multiline, \n-only) $ (all newlines) 106.0 us 122.9 us 1.16x Now correct; was 1.37x before optimization
P2 ^ (multiline, \n-only) ^ (all newlines) 152.9 us 119.4 us 0.78x Now correct; faster here due to a curiosity (issue)
P3 \w+\r?\Z (partial) \w+\Z 113.4 us 114.9 us 1.01x Was ~1.9x before optimization

Summary:

  • The remaining overhead in dot-heavy patterns (1.02x--1.16x) comes entirely from . being lowered to a Set node ([^\n\r\v\f\u0085\u2028\u2029]) instead of the engine's native Notone node -- this is inherent to the lowering approach and would require engine changes to address.
  • The anchor optimization eliminated the worst regressions: \w+$ from 1.33x to 1.00x, bare $ from 1.37x to 1.16x, and \w+\Z from ~1.9x to 1.01x.
  • Patterns where AnyNewLine simplifies the regex (removing \r?, (\r?\n)?, (?:^|\r\n)) are often faster than the workaround (0.90x--0.95x).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

- Remove UTF-8 BOM from RegexParser.cs and Regex.Match.Tests.cs
- Remove extra blank line in RegexParser.cs (line 24)
- Add blank line between AnyNewLine_Dollar_TestData and AnyNewLine_EndZ
- Add missing \u2028 (Line Separator) test case for \Z
- Add RegexOptionAnyNewLine assertion in test helpers

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regex Match, Split and Matches should support RegexOptions.AnyNewLine as (?=\r\z|\n\z|\r\n\z|\z)

5 participants