Skip to content

Rework RegexGenerator incremental source model to support value equality#125438

Closed
eiriktsarpalis wants to merge 36 commits intomainfrom
fix/regex-generator-incrementality
Closed

Rework RegexGenerator incremental source model to support value equality#125438
eiriktsarpalis wants to merge 36 commits intomainfrom
fix/regex-generator-incrementality

Conversation

@eiriktsarpalis
Copy link
Copy Markdown
Member

@eiriktsarpalis eiriktsarpalis commented Mar 11, 2026

Fixes #125409

Discovered while working on #125404 — the RegexGenerator's incremental source model lacked proper value equality, causing the source output callback to fire on every compilation change regardless of whether regex patterns had actually changed.

Approach

Note

AI-generated description update via Copilot.

Refactors the generator to keep the regular parsed RegexMethod / RegexTree / AnalysisResults objects in the parser step, while wrapping each parsed method in a dedicated public RegexMethodSpec model that provides structural equality for incremental caching.

This avoids the earlier mirrored *Spec object graph and also avoids re-parsing in the emitter. The emitter now consumes the parsed tree and analysis directly, and Roslyn can skip source emission when the structural regex state is unchanged.

Focused incremental tests cover deterministic ordering, deep-nesting fallback, and signature-affecting changes.

eiriktsarpalis and others added 4 commits March 11, 2026 13:05
Add RegexGeneratorIncrementalTests.cs with 18 tests covering:
- Same compilation caching (SameInput_DoesNotRegenerate)
- Unrelated code changes (UnrelatedCodeChange_DoesNotRegenerate)
- Pattern/options/timeout/culture changes triggering regeneration
- Declaring type and member name changes
- Method add/remove scenarios
- Property vs method support
- Complex patterns with unrelated changes (Theory with 5 patterns)
- Limited support regex (NonBacktracking) caching
- Multiple regex methods with partial changes

These tests establish the baseline behavior before refactoring the
generator's incremental pipeline for proper value equality.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…gex generator

Add ImmutableEquatableDictionary<TKey, TValue> to Common/src/SourceGenerators/
following the same pattern as the existing ImmutableEquatableArray<T>.

Link ImmutableEquatableArray, ImmutableEquatableDictionary, and HashHelpers
to the Regex generator project for use in the upcoming equatable model types.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…line

Add gen/Model/ directory with immutable, structurally equatable record types:
- RegexGenerationSpec: top-level envelope (incremental cache boundary)
- RegexMethodSpec: per-method data with declaring type, pattern, options, tree
- RegexTreeSpec: equatable regex tree with baked-in analysis results
- RegexNodeSpec: equatable tree node with analysis flags per node
- FindOptimizationsSpec: equatable find optimizations
- FixedDistanceSetSpec: equatable fixed-distance set
- RegexTypeSpec: declaring type hierarchy

Also add required member polyfills for netstandard2.0 compatibility.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ssion

Rewrites the RegexGenerator Initialize() pipeline to:
1. Collect per-method results into a single batch
2. Run a Parser that builds an equatable RegexGenerationSpec model
3. Split into source model (with value equality) and diagnostics providers
4. Register separate source outputs for model and diagnostics

The equatable model (RegexGenerationSpec -> RegexMethodSpec -> RegexTreeSpec ->
RegexNodeSpec) enables Roslyn to skip the emitter when regex patterns/options
haven't changed.

During emission, the spec is converted back to the mutable RegexMethod/RegexTree
types by re-parsing the regex pattern. This re-parse is fast (~1ms) and only
occurs on incremental cache misses, while the expensive emission (10-100ms) is
fully skipped on cache hits.

Also fixes CultureChange_Regenerates test to check incremental step reason
rather than output string equality, since culture changes may not always
produce different generated code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 11, 2026 12:55
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the RegexGenerator incremental source generator pipeline to use deeply equatable model types, enabling Roslyn incremental caching to skip expensive emission when regex inputs haven’t changed.

Changes:

  • Replaced the heterogeneous ImmutableArray<object> incremental model with equatable *Spec model records (using ImmutableEquatableArray / new ImmutableEquatableDictionary).
  • Reworked the pipeline to split source emission from diagnostic reporting and re-parse the regex during emission only on cache misses.
  • Added a new incremental-caching-focused test suite and wired it into the functional test project.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/System.Text.RegularExpressions.Tests.csproj Adds the new incremental generator tests to the test project.
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/RegexGeneratorIncrementalTests.cs New tests validating incremental caching behavior via tracked generator steps.
src/libraries/System.Text.RegularExpressions/gen/System.Text.RegularExpressions.Generator.csproj Links shared equatable collection helpers and required-member polyfills into the generator project.
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.cs Pipeline rewrite to produce a structurally equatable model and re-parse on emission.
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Conversion.cs Converts RegexTree / analysis into equatable *Spec snapshots.
src/libraries/System.Text.RegularExpressions/gen/Model/RegexGenerationSpec.cs New top-level equatable incremental cache boundary model.
src/libraries/System.Text.RegularExpressions/gen/Model/RegexMethodSpec.cs New per-member equatable model for generation inputs.
src/libraries/System.Text.RegularExpressions/gen/Model/RegexTreeSpec.cs New equatable representation of regex trees and baked-in analysis flags.
src/libraries/System.Text.RegularExpressions/gen/Model/RegexNodeSpec.cs New equatable representation of regex parse nodes.
src/libraries/System.Text.RegularExpressions/gen/Model/RegexTypeSpec.cs New equatable representation of declaring type hierarchy.
src/libraries/System.Text.RegularExpressions/gen/Model/FindOptimizationsSpec.cs New equatable representation of find-optimization state.
src/libraries/System.Text.RegularExpressions/gen/Model/FixedDistanceSetSpec.cs New equatable representation of fixed-distance set optimization data.
src/libraries/Common/src/SourceGenerators/ImmutableEquatableDictionary.cs New shared equatable dictionary helper for structural equality in generator models.

You can also share your feedback on Copilot code review. Take the survey.

eiriktsarpalis and others added 2 commits March 11, 2026 15:29
…mitter

- F1: Extract LiteralAfterLoopSpec record from nested tuple in
  FindOptimizationsSpec for clarity.
- F2+F3: Rewrite RegexGenerator.Parser.cs with unified parsing flow.
  Parse() is the outer method accepting GeneratorAttributeSyntaxContext
  batch; ParseMethod() handles each member end-to-end (attribute
  extraction + regex parsing + spec conversion). Eliminates the
  intermediate RegexPatternAndSyntax type and ImmutableArray<object>
  boxing.
- F4: Move Emit() orchestration and ConvertRegexTypeSpecToRegexType to
  RegexGenerator.Emitter.cs. RegexGenerator.cs now only contains
  Initialize() pipeline wiring, constants, and CompilationData.
- Fix parent chain construction for immutable RegexTypeSpec records
  (build outside-in instead of mutating).
- Remove unused usings from slimmed RegexGenerator.cs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 11, 2026 13:58
Add ToImmutableEquatableDictionary<TKey, TValue>(this Hashtable)
extension method following the existing factory pattern, and simplify
the ConvertHashtable caller in Conversion.cs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.


You can also share your feedback on Copilot code review. Take the survey.

- Add ImmutableEquatableSet<T> (from PolyType) providing order-
  independent structural equality via HashSet<T>. Use it for
  RegexGenerationSpec.RegexMethods to prevent spurious cache misses
  from non-deterministic ordering of GeneratorAttributeSyntaxContext.
- Fix null-safety bug in ImmutableEquatableDictionary.Equals: use
  EqualityComparer<TValue>.Default.Equals instead of direct call
  that would NRE on null values.
- Remove unused keySelector/valueSelector and IEnumerable<KVP>
  overloads from ImmutableEquatableDictionary factory.
- Inline ConvertHashtable calls in Conversion.cs.
- Extract AddDiagnostic helper to reduce noise in Parser.cs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 11, 2026 15:29
…culture

- Replace var with explicit types throughout changed code
- Change RegexMethod/RegexType from internal sealed record to private sealed class
- Add using var to runnerSw/runnerWriter for proper disposal
- Include CultureName in dedup key for emitted expressions
- Remove unused using SourceGenerators from Emitter.cs
- Add using System.Text to incremental tests
- Strengthen SameInput test with cached/unchanged assertion

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@eiriktsarpalis eiriktsarpalis force-pushed the fix/regex-generator-incrementality branch from 53e38f5 to 93dd4a4 Compare March 11, 2026 15:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (1)

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Parser.cs:150

  • The spec currently doesn’t preserve the cultureName argument from [GeneratedRegex]. CreateRegexTreeSpec later relies on tree.Culture?.Name, but RegexParser.Parse sets RegexTree.Culture to non-null only for ignore-case backreferences; for most ignore-case regexes the tree culture will be null. That means the emitter may re-parse using InvariantCulture even when a culture was specified, potentially generating incorrect code for culture-sensitive case folding. Store the requested/selected culture name directly on RegexMethodSpec (or alongside Tree) and use that for emission and de-duping.
            string? pattern = items[0].Value as string;
            int? options = null;
            int? matchTimeout = null;
            string? cultureName = string.Empty;
            if (items.Length >= 2)
            {
                options = items[1].Value as int?;
                if (items.Length == 4)
                {
                    matchTimeout = items[2].Value as int?;
                    cultureName = items[3].Value as string;
                }
                // If there are 3 parameters, we need to check if the third argument is
                // int matchTimeoutMilliseconds, or string cultureName.
                else if (items.Length == 3)
                {
                    if (items[2].Type?.SpecialType == SpecialType.System_Int32)
                    {
                        matchTimeout = items[2].Value as int?;
                    }
                    else
                    {
                        cultureName = items[2].Value as string;
                    }

You can also share your feedback on Copilot code review. Take the survey.

eiriktsarpalis and others added 2 commits March 11, 2026 17:48
Sort method specs by fully-qualified type name before ID assignment
and emission to ensure deterministic output regardless of HashSet
iteration order (which varies with string hash randomization).

Add test for CompilationData (AllowUnsafe) change triggering regeneration.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move CompilationData record struct to gen/Model/CompilationData.cs
- Change Parse() accumulators from HashSet/ImmutableArray.Builder to List
- Reorder declarations so methods come before diagnostics

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 11, 2026 16:05
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.


You can also share your feedback on Copilot code review. Take the survey.

Inline diagnostic accumulation using (diagnostics ??= []).Add(...).
Remove sort in Emit() since ImmutableEquatableSet iteration order
is deterministic within a process and the emitter only runs on
cache misses.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@eiriktsarpalis eiriktsarpalis force-pushed the fix/regex-generator-incrementality branch from 6421f6a to 4740968 Compare March 11, 2026 16:14
@github-actions

This comment has been minimized.

eiriktsarpalis and others added 4 commits April 3, 2026 17:57
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 3, 2026 16:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

This comment has been minimized.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 3, 2026 16:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

This comment has been minimized.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 3, 2026 18:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

@github-actions

This comment has been minimized.

eiriktsarpalis and others added 2 commits April 3, 2026 21:48
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 3, 2026 18:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

@github-actions

This comment has been minimized.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

🤖 Copilot Code Review — PR #125438

Note

This review was generated by Copilot (Claude Opus 4.6), with cross-validation from GPT-5.2.

Holistic Assessment

Motivation: The PR addresses a real and important problem — the RegexGenerator's incremental pipeline was not properly caching because the model types (RegexTree, AnalysisResults, RegexFindOptimizations) are mutable reference types without structural equality. This caused the emitter to re-run on every compilation, degrading IDE performance. The "source generators must be properly incremental" convention is well-established in this repo.

Approach: The approach of wrapping RegexMethod in an equatable RegexMethodSpec with deep structural comparison — rather than making all regex library types immutable/equatable — is pragmatic and well-suited to this codebase. It avoids modifying the large regex library while achieving proper incrementality. The pipeline restructuring (batch Collect → single Parse → equatable model → Emit) follows the correct incremental generator pattern, consistent with other generators in the repo (e.g., LoggerMessageGenerator).

Summary: ✅ LGTM. The change is well-designed, thoroughly tested, and the equality implementation correctly covers all fields that affect code generation. Both independent model reviews converged on the same conclusion. Two minor style and robustness observations are noted below as non-blocking suggestions.


Detailed Findings

✅ Equality Implementation — All code-gen-relevant fields are covered

Verified the RegexMethodSpec.Equals comparison against every field the emitter reads. All fields that affect generated output are compared:

  • RegexMethod fields: DeclaringType, IsProperty, MemberName, Modifiers, NullableRegex, Pattern, Options, MatchTimeout, CultureName, Tree, Analysis, LimitedSupportReason, CompilationData — all compared ✓
  • RegexTree fields: Options, CaptureCount, Culture?.Name, CaptureNames, CaptureNameToNumberMapping, CaptureNumberSparseMapping, Root, FindOptimizations — all compared ✓
  • RegexNode fields: Kind, Str, Ch, M, N, Options, child structure — all compared ✓
  • RegexFindOptimizations fields: FindMode, LeadingAnchor, TrailingAnchor, MinRequiredLength, MaxPossibleLength, LeadingPrefix, LeadingPrefixes, FixedDistanceLiteral (Char/String/Distance), FixedDistanceSets (Set/Negated/Chars/Distance/Range), LiteralAfterLoop (LoopNode + Literal tuple) — all compared ✓
  • AnalysisResults: HasIgnoreCase, HasRightToLeft, plus per-node IsAtomicByAncestor/MayContainCapture/MayBacktrack/IsInLoop — all compared ✓
  • Not compared but correctly excluded: LeadingStrings (SearchValues<string>?) is behind #if SYSTEM_TEXT_REGULAREXPRESSIONS which is not defined in the generator build. The emitter reconstructs SearchValues from LeadingPrefixes + FindMode (both compared). GeneratedName/IsDuplicate are emission-time fields set after equality checks.

Hash codes are consistent with Equals across all types — every compared field participates in the hash.

[Flagged by both Claude Opus 4.6 and GPT-5.2]

✅ Pipeline Architecture — Correct incremental pattern

The pipeline restructuring is sound:

  1. ForAttributeWithMetadataNameCollectSelect(Parse) produces (RegexGenerationSpec?, ImmutableArray<Diagnostic>)
  2. Source model is projected with WithTrackingName for test observability, cached via structural equality
  3. Diagnostics are projected separately, preserving SourceLocation for #pragma suppressibility (fixing Address diagnostic issues in runtime incremental source generators #92509)
  4. Emit callback only fires when the equatable model changes

The deterministic sorting in Emit (by Namespace/FullName/MemberName/Pattern/Options/Timeout/CultureName) ensures stable output despite non-deterministic Collect() ordering.

Bug fix included: The dedup key now includes CultureName — the old code only deduped on (Pattern, Options, Timeout), which would incorrectly merge identical patterns with different cultures.

[Flagged by both models]

✅ Test Coverage — Comprehensive incremental verification

The 891-line test file covers the critical scenarios:

  • Same compilation caching, unrelated code changes, pattern/options/timeout/culture changes
  • Declaring type changes, signature changes (method↔property, modifiers, nullable)
  • Nested type determinism, deeply nested pattern fallback
  • Adding/removing regex methods, limited support regex caching
  • Culture normalization equivalence (e.g., zh-CHSzh-Hans)
  • Same pattern with different cultures not being deduplicated

Tests use WithTrackingName + TrackedSteps to verify actual pipeline caching behavior (not just output text equality).

💡 ImmutableEquatableSet.cs — Missing blank line before namespace

ImmutableEquatableSet.cs (line 7-8) is missing the blank line between using directives and the namespace declaration. The sibling file ImmutableEquatableArray.cs has this blank line (line 9-10). Minor formatting inconsistency.

 using System.Collections.Generic;
 using System.Diagnostics;
+
 namespace SourceGenerators

💡 AnalysisEquals — Implicitly relies on RegexTreeEquals having passed first

AnalysisEquals (line 164) iterates children using left.ChildCount() without validating that right has the same child count. This is safe because RegexTreeEquals is evaluated first (via && short-circuiting at line 37-40), which guarantees identical tree structure. However, a Debug.Assert would make this invariant explicit and protect against future refactoring:

// In AnalysisEquals, inside the loop body (after line 163):
int childCount = left.ChildCount();
Debug.Assert(childCount == right.ChildCount());
for (int i = childCount - 1; i >= 0; i--)

[Flagged by GPT-5.2, confirmed by Claude Opus 4.6]

Generated by Code Review for issue #125438 ·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-System.Text.RegularExpressions source-generator Indicates an issue with a source generator feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RegexGenerator source model lacks value equality, preventing effective incremental caching

4 participants