Skip to content

diff/catalog: fix generated-column expression drift#96

Merged
winebarrel merged 7 commits into
mainfrom
diff-strip-introducer-from-generated-expr
May 6, 2026
Merged

diff/catalog: fix generated-column expression drift#96
winebarrel merged 7 commits into
mainfrom
diff-strip-introducer-from-generated-expr

Conversation

@winebarrel

Copy link
Copy Markdown
Owner

Summary

A generated column whose body contained any string literal — even a simple `' '` — fired a no-op `MODIFY COLUMN` on every plan and apply never converged. Two independent issues, both root-caused by probing the actual catalog form against MySQL 8.0:

Issue 1: catalog encoding

`information_schema.COLUMNS.GENERATION_EXPRESSION` stores generated bodies in MySQL's escape form: every `\` doubled to `\\`, every `'` (delimiter or content) preceded by `\`. Desired `CONCAT(name, ' x ', name)` round-trips on the catalog side as `concat(\`name\`,_utf8mb4\\\' x \\\',\`name\`)`. vitess refuses the `\\'`-as-delimiter form, so `equalExprPtr` fell back to byte equality and the diff fired.

Fix. A two-pass decoder `catalog.decodeGenerationExpr`: `\\` → `\` first, then `\'` → `'`. Order matters — doing `\'` first would over-double a literal `\` whose encoded form is `\\\\`. Probed for the three real-world body shapes (no escapes / single-backslash / single-apostrophe); all three round-trip cleanly through vitess after decoding.

Issue 2: charset introducer asymmetry

Even after decoding, the catalog body carries MySQL's `_utf8mb4` (or `_latin1`, etc.) introducer that the parser side doesn't have. vitess preserves introducers across parse → restore, so the symmetric canonicalisation in `equalExprPtr` still saw the two sides differing.

Fix. `diff/expr.go::stripIntroducers` walks the AST and replaces every `*sqlparser.IntroducerExpr` with its inner expression. AST-based, so it handles every charset name vitess parses (not just `_utf8mb4`) without hard-coding.

Trade-off documented in CAVEATS. A deliberate `_latin1` ↔ `_utf8mb4` change in a generated literal looks identical to the diff. Manage that change out-of-band if it matters.

Tests

  • `diff/expr_test.go`: 5 new cases pin introducer absorption (plain ↔ `_utf8mb4`, plain ↔ `_latin1`, both-explicit-same, different-introducers-equal-by-design, different-payload-still-diffs).
  • `catalog/tables_test.go`: `TestDecodeGenerationExpr` pins the three real-world MySQL escape shapes.
  • `testdata/apply/generated_column_string_literal.yml`: end-to-end YAML fixture exercising the full pipeline (apply → re-plan empty under `verify_no_drift: true`).

Docs

  • `CAVEATS.md`: new "Generated column expression bodies that contain string literals" section. Default is now drift-free; the documented limitations are introducer-only-change detection and pathological literal-with-quote-and-backslash mixes.
  • `TODO.md`: drop the corresponding "vitess-keyword identifier" entry (the original framing only saw one of three drift causes; all three are resolved).

Test plan

  • `make lint` clean.
  • `make test` all 6 packages green.
  • `make test-scenario` 5/5 passing.
  • End-to-end repro confirmed: `apply` then `plan` reports `-- No changes` on a generated column with a string literal.

🤖 Generated with Claude Code

A generated column whose body contained any string literal — even a
simple `' '` — fired a no-op MODIFY COLUMN on every plan and apply
never converged. Two independent issues, both root-caused via probing
the catalog form:

1. **Catalog encoding.** MySQL stores
   `information_schema.COLUMNS.GENERATION_EXPRESSION` with every `\`
   doubled to `\\` and every `'` (delimiter or content) preceded
   by `\`, so a desired-side `CONCAT(name, ' x ', name)` round-trips
   as `concat(\`name\`,_utf8mb4\\\' x \\\',\`name\`)`. vitess refuses
   the `\\\'`-as-delimiter form, so equalExprPtr fell back to byte
   equality and the diff fired.

   Fix: a two-pass decoder in catalog (`decodeGenerationExpr`):
   `\\` → `\` first, then `\'` → `'`. Order matters — doing `\'`
   first would over-double a literal `\` whose encoded form is
   `\\\\`. Probed against MySQL 8.0 for the three real-world body
   shapes (no escapes, single-backslash, single-apostrophe); all
   three round-trip cleanly through vitess after decoding.

2. **Charset introducer asymmetry.** Even after decoding, the
   catalog body still carries MySQL's `_utf8mb4` (or `_latin1`,
   etc.) introducer that the parser side doesn't have. vitess
   preserves introducers across parse → restore, so the symmetric
   canonicalisation in `equalExprPtr` left the two sides differing.

   Fix: `stripIntroducers` walks the AST and replaces every
   `*sqlparser.IntroducerExpr` with its inner expression. AST-based,
   so it handles every charset name vitess parses (not just
   `_utf8mb4`) without hard-coding.

   Trade-off documented in CAVEATS — a deliberate `_latin1` ↔
   `_utf8mb4` change in a generated literal looks identical to the
   diff. Manage that change out-of-band if it matters.

Tests:
- diff/expr_test.go: 5 new cases pin introducer absorption (plain ↔
  `_utf8mb4`, plain ↔ `_latin1`, both-explicit-same-introducer,
  different-introducers-equal-by-design, different-payload-still-diffs).
- catalog/tables_test.go: TestDecodeGenerationExpr pins the
  three real-world MySQL escape shapes.
- testdata/apply/generated_column_string_literal.yml:
  end-to-end YAML fixture exercising the full pipeline (apply → re-plan
  empty under verify_no_drift).

Docs:
- CAVEATS.md: new "Generated column expression bodies that contain
  string literals" section. Default behaviour is now drift-free; the
  documented limitations are introducer-only-change detection and
  pathological literal-with-quote-and-backslash mixes.
- TODO.md: drop the corresponding "vitess-keyword identifier" entry
  (the original framing only saw one of the three drift causes; all
  three are now resolved).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 02:46
@codecov

codecov Bot commented May 6, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 84.61538% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.96%. Comparing base (fa1f2ba) to head (36f63ef).

Files with missing lines Patch % Lines
diff/expr.go 82.97% 5 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #96      +/-   ##
==========================================
- Coverage   87.00%   86.96%   -0.05%     
==========================================
  Files          30       30              
  Lines        3271     3321      +50     
==========================================
+ Hits         2846     2888      +42     
- Misses        261      266       +5     
- Partials      164      167       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes persistent no-op drift for generated columns whose expressions contain string literals by normalizing MySQL catalog GENERATION_EXPRESSION into a Vitess-parseable/canonical form and stripping charset introducers during semantic expression comparison.

Changes:

  • Decode MySQL’s escape-encoded information_schema.COLUMNS.GENERATION_EXPRESSION before storing/comparing generated expressions.
  • Strip Vitess IntroducerExpr nodes during expression canonicalization to avoid introducer-only drift.
  • Add focused unit tests plus an end-to-end YAML apply fixture; update docs/TODO accordingly.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
TODO.md Removes resolved TODO entry related to generated-column drift.
testdata/apply/generated_column_string_literal.yml New end-to-end fixture ensuring apply converges (no drift) for generated columns with string literals.
diff/expr.go Strips charset introducers during canonical expression comparison (now used for GENERATED as well).
diff/expr_test.go Adds test cases covering introducer stripping behavior and its documented trade-off.
CAVEATS.md Documents new drift-free behavior and limitations for generated bodies with string literals.
catalog/tables.go Decodes GENERATION_EXPRESSION from info_schema before storing into Column.Generated.
catalog/tables_test.go Adds unit tests pinning the two-pass decoding behavior.
catalog/export_test.go Re-exports the decoder helper for external test packages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread CAVEATS.md Outdated
Comment thread diff/expr.go Outdated
Two valid points:

- The CAVEATS example "catalog body of `concat(\`email\`,...)`"
  was inside a single-backtick inline code span, but the example
  itself contains backticks (for the `email` / `name` column
  references) that markdown can't escape inside a single-backtick
  span. The block was rendering broken on GitHub. Promoted to a
  fenced code block so the embedded back-ticked identifiers
  display correctly.

- The introducer strip in `canonicalExpr` actually applies to
  *every* expression slot that flows through the helper —
  DEFAULT, ON UPDATE, CHECK, and GENERATED — not just generated
  bodies. The previous CAVEATS wording implied the trade-off was
  scoped to generated columns, which understated the behaviour.
  Updated both the function doc and CAVEATS to spell out that an
  introducer-only difference is invisible across all four slots,
  and explain why this is the right trade-off (the catalog
  charset introducer is the source of every-plan reshape in the
  common case; deliberate `_latin1` vs `_utf8mb4` differences in
  literals are vanishingly rare).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Coverage audit on PR #96 turned up gaps between the documented
behaviour and what the tests actually pinned:

- CAVEATS calls out that the introducer strip applies to DEFAULT,
  ON UPDATE, CHECK, and GENERATED expressions (every slot that
  goes through canonicalExpr), but no test pinned the CHECK path
  specifically. Add two cases to TestEqualCheckDef:
  - Plain `'open'` ↔ `_utf8mb4'open'` → equal (the common
    introducer-on-catalog case).
  - `_latin1'open'` ↔ `_utf8mb4'open'` → equal by design (the
    documented limitation in CAVEATS — same trade-off as the
    GENERATED case, now pinned at the CHECK call site).

- The introducer-stripping AST walk has to handle multiple
  literals in one expression and literals at any depth, but
  TestEqualExpr only had cases with one literal at the top level.
  Add:
  - Five literals across one CONCAT — proves the walk doesn't
    stop at the first introducer.
  - Nested CONCAT(UPPER(CONCAT(…))) — proves depth doesn't matter.

- decodeGenerationExpr's two-pass replacement walks the whole
  string, but the existing fixtures only exercised single literals.
  Add multiple-literals-in-one-expression and the empty-literal
  edge case (catalog body `_utf8mb4\\\'\\\'`).

Line coverage on the new functions is unchanged (the 87.5% gap
on canonicalExpr / stripIntroducers is two unreachable defensive
returns that fire only if `sqlparser.Rewrite` were to return a
non-Expr from an Expr input — per CLAUDE.md "don't add
validation for scenarios that can't happen", I'm leaving those
arms uncovered rather than mocking the impossible).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread diff/expr.go Outdated
Comment thread CAVEATS.md Outdated
The original `canonicalExpr` lower-cased the entire restored
expression with `strings.ToLower`. While generated-column bodies
were unparseable on the catalog side, that hadn't mattered:
equalExprPtr fell back to byte equality, which is case-sensitive.
The decoder + introducer-strip in this PR finally make catalog
bodies parseable, so canonicalExpr now actually runs on those
values — and the global ToLower silently canonicalised
`DEFAULT 'X'` and `DEFAULT 'x'` to the same string, hiding a
real case-sensitive value change. Same trap for CHECK and
GENERATED literals.

Replace `strings.ToLower` with `lowerOutsideStringLiterals`, a
state machine that lower-cases every byte except the content of
single-quoted string literals. Function names, keywords, and
unquoted identifiers still get the lower-case treatment they
need (vitess preserves user-written case for those), but literal
payload survives byte-for-byte. The state machine respects
backslash escapes (`\'`) and SQL-standard doubled quotes
(`''`) so it tracks literal boundaries correctly.

Tests cover:
- Upper-case vs lower-case literal differs (the regression).
- Function name case canonicalised but literal case preserved.
- Backslash escape inside literal doesn't fool the boundary
  tracker.
- SQL-standard doubled-quote escape inside literal doesn't get
  read as a closing quote.

Also addresses the CAVEATS wording about `dump` — the catalog
reader now runs `decodeGenerationExpr` before populating
`Column.Generated`, so dump output is the decoded
(parser-friendly) form, not a byte-for-byte echo of
`information_schema.COLUMNS.GENERATION_EXPRESSION`. Updated the
"escape hatch" paragraph to reflect that.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread diff/expr.go Outdated
The previous round's `lowerOutsideStringLiterals` doc comment
included the SQL-standard doubled-apostrophe escape written as
two ASCII apostrophes (`''`); gofumpt's smart-quote heuristic
treated the second `'` as the closing half of a quoted span and
rewrote the pair into U+201D (`”`).

A grep across the repo turned up four pre-existing instances of
the same pattern — comments containing
`` `<word>` `` code spans whose trailing backtick fell next to a
backslash-escape sequence, where gofumpt then collapsed
`` \` `` into the U+201C left-double-quote (`“`):

- catalog/tables.go:629 (introduced last commit, my fault)
- diff/views.go:26
- parser/view.go:67
- parser/view.go:68

All four were typos that survived because the offending bytes
are non-ASCII, escape `git diff`'s usual visual cues, and don't
trip lint. Fix all five in one pass by rewriting the doc lines
to drop markdown-style code-span back-ticks where they were
adjacent to escape sequences. The replacements use plain prose
or single back-ticks far from any `\\` so gofumpt has nothing to
canonicalise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread parser/view.go Outdated
Comment thread diff/views.go
Comment thread diff/expr.go Outdated
Three doc accuracy fixes:

- diff/expr.go::lowerOutsideStringLiterals — the previous round's
  doc said "lower-cases every byte" but the implementation only
  touches ASCII A–Z. ASCII-only is deliberate (vitess restores
  pure-ASCII keywords and identifiers into canonicalExpr's input,
  so a Unicode-aware lower would be slower without changing the
  result), so make the comment match the implementation and
  spell out the rationale.
- diff/views.go::DiffViews — the previous round's parenthetical
  example "qualified form FROM ... )." had a stray closing paren
  that read as unbalanced. Rewrite the sentence so the parens
  bracket the SQL example clearly.
- parser/view.go::NormalizeViewDefinition — the function's doc
  claimed a "ReplaceAll spaces" pass that hasn't existed in the
  implementation for a while; only the lowercase pass survives.
  Update the doc to describe what actually happens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread diff/expr.go Outdated
Round 6 of Copilot review: the previous round narrowed the
lowering to ASCII A–Z and asserted vitess-restored identifiers
are pure ASCII. The reviewer pointed out that the repo already
exercises non-ASCII identifiers (e.g. `\`tαble\`` in
diff_all_internal_test.go), so a back-ticked Greek-letter
identifier whose case differs between desired and catalog would
silently drift under ASCII-only lowering — the previous global
`strings.ToLower` happened to handle this through Unicode case-
folding.

Switch the helper to gather outside-literal segments and run
`strings.ToLower` on each segment while still copying literal
content byte-for-byte. The state machine and escape-handling are
unchanged. Byte-by-byte literal scanning stays safe for multi-
byte UTF-8 because the `'` and `\` marker bytes never appear
inside a UTF-8 continuation byte.

Tests:
- "non-ASCII identifier case difference is folded": back-ticked
  `tΑble` (Greek capital alpha) compares equal to `tαble`
  (lowercase alpha). Probed against vitess: bare non-ASCII
  identifiers fail to parse, but the back-ticked form
  round-trips and the canonicalised lower output catches the
  case fold.
- "non-ASCII literal content case-preserved": Greek-α inside a
  string literal differs by case between the two sides and
  must stay distinct, since the literal is part of the value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@winebarrel winebarrel merged commit e0f7af6 into main May 6, 2026
9 checks passed
@winebarrel winebarrel deleted the diff-strip-introducer-from-generated-expr branch May 6, 2026 04:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants