diff/catalog: fix generated-column expression drift by winebarrel · Pull Request #96 · winebarrel/myschema

winebarrel · 2026-05-06T02:46:43Z

Summary

A generated column whose body contained any string literal — even a simple `' '` — fired a no-op `MODIFY COLUMN` on every plan and apply never converged. Two independent issues, both root-caused by probing the actual catalog form against MySQL 8.0:

Issue 1: catalog encoding

`information_schema.COLUMNS.GENERATION_EXPRESSION` stores generated bodies in MySQL's escape form: every `\` doubled to `\\`, every `'` (delimiter or content) preceded by `\`. Desired `CONCAT(name, ' x ', name)` round-trips on the catalog side as `concat(\`name\`,_utf8mb4\\\' x \\\',\`name\`)`. vitess refuses the `\\'`-as-delimiter form, so `equalExprPtr` fell back to byte equality and the diff fired.

Fix. A two-pass decoder `catalog.decodeGenerationExpr`: `\\` → `\` first, then `\'` → `'`. Order matters — doing `\'` first would over-double a literal `\` whose encoded form is `\\\\`. Probed for the three real-world body shapes (no escapes / single-backslash / single-apostrophe); all three round-trip cleanly through vitess after decoding.

Issue 2: charset introducer asymmetry

Even after decoding, the catalog body carries MySQL's `_utf8mb4` (or `_latin1`, etc.) introducer that the parser side doesn't have. vitess preserves introducers across parse → restore, so the symmetric canonicalisation in `equalExprPtr` still saw the two sides differing.

Fix. `diff/expr.go::stripIntroducers` walks the AST and replaces every `*sqlparser.IntroducerExpr` with its inner expression. AST-based, so it handles every charset name vitess parses (not just `_utf8mb4`) without hard-coding.

Trade-off documented in CAVEATS. A deliberate `_latin1` ↔ `_utf8mb4` change in a generated literal looks identical to the diff. Manage that change out-of-band if it matters.

Tests

`diff/expr_test.go`: 5 new cases pin introducer absorption (plain ↔ `_utf8mb4`, plain ↔ `_latin1`, both-explicit-same, different-introducers-equal-by-design, different-payload-still-diffs).
`catalog/tables_test.go`: `TestDecodeGenerationExpr` pins the three real-world MySQL escape shapes.
`testdata/apply/generated_column_string_literal.yml`: end-to-end YAML fixture exercising the full pipeline (apply → re-plan empty under `verify_no_drift: true`).

Docs

`CAVEATS.md`: new "Generated column expression bodies that contain string literals" section. Default is now drift-free; the documented limitations are introducer-only-change detection and pathological literal-with-quote-and-backslash mixes.
`TODO.md`: drop the corresponding "vitess-keyword identifier" entry (the original framing only saw one of three drift causes; all three are resolved).

Test plan

`make lint` clean.
`make test` all 6 packages green.
`make test-scenario` 5/5 passing.
End-to-end repro confirmed: `apply` then `plan` reports `-- No changes` on a generated column with a string literal.

🤖 Generated with Claude Code

A generated column whose body contained any string literal — even a simple `' '` — fired a no-op MODIFY COLUMN on every plan and apply never converged. Two independent issues, both root-caused via probing the catalog form: 1. **Catalog encoding.** MySQL stores `information_schema.COLUMNS.GENERATION_EXPRESSION` with every `\` doubled to `\\` and every `'` (delimiter or content) preceded by `\`, so a desired-side `CONCAT(name, ' x ', name)` round-trips as `concat(\`name\`,_utf8mb4\\\' x \\\',\`name\`)`. vitess refuses the `\\\'`-as-delimiter form, so equalExprPtr fell back to byte equality and the diff fired. Fix: a two-pass decoder in catalog (`decodeGenerationExpr`): `\\` → `\` first, then `\'` → `'`. Order matters — doing `\'` first would over-double a literal `\` whose encoded form is `\\\\`. Probed against MySQL 8.0 for the three real-world body shapes (no escapes, single-backslash, single-apostrophe); all three round-trip cleanly through vitess after decoding. 2. **Charset introducer asymmetry.** Even after decoding, the catalog body still carries MySQL's `_utf8mb4` (or `_latin1`, etc.) introducer that the parser side doesn't have. vitess preserves introducers across parse → restore, so the symmetric canonicalisation in `equalExprPtr` left the two sides differing. Fix: `stripIntroducers` walks the AST and replaces every `*sqlparser.IntroducerExpr` with its inner expression. AST-based, so it handles every charset name vitess parses (not just `_utf8mb4`) without hard-coding. Trade-off documented in CAVEATS — a deliberate `_latin1` ↔ `_utf8mb4` change in a generated literal looks identical to the diff. Manage that change out-of-band if it matters. Tests: - diff/expr_test.go: 5 new cases pin introducer absorption (plain ↔ `_utf8mb4`, plain ↔ `_latin1`, both-explicit-same-introducer, different-introducers-equal-by-design, different-payload-still-diffs). - catalog/tables_test.go: TestDecodeGenerationExpr pins the three real-world MySQL escape shapes. - testdata/apply/generated_column_string_literal.yml: end-to-end YAML fixture exercising the full pipeline (apply → re-plan empty under verify_no_drift). Docs: - CAVEATS.md: new "Generated column expression bodies that contain string literals" section. Default behaviour is now drift-free; the documented limitations are introducer-only-change detection and pathological literal-with-quote-and-backslash mixes. - TODO.md: drop the corresponding "vitess-keyword identifier" entry (the original framing only saw one of the three drift causes; all three are now resolved). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-06T02:48:15Z

Codecov Report

❌ Patch coverage is 84.61538% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.96%. Comparing base (fa1f2ba) to head (36f63ef).

Files with missing lines	Patch %	Lines
diff/expr.go	82.97%	5 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #96      +/-   ##
==========================================
- Coverage   87.00%   86.96%   -0.05%     
==========================================
  Files          30       30              
  Lines        3271     3321      +50     
==========================================
+ Hits         2846     2888      +42     
- Misses        261      266       +5     
- Partials      164      167       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Fixes persistent no-op drift for generated columns whose expressions contain string literals by normalizing MySQL catalog GENERATION_EXPRESSION into a Vitess-parseable/canonical form and stripping charset introducers during semantic expression comparison.

Changes:

Decode MySQL’s escape-encoded information_schema.COLUMNS.GENERATION_EXPRESSION before storing/comparing generated expressions.
Strip Vitess IntroducerExpr nodes during expression canonicalization to avoid introducer-only drift.
Add focused unit tests plus an end-to-end YAML apply fixture; update docs/TODO accordingly.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
TODO.md	Removes resolved TODO entry related to generated-column drift.
testdata/apply/generated_column_string_literal.yml	New end-to-end fixture ensuring apply converges (no drift) for generated columns with string literals.
diff/expr.go	Strips charset introducers during canonical expression comparison (now used for GENERATED as well).
diff/expr_test.go	Adds test cases covering introducer stripping behavior and its documented trade-off.
CAVEATS.md	Documents new drift-free behavior and limitations for generated bodies with string literals.
catalog/tables.go	Decodes `GENERATION_EXPRESSION` from info_schema before storing into `Column.Generated`.
catalog/tables_test.go	Adds unit tests pinning the two-pass decoding behavior.
catalog/export_test.go	Re-exports the decoder helper for external test packages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Two valid points: - The CAVEATS example "catalog body of `concat(\`email\`,...)`" was inside a single-backtick inline code span, but the example itself contains backticks (for the `email` / `name` column references) that markdown can't escape inside a single-backtick span. The block was rendering broken on GitHub. Promoted to a fenced code block so the embedded back-ticked identifiers display correctly. - The introducer strip in `canonicalExpr` actually applies to *every* expression slot that flows through the helper — DEFAULT, ON UPDATE, CHECK, and GENERATED — not just generated bodies. The previous CAVEATS wording implied the trade-off was scoped to generated columns, which understated the behaviour. Updated both the function doc and CAVEATS to spell out that an introducer-only difference is invisible across all four slots, and explain why this is the right trade-off (the catalog charset introducer is the source of every-plan reshape in the common case; deliberate `_latin1` vs `_utf8mb4` differences in literals are vanishingly rare). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Coverage audit on PR #96 turned up gaps between the documented behaviour and what the tests actually pinned: - CAVEATS calls out that the introducer strip applies to DEFAULT, ON UPDATE, CHECK, and GENERATED expressions (every slot that goes through canonicalExpr), but no test pinned the CHECK path specifically. Add two cases to TestEqualCheckDef: - Plain `'open'` ↔ `_utf8mb4'open'` → equal (the common introducer-on-catalog case). - `_latin1'open'` ↔ `_utf8mb4'open'` → equal by design (the documented limitation in CAVEATS — same trade-off as the GENERATED case, now pinned at the CHECK call site). - The introducer-stripping AST walk has to handle multiple literals in one expression and literals at any depth, but TestEqualExpr only had cases with one literal at the top level. Add: - Five literals across one CONCAT — proves the walk doesn't stop at the first introducer. - Nested CONCAT(UPPER(CONCAT(…))) — proves depth doesn't matter. - decodeGenerationExpr's two-pass replacement walks the whole string, but the existing fixtures only exercised single literals. Add multiple-literals-in-one-expression and the empty-literal edge case (catalog body `_utf8mb4\\\'\\\'`). Line coverage on the new functions is unchanged (the 87.5% gap on canonicalExpr / stripIntroducers is two unreachable defensive returns that fire only if `sqlparser.Rewrite` were to return a non-Expr from an Expr input — per CLAUDE.md "don't add validation for scenarios that can't happen", I'm leaving those arms uncovered rather than mocking the impossible). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The original `canonicalExpr` lower-cased the entire restored expression with `strings.ToLower`. While generated-column bodies were unparseable on the catalog side, that hadn't mattered: equalExprPtr fell back to byte equality, which is case-sensitive. The decoder + introducer-strip in this PR finally make catalog bodies parseable, so canonicalExpr now actually runs on those values — and the global ToLower silently canonicalised `DEFAULT 'X'` and `DEFAULT 'x'` to the same string, hiding a real case-sensitive value change. Same trap for CHECK and GENERATED literals. Replace `strings.ToLower` with `lowerOutsideStringLiterals`, a state machine that lower-cases every byte except the content of single-quoted string literals. Function names, keywords, and unquoted identifiers still get the lower-case treatment they need (vitess preserves user-written case for those), but literal payload survives byte-for-byte. The state machine respects backslash escapes (`\'`) and SQL-standard doubled quotes (`''`) so it tracks literal boundaries correctly. Tests cover: - Upper-case vs lower-case literal differs (the regression). - Function name case canonicalised but literal case preserved. - Backslash escape inside literal doesn't fool the boundary tracker. - SQL-standard doubled-quote escape inside literal doesn't get read as a closing quote. Also addresses the CAVEATS wording about `dump` — the catalog reader now runs `decodeGenerationExpr` before populating `Column.Generated`, so dump output is the decoded (parser-friendly) form, not a byte-for-byte echo of `information_schema.COLUMNS.GENERATION_EXPRESSION`. Updated the "escape hatch" paragraph to reflect that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The previous round's `lowerOutsideStringLiterals` doc comment included the SQL-standard doubled-apostrophe escape written as two ASCII apostrophes (`''`); gofumpt's smart-quote heuristic treated the second `'` as the closing half of a quoted span and rewrote the pair into U+201D (`”`). A grep across the repo turned up four pre-existing instances of the same pattern — comments containing `` `<word>` `` code spans whose trailing backtick fell next to a backslash-escape sequence, where gofumpt then collapsed `` \` `` into the U+201C left-double-quote (`“`): - catalog/tables.go:629 (introduced last commit, my fault) - diff/views.go:26 - parser/view.go:67 - parser/view.go:68 All four were typos that survived because the offending bytes are non-ASCII, escape `git diff`'s usual visual cues, and don't trip lint. Fix all five in one pass by rewriting the doc lines to drop markdown-style code-span back-ticks where they were adjacent to escape sequences. The replacements use plain prose or single back-ticks far from any `\\` so gofumpt has nothing to canonicalise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Three doc accuracy fixes: - diff/expr.go::lowerOutsideStringLiterals — the previous round's doc said "lower-cases every byte" but the implementation only touches ASCII A–Z. ASCII-only is deliberate (vitess restores pure-ASCII keywords and identifiers into canonicalExpr's input, so a Unicode-aware lower would be slower without changing the result), so make the comment match the implementation and spell out the rationale. - diff/views.go::DiffViews — the previous round's parenthetical example "qualified form FROM ... )." had a stray closing paren that read as unbalanced. Rewrite the sentence so the parens bracket the SQL example clearly. - parser/view.go::NormalizeViewDefinition — the function's doc claimed a "ReplaceAll spaces" pass that hasn't existed in the implementation for a while; only the lowercase pass survives. Update the doc to describe what actually happens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Round 6 of Copilot review: the previous round narrowed the lowering to ASCII A–Z and asserted vitess-restored identifiers are pure ASCII. The reviewer pointed out that the repo already exercises non-ASCII identifiers (e.g. `\`tαble\`` in diff_all_internal_test.go), so a back-ticked Greek-letter identifier whose case differs between desired and catalog would silently drift under ASCII-only lowering — the previous global `strings.ToLower` happened to handle this through Unicode case- folding. Switch the helper to gather outside-literal segments and run `strings.ToLower` on each segment while still copying literal content byte-for-byte. The state machine and escape-handling are unchanged. Byte-by-byte literal scanning stays safe for multi- byte UTF-8 because the `'` and `\` marker bytes never appear inside a UTF-8 continuation byte. Tests: - "non-ASCII identifier case difference is folded": back-ticked `tΑble` (Greek capital alpha) compares equal to `tαble` (lowercase alpha). Probed against vitess: bare non-ASCII identifiers fail to parse, but the back-ticked form round-trips and the canonicalised lower output catches the case fold. - "non-ASCII literal content case-preserved": Greek-α inside a string literal differs by case between the two sides and must stay distinct, since the literal is part of the value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings May 6, 2026 02:46

Copilot started reviewing on behalf of winebarrel May 6, 2026 02:47 View session