diff/catalog: fix generated-column expression drift#96
Conversation
A generated column whose body contained any string literal — even a simple `' '` — fired a no-op MODIFY COLUMN on every plan and apply never converged. Two independent issues, both root-caused via probing the catalog form: 1. **Catalog encoding.** MySQL stores `information_schema.COLUMNS.GENERATION_EXPRESSION` with every `\` doubled to `\\` and every `'` (delimiter or content) preceded by `\`, so a desired-side `CONCAT(name, ' x ', name)` round-trips as `concat(\`name\`,_utf8mb4\\\' x \\\',\`name\`)`. vitess refuses the `\\\'`-as-delimiter form, so equalExprPtr fell back to byte equality and the diff fired. Fix: a two-pass decoder in catalog (`decodeGenerationExpr`): `\\` → `\` first, then `\'` → `'`. Order matters — doing `\'` first would over-double a literal `\` whose encoded form is `\\\\`. Probed against MySQL 8.0 for the three real-world body shapes (no escapes, single-backslash, single-apostrophe); all three round-trip cleanly through vitess after decoding. 2. **Charset introducer asymmetry.** Even after decoding, the catalog body still carries MySQL's `_utf8mb4` (or `_latin1`, etc.) introducer that the parser side doesn't have. vitess preserves introducers across parse → restore, so the symmetric canonicalisation in `equalExprPtr` left the two sides differing. Fix: `stripIntroducers` walks the AST and replaces every `*sqlparser.IntroducerExpr` with its inner expression. AST-based, so it handles every charset name vitess parses (not just `_utf8mb4`) without hard-coding. Trade-off documented in CAVEATS — a deliberate `_latin1` ↔ `_utf8mb4` change in a generated literal looks identical to the diff. Manage that change out-of-band if it matters. Tests: - diff/expr_test.go: 5 new cases pin introducer absorption (plain ↔ `_utf8mb4`, plain ↔ `_latin1`, both-explicit-same-introducer, different-introducers-equal-by-design, different-payload-still-diffs). - catalog/tables_test.go: TestDecodeGenerationExpr pins the three real-world MySQL escape shapes. - testdata/apply/generated_column_string_literal.yml: end-to-end YAML fixture exercising the full pipeline (apply → re-plan empty under verify_no_drift). Docs: - CAVEATS.md: new "Generated column expression bodies that contain string literals" section. Default behaviour is now drift-free; the documented limitations are introducer-only-change detection and pathological literal-with-quote-and-backslash mixes. - TODO.md: drop the corresponding "vitess-keyword identifier" entry (the original framing only saw one of the three drift causes; all three are now resolved). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #96 +/- ##
==========================================
- Coverage 87.00% 86.96% -0.05%
==========================================
Files 30 30
Lines 3271 3321 +50
==========================================
+ Hits 2846 2888 +42
- Misses 261 266 +5
- Partials 164 167 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Fixes persistent no-op drift for generated columns whose expressions contain string literals by normalizing MySQL catalog GENERATION_EXPRESSION into a Vitess-parseable/canonical form and stripping charset introducers during semantic expression comparison.
Changes:
- Decode MySQL’s escape-encoded
information_schema.COLUMNS.GENERATION_EXPRESSIONbefore storing/comparing generated expressions. - Strip Vitess
IntroducerExprnodes during expression canonicalization to avoid introducer-only drift. - Add focused unit tests plus an end-to-end YAML apply fixture; update docs/TODO accordingly.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| TODO.md | Removes resolved TODO entry related to generated-column drift. |
| testdata/apply/generated_column_string_literal.yml | New end-to-end fixture ensuring apply converges (no drift) for generated columns with string literals. |
| diff/expr.go | Strips charset introducers during canonical expression comparison (now used for GENERATED as well). |
| diff/expr_test.go | Adds test cases covering introducer stripping behavior and its documented trade-off. |
| CAVEATS.md | Documents new drift-free behavior and limitations for generated bodies with string literals. |
| catalog/tables.go | Decodes GENERATION_EXPRESSION from info_schema before storing into Column.Generated. |
| catalog/tables_test.go | Adds unit tests pinning the two-pass decoding behavior. |
| catalog/export_test.go | Re-exports the decoder helper for external test packages. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Two valid points: - The CAVEATS example "catalog body of `concat(\`email\`,...)`" was inside a single-backtick inline code span, but the example itself contains backticks (for the `email` / `name` column references) that markdown can't escape inside a single-backtick span. The block was rendering broken on GitHub. Promoted to a fenced code block so the embedded back-ticked identifiers display correctly. - The introducer strip in `canonicalExpr` actually applies to *every* expression slot that flows through the helper — DEFAULT, ON UPDATE, CHECK, and GENERATED — not just generated bodies. The previous CAVEATS wording implied the trade-off was scoped to generated columns, which understated the behaviour. Updated both the function doc and CAVEATS to spell out that an introducer-only difference is invisible across all four slots, and explain why this is the right trade-off (the catalog charset introducer is the source of every-plan reshape in the common case; deliberate `_latin1` vs `_utf8mb4` differences in literals are vanishingly rare). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Coverage audit on PR #96 turned up gaps between the documented behaviour and what the tests actually pinned: - CAVEATS calls out that the introducer strip applies to DEFAULT, ON UPDATE, CHECK, and GENERATED expressions (every slot that goes through canonicalExpr), but no test pinned the CHECK path specifically. Add two cases to TestEqualCheckDef: - Plain `'open'` ↔ `_utf8mb4'open'` → equal (the common introducer-on-catalog case). - `_latin1'open'` ↔ `_utf8mb4'open'` → equal by design (the documented limitation in CAVEATS — same trade-off as the GENERATED case, now pinned at the CHECK call site). - The introducer-stripping AST walk has to handle multiple literals in one expression and literals at any depth, but TestEqualExpr only had cases with one literal at the top level. Add: - Five literals across one CONCAT — proves the walk doesn't stop at the first introducer. - Nested CONCAT(UPPER(CONCAT(…))) — proves depth doesn't matter. - decodeGenerationExpr's two-pass replacement walks the whole string, but the existing fixtures only exercised single literals. Add multiple-literals-in-one-expression and the empty-literal edge case (catalog body `_utf8mb4\\\'\\\'`). Line coverage on the new functions is unchanged (the 87.5% gap on canonicalExpr / stripIntroducers is two unreachable defensive returns that fire only if `sqlparser.Rewrite` were to return a non-Expr from an Expr input — per CLAUDE.md "don't add validation for scenarios that can't happen", I'm leaving those arms uncovered rather than mocking the impossible). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The original `canonicalExpr` lower-cased the entire restored expression with `strings.ToLower`. While generated-column bodies were unparseable on the catalog side, that hadn't mattered: equalExprPtr fell back to byte equality, which is case-sensitive. The decoder + introducer-strip in this PR finally make catalog bodies parseable, so canonicalExpr now actually runs on those values — and the global ToLower silently canonicalised `DEFAULT 'X'` and `DEFAULT 'x'` to the same string, hiding a real case-sensitive value change. Same trap for CHECK and GENERATED literals. Replace `strings.ToLower` with `lowerOutsideStringLiterals`, a state machine that lower-cases every byte except the content of single-quoted string literals. Function names, keywords, and unquoted identifiers still get the lower-case treatment they need (vitess preserves user-written case for those), but literal payload survives byte-for-byte. The state machine respects backslash escapes (`\'`) and SQL-standard doubled quotes (`''`) so it tracks literal boundaries correctly. Tests cover: - Upper-case vs lower-case literal differs (the regression). - Function name case canonicalised but literal case preserved. - Backslash escape inside literal doesn't fool the boundary tracker. - SQL-standard doubled-quote escape inside literal doesn't get read as a closing quote. Also addresses the CAVEATS wording about `dump` — the catalog reader now runs `decodeGenerationExpr` before populating `Column.Generated`, so dump output is the decoded (parser-friendly) form, not a byte-for-byte echo of `information_schema.COLUMNS.GENERATION_EXPRESSION`. Updated the "escape hatch" paragraph to reflect that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The previous round's `lowerOutsideStringLiterals` doc comment included the SQL-standard doubled-apostrophe escape written as two ASCII apostrophes (`''`); gofumpt's smart-quote heuristic treated the second `'` as the closing half of a quoted span and rewrote the pair into U+201D (`”`). A grep across the repo turned up four pre-existing instances of the same pattern — comments containing `` `<word>` `` code spans whose trailing backtick fell next to a backslash-escape sequence, where gofumpt then collapsed `` \` `` into the U+201C left-double-quote (`“`): - catalog/tables.go:629 (introduced last commit, my fault) - diff/views.go:26 - parser/view.go:67 - parser/view.go:68 All four were typos that survived because the offending bytes are non-ASCII, escape `git diff`'s usual visual cues, and don't trip lint. Fix all five in one pass by rewriting the doc lines to drop markdown-style code-span back-ticks where they were adjacent to escape sequences. The replacements use plain prose or single back-ticks far from any `\\` so gofumpt has nothing to canonicalise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Three doc accuracy fixes: - diff/expr.go::lowerOutsideStringLiterals — the previous round's doc said "lower-cases every byte" but the implementation only touches ASCII A–Z. ASCII-only is deliberate (vitess restores pure-ASCII keywords and identifiers into canonicalExpr's input, so a Unicode-aware lower would be slower without changing the result), so make the comment match the implementation and spell out the rationale. - diff/views.go::DiffViews — the previous round's parenthetical example "qualified form FROM ... )." had a stray closing paren that read as unbalanced. Rewrite the sentence so the parens bracket the SQL example clearly. - parser/view.go::NormalizeViewDefinition — the function's doc claimed a "ReplaceAll spaces" pass that hasn't existed in the implementation for a while; only the lowercase pass survives. Update the doc to describe what actually happens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Round 6 of Copilot review: the previous round narrowed the lowering to ASCII A–Z and asserted vitess-restored identifiers are pure ASCII. The reviewer pointed out that the repo already exercises non-ASCII identifiers (e.g. `\`tαble\`` in diff_all_internal_test.go), so a back-ticked Greek-letter identifier whose case differs between desired and catalog would silently drift under ASCII-only lowering — the previous global `strings.ToLower` happened to handle this through Unicode case- folding. Switch the helper to gather outside-literal segments and run `strings.ToLower` on each segment while still copying literal content byte-for-byte. The state machine and escape-handling are unchanged. Byte-by-byte literal scanning stays safe for multi- byte UTF-8 because the `'` and `\` marker bytes never appear inside a UTF-8 continuation byte. Tests: - "non-ASCII identifier case difference is folded": back-ticked `tΑble` (Greek capital alpha) compares equal to `tαble` (lowercase alpha). Probed against vitess: bare non-ASCII identifiers fail to parse, but the back-ticked form round-trips and the canonicalised lower output catches the case fold. - "non-ASCII literal content case-preserved": Greek-α inside a string literal differs by case between the two sides and must stay distinct, since the literal is part of the value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
A generated column whose body contained any string literal — even a simple `' '` — fired a no-op `MODIFY COLUMN` on every plan and apply never converged. Two independent issues, both root-caused by probing the actual catalog form against MySQL 8.0:
Issue 1: catalog encoding
`information_schema.COLUMNS.GENERATION_EXPRESSION` stores generated bodies in MySQL's escape form: every `\` doubled to `\\`, every `'` (delimiter or content) preceded by `\`. Desired `CONCAT(name, ' x ', name)` round-trips on the catalog side as `concat(\`name\`,_utf8mb4\\\' x \\\',\`name\`)`. vitess refuses the `\\'`-as-delimiter form, so `equalExprPtr` fell back to byte equality and the diff fired.
Fix. A two-pass decoder `catalog.decodeGenerationExpr`: `\\` → `\` first, then `\'` → `'`. Order matters — doing `\'` first would over-double a literal `\` whose encoded form is `\\\\`. Probed for the three real-world body shapes (no escapes / single-backslash / single-apostrophe); all three round-trip cleanly through vitess after decoding.
Issue 2: charset introducer asymmetry
Even after decoding, the catalog body carries MySQL's `_utf8mb4` (or `_latin1`, etc.) introducer that the parser side doesn't have. vitess preserves introducers across parse → restore, so the symmetric canonicalisation in `equalExprPtr` still saw the two sides differing.
Fix. `diff/expr.go::stripIntroducers` walks the AST and replaces every `*sqlparser.IntroducerExpr` with its inner expression. AST-based, so it handles every charset name vitess parses (not just `_utf8mb4`) without hard-coding.
Trade-off documented in CAVEATS. A deliberate `_latin1` ↔ `_utf8mb4` change in a generated literal looks identical to the diff. Manage that change out-of-band if it matters.
Tests
Docs
Test plan
🤖 Generated with Claude Code