Skip to content

catalog/diff: surface CHARACTER SET / COLLATE diffs at table and column level#40

Merged
winebarrel merged 6 commits into
mainfrom
charset-collation-diff
May 3, 2026
Merged

catalog/diff: surface CHARACTER SET / COLLATE diffs at table and column level#40
winebarrel merged 6 commits into
mainfrom
charset-collation-diff

Conversation

@winebarrel

Copy link
Copy Markdown
Owner

Summary

Closes the Medium-priority "Character set / collation diff at the table and column level" entry. Both layers were already loaded into model.Table / model.Column but the diff intentionally skipped them — silent drift on every CHARSET change.

What changes diff-side

  • diff/tables.go now emits ALTER TABLE … DEFAULT CHARSET=… [COLLATE=…] when desired and current diverge. Engine and Comment stay un-diffed (separate scope).
  • columnEqual no longer skips CharacterSet / Collation. They go through the same normalisation on both sides so an "inherited" column reads as nil → equal to a parser-side column the user didn't spell out.

Normalisation (parser ↔ catalog symmetry)

  • New model/charset.go owns the MySQL 8.0+ charset → default-collation table and the CollapseDefaultCollation helper. Both sides call it so CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci and bare CHARSET=utf8mb4 describe the same MySQL state.
  • catalog joins information_schema.COLLATIONS to populate Table.Charset from TABLE_COLLATION (single source of truth, no fragile prefix splitting). Per-column CHARACTER_SET_NAME / COLLATION_NAME that match either the table default or the charset's default collation collapse to nil.
  • parser/parseColumnDef now reads cd.Type.Charset.Name (vitess holds it on a separate struct, not in opts) so column-level CHARACTER SET isn't dropped on the floor. Previously parser saw it as nil and the catalog as non-nil → endless MODIFY COLUMN drift.

Documented limitation

AGENTS.md operational rules + a new CAVEATS.md section pin the one rough edge: changing DEFAULT CHARSET on a table with pre-existing string columns converges in two applies. The first changes the table default; the second emits MODIFY COLUMN per string column to inherit the new default. This is the honest expression of MySQL's own behaviour — ALTER TABLE … DEFAULT CHARSET=… doesn't rewrite per-column charset metadata. The change_table_charset fixture has verify_no_drift: false to honour this with the reasoning in the header comment.

A future --convert-charset flag / directive can opt into single-apply convergence via ALTER TABLE … CONVERT TO CHARACTER SET … (rewrites stored bytes, heavyweight); filed as a Medium TODO.

Tests

  • parser: TestParseTableOptions updated (utf8mb4_0900_ai_ci collapses to nil); new TestParseTableOptionsExplicitNonDefaultCollation pins that a non-default collation does survive.
  • end-to-end:
    • testdata/apply/change_column_charset.yml — single MODIFY COLUMN, verify_no_drift: true
    • testdata/apply/change_table_charset.yml — single ALTER TABLE, verify_no_drift: false documents the two-apply convergence
  • dump fixtures (4) updated for the new round-trip shape: table-level DEFAULT CHARSET=utf8mb4 instead of per-column CHARACTER SET clauses + table-only COLLATE.

Verified live on MySQL 8.0 + 9.4: round-trip plan / apply / dump, scenario suite, lint all green.

Test plan

  • make test (8.0)
  • make test-mysql9 (9.4)
  • make test-scenario (8.0)
  • make lint

🤖 Generated with Claude Code

…mn level

Closes the Medium-priority "charset diff at the table and column
level" entry. Both layers were already loaded into model.Table /
model.Column but the diff intentionally skipped them — silent drift
on every CHARSET change.

Diff
- diff/tables.go now emits `ALTER TABLE … DEFAULT CHARSET=… [COLLATE=…]`
  when desired and current diverge. Engine and Comment stay un-diffed
  (separate scope).
- columnEqual no longer skips CharacterSet / Collation. They go
  through the same normalisation on both sides so an "inherited"
  column reads as nil → equal to a parser-side column the user didn't
  spell out.

Normalisation (parser ↔ catalog symmetry)
- New model/charset.go owns the MySQL 8.0+ charset → default-collation
  table and the CollapseDefaultCollation helper. Both sides call it
  so `CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci` and bare
  `CHARSET=utf8mb4` describe the same MySQL state.
- catalog joins information_schema.COLLATIONS to populate Table.Charset
  from TABLE_COLLATION (single source of truth, no fragile prefix
  splitting). Per-column CHARACTER_SET_NAME / COLLATION_NAME that
  match either the table default or the charset's default collation
  collapse to nil.
- parser/parseColumnDef now reads cd.Type.Charset.Name (vitess holds
  it on a separate struct, not in opts) so column-level CHARACTER SET
  isn't dropped on the floor. Previously parser saw it as nil and the
  catalog as non-nil → endless MODIFY COLUMN drift.

Documented limitation
- AGENTS.md operational rules + new CAVEATS.md section pin the one
  rough edge: changing DEFAULT CHARSET on a table with pre-existing
  string columns converges in two applies. The first changes the
  table default; the second emits MODIFY COLUMN per string column to
  inherit the new default (MySQL freezes the old charset onto each
  COLUMN row when the table default flips). The change_table_charset
  fixture has verify_no_drift: false to honour this with the
  reasoning spelled out in the header comment.
- A future `--convert-charset` flag / directive can opt into single-
  apply convergence via `ALTER TABLE … CONVERT TO CHARACTER SET …`;
  filed as a Medium TODO.

Tests
- catalog: existing tests survive thanks to the symmetric
  normalisation.
- parser: TestParseTableOptions updated (utf8mb4_0900_ai_ci collapses
  to nil); TestParseTableOptionsExplicitNonDefaultCollation pins that
  a non-default collation does survive.
- end-to-end: testdata/apply/change_column_charset.yml (single
  MODIFY COLUMN, verify_no_drift: true), testdata/apply/
  change_table_charset.yml (single ALTER TABLE, verify_no_drift:
  false documents the two-apply convergence).
- testdata/dump fixtures (4 of them) updated for the new shape:
  table-level DEFAULT CHARSET=utf8mb4 instead of per-column
  CHARACTER SET clauses + table-only COLLATE.

Verified live on MySQL 8.0 + 9.4: round-trip plan / apply / dump,
scenario suite, lint all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 3, 2026 04:35
@codecov

codecov Bot commented May 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 66.32653% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.58%. Comparing base (6afdf0f) to head (d4da28f).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
diff/tables.go 9.67% 25 Missing and 3 partials ⚠️
catalog/tables.go 93.33% 1 Missing and 1 partial ⚠️
parser/parser.go 90.00% 1 Missing and 1 partial ⚠️
model/charset.go 94.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #40      +/-   ##
==========================================
- Coverage   74.02%   73.58%   -0.45%     
==========================================
  Files          27       28       +1     
  Lines        2264     2347      +83     
==========================================
+ Hits         1676     1727      +51     
- Misses        425      453      +28     
- Partials      163      167       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PR #40's first cut tested the charset/collation behaviour end-to-end
through the YAML harness, which left two pieces under-pinned:

- model/charset.go (new file) had no direct unit test, so a future
  refactor of DefaultCollationOf / CollapseDefaultCollation could
  silently change the table or the collapse semantics.
- parseColumnDef now reads cd.Type.Charset.Name from vitess. That's
  the exact code path that was broken before this PR (parser saw nil
  while catalog returned the column charset, leading to endless
  MODIFY COLUMN drift), so it deserves a direct regression test
  rather than relying on the YAML harness alone.

Add:
- TestDefaultCollationOf — covers the four most-used charsets, the
  utf8 ↔ utf8mb3 alias, and the "unknown charset → empty string"
  contract.
- TestCollapseDefaultCollation — covers both-nil, one-nil, default-
  match (collapsed), non-default (kept), unrelated charset's default
  (kept), and unknown-charset (kept) shapes.
- TestParseColumnCharacterSet — pins that explicit `CHARACTER SET`
  on a column lands on model.Column.CharacterSet, that the COLLATE
  half is captured when given, that a non-default collation
  survives, and that the charset's default collation is collapsed to
  nil on the parser side too. Comment block calls out why this path
  matters (the catalog-vs-parser drift it prevents).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Surfaces table- and column-level MySQL CHARACTER SET / COLLATE drift in myschema by loading/normalizing charset+collation consistently in parser and catalog, and emitting the appropriate ALTER TABLE / MODIFY COLUMN statements.

Changes:

  • Add shared default-collation collapsing (model.CollapseDefaultCollation) and apply it on both parser and catalog paths so implicit MySQL defaults don’t cause churn.
  • Emit table-level charset/collation diffs (ALTER TABLE … DEFAULT CHARSET=… [COLLATE=…]) and include column charset/collation in column equality comparisons to surface real drift.
  • Update end-to-end fixtures and docs to reflect the new round-trip shape and the documented two-apply convergence behavior for table default charset changes.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
catalog/tables.go Load table charset via information_schema.COLLATIONS join and normalize column/table charset+collation inheritance/defaults.
diff/tables.go Emit table-level charset DDL and compare column charset/collation in columnEqual.
parser/parser.go Parse column CHARACTER SET from vitess AST and collapse default collations on parsed tables/columns.
parser/parser_test.go Adjust table-option expectations for collapsed default collation and add explicit non-default collation test.
model/charset.go Introduce default-collation lookup table and CollapseDefaultCollation helper shared across layers.
testdata/apply/change_table_charset.yml Add fixture for table default charset change (documented 2-apply convergence).
testdata/apply/change_column_charset.yml Add fixture for explicit per-column charset change (single-apply convergence).
testdata/dump/basic.yml Update dump output expectations to prefer table-level DEFAULT CHARSET over redundant per-column clauses.
testdata/dump/include_filter.yml Same as above for include-filter dump output.
testdata/dump/view.yml Same as above for view-related dump output.
testdata/dump/with_index_and_fk.yml Same as above for tables with indexes/FKs.
TODO.md Replace previous charset/collation diff TODO with a follow-up “one-shot CONVERT” TODO.
CAVEATS.md Document the two-apply convergence behavior when changing table default charset with existing string columns.
AGENTS.md Add operational note pointing to the charset two-apply caveat.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread parser/parser.go
Comment thread parser/parser.go
Comment thread model/charset.go
Comment thread model/charset.go
…ation

+ map-completeness test

All four nits were valid (or near-valid):

1, 2. Parser preserved the user's original SQL casing, so
   `CHARSET=UTF8MB4` would never compare equal to the catalog's
   canonical lower-case `utf8mb4` and tables/columns would diff
   forever. Lower-case the column-level CHARACTER SET, the column-
   level COLLATE, and both halves of the table-level option in
   parser/parser.go so parser-side and catalog-side values are in
   the same form before any comparison.

3. `model.DefaultCollationOf` and `CollapseDefaultCollation` did a
   direct map lookup, so a mixed-case input from any future caller
   would silently bypass normalisation. Lower-case `charset` inside
   DefaultCollationOf and use `strings.EqualFold` for the coll
   comparison in CollapseDefaultCollation. Defensive — the parser
   already lower-cases above, but a pure helper that depends on
   call-site discipline is fragile.

4. Reviewer claimed `cp874` and `latin9` were missing from the
   default-collations map, which would cause endless drift on
   columns using them. Verified live against MySQL 8.0 + 9.4: those
   charsets do not exist on either server (the map already covers
   all 41 stock charsets reported by information_schema). Add a
   `TestCharsetDefaultCollationsCoverServer` integration test that
   walks `information_schema.CHARACTER_SETS` and asserts every entry
   resolves through `model.DefaultCollationOf`. The test pins the
   current completeness and will fail loudly if a future MySQL
   release ships a new charset (or the project bumps the baseline)
   so the map can be filled in before drift starts.

Unit-side coverage:
- TestDefaultCollationOf: add mixed-case rows (`UTF8MB4`, `Latin1`)
  to pin case-insensitive lookup.
- TestCollapseDefaultCollation: append a case-insensitive collapse
  case (`UTF8MB4` + `UTF8MB4_0900_AI_CI` → nil) so the EqualFold
  branch in the helper has direct coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread parser/parser.go Outdated
Comment thread diff/tables.go Outdated
… side

Two related convergence bugs surfaced in the second Copilot pass; both
boil down to "the desired side spelled out only one half of the
charset/collation pair, but the other side carries both".

#1 parser/parser.go column-level COLLATE-only

Per-column collation collapse only used `c.CharacterSet`, which is
nil when the user wrote just `COLLATE …` on the column. The catalog
side resolves the effective charset (column charset, falling back to
the table default) before collapsing, so a redundantly-spelled
default collation collapsed on the catalog side but survived on the
parser side, drifting forever as MODIFY COLUMN.

Fix: use the same effective-charset fallback in the parser pass
(column.CharacterSet ?? table.Charset) before calling
CollapseDefaultCollation.

#2 diff/tables.go don't-care semantics for table-level diff

`tableCharsetCollationSQL` did a strict ptrEq on both fields. When
desired specifies only `COLLATE=…` (Charset=nil), the catalog's
non-nil Charset never matched and the same `ALTER TABLE …
COLLATE=…` got emitted on every plan. Symmetrically, a desired
`DEFAULT CHARSET=…` with no COLLATE would have looped if we'd taken
that path on the Collation field.

Fix: treat a desired field set to nil as "the user didn't say, so
anything matches". The result is convergent semantics for either
side specified alone, with the existing both-non-nil and both-nil
paths unchanged.

Tests
- TestParseColumnCollateOnlyInheritsTableCharset: COLLATE-only column
  with the table-default charset's default collation collapses to nil;
  a non-default COLLATE on a COLLATE-only column survives. Pins #1.
- testdata/plan/charset_collate_only_no_changes.yml: desired SQL has
  `COLLATE=utf8mb4_unicode_ci` only (no DEFAULT CHARSET), matches the
  catalog state, plan is empty. Pins #2 end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread model/charset.go
Comment thread diff/tables.go Outdated
Two nits in this pass:

#1 model/charset.go map "missing cp1252" — false alarm.
Same shape as Copilot's earlier `cp874` / `latin9` claim: the
identifier doesn't actually exist in MySQL 8.0+. Verified live on
8.0 + 9.4: information_schema.CHARACTER_SETS lists cp1250, cp1251,
cp1256, cp1257 — there is no cp1252. The map already covers all 41
stock charsets (verified via TestCharsetDefaultCollationsCoverServer
on both servers). No code change.

#2 diff/tables.go docstring inaccuracy — fixed.
`tableCharsetCollationSQL`'s docstring claimed the emitted DDL was
always `ALTER TABLE … DEFAULT CHARSET=… COLLATE=…`, but the
function only spells out the clauses the desired side actually set
(either / both, depending on what differed). Reword the docstring
to match: `… DEFAULT CHARSET=…` alone, `… COLLATE=…` alone, or both
together — three shapes covered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread diff/tables.go Outdated
Comment thread parser/parser.go
"reset to default" semantics

Two more convergence holes the previous pass left open. Both stem
from over-reaching the simplifications I made earlier.

#1 parser table-level COLLATE-only

Last pass added effective-charset fallback for column-level COLLATE-
only, but the *table-level* equivalent was still left bare:

    CREATE TABLE t (...) COLLATE=utf8mb4_0900_ai_ci;

(no DEFAULT CHARSET) → parser had t.Charset=nil so
CollapseDefaultCollation couldn't fire, while catalog (which always
knows the effective charset via information_schema.COLLATIONS) did
collapse — endless `ALTER TABLE … COLLATE=…` loops.

Add `model.CharsetOfCollation(coll)` (splits on the first `_`, with
the `binary` self-named special case) and a parser-side
`effectiveCharsetForCollation` helper that walks the chain
(declared charset → table default → derived from collation name)
before calling CollapseDefaultCollation. Used for both table-level
and column-level normalisation.

#2 diff don't-care semantics was too loose

Last pass treated `desired.Collation == nil` as "don't care", which
silently swallowed a real intent: dropping COLLATE from desired SQL
while keeping DEFAULT CHARSET means "reset to the charset's default
collation". With the catch-all don't-care, that case never emitted
an ALTER even when the catalog carried an explicit non-default
collation.

Refine the comparison:
- desired.Collation != nil → strict ptrEq (unchanged).
- desired.Collation == nil && desired.Charset != nil → catalog must
  be at default (current.Collation == nil after collapse); otherwise
  emit `ALTER TABLE … DEFAULT CHARSET=…` so MySQL resets.
- desired.Charset == nil treated as "no opinion on charset".
- Both nil → handled by the existing early return.

Tests
- model: TestCharsetOfCollation pins prefix split, the `binary`
  special case, mixed-case input, and the no-underscore fallback.
- parser: TestParseTableOptionsCollateOnly has two sub-tests — the
  default collation collapses to nil, a non-default one survives.
- diff: testdata/plan/charset_drop_explicit_collation.yml — catalog
  has explicit COLLATE=utf8mb4_unicode_ci, desired drops it but
  keeps DEFAULT CHARSET=utf8mb4 → plan emits a single
  `ALTER TABLE … DEFAULT CHARSET=utf8mb4` (no COLLATE).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@winebarrel winebarrel merged commit 1a34091 into main May 3, 2026
7 of 9 checks passed
@winebarrel winebarrel deleted the charset-collation-diff branch May 3, 2026 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants