Migrate kimi-k2-turbo-preview → kimi-k2.6 (closes #68)#69
Conversation
Moonshot announced (email 2026-05-04) that the kimi-k2-* hyphenated named-feature variants — including kimi-k2-turbo-preview — will be discontinued on 2026-05-25 and replaced by kimi-k2.6. The dot-versioned line (k2.5, k2.6) is the continuing one; our other Moonshot model (kimi-k2.5) is unaffected. Three active code references updated: - scripts/plot_results.py MODELS registry entry - scripts/run_full_benchmark.py MODELS dict entry - scripts/README.md documented sweep recipe Tier placement is left as the existing sonnet slot for this PR. K2.6 is structurally the new flagship-line model rather than a "secondary/cheaper" variant, but moving it to the flagship tier forces a chart-layout discontinuity (K2.5 sliding from flagship → sonnet between v0.0.11 results and the next sweep). Better timed with fresh K2.6 result files than with a registry-only change. Issue #68 tracks the deferred decision. Out of scope (unchanged): - 9 historical results/moonshot-kimi-k2-turbo-preview-*.jsonl files — frozen audit records of what was measured at the time. - README.md / ROADMAP.md v0.0.7 narrative — historical text describing past benchmarks. Rewriting falsifies history. - tests/test_models.py uses moonshot/kimi-k2 (generic, not the deprecated SKU) for routing tests. Nothing to update. No bench-version bump: this is a model-registry update, not a methodology change. Same scoring contract, no result files affected. Verified: - ruff check + ruff format --check clean. - pytest tests/ → 495 passed (unchanged from main). - scripts/plot_results.py runs against new registry; warns gracefully on missing K2.6 result files and produces a chart with zero K2.6 bars. Closes #68. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughReplace deprecated Moonshot SKU ChangesMoonshot Model Deprecation Migration
Estimated Code Review Effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly Related PRs
Suggested Labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #69 +/- ##
=======================================
Coverage 83.65% 83.65%
=======================================
Files 10 10
Lines 1395 1395
=======================================
Hits 1167 1167
Misses 228 228
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/README.md (1)
106-115:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winStale model name in the timing table.
The "Sweeping the full matrix" section now loops over
moonshot/kimi-k2.6, but the timing reference table immediately below still lists the deprecated predecessor:| Moonshot K2 Turbo | ~1.5 h |This is the
scripts/operational README (not the top-levelREADME.md/ROADMAP.mdpreserved as historical record), so it should reflect the active sweep. As no K2.6 timing data exists yet, the row can be updated with a TBD note until a real sweep is completed.📝 Proposed fix
-| Moonshot K2 Turbo | ~1.5 h | +| Moonshot K2.6 | TBD (first sweep pending) |🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/README.md` around lines 106 - 115, Update the timing table row that currently reads "Moonshot K2 Turbo | ~1.5 h" to reflect the active sweep which loops over "moonshot/kimi-k2.6": replace the K2 Turbo entry with a placeholder like "Moonshot K2.6 | TBD" (or similar TBD text) so the scripts/README.md "Sweeping the full matrix" section matches the current sweep; ensure you modify the table row containing the literal "Moonshot K2 Turbo" so the docs no longer reference the deprecated predecessor.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@scripts/README.md`:
- Around line 106-115: Update the timing table row that currently reads
"Moonshot K2 Turbo | ~1.5 h" to reflect the active sweep which loops over
"moonshot/kimi-k2.6": replace the K2 Turbo entry with a placeholder like
"Moonshot K2.6 | TBD" (or similar TBD text) so the scripts/README.md "Sweeping
the full matrix" section matches the current sweep; ensure you modify the table
row containing the literal "Moonshot K2 Turbo" so the docs no longer reference
the deprecated predecessor.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5d8cfb8a-7243-42a8-a1ff-2f3ab2f669fb
📒 Files selected for processing (3)
scripts/README.mdscripts/plot_results.pyscripts/run_full_benchmark.py
The 'Timing expectations' table in scripts/README.md still showed 'Moonshot K2 Turbo | ~1.5 h' after the migration, which mismatched the sweep recipe that loops over moonshot/kimi-k2.6. Both rows now appear: K2.6 with TBD pending the first sweep, and K2 Turbo kept with an explicit 'historical; SKU deprecated 2026-05-25' annotation so the prior real-world data point isn't lost. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@coderabbitai Valid finding, and verified against the current branch — line 112 of Addressed in b35e4da with a small variation on your suggestion: rather than replacing the K2 Turbo row outright, I kept it with an explicit "historical; SKU deprecated 2026-05-25" annotation and added a The trailing paragraph also updated to note that K2.6 timings will be filled in after the first sweep. |
|
Tip For best results, initiate chat on the files or code changes.
|
The previous comment defended placing K2.6 in the sonnet slot 'for now' on the grounds that K2.6 is the new flagship — which is self-contradictory if you read it twice. Moonshot promoted K2.6 to its flagship line; the chart tier should reflect that. K2.5 moves to sonnet, joining Claude Sonnet 4 and GPT-4o as 'previous-generation / secondary slot from each provider' — a tighter semantic grouping than the prior mix anyway. No code or test surface affected; chart generation is unchanged in mechanics. Registry-only change. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Followup: promoted K2.6 to flagship and demoted K2.5 to sonnet in 609171d. The earlier "K2.6 in sonnet, defer the tier call" framing was self-contradictory once I re-read it — defending a placement on the grounds that it's structurally wrong is a flag that the placement is wrong. New layout:
Layout symmetry preserved (3 + 3), and sonnet is now a tighter semantic grouping ("previous-generation slot from each provider") rather than a mix of previous-gen + cost-optimized variants. The K2.5 flagship→sonnet shift between v0.0.11 and future sweeps is a real chart discontinuity but it's information, not noise — Moonshot updated their flagship, and the chart should reflect that the moment the registry does. PR body updated to match. Issue #68's "tier placement deferred to follow-up" framing no longer applies — closing it here resolves the whole question. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/plot_results.py`:
- Around line 96-106: The PR incorrectly changes benchmark tiers for Kimi
models; instead of promoting "Kimi K2.6" (ModelSpec("Kimi K2.6",
"moonshot-kimi-k2.6", ...)) to "flagship" and demoting "Kimi K2.5"
(ModelSpec("Kimi K2.5", "moonshot-kimi-k2.5", ...)) to "sonnet", preserve their
original tier assignments and only perform the registry rename/deprecation swap
on the identifier fields; update the model name/ID strings as needed but leave
the third argument (tier) for both ModelSpec entries unchanged to respect the
registry-only migration scope.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 79ea0898-b688-411b-b4ea-3ce1ea04b7dd
📒 Files selected for processing (1)
scripts/plot_results.py
Standalone script that renders the v0.0.7 result panels as 16:9 slides sized and styled for talk presentation. Three slide types: - delta — "Does Vera beat Python / TypeScript?" headline chart - tiers — Flagship and Sonnet tier comparisons side-by-side - all-modes — all 6 models × 4 modes in a single grouped-bar panel Specialised, not general. The v0.0.7 model lineup (K2.5 in flagship, K2 Turbo in sonnet) is hard-coded because the live plot_results.MODELS registry now reflects the post-K2.6 migration (PR aallan#69). Reuses palette + extract_data() from plot_results.py so slide numbers match the README chart by construction. Typography is bumped roughly 3× from the documentation-chart sizes so the slide reads from the back of a room. Slide canvas is 16×9 in at dpi=180 (2880×1620 px), with a tunable --background flag offering four light-theme options (paper / white / cream / light-grey). Dark mode is deliberately out of scope — would require cascading text-colour inversion that the current talk's design doesn't need. Output handling: - PNGs default to /tmp/ because they're talk-prep ephemera that belong in the speaker's deck rather than the repo - assets/vera-bench_slide_*.png is gitignored for the case where someone outputs to assets/ for preview — the canonical artefact is the script itself; regeneration is cheap - Brief section added to scripts/README.md flagging the v0.0.7 pin so future-me doesn't try to use this for a different release Verified: - ruff check + ruff format --check clean - All three slides render cleanly on default paper background - Numbers cross-check against the v0.0.7 documentation chart (each of 6 models × 4 modes = 24 cells) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Closes #68.
Moonshot announced (email 2026-05-04) that the
kimi-k2-*hyphenated named-feature variants — includingkimi-k2-turbo-preview— will be discontinued 2026-05-25 and replaced bykimi-k2.6. The dot-versioned line (k2.5,k2.6) is the continuing one.What changed
scripts/plot_results.pyKimi K2.6→ flagship slot;Kimi K2.5demoted to sonnet (replacing the previousKimi K2 Turborow entirely)scripts/run_full_benchmark.pymoonshot/kimi-k2.6replacesmoonshot/kimi-k2-turbo-previewscripts/README.mdmoonshot/kimi-k2.6; timing table keeps the K2 Turbo row annotated as historical (deprecated 2026-05-25) and adds a `Moonshot K2.6Tier reassignment
K2.6 is structurally the new flagship-line model from Moonshot, so it goes in the flagship tier alongside Claude Opus 4 and GPT-4.1. K2.5 moves to sonnet, joining Claude Sonnet 4 and GPT-4o as "previous-generation / secondary slot from each provider" — a tighter semantic grouping than the previous mix (where the sonnet Moonshot slot held the cost-optimized Turbo variant rather than a same-generation sibling).
Layout symmetry preserved: 3 flagship + 3 sonnet. The K2.5 flagship→sonnet move is a real chart-layout discontinuity between v0.0.11 results and future sweeps, but the discontinuity is information — Moonshot updated their flagship, and the chart should reflect that the moment the registry does, not several PRs later.
Out of scope (deliberately untouched)
results/moonshot-kimi-k2-turbo-preview-*.jsonlfiles — frozen audit records of what was measured at the timeREADME.md/ROADMAP.mdv0.0.7 narrative — historical text describing past benchmarks; rewriting falsifies historytests/test_models.pyusesmoonshot/kimi-k2(generic, not the deprecated SKU) for routing tests; no change neededNo version bump
Model-registry update + chart-tier reassignment, not a methodology change. Same scoring contract, no result files affected. Mirrors the precedent of registry/config changes (PR #58 Cloudflare User-Agent, PR #67 baseline
bench_version).Test plan
ruff check+ruff format --checkcleanpytest tests/→ 495 passed (unchanged frommain)scripts/plot_results.py --version 0.0.7runs against new registry; warns gracefully on missing K2.6 result files; K2.5 in sonnet correctly resolves existing v0.0.7 file (file lookup is byfile_prefix, not tier)🤖 Generated with Claude Code
Summary by CodeRabbit