Skip to content

Migrate kimi-k2-turbo-preview → kimi-k2.6 (closes #68)#69

Merged
aallan merged 3 commits into
mainfrom
deps/kimi-k2.6-migration
May 12, 2026
Merged

Migrate kimi-k2-turbo-preview → kimi-k2.6 (closes #68)#69
aallan merged 3 commits into
mainfrom
deps/kimi-k2.6-migration

Conversation

@aallan

@aallan aallan commented May 7, 2026

Copy link
Copy Markdown
Owner

Summary

Closes #68.

Moonshot announced (email 2026-05-04) that the kimi-k2-* hyphenated named-feature variants — including kimi-k2-turbo-preview — will be discontinued 2026-05-25 and replaced by kimi-k2.6. The dot-versioned line (k2.5, k2.6) is the continuing one.

What changed

File Change
scripts/plot_results.py Registry: Kimi K2.6flagship slot; Kimi K2.5 demoted to sonnet (replacing the previous Kimi K2 Turbo row entirely)
scripts/run_full_benchmark.py moonshot/kimi-k2.6 replaces moonshot/kimi-k2-turbo-preview
scripts/README.md Sweep recipe references moonshot/kimi-k2.6; timing table keeps the K2 Turbo row annotated as historical (deprecated 2026-05-25) and adds a `Moonshot K2.6

Tier reassignment

K2.6 is structurally the new flagship-line model from Moonshot, so it goes in the flagship tier alongside Claude Opus 4 and GPT-4.1. K2.5 moves to sonnet, joining Claude Sonnet 4 and GPT-4o as "previous-generation / secondary slot from each provider" — a tighter semantic grouping than the previous mix (where the sonnet Moonshot slot held the cost-optimized Turbo variant rather than a same-generation sibling).

Layout symmetry preserved: 3 flagship + 3 sonnet. The K2.5 flagship→sonnet move is a real chart-layout discontinuity between v0.0.11 results and future sweeps, but the discontinuity is information — Moonshot updated their flagship, and the chart should reflect that the moment the registry does, not several PRs later.

Out of scope (deliberately untouched)

  • 9 historical results/moonshot-kimi-k2-turbo-preview-*.jsonl files — frozen audit records of what was measured at the time
  • README.md / ROADMAP.md v0.0.7 narrative — historical text describing past benchmarks; rewriting falsifies history
  • tests/test_models.py uses moonshot/kimi-k2 (generic, not the deprecated SKU) for routing tests; no change needed

No version bump

Model-registry update + chart-tier reassignment, not a methodology change. Same scoring contract, no result files affected. Mirrors the precedent of registry/config changes (PR #58 Cloudflare User-Agent, PR #67 baseline bench_version).

Test plan

  • ruff check + ruff format --check clean
  • pytest tests/ → 495 passed (unchanged from main)
  • scripts/plot_results.py --version 0.0.7 runs against new registry; warns gracefully on missing K2.6 result files; K2.5 in sonnet correctly resolves existing v0.0.7 file (file lookup is by file_prefix, not tier)
  • Chart generation produces a sensible layout with K2.6 in flagship + K2.5 in sonnet

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Chores
    • Replaced Kimi K2 Turbo with Kimi K2.6 across benchmark tooling and model lists.
  • Documentation
    • Updated examples and timing notes to reference Moonshot K2.6, with timings marked TBD until full sweep completes.

Review Change Stack

Moonshot announced (email 2026-05-04) that the kimi-k2-* hyphenated
named-feature variants — including kimi-k2-turbo-preview — will be
discontinued on 2026-05-25 and replaced by kimi-k2.6. The dot-versioned
line (k2.5, k2.6) is the continuing one; our other Moonshot model
(kimi-k2.5) is unaffected.

Three active code references updated:

- scripts/plot_results.py MODELS registry entry
- scripts/run_full_benchmark.py MODELS dict entry
- scripts/README.md documented sweep recipe

Tier placement is left as the existing sonnet slot for this PR. K2.6 is
structurally the new flagship-line model rather than a "secondary/cheaper"
variant, but moving it to the flagship tier forces a chart-layout
discontinuity (K2.5 sliding from flagship → sonnet between v0.0.11
results and the next sweep). Better timed with fresh K2.6 result files
than with a registry-only change. Issue #68 tracks the deferred decision.

Out of scope (unchanged):
- 9 historical results/moonshot-kimi-k2-turbo-preview-*.jsonl files —
  frozen audit records of what was measured at the time.
- README.md / ROADMAP.md v0.0.7 narrative — historical text describing
  past benchmarks. Rewriting falsifies history.
- tests/test_models.py uses moonshot/kimi-k2 (generic, not the deprecated
  SKU) for routing tests. Nothing to update.

No bench-version bump: this is a model-registry update, not a
methodology change. Same scoring contract, no result files affected.

Verified:
- ruff check + ruff format --check clean.
- pytest tests/ → 495 passed (unchanged from main).
- scripts/plot_results.py runs against new registry; warns gracefully
  on missing K2.6 result files and produces a chart with zero K2.6 bars.

Closes #68.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 7, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Replace deprecated Moonshot SKU kimi-k2-turbo-preview with kimi-k2.6 in the plotting registry, the benchmark runner model list, and the README sweep example and timing table.

Changes

Moonshot Model Deprecation Migration

Layer / File(s) Summary
Model registry update
scripts/plot_results.py
Remove ModelSpec("Kimi K2 Turbo", "moonshot-kimi-k2-turbo-preview", "sonnet"); add Kimi K2.6 and move Kimi K2.5 into sonnet with inline comments documenting tier changes.
Benchmark runner update
scripts/run_full_benchmark.py
Replace ("Kimi K2 Turbo", "moonshot/kimi-k2-turbo-preview") with ("Kimi K2.6", "moonshot/kimi-k2.6") in MODELS["moonshot"].
Documentation & examples
scripts/README.md
Update sweep example to reference moonshot/kimi-k2.6; add Moonshot K2.6 timing row marked TBD and update narrative to state timings will be filled after the first full sweep.

Estimated Code Review Effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly Related PRs

Suggested Labels

ci

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Out of Scope Changes check ❓ Inconclusive Changes are scoped to the three identified files with documented tier-placement deferral. However, commit message mentions tier promotion decisions inconsistent with PR objectives stating deferral. Clarify whether the commit message tier decisions (K2.6→flagship, K2.5→sonnet) override the PR's stated deferral until K2.6 sweep data is available.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: migrating from the deprecated kimi-k2-turbo-preview model to kimi-k2.6 across three configuration files.
Linked Issues check ✅ Passed All three acceptance criteria are met: deprecated SKU removed, K2.6 added in sonnet tier, existing tests pass. Tier-placement deferral is documented in code comments.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch deps/kimi-k2.6-migration

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented May 7, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.65%. Comparing base (e697ea2) to head (609171d).

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #69   +/-   ##
=======================================
  Coverage   83.65%   83.65%           
=======================================
  Files          10       10           
  Lines        1395     1395           
=======================================
  Hits         1167     1167           
  Misses        228      228           
Flag Coverage Δ
python 83.65% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/README.md (1)

106-115: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stale model name in the timing table.

The "Sweeping the full matrix" section now loops over moonshot/kimi-k2.6, but the timing reference table immediately below still lists the deprecated predecessor:

| Moonshot K2 Turbo | ~1.5 h |

This is the scripts/ operational README (not the top-level README.md/ROADMAP.md preserved as historical record), so it should reflect the active sweep. As no K2.6 timing data exists yet, the row can be updated with a TBD note until a real sweep is completed.

📝 Proposed fix
-| Moonshot K2 Turbo | ~1.5 h |
+| Moonshot K2.6     | TBD (first sweep pending) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/README.md` around lines 106 - 115, Update the timing table row that
currently reads "Moonshot K2 Turbo | ~1.5 h" to reflect the active sweep which
loops over "moonshot/kimi-k2.6": replace the K2 Turbo entry with a placeholder
like "Moonshot K2.6 | TBD" (or similar TBD text) so the scripts/README.md
"Sweeping the full matrix" section matches the current sweep; ensure you modify
the table row containing the literal "Moonshot K2 Turbo" so the docs no longer
reference the deprecated predecessor.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@scripts/README.md`:
- Around line 106-115: Update the timing table row that currently reads
"Moonshot K2 Turbo | ~1.5 h" to reflect the active sweep which loops over
"moonshot/kimi-k2.6": replace the K2 Turbo entry with a placeholder like
"Moonshot K2.6 | TBD" (or similar TBD text) so the scripts/README.md "Sweeping
the full matrix" section matches the current sweep; ensure you modify the table
row containing the literal "Moonshot K2 Turbo" so the docs no longer reference
the deprecated predecessor.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5d8cfb8a-7243-42a8-a1ff-2f3ab2f669fb

📥 Commits

Reviewing files that changed from the base of the PR and between e697ea2 and 3c97621.

📒 Files selected for processing (3)
  • scripts/README.md
  • scripts/plot_results.py
  • scripts/run_full_benchmark.py

The 'Timing expectations' table in scripts/README.md still showed 'Moonshot K2 Turbo | ~1.5 h' after the migration, which mismatched the sweep recipe that loops over moonshot/kimi-k2.6. Both rows now appear: K2.6 with TBD pending the first sweep, and K2 Turbo kept with an explicit 'historical; SKU deprecated 2026-05-25' annotation so the prior real-world data point isn't lost.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aallan

aallan commented May 12, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai Valid finding, and verified against the current branch — line 112 of scripts/README.md did still read | Moonshot K2 Turbo | ~1.5 h |, which mismatched the K2.6 sweep recipe two sections above.

Addressed in b35e4da with a small variation on your suggestion: rather than replacing the K2 Turbo row outright, I kept it with an explicit "historical; SKU deprecated 2026-05-25" annotation and added a Moonshot K2.6 | TBD row pointing at issue #68. Reasoning: the table's preamble explicitly frames it as "Rough per-model totals observed on v0.0.9" — historical observations, not current projections — so the K2 Turbo timing is the only real-world Moonshot Turbo data point we have, useful for budgeting future K2.6 sweeps. Dropping it loses information; relabeling it preserves the data while removing the table-vs-sweep-recipe confusion you correctly flagged.

The trailing paragraph also updated to note that K2.6 timings will be filled in after the first sweep.

@coderabbitai

coderabbitai Bot commented May 12, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

@aallan That's a better resolution than a straight replacement — annotating the K2 Turbo row as historical with the deprecation date preserves the only real-world Moonshot Turbo sweep timing we have (useful as a rough lower-bound budget for K2.6 runs), while the new | Moonshot K2.6 | TBD | row pointing at #68 makes the forward path explicit. The table now correctly reflects both the historical record and the current registry state. Thanks for the clear reasoning.

The previous comment defended placing K2.6 in the sonnet slot 'for now' on the grounds that K2.6 is the new flagship — which is self-contradictory if you read it twice. Moonshot promoted K2.6 to its flagship line; the chart tier should reflect that. K2.5 moves to sonnet, joining Claude Sonnet 4 and GPT-4o as 'previous-generation / secondary slot from each provider' — a tighter semantic grouping than the prior mix anyway.

No code or test surface affected; chart generation is unchanged in mechanics. Registry-only change.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aallan

aallan commented May 12, 2026

Copy link
Copy Markdown
Owner Author

Followup: promoted K2.6 to flagship and demoted K2.5 to sonnet in 609171d. The earlier "K2.6 in sonnet, defer the tier call" framing was self-contradictory once I re-read it — defending a placement on the grounds that it's structurally wrong is a flag that the placement is wrong. New layout:

  • Flagship: Claude Opus 4, GPT-4.1, Kimi K2.6
  • Sonnet: Claude Sonnet 4, GPT-4o, Kimi K2.5

Layout symmetry preserved (3 + 3), and sonnet is now a tighter semantic grouping ("previous-generation slot from each provider") rather than a mix of previous-gen + cost-optimized variants. The K2.5 flagship→sonnet shift between v0.0.11 and future sweeps is a real chart discontinuity but it's information, not noise — Moonshot updated their flagship, and the chart should reflect that the moment the registry does.

PR body updated to match. Issue #68's "tier placement deferred to follow-up" framing no longer applies — closing it here resolves the whole question.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/plot_results.py`:
- Around line 96-106: The PR incorrectly changes benchmark tiers for Kimi
models; instead of promoting "Kimi K2.6" (ModelSpec("Kimi K2.6",
"moonshot-kimi-k2.6", ...)) to "flagship" and demoting "Kimi K2.5"
(ModelSpec("Kimi K2.5", "moonshot-kimi-k2.5", ...)) to "sonnet", preserve their
original tier assignments and only perform the registry rename/deprecation swap
on the identifier fields; update the model name/ID strings as needed but leave
the third argument (tier) for both ModelSpec entries unchanged to respect the
registry-only migration scope.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 79ea0898-b688-411b-b4ea-3ce1ea04b7dd

📥 Commits

Reviewing files that changed from the base of the PR and between b35e4da and 609171d.

📒 Files selected for processing (1)
  • scripts/plot_results.py

Comment thread scripts/plot_results.py
@aallan aallan merged commit 6915d87 into main May 12, 2026
10 checks passed
@aallan aallan deleted the deps/kimi-k2.6-migration branch May 12, 2026 09:57
aallan added a commit to sunholo-voight-kampff/vera-bench that referenced this pull request May 22, 2026
Standalone script that renders the v0.0.7 result panels as 16:9 slides
sized and styled for talk presentation. Three slide types:

- delta     — "Does Vera beat Python / TypeScript?" headline chart
- tiers     — Flagship and Sonnet tier comparisons side-by-side
- all-modes — all 6 models × 4 modes in a single grouped-bar panel

Specialised, not general. The v0.0.7 model lineup (K2.5 in flagship,
K2 Turbo in sonnet) is hard-coded because the live plot_results.MODELS
registry now reflects the post-K2.6 migration (PR aallan#69). Reuses palette
+ extract_data() from plot_results.py so slide numbers match the README
chart by construction.

Typography is bumped roughly 3× from the documentation-chart sizes so
the slide reads from the back of a room. Slide canvas is 16×9 in at
dpi=180 (2880×1620 px), with a tunable --background flag offering four
light-theme options (paper / white / cream / light-grey). Dark mode is
deliberately out of scope — would require cascading text-colour
inversion that the current talk's design doesn't need.

Output handling:
- PNGs default to /tmp/ because they're talk-prep ephemera that belong
  in the speaker's deck rather than the repo
- assets/vera-bench_slide_*.png is gitignored for the case where
  someone outputs to assets/ for preview — the canonical artefact is
  the script itself; regeneration is cheap
- Brief section added to scripts/README.md flagging the v0.0.7 pin so
  future-me doesn't try to use this for a different release

Verified:
- ruff check + ruff format --check clean
- All three slides render cleanly on default paper background
- Numbers cross-check against the v0.0.7 documentation chart (each of
  6 models × 4 modes = 24 cells)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate kimi-k2-turbo-preview → kimi-k2.6 (deprecation 2026-05-25)

1 participant