Migrate kimi-k2-turbo-preview → kimi-k2.6 (closes #68) by aallan · Pull Request #69 · aallan/vera-bench

aallan · 2026-05-07T14:35:23Z

Summary

Closes #68.

Moonshot announced (email 2026-05-04) that the kimi-k2-* hyphenated named-feature variants — including kimi-k2-turbo-preview — will be discontinued 2026-05-25 and replaced by kimi-k2.6. The dot-versioned line (k2.5, k2.6) is the continuing one.

What changed

File	Change
`scripts/plot_results.py`	Registry: `Kimi K2.6` → flagship slot; `Kimi K2.5` demoted to sonnet (replacing the previous `Kimi K2 Turbo` row entirely)
`scripts/run_full_benchmark.py`	`moonshot/kimi-k2.6` replaces `moonshot/kimi-k2-turbo-preview`
`scripts/README.md`	Sweep recipe references `moonshot/kimi-k2.6`; timing table keeps the K2 Turbo row annotated as historical (deprecated 2026-05-25) and adds a `Moonshot K2.6

Tier reassignment

K2.6 is structurally the new flagship-line model from Moonshot, so it goes in the flagship tier alongside Claude Opus 4 and GPT-4.1. K2.5 moves to sonnet, joining Claude Sonnet 4 and GPT-4o as "previous-generation / secondary slot from each provider" — a tighter semantic grouping than the previous mix (where the sonnet Moonshot slot held the cost-optimized Turbo variant rather than a same-generation sibling).

Layout symmetry preserved: 3 flagship + 3 sonnet. The K2.5 flagship→sonnet move is a real chart-layout discontinuity between v0.0.11 results and future sweeps, but the discontinuity is information — Moonshot updated their flagship, and the chart should reflect that the moment the registry does, not several PRs later.

Out of scope (deliberately untouched)

9 historical results/moonshot-kimi-k2-turbo-preview-*.jsonl files — frozen audit records of what was measured at the time
README.md / ROADMAP.md v0.0.7 narrative — historical text describing past benchmarks; rewriting falsifies history
tests/test_models.py uses moonshot/kimi-k2 (generic, not the deprecated SKU) for routing tests; no change needed

No version bump

Model-registry update + chart-tier reassignment, not a methodology change. Same scoring contract, no result files affected. Mirrors the precedent of registry/config changes (PR #58 Cloudflare User-Agent, PR #67 baseline bench_version).

Test plan

ruff check + ruff format --check clean
pytest tests/ → 495 passed (unchanged from main)
scripts/plot_results.py --version 0.0.7 runs against new registry; warns gracefully on missing K2.6 result files; K2.5 in sonnet correctly resolves existing v0.0.7 file (file lookup is by file_prefix, not tier)
Chart generation produces a sensible layout with K2.6 in flagship + K2.5 in sonnet

🤖 Generated with Claude Code

Summary by CodeRabbit

Chores
- Replaced Kimi K2 Turbo with Kimi K2.6 across benchmark tooling and model lists.
Documentation
- Updated examples and timing notes to reference Moonshot K2.6, with timings marked TBD until full sweep completes.

Moonshot announced (email 2026-05-04) that the kimi-k2-* hyphenated named-feature variants — including kimi-k2-turbo-preview — will be discontinued on 2026-05-25 and replaced by kimi-k2.6. The dot-versioned line (k2.5, k2.6) is the continuing one; our other Moonshot model (kimi-k2.5) is unaffected. Three active code references updated: - scripts/plot_results.py MODELS registry entry - scripts/run_full_benchmark.py MODELS dict entry - scripts/README.md documented sweep recipe Tier placement is left as the existing sonnet slot for this PR. K2.6 is structurally the new flagship-line model rather than a "secondary/cheaper" variant, but moving it to the flagship tier forces a chart-layout discontinuity (K2.5 sliding from flagship → sonnet between v0.0.11 results and the next sweep). Better timed with fresh K2.6 result files than with a registry-only change. Issue #68 tracks the deferred decision. Out of scope (unchanged): - 9 historical results/moonshot-kimi-k2-turbo-preview-*.jsonl files — frozen audit records of what was measured at the time. - README.md / ROADMAP.md v0.0.7 narrative — historical text describing past benchmarks. Rewriting falsifies history. - tests/test_models.py uses moonshot/kimi-k2 (generic, not the deprecated SKU) for routing tests. Nothing to update. No bench-version bump: this is a model-registry update, not a methodology change. Same scoring contract, no result files affected. Verified: - ruff check + ruff format --check clean. - pytest tests/ → 495 passed (unchanged from main). - scripts/plot_results.py runs against new registry; warns gracefully on missing K2.6 result files and produces a chart with zero K2.6 bars. Closes #68. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-05-07T14:35:35Z

📝 Walkthrough

Walkthrough

Replace deprecated Moonshot SKU kimi-k2-turbo-preview with kimi-k2.6 in the plotting registry, the benchmark runner model list, and the README sweep example and timing table.

Changes

Moonshot Model Deprecation Migration

Layer / File(s)	Summary
Model registry update `scripts/plot_results.py`	Remove `ModelSpec("Kimi K2 Turbo", "moonshot-kimi-k2-turbo-preview", "sonnet")`; add `Kimi K2.6` and move `Kimi K2.5` into `sonnet` with inline comments documenting tier changes.
Benchmark runner update `scripts/run_full_benchmark.py`	Replace `("Kimi K2 Turbo", "moonshot/kimi-k2-turbo-preview")` with `("Kimi K2.6", "moonshot/kimi-k2.6")` in `MODELS["moonshot"]`.
Documentation & examples `scripts/README.md`	Update sweep example to reference `moonshot/kimi-k2.6`; add `Moonshot K2.6` timing row marked `TBD` and update narrative to state timings will be filled after the first full sweep.

Estimated Code Review Effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly Related PRs

aallan/vera-bench#59: Modifies the same model registry and related model identifiers across benchmark scripts.
aallan/vera-bench#41: Edits Moonshot model identifiers in scripts and docs.
aallan/vera-bench#38: Introduced Moonshot provider support and initial model entries updated here.

Suggested Labels

ci

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	❓ Inconclusive	Changes are scoped to the three identified files with documented tier-placement deferral. However, commit message mentions tier promotion decisions inconsistent with PR objectives stating deferral.	Clarify whether the commit message tier decisions (K2.6→flagship, K2.5→sonnet) override the PR's stated deferral until K2.6 sweep data is available.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: migrating from the deprecated kimi-k2-turbo-preview model to kimi-k2.6 across three configuration files.
Linked Issues check	✅ Passed	All three acceptance criteria are met: deprecated SKU removed, K2.6 added in sonnet tier, existing tests pass. Tier-placement deferral is documented in code comments.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch deps/kimi-k2.6-migration

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-07T14:42:56Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.65%. Comparing base (e697ea2) to head (609171d).

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #69   +/-   ##
=======================================
  Coverage   83.65%   83.65%           
=======================================
  Files          10       10           
  Lines        1395     1395           
=======================================
  Hits         1167     1167           
  Misses        228      228

Flag	Coverage Δ
python	`83.65% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

scripts/README.md (1)
106-115: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stale model name in the timing table.

The "Sweeping the full matrix" section now loops over moonshot/kimi-k2.6, but the timing reference table immediately below still lists the deprecated predecessor:
| Moonshot K2 Turbo | ~1.5 h |
This is the scripts/ operational README (not the top-level README.md/ROADMAP.md preserved as historical record), so it should reflect the active sweep. As no K2.6 timing data exists yet, the row can be updated with a TBD note until a real sweep is completed.
📝 Proposed fix
-| Moonshot K2 Turbo | ~1.5 h |
+| Moonshot K2.6     | TBD (first sweep pending) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/README.md` around lines 106 - 115, Update the timing table row that
currently reads "Moonshot K2 Turbo | ~1.5 h" to reflect the active sweep which
loops over "moonshot/kimi-k2.6": replace the K2 Turbo entry with a placeholder
like "Moonshot K2.6 | TBD" (or similar TBD text) so the scripts/README.md
"Sweeping the full matrix" section matches the current sweep; ensure you modify
the table row containing the literal "Moonshot K2 Turbo" so the docs no longer
reference the deprecated predecessor.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@scripts/README.md`:
- Around line 106-115: Update the timing table row that currently reads
"Moonshot K2 Turbo | ~1.5 h" to reflect the active sweep which loops over
"moonshot/kimi-k2.6": replace the K2 Turbo entry with a placeholder like
"Moonshot K2.6 | TBD" (or similar TBD text) so the scripts/README.md "Sweeping
the full matrix" section matches the current sweep; ensure you modify the table
row containing the literal "Moonshot K2 Turbo" so the docs no longer reference
the deprecated predecessor.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5d8cfb8a-7243-42a8-a1ff-2f3ab2f669fb

📥 Commits

Reviewing files that changed from the base of the PR and between e697ea2 and 3c97621.

📒 Files selected for processing (3)

scripts/README.md
scripts/plot_results.py
scripts/run_full_benchmark.py

The 'Timing expectations' table in scripts/README.md still showed 'Moonshot K2 Turbo | ~1.5 h' after the migration, which mismatched the sweep recipe that loops over moonshot/kimi-k2.6. Both rows now appear: K2.6 with TBD pending the first sweep, and K2 Turbo kept with an explicit 'historical; SKU deprecated 2026-05-25' annotation so the prior real-world data point isn't lost. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

aallan · 2026-05-12T09:16:17Z

@coderabbitai Valid finding, and verified against the current branch — line 112 of scripts/README.md did still read | Moonshot K2 Turbo | ~1.5 h |, which mismatched the K2.6 sweep recipe two sections above.

Addressed in b35e4da with a small variation on your suggestion: rather than replacing the K2 Turbo row outright, I kept it with an explicit "historical; SKU deprecated 2026-05-25" annotation and added a Moonshot K2.6 | TBD row pointing at issue #68. Reasoning: the table's preamble explicitly frames it as "Rough per-model totals observed on v0.0.9" — historical observations, not current projections — so the K2 Turbo timing is the only real-world Moonshot Turbo data point we have, useful for budgeting future K2.6 sweeps. Dropping it loses information; relabeling it preserves the data while removing the table-vs-sweep-recipe confusion you correctly flagged.

The trailing paragraph also updated to note that K2.6 timings will be filled in after the first sweep.

coderabbitai · 2026-05-12T09:16:36Z

Tip

For best results, initiate chat on the files or code changes.

@aallan That's a better resolution than a straight replacement — annotating the K2 Turbo row as historical with the deprecation date preserves the only real-world Moonshot Turbo sweep timing we have (useful as a rough lower-bound budget for K2.6 runs), while the new | Moonshot K2.6 | TBD | row pointing at #68 makes the forward path explicit. The table now correctly reflects both the historical record and the current registry state. Thanks for the clear reasoning.

The previous comment defended placing K2.6 in the sonnet slot 'for now' on the grounds that K2.6 is the new flagship — which is self-contradictory if you read it twice. Moonshot promoted K2.6 to its flagship line; the chart tier should reflect that. K2.5 moves to sonnet, joining Claude Sonnet 4 and GPT-4o as 'previous-generation / secondary slot from each provider' — a tighter semantic grouping than the prior mix anyway. No code or test surface affected; chart generation is unchanged in mechanics. Registry-only change. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

aallan · 2026-05-12T09:32:06Z

Followup: promoted K2.6 to flagship and demoted K2.5 to sonnet in 609171d. The earlier "K2.6 in sonnet, defer the tier call" framing was self-contradictory once I re-read it — defending a placement on the grounds that it's structurally wrong is a flag that the placement is wrong. New layout:

Flagship: Claude Opus 4, GPT-4.1, Kimi K2.6
Sonnet: Claude Sonnet 4, GPT-4o, Kimi K2.5

Layout symmetry preserved (3 + 3), and sonnet is now a tighter semantic grouping ("previous-generation slot from each provider") rather than a mix of previous-gen + cost-optimized variants. The K2.5 flagship→sonnet shift between v0.0.11 and future sweeps is a real chart discontinuity but it's information, not noise — Moonshot updated their flagship, and the chart should reflect that the moment the registry does.

PR body updated to match. Issue #68's "tier placement deferred to follow-up" framing no longer applies — closing it here resolves the whole question.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/plot_results.py`:
- Around line 96-106: The PR incorrectly changes benchmark tiers for Kimi
models; instead of promoting "Kimi K2.6" (ModelSpec("Kimi K2.6",
"moonshot-kimi-k2.6", ...)) to "flagship" and demoting "Kimi K2.5"
(ModelSpec("Kimi K2.5", "moonshot-kimi-k2.5", ...)) to "sonnet", preserve their
original tier assignments and only perform the registry rename/deprecation swap
on the identifier fields; update the model name/ID strings as needed but leave
the third argument (tier) for both ModelSpec entries unchanged to respect the
registry-only migration scope.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 79ea0898-b688-411b-b4ea-3ce1ea04b7dd

📥 Commits

Reviewing files that changed from the base of the PR and between b35e4da and 609171d.

📒 Files selected for processing (1)

scripts/plot_results.py

Standalone script that renders the v0.0.7 result panels as 16:9 slides sized and styled for talk presentation. Three slide types: - delta — "Does Vera beat Python / TypeScript?" headline chart - tiers — Flagship and Sonnet tier comparisons side-by-side - all-modes — all 6 models × 4 modes in a single grouped-bar panel Specialised, not general. The v0.0.7 model lineup (K2.5 in flagship, K2 Turbo in sonnet) is hard-coded because the live plot_results.MODELS registry now reflects the post-K2.6 migration (PR aallan#69). Reuses palette + extract_data() from plot_results.py so slide numbers match the README chart by construction. Typography is bumped roughly 3× from the documentation-chart sizes so the slide reads from the back of a room. Slide canvas is 16×9 in at dpi=180 (2880×1620 px), with a tunable --background flag offering four light-theme options (paper / white / cream / light-grey). Dark mode is deliberately out of scope — would require cascading text-colour inversion that the current talk's design doesn't need. Output handling: - PNGs default to /tmp/ because they're talk-prep ephemera that belong in the speaker's deck rather than the repo - assets/vera-bench_slide_*.png is gitignored for the case where someone outputs to assets/ for preview — the canonical artefact is the script itself; regeneration is cheap - Brief section added to scripts/README.md flagging the v0.0.7 pin so future-me doesn't try to use this for a different release Verified: - ruff check + ruff format --check clean - All three slides render cleanly on default paper background - Numbers cross-check against the v0.0.7 documentation chart (each of 6 models × 4 modes = 24 cells) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

Comment thread scripts/plot_results.py

aallan merged commit 6915d87 into main May 12, 2026
10 checks passed

aallan deleted the deps/kimi-k2.6-migration branch May 12, 2026 09:57

aallan mentioned this pull request May 22, 2026

Add scripts/plot_slide.py — v0.0.7 talk-slide renderer at 16:9 #71

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate kimi-k2-turbo-preview → kimi-k2.6 (closes #68)#69

Migrate kimi-k2-turbo-preview → kimi-k2.6 (closes #68)#69
aallan merged 3 commits into
mainfrom
deps/kimi-k2.6-migration

aallan commented May 7, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 7, 2026 •

edited

Loading

Walkthrough

Changes

Estimated Code Review Effort

Possibly Related PRs

Suggested Labels

❌ Failed checks (1 inconclusive)

Uh oh!

codecov Bot commented May 7, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

aallan commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026

Uh oh!

aallan commented May 12, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aallan commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Tier reassignment

Out of scope (deliberately untouched)

No version bump

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated Code Review Effort

Possibly Related PRs

Suggested Labels

❌ Failed checks (1 inconclusive)

Uh oh!

codecov Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

aallan commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026

Uh oh!

aallan commented May 12, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aallan commented May 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading

codecov Bot commented May 7, 2026 •

edited

Loading