Skip to content

Moonshot provider support + full benchmark script (v0.0.7)#38

Merged
aallan merged 9 commits into
mainfrom
feature/run-all-script
Apr 7, 2026
Merged

Moonshot provider support + full benchmark script (v0.0.7)#38
aallan merged 9 commits into
mainfrom
feature/run-all-script

Conversation

@aallan

@aallan aallan commented Apr 7, 2026

Copy link
Copy Markdown
Owner

Summary

Adds Moonshot (Kimi) as a third LLM provider and a script to run the complete benchmark suite with one command.

Moonshot provider

  • MoonshotClient in models.py — OpenAI-compatible API via api.moonshot.cn/v1
  • create_client() routes moonshot/* prefixed models
  • Uses MOONSHOT_API_KEY environment variable

Full benchmark script

# Interactive — prompts for provider, model, API key
python scripts/run_full_benchmark.py

# Autonomous
python scripts/run_full_benchmark.py --model claude-sonnet-4-20250514 --api-key sk-ant-...

# Skip baselines (they don't change between runs)
python scripts/run_full_benchmark.py --model claude-sonnet-4-20250514 --skip-baselines

Runs all 6 targets: Vera full-spec, Vera spec-from-NL, Python LLM, TypeScript LLM, Python baselines, TypeScript baselines. Then generates the combined report.

Supports the workflow for issue 24 (multi-model runs).

Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Moonshot (Kimi) provider support via a moonshot/* model prefix
    • Full benchmark suite runner with interactive and CLI modes
  • Documentation

    • Added supported-providers reference and benchmark-runner usage
    • CHANGELOG and roadmap updated for 0.0.7
  • Tests

    • Added and extended tests covering Moonshot provider and completion flows
  • Chores

    • Project version bumped to 0.0.7 and citation/date updated

Interactive mode: prompts for provider, model, and API key.
Autonomous mode: --model and --api-key on command line.

Runs all 6 targets: Vera full-spec, Vera spec-from-NL, Python LLM,
TypeScript LLM, Python baselines, TypeScript baselines. Then generates
the combined report.

Supports Anthropic (Claude), OpenAI (GPT-4o), and Moonshot (Kimi).
--skip-baselines flag available for LLM-only runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Apr 7, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a full-benchmark orchestration CLI script, introduces Moonshot provider support with a new MoonshotClient and model routing, and extends tests to cover Moonshot credential/SDK error handling; updates docs, version, and metadata for v0.0.7.

Changes

Cohort / File(s) Summary
Benchmark orchestration script
scripts/run_full_benchmark.py
New executable to run all VeraBench targets (interactive or CLI-driven). Resolves provider from model, obtains API key (flag, env var, or secure prompt), builds ordered targets (Vera specs, Python/TypeScript LLM runs, optional baselines), runs each via subprocess.run with per-target timeouts, aggregates PASS/FAIL, tries vera-bench report, and returns non-zero on any failure.
LLM client & factory
vera_bench/models.py
Added MOONSHOT_BASE_URL and MoonshotClient that dynamically imports openai, requires MOONSHOT_API_KEY, issues OpenAI-compatible chat completion requests, maps responses to LLMResponse (text, tokens, model), handles OpenAI timeout exceptions, and extended create_client() to accept moonshot/* prefixes.
Tests
tests/test_models.py
Added Moonshot-focused tests: ensure missing MOONSHOT_API_KEY or SDK raises appropriate errors; added mocked completion test for Moonshot; refactored Anthropic/OpenAI tests to replace module patching with direct client instantiation and internal _client mocking.
Docs & metadata
README.md, CHANGELOG.md, ROADMAP.md, CITATION.cff, pyproject.toml
Bumped package to 0.0.7, documented the full-benchmark runner and supported providers (Anthropic, OpenAI, Moonshot) with required env vars, updated changelog/roadmap, and updated citation release date/version.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

harness, ci, docs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the two main changes: adding Moonshot provider support and introducing a full benchmark script, with version identifier.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/run-all-script

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/run_full_benchmark.py`:
- Around line 41-43: MODELS and the _detect_provider() function advertise
"moonshot/kimi-k2" but the client factory only supports Anthropic/OpenAI
prefixes, so either remove the Moonshot entry or implement Moonshot provider
wiring; to fix, delete the "moonshot" entry from the MODELS mapping (and any
corresponding entries in the same block around lines 53-65) and ensure
_detect_provider() no longer returns a "moonshot" provider, or alternatively add
proper provider detection and client creation for the "moonshot" prefix in the
client factory so that _detect_provider(), MODELS, and the client creation logic
(client factory function) are consistent.
- Around line 117-121: The interactive prompt currently uses input() to read the
API key (variable key when env_key is missing), which echoes the secret; replace
input(f"\nEnter {env_key}: ").strip() with a secure prompt using
getpass.getpass() so the entered API key is not echoed or stored in shell
history, and add an import for getpass at the top of the script; keep the
existing empty-check and sys.exit(1) behavior unchanged, and ensure you still
call .strip() on the returned value.
- Around line 105-106: The public helpers and entry point lack precise type
hints: update _ensure_api_key to use Optional for api_key and a precise mapping
return type (change def _ensure_api_key(model: str, api_key: str | None) -> dict
to def _ensure_api_key(model: str, api_key: str | None) -> dict[str, str]), add
an explicit return annotation to main (e.g., def main(...) -> None or -> int
depending on the script's exit behavior), and replace any other unparameterised
dict occurrences with parameterised types like dict[str, str] (and use
Optional[str] or str | None for nullable strings) so all public helpers and the
entry point have full, concrete type hints.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ffb1d00b-ae01-4d1b-9a0b-510fef3bae41

📥 Commits

Reviewing files that changed from the base of the PR and between ea6abef and 6bdb1c8.

📒 Files selected for processing (1)
  • scripts/run_full_benchmark.py

Comment thread scripts/run_full_benchmark.py
Comment thread scripts/run_full_benchmark.py
Comment thread scripts/run_full_benchmark.py
@codecov

codecov Bot commented Apr 7, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 84.61538% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.94%. Comparing base (ea6abef) to head (8dfafa0).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
vera_bench/models.py 84.61% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #38      +/-   ##
==========================================
+ Coverage   82.93%   84.94%   +2.01%     
==========================================
  Files          10       10              
  Lines        1090     1116      +26     
==========================================
+ Hits          904      948      +44     
+ Misses        186      168      -18     
Flag Coverage Δ
python 84.94% <84.61%> (+2.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

aallan and others added 2 commits April 7, 2026 21:12
- MoonshotClient in models.py — OpenAI-compatible API with custom
  base_url (api.moonshot.cn/v1). Uses MOONSHOT_API_KEY env var.
- create_client() now routes moonshot/* prefixed models
- run_full_benchmark.py: getpass for secure API key input
- Tests for Moonshot client creation and missing key

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aallan aallan changed the title Add run_full_benchmark.py for complete benchmark suite Moonshot provider support + full benchmark script (v0.0.7) Apr 7, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/run_full_benchmark.py`:
- Around line 129-135: The _run function currently calls subprocess.run(cmd,
env=env, check=False) with no timeout, so a stalled vera-bench process can hang
the full-suite runner; update _run to pass a timeout (e.g.,
timeout=DEFAULT_TIMEOUT or a passed-in parameter) to subprocess.run and wrap the
call in a try/except that catches subprocess.TimeoutExpired, logs the timeout
(including the cmd), terminates/cleans up if needed, and returns a non-zero
error code; reference the _run function, the cmd and env variables, and the
subprocess.TimeoutExpired exception when implementing this change.
- Around line 231-242: The report generation step currently calls
_run(["vera-bench", "report", "results/"], env) but ignores its return code,
allowing the script to exit 0 even if report generation failed; modify the block
around the _run call (the report generation section) to capture the return value
(e.g., rc = _run(...)), check if rc != 0, print a clear error message including
the rc, and return that non-zero code (or increment the failed count) before
proceeding to count test failures so the script exits non-zero when report
generation fails.
- Line 27: The script currently imports and uses argparse; replace that with
click-based CLI: remove the argparse import and any ArgumentParser usage and
instead create a click-decorated function (e.g., `@click.command`() def main(...))
with click.option()s that map to the former parser arguments (the variables
referenced as args in the script), update calls that read args.<name> to use the
function parameters, and replace the if __name__ == "__main__": block to call
main() so the CLI uses click. Target identifiers to change: the top-level import
argparse, the ArgumentParser/parse_args usage, the args.* references, and the
script entrypoint (main or the current entry function).

In `@vera_bench/models.py`:
- Around line 49-50: Update the ValueError message in vera_bench/models.py to
include the missing accepted prefixes so it matches create_client's supported
list: add "anthropic/*" and "openai/*" to the string that currently reads
"Expected claude-*, gpt-*, o1-*, o3-*, or moonshot/* prefix." Locate the error
message near the model validation (the ValueError raised for unknown model) and
adjust the text to enumerate all accepted prefixes exactly as implemented in
create_client.
- Around line 181-203: The complete method currently only catches
openai.APITimeoutError; add retry-on-rate-limit logic with exponential backoff
inside MoonshotClient.complete: wrap the
_client.with_options(...).chat.completions.create call in a retry loop (e.g.,
3-5 attempts), catch openai.error.RateLimitError and
openai.APIError/requests-like 429 responses, sleep with exponential backoff +
jitter between retries, and rethrow the last error if attempts exhausted; ensure
the timeout exception handling for openai.APITimeoutError remains. Also update
the earlier user-facing error message (the one referencing allowed provider
prefixes) to include the accepted prefixes "anthropic/" and "openai/" so it
reads correctly for users.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: dfc1eaa9-78b9-4bf0-a562-ef54656ab5bf

📥 Commits

Reviewing files that changed from the base of the PR and between 6bdb1c8 and 495949a.

📒 Files selected for processing (3)
  • scripts/run_full_benchmark.py
  • tests/test_models.py
  • vera_bench/models.py

Comment thread scripts/run_full_benchmark.py
Comment thread scripts/run_full_benchmark.py Outdated
Comment thread scripts/run_full_benchmark.py
Comment thread vera_bench/models.py Outdated
Comment thread vera_bench/models.py
- Add 30-minute timeout to subprocess calls in run_full_benchmark.py
- Check and warn on report generation failure exit code
- Update ValueError message to list all accepted model prefixes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CITATION.cff`:
- Around line 5-6: Update CITATION.cff so the version metadata is complete and
correct: change the date-released value to "2026-04-07" to match the v0.0.7
release and add a doi field pointing to the published benchmark results (e.g.,
doi: "10.xxxx/xxxxx") following standard CFF syntax; also verify the file
includes authors and title fields alongside version and date to comply with the
project citation policy.

In `@scripts/run_full_benchmark.py`:
- Around line 154-157: Remove the command-line flag that accepts secrets: stop
adding "--api-key" via parser.add_argument and instead read the API key from a
secure source (environment variable or prompt via getpass/STDIN) where
parser.add_argument("--api-key", ...) is currently defined; update any code that
reads args.api_key to fall back to os.environ["API_KEY"] (or call
getpass.getpass() for interactive prompts) and ensure no new CLI flags accept
secrets (also remove/replace the similar flag referenced at the other
occurrence).
- Around line 111-113: The current env_key check (if not env_key) only prints a
warning and returns os.environ, allowing the benchmark to continue; change this
to fail-fast by terminating execution when an unknown provider/model prefix is
encountered: in the block that checks env_key (the condition referencing env_key
and model), replace the silent return with an immediate exit (e.g., call
sys.exit(1) or raise SystemExit with a clear message mentioning model) so the
script stops instead of proceeding with a doomed full run.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: cb855484-280b-4d93-bc94-75a04761afd7

📥 Commits

Reviewing files that changed from the base of the PR and between 495949a and 1b63a26.

📒 Files selected for processing (7)
  • CHANGELOG.md
  • CITATION.cff
  • README.md
  • ROADMAP.md
  • pyproject.toml
  • scripts/run_full_benchmark.py
  • vera_bench/models.py

Comment thread CITATION.cff Outdated
Comment thread scripts/run_full_benchmark.py Outdated
Comment thread scripts/run_full_benchmark.py
…ovider

- Update date-released to 2026-04-07 matching v0.0.7 release
- Fail-fast with sys.exit(1) when model has unknown provider prefix

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
CITATION.cff (1)

5-6: ⚠️ Potential issue | 🟠 Major

Version/date bump is correct, but citation metadata is still incomplete without a DOI.

Good update on version and date-released, but this release metadata still does not satisfy citation completeness because doi is missing from CITATION.cff.

Suggested patch
 version: "0.0.7"
 date-released: "2026-04-07"
+doi: "10.xxxx/xxxxx"

As per coding guidelines: "Citation metadata file must include authors, title, version, date, and DOI linking to published benchmark results; follow standard CFF format".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CITATION.cff` around lines 5 - 6, CITATION.cff was updated with version and
date-released but is missing the required DOI and possibly other mandatory
fields; update the CITATION.cff to include a doi field with the published
benchmark DOI string and verify presence of authors and title entries (ensure
fields named authors, title, version, date-released, doi appear in the file) so
the citation conforms to standard CFF metadata requirements.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@CITATION.cff`:
- Around line 5-6: CITATION.cff was updated with version and date-released but
is missing the required DOI and possibly other mandatory fields; update the
CITATION.cff to include a doi field with the published benchmark DOI string and
verify presence of authors and title entries (ensure fields named authors,
title, version, date-released, doi appear in the file) so the citation conforms
to standard CFF metadata requirements.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: efb2cdc4-fe15-400f-9240-2809fb052db5

📥 Commits

Reviewing files that changed from the base of the PR and between 1b63a26 and a506061.

📒 Files selected for processing (2)
  • CITATION.cff
  • scripts/run_full_benchmark.py

Mock tests were silently skipping because they used
patch("vera_bench.models.anthropic") which fails since the SDKs
are lazy-imported inside methods. Fixed to construct real clients
then replace _client with MagicMock. Coverage: models.py 48% → 87%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aallan

aallan commented Apr 7, 2026

Copy link
Copy Markdown
Owner Author
  • Around line 5-6: CITATION.cff was updated with version and date-released but
    is missing the required DOI and possibly other mandatory fields; update the
    CITATION.cff to include a doi field with the published benchmark DOI string and
    verify presence of authors and title entries (ensure fields named authors,
    title, version, date-released, doi appear in the file) so the citation conforms
    to standard CFF metadata requirements.

@coderabbitai We don't have a DOI because there's no published paper yet. Nothing to fix.

@coderabbitai

coderabbitai Bot commented Apr 7, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

@aallan Understood — no DOI required until a paper is published. I'll update my notes so I don't flag this again in future reviews.


✏️ Learnings added
Learnt from: aallan
URL: https://github.com/aallan/vera-bench/pull/38

Timestamp: 2026-04-07T20:36:57.806Z
Learning: Applies to CITATION.cff in aallan/vera-bench: The project does not yet have a published paper, so no DOI is available. Do not flag a missing `doi` field in CITATION.cff as a deficiency. The file should still include authors, title, version, and date-released fields following standard CFF format.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


⛔ Learnings removed
Learnt from: CR
Repo: aallan/vera-bench PR: 0
File: BRIEFING.md:0-0
Timestamp: 2026-03-29T19:42:24.529Z
Learning: Applies to CITATION.cff : Citation metadata file must include authors, title, version, date, and DOI linking to published benchmark results; follow standard CFF format

🧠 Learnings used
Learnt from: CR
Repo: aallan/vera-bench PR: 0
File: BRIEFING.md:0-0
Timestamp: 2026-03-29T19:42:24.529Z
Learning: Applies to CITATION.cff : Citation metadata file must include authors, title, version, date, and DOI linking to published benchmark results; follow standard CFF format

Learnt from: CR
Repo: aallan/vera-bench PR: 0
File: BRIEFING.md:0-0
Timestamp: 2026-03-29T19:42:24.529Z
Learning: Applies to vera_bench/models.py : LLM API abstraction must support multiple providers (Anthropic, OpenAI), extract code blocks from responses, and handle rate limits with backoff

Learnt from: aallan
Repo: aallan/vera-bench PR: 38
File: vera_bench/models.py:182-204
Timestamp: 2026-04-07T20:21:17.067Z
Learning: In vera_bench/models.py, MoonshotClient (and OpenAIClient) use the openai-python SDK which has built-in automatic retry with exponential backoff for rate limit errors (429), timeouts, and 5xx responses. Do not flag the absence of manual retry/backoff logic in these clients as a deficiency.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_models.py`:
- Around line 144-148: The test should assert the returned model identifier from
OpenAIClient.complete to close the response-contract gap: update the assertion
block after calling client.complete("system", "user") to include an assertion on
result.model (e.g., assert result.model == <expected_model_identifier>), where
<expected_model_identifier> is the model string used by your test mock or
fixture; reference the client.complete call and the result.model property when
making this change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0c463745-9526-4058-badc-e8bc64cc9e81

📥 Commits

Reviewing files that changed from the base of the PR and between a506061 and 218f2af.

📒 Files selected for processing (1)
  • tests/test_models.py

Comment thread tests/test_models.py
aallan and others added 2 commits April 7, 2026 21:40
Test job was installing [dev] only, so anthropic/openai SDKs were
missing and all model mock tests silently skipped. Install [dev,llm]
to get full coverage of models.py (48% → 87%).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant