Moonshot provider support + full benchmark script (v0.0.7)#38
Conversation
Interactive mode: prompts for provider, model, and API key. Autonomous mode: --model and --api-key on command line. Runs all 6 targets: Vera full-spec, Vera spec-from-NL, Python LLM, TypeScript LLM, Python baselines, TypeScript baselines. Then generates the combined report. Supports Anthropic (Claude), OpenAI (GPT-4o), and Moonshot (Kimi). --skip-baselines flag available for LLM-only runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a full-benchmark orchestration CLI script, introduces Moonshot provider support with a new MoonshotClient and model routing, and extends tests to cover Moonshot credential/SDK error handling; updates docs, version, and metadata for v0.0.7. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/run_full_benchmark.py`:
- Around line 41-43: MODELS and the _detect_provider() function advertise
"moonshot/kimi-k2" but the client factory only supports Anthropic/OpenAI
prefixes, so either remove the Moonshot entry or implement Moonshot provider
wiring; to fix, delete the "moonshot" entry from the MODELS mapping (and any
corresponding entries in the same block around lines 53-65) and ensure
_detect_provider() no longer returns a "moonshot" provider, or alternatively add
proper provider detection and client creation for the "moonshot" prefix in the
client factory so that _detect_provider(), MODELS, and the client creation logic
(client factory function) are consistent.
- Around line 117-121: The interactive prompt currently uses input() to read the
API key (variable key when env_key is missing), which echoes the secret; replace
input(f"\nEnter {env_key}: ").strip() with a secure prompt using
getpass.getpass() so the entered API key is not echoed or stored in shell
history, and add an import for getpass at the top of the script; keep the
existing empty-check and sys.exit(1) behavior unchanged, and ensure you still
call .strip() on the returned value.
- Around line 105-106: The public helpers and entry point lack precise type
hints: update _ensure_api_key to use Optional for api_key and a precise mapping
return type (change def _ensure_api_key(model: str, api_key: str | None) -> dict
to def _ensure_api_key(model: str, api_key: str | None) -> dict[str, str]), add
an explicit return annotation to main (e.g., def main(...) -> None or -> int
depending on the script's exit behavior), and replace any other unparameterised
dict occurrences with parameterised types like dict[str, str] (and use
Optional[str] or str | None for nullable strings) so all public helpers and the
entry point have full, concrete type hints.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ffb1d00b-ae01-4d1b-9a0b-510fef3bae41
📒 Files selected for processing (1)
scripts/run_full_benchmark.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #38 +/- ##
==========================================
+ Coverage 82.93% 84.94% +2.01%
==========================================
Files 10 10
Lines 1090 1116 +26
==========================================
+ Hits 904 948 +44
+ Misses 186 168 -18
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- MoonshotClient in models.py — OpenAI-compatible API with custom base_url (api.moonshot.cn/v1). Uses MOONSHOT_API_KEY env var. - create_client() now routes moonshot/* prefixed models - run_full_benchmark.py: getpass for secure API key input - Tests for Moonshot client creation and missing key Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/run_full_benchmark.py`:
- Around line 129-135: The _run function currently calls subprocess.run(cmd,
env=env, check=False) with no timeout, so a stalled vera-bench process can hang
the full-suite runner; update _run to pass a timeout (e.g.,
timeout=DEFAULT_TIMEOUT or a passed-in parameter) to subprocess.run and wrap the
call in a try/except that catches subprocess.TimeoutExpired, logs the timeout
(including the cmd), terminates/cleans up if needed, and returns a non-zero
error code; reference the _run function, the cmd and env variables, and the
subprocess.TimeoutExpired exception when implementing this change.
- Around line 231-242: The report generation step currently calls
_run(["vera-bench", "report", "results/"], env) but ignores its return code,
allowing the script to exit 0 even if report generation failed; modify the block
around the _run call (the report generation section) to capture the return value
(e.g., rc = _run(...)), check if rc != 0, print a clear error message including
the rc, and return that non-zero code (or increment the failed count) before
proceeding to count test failures so the script exits non-zero when report
generation fails.
- Line 27: The script currently imports and uses argparse; replace that with
click-based CLI: remove the argparse import and any ArgumentParser usage and
instead create a click-decorated function (e.g., `@click.command`() def main(...))
with click.option()s that map to the former parser arguments (the variables
referenced as args in the script), update calls that read args.<name> to use the
function parameters, and replace the if __name__ == "__main__": block to call
main() so the CLI uses click. Target identifiers to change: the top-level import
argparse, the ArgumentParser/parse_args usage, the args.* references, and the
script entrypoint (main or the current entry function).
In `@vera_bench/models.py`:
- Around line 49-50: Update the ValueError message in vera_bench/models.py to
include the missing accepted prefixes so it matches create_client's supported
list: add "anthropic/*" and "openai/*" to the string that currently reads
"Expected claude-*, gpt-*, o1-*, o3-*, or moonshot/* prefix." Locate the error
message near the model validation (the ValueError raised for unknown model) and
adjust the text to enumerate all accepted prefixes exactly as implemented in
create_client.
- Around line 181-203: The complete method currently only catches
openai.APITimeoutError; add retry-on-rate-limit logic with exponential backoff
inside MoonshotClient.complete: wrap the
_client.with_options(...).chat.completions.create call in a retry loop (e.g.,
3-5 attempts), catch openai.error.RateLimitError and
openai.APIError/requests-like 429 responses, sleep with exponential backoff +
jitter between retries, and rethrow the last error if attempts exhausted; ensure
the timeout exception handling for openai.APITimeoutError remains. Also update
the earlier user-facing error message (the one referencing allowed provider
prefixes) to include the accepted prefixes "anthropic/" and "openai/" so it
reads correctly for users.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: dfc1eaa9-78b9-4bf0-a562-ef54656ab5bf
📒 Files selected for processing (3)
scripts/run_full_benchmark.pytests/test_models.pyvera_bench/models.py
- Add 30-minute timeout to subprocess calls in run_full_benchmark.py - Check and warn on report generation failure exit code - Update ValueError message to list all accepted model prefixes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CITATION.cff`:
- Around line 5-6: Update CITATION.cff so the version metadata is complete and
correct: change the date-released value to "2026-04-07" to match the v0.0.7
release and add a doi field pointing to the published benchmark results (e.g.,
doi: "10.xxxx/xxxxx") following standard CFF syntax; also verify the file
includes authors and title fields alongside version and date to comply with the
project citation policy.
In `@scripts/run_full_benchmark.py`:
- Around line 154-157: Remove the command-line flag that accepts secrets: stop
adding "--api-key" via parser.add_argument and instead read the API key from a
secure source (environment variable or prompt via getpass/STDIN) where
parser.add_argument("--api-key", ...) is currently defined; update any code that
reads args.api_key to fall back to os.environ["API_KEY"] (or call
getpass.getpass() for interactive prompts) and ensure no new CLI flags accept
secrets (also remove/replace the similar flag referenced at the other
occurrence).
- Around line 111-113: The current env_key check (if not env_key) only prints a
warning and returns os.environ, allowing the benchmark to continue; change this
to fail-fast by terminating execution when an unknown provider/model prefix is
encountered: in the block that checks env_key (the condition referencing env_key
and model), replace the silent return with an immediate exit (e.g., call
sys.exit(1) or raise SystemExit with a clear message mentioning model) so the
script stops instead of proceeding with a doomed full run.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: cb855484-280b-4d93-bc94-75a04761afd7
📒 Files selected for processing (7)
CHANGELOG.mdCITATION.cffREADME.mdROADMAP.mdpyproject.tomlscripts/run_full_benchmark.pyvera_bench/models.py
…ovider - Update date-released to 2026-04-07 matching v0.0.7 release - Fail-fast with sys.exit(1) when model has unknown provider prefix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
CITATION.cff (1)
5-6:⚠️ Potential issue | 🟠 MajorVersion/date bump is correct, but citation metadata is still incomplete without a DOI.
Good update on
versionanddate-released, but this release metadata still does not satisfy citation completeness becausedoiis missing fromCITATION.cff.Suggested patch
version: "0.0.7" date-released: "2026-04-07" +doi: "10.xxxx/xxxxx"As per coding guidelines: "Citation metadata file must include authors, title, version, date, and DOI linking to published benchmark results; follow standard CFF format".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@CITATION.cff` around lines 5 - 6, CITATION.cff was updated with version and date-released but is missing the required DOI and possibly other mandatory fields; update the CITATION.cff to include a doi field with the published benchmark DOI string and verify presence of authors and title entries (ensure fields named authors, title, version, date-released, doi appear in the file) so the citation conforms to standard CFF metadata requirements.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@CITATION.cff`:
- Around line 5-6: CITATION.cff was updated with version and date-released but
is missing the required DOI and possibly other mandatory fields; update the
CITATION.cff to include a doi field with the published benchmark DOI string and
verify presence of authors and title entries (ensure fields named authors,
title, version, date-released, doi appear in the file) so the citation conforms
to standard CFF metadata requirements.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: efb2cdc4-fe15-400f-9240-2809fb052db5
📒 Files selected for processing (2)
CITATION.cffscripts/run_full_benchmark.py
Mock tests were silently skipping because they used
patch("vera_bench.models.anthropic") which fails since the SDKs
are lazy-imported inside methods. Fixed to construct real clients
then replace _client with MagicMock. Coverage: models.py 48% → 87%.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai We don't have a DOI because there's no published paper yet. Nothing to fix. |
|
Tip For best results, initiate chat on the files or code changes.
✏️ Learnings added
⛔ Learnings removed🧠 Learnings used |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_models.py`:
- Around line 144-148: The test should assert the returned model identifier from
OpenAIClient.complete to close the response-contract gap: update the assertion
block after calling client.complete("system", "user") to include an assertion on
result.model (e.g., assert result.model == <expected_model_identifier>), where
<expected_model_identifier> is the model string used by your test mock or
fixture; reference the client.complete call and the result.model property when
making this change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0c463745-9526-4058-badc-e8bc64cc9e81
📒 Files selected for processing (1)
tests/test_models.py
Test job was installing [dev] only, so anthropic/openai SDKs were missing and all model mock tests silently skipped. Install [dev,llm] to get full coverage of models.py (48% → 87%). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Adds Moonshot (Kimi) as a third LLM provider and a script to run the complete benchmark suite with one command.
Moonshot provider
MoonshotClientin models.py — OpenAI-compatible API viaapi.moonshot.cn/v1create_client()routesmoonshot/*prefixed modelsMOONSHOT_API_KEYenvironment variableFull benchmark script
Runs all 6 targets: Vera full-spec, Vera spec-from-NL, Python LLM, TypeScript LLM, Python baselines, TypeScript baselines. Then generates the combined report.
Supports the workflow for issue 24 (multi-model runs).
Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Tests
Chores