You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During v0.0.7 benchmarking, Kimi K2.5 took ~48 minutes per target while Claude Opus 4 took ~4 minutes. We had no way to know this because the script doesn't log start/completion times.
What's needed
Log start time and elapsed time for each target in scripts/run_full_benchmark.py
Add per-target timing to the summary table at the end
Optionally: write a timing.json alongside the results for programmatic analysis
Context
During v0.0.7 benchmarking, Kimi K2.5 took ~48 minutes per target while Claude Opus 4 took ~4 minutes. We had no way to know this because the script doesn't log start/completion times.
What's needed
scripts/run_full_benchmark.pytiming.jsonalongside the results for programmatic analysisFlagged during v0.0.7 runs but deferred.