Skip to content

Add per-target timing to benchmark runner#55

Merged
aallan merged 3 commits into
mainfrom
feature/benchmark-timing
Apr 13, 2026
Merged

Add per-target timing to benchmark runner#55
aallan merged 3 commits into
mainfrom
feature/benchmark-timing

Conversation

@aallan

@aallan aallan commented Apr 13, 2026

Copy link
Copy Markdown
Owner

Summary

  • Log start time and elapsed duration for each benchmark target
  • Show per-target timing in the summary table (aligned columns with human-readable durations)
  • Write results/timing.json with model, timestamps, and per-target seconds for programmatic analysis
  • Total wall-clock time shown at the bottom of the summary

Closes #51

Example summary output

  Vera full-spec      PASS                  4m 12s
  Vera spec-from-NL   PASS                  3m 58s
  Python LLM          PASS                  1m 22s
  Kimi K2.5           PASS                 48m 03s

  Total                                  57m 35s

Test plan

  • ruff check . passes
  • CI green

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Benchmark runs now show per-target and total elapsed time in a human‑readable format.
    • Run timing is persistently exported to a results file with run metadata, timestamps and per‑target durations.
  • Improvements

    • Reporting now includes aligned timing for each benchmark target and overall run.
    • Failure counting and report generation updated to work with the new structured results and explicit status fields.

Log elapsed time for each target, show durations in the summary table,
write timing.json alongside results for programmatic analysis.

Closes #51

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Apr 13, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: fc6f8d17-1468-4148-aea0-423fd30b5a67

📥 Commits

Reviewing files that changed from the base of the PR and between 6e1c59c and 1ff2869.

📒 Files selected for processing (1)
  • scripts/run_full_benchmark.py

📝 Walkthrough

Walkthrough

Added per-target and total timing to scripts/run_full_benchmark.py. _run() now returns a (return_code, elapsed_seconds) tuple; results store status and elapsed fields, are persisted to results/timing.json, and summaries print formatted durations and UTC start/finish timestamps.

Changes

Cohort / File(s) Summary
Benchmark runner
scripts/run_full_benchmark.py
Added _format_duration(seconds: float); changed _run(...) signature to return (int, float) and measure elapsed time with time.monotonic(); print UTC start/finish timestamps and per-target formatted durations; on timeout print formatted duration and return (1, elapsed); results now store objects with status and elapsed; persist run timing to results/timing.json (creating results/ if missing); adjust report invocation unpacking and failure counting to use info["status"]; updated main() signature to -> int.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

ci

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main objective of the PR, which is to add timing measurements to the benchmark runner for each target.
Linked Issues check ✅ Passed The PR comprehensively addresses all three requirements from issue #51: logging start/elapsed times, adding per-target timing to summary, and writing timing.json.
Out of Scope Changes check ✅ Passed All changes are directly related to adding timing functionality as specified in issue #51; no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/benchmark-timing

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Apr 13, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.06%. Comparing base (02e229f) to head (1ff2869).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #55   +/-   ##
=======================================
  Coverage   83.06%   83.06%           
=======================================
  Files          10       10           
  Lines        1346     1346           
=======================================
  Hits         1118     1118           
  Misses        228      228           
Flag Coverage Δ
python 83.06% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/run_full_benchmark.py`:
- Around line 272-308: The timing.json currently sets "started_at" after all
targets run; use the earlier run_start to record the actual run start time:
convert run_start (captured before the loop) into an ISO 8601 UTC timestamp and
assign that to timing_data["started_at"] instead of datetime.now(...), keeping
total_seconds computed from total_elapsed; update the timing_data construction
(symbol: timing_data) to reference the start time derived from run_start and
ensure timezone is UTC and format matches .isoformat().
- Around line 156-168: Add explicit wall-clock start and finish timestamps
around the subprocess run so each target logs when it began and when it
completed/timeout; before setting t0 capture start_ts = datetime.now() and print
a "Started at {start_ts.isoformat()}" message (include cmd), and on both the
TimeoutExpired branch and the normal completion branch capture finish_ts =
datetime.now() and print "Finished at {finish_ts.isoformat()}" (alongside the
existing duration and cmd). Update the prints around variables cmd, t0, elapsed
and reuse _format_duration for duration output so both start and finish
timestamps (and durations) are logged for the function that runs the subprocess.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 71f9a0d8-1daf-4074-a3ea-a76d1d52f4f8

📥 Commits

Reviewing files that changed from the base of the PR and between 02e229f and 1f695ac.

📒 Files selected for processing (1)
  • scripts/run_full_benchmark.py

Comment thread scripts/run_full_benchmark.py
Comment thread scripts/run_full_benchmark.py
Record the actual run start time (not post-run time) in timing.json.
Log wall-clock start/finish timestamps for each target so long runs
(e.g. 48min Kimi targets) show when they began and ended.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
scripts/run_full_benchmark.py (1)

300-316: 🧹 Nitpick | 🔵 Trivial

Consider adding finished_at to timing.json.

The structure meets the PR objectives, and started_at is now captured correctly. However, downstream consumers might benefit from an explicit finished_at timestamp rather than having to compute it from started_at + total_seconds.

💡 Optional enhancement
     timing_data = {
         "model": model,
         "started_at": run_start_ts.isoformat(),
+        "finished_at": datetime.now(timezone.utc).isoformat(),
         "total_seconds": round(total_elapsed, 1),
         "targets": {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/run_full_benchmark.py` around lines 300 - 316, The timing.json lacks
an explicit finished_at; compute the end timestamp (either capture run_end_ts
when the run completes or derive run_end_ts = run_start_ts +
timedelta(seconds=total_elapsed)) and add "finished_at": run_end_ts.isoformat()
into the timing_data dict alongside "started_at" and "total_seconds" (update
references to run_start_ts, total_elapsed, and timing_data/timing_path
accordingly so timing_path.write_text writes the new field).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/run_full_benchmark.py`:
- Line 177: The main() function lacks a return type annotation; update its
signature to declare it returns an int (e.g., change def main(): to def main()
-> int:) and ensure all return paths within main (the places returning 0 or 1)
remain consistent with the int type annotation so type checkers and readers see
the intended return type.
- Around line 147-174: In _run, the timeout branch currently prints "Finished:"
which can be misleading; update the subprocess.TimeoutExpired except block to
print a clearer label (e.g., "Timed out at:" or "Timeout at:") instead of
"Finished:" so the console output matches the earlier "TIMEOUT" message — adjust
the print that uses finish_ts.strftime('%H:%M:%S UTC') in the exception handler
within _run to use the new label.

---

Duplicate comments:
In `@scripts/run_full_benchmark.py`:
- Around line 300-316: The timing.json lacks an explicit finished_at; compute
the end timestamp (either capture run_end_ts when the run completes or derive
run_end_ts = run_start_ts + timedelta(seconds=total_elapsed)) and add
"finished_at": run_end_ts.isoformat() into the timing_data dict alongside
"started_at" and "total_seconds" (update references to run_start_ts,
total_elapsed, and timing_data/timing_path accordingly so timing_path.write_text
writes the new field).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 673e3cdb-94ef-4a5d-8cf7-7d8df4d59999

📥 Commits

Reviewing files that changed from the base of the PR and between 1f695ac and 6e1c59c.

📒 Files selected for processing (1)
  • scripts/run_full_benchmark.py

Comment thread scripts/run_full_benchmark.py
Comment thread scripts/run_full_benchmark.py Outdated
…json

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aallan aallan merged commit acf1693 into main Apr 13, 2026
10 checks passed
@aallan aallan deleted the feature/benchmark-timing branch April 13, 2026 23:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add timing to run_full_benchmark.py

1 participant