Add per-target timing to benchmark runner by aallan · Pull Request #55 · aallan/vera-bench

aallan · 2026-04-13T23:02:01Z

Summary

Log start time and elapsed duration for each benchmark target
Show per-target timing in the summary table (aligned columns with human-readable durations)
Write results/timing.json with model, timestamps, and per-target seconds for programmatic analysis
Total wall-clock time shown at the bottom of the summary

Closes #51

Example summary output

  Vera full-spec      PASS                  4m 12s
  Vera spec-from-NL   PASS                  3m 58s
  Python LLM          PASS                  1m 22s
  Kimi K2.5           PASS                 48m 03s

  Total                                  57m 35s

Test plan

ruff check . passes
CI green

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Benchmark runs now show per-target and total elapsed time in a human‑readable format.
- Run timing is persistently exported to a results file with run metadata, timestamps and per‑target durations.
Improvements
- Reporting now includes aligned timing for each benchmark target and overall run.
- Failure counting and report generation updated to work with the new structured results and explicit status fields.

Log elapsed time for each target, show durations in the summary table, write timing.json alongside results for programmatic analysis. Closes #51 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-04-13T23:02:13Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: fc6f8d17-1468-4148-aea0-423fd30b5a67

📥 Commits

Reviewing files that changed from the base of the PR and between 6e1c59c and 1ff2869.

📒 Files selected for processing (1)

scripts/run_full_benchmark.py

📝 Walkthrough

Walkthrough

Added per-target and total timing to scripts/run_full_benchmark.py. _run() now returns a (return_code, elapsed_seconds) tuple; results store status and elapsed fields, are persisted to results/timing.json, and summaries print formatted durations and UTC start/finish timestamps.

Changes

Cohort / File(s)	Summary
Benchmark runner `scripts/run_full_benchmark.py`	Added `_format_duration(seconds: float)`; changed `_run(...)` signature to return `(int, float)` and measure elapsed time with `time.monotonic()`; print UTC start/finish timestamps and per-target formatted durations; on timeout print formatted duration and return `(1, elapsed)`; results now store objects with `status` and `elapsed`; persist run timing to `results/timing.json` (creating `results/` if missing); adjust report invocation unpacking and failure counting to use `info["status"]`; updated `main()` signature to `-> int`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Bump script timeout to 60 minutes per target #42: Modifies the same _run() function in scripts/run_full_benchmark.py; check for interactions with timeout/default changes.

Suggested labels

ci

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main objective of the PR, which is to add timing measurements to the benchmark runner for each target.
Linked Issues check	✅ Passed	The PR comprehensively addresses all three requirements from issue `#51`: logging start/elapsed times, adding per-target timing to summary, and writing timing.json.
Out of Scope Changes check	✅ Passed	All changes are directly related to adding timing functionality as specified in issue `#51`; no unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/benchmark-timing

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-04-13T23:07:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.06%. Comparing base (02e229f) to head (1ff2869).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #55   +/-   ##
=======================================
  Coverage   83.06%   83.06%           
=======================================
  Files          10       10           
  Lines        1346     1346           
=======================================
  Hits         1118     1118           
  Misses        228      228

Flag	Coverage Δ
python	`83.06% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/run_full_benchmark.py`:
- Around line 272-308: The timing.json currently sets "started_at" after all
targets run; use the earlier run_start to record the actual run start time:
convert run_start (captured before the loop) into an ISO 8601 UTC timestamp and
assign that to timing_data["started_at"] instead of datetime.now(...), keeping
total_seconds computed from total_elapsed; update the timing_data construction
(symbol: timing_data) to reference the start time derived from run_start and
ensure timezone is UTC and format matches .isoformat().
- Around line 156-168: Add explicit wall-clock start and finish timestamps
around the subprocess run so each target logs when it began and when it
completed/timeout; before setting t0 capture start_ts = datetime.now() and print
a "Started at {start_ts.isoformat()}" message (include cmd), and on both the
TimeoutExpired branch and the normal completion branch capture finish_ts =
datetime.now() and print "Finished at {finish_ts.isoformat()}" (alongside the
existing duration and cmd). Update the prints around variables cmd, t0, elapsed
and reuse _format_duration for duration output so both start and finish
timestamps (and durations) are logged for the function that runs the subprocess.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 71f9a0d8-1daf-4074-a3ea-a76d1d52f4f8

📥 Commits

Reviewing files that changed from the base of the PR and between 02e229f and 1f695ac.

📒 Files selected for processing (1)

scripts/run_full_benchmark.py

Record the actual run start time (not post-run time) in timing.json. Log wall-clock start/finish timestamps for each target so long runs (e.g. 48min Kimi targets) show when they began and ended. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

scripts/run_full_benchmark.py (1)
300-316: 🧹 Nitpick | 🔵 Trivial

Consider adding finished_at to timing.json.

The structure meets the PR objectives, and started_at is now captured correctly. However, downstream consumers might benefit from an explicit finished_at timestamp rather than having to compute it from started_at + total_seconds.
💡 Optional enhancement
     timing_data = {
         "model": model,
         "started_at": run_start_ts.isoformat(),
+        "finished_at": datetime.now(timezone.utc).isoformat(),
         "total_seconds": round(total_elapsed, 1),
         "targets": {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/run_full_benchmark.py` around lines 300 - 316, The timing.json lacks
an explicit finished_at; compute the end timestamp (either capture run_end_ts
when the run completes or derive run_end_ts = run_start_ts +
timedelta(seconds=total_elapsed)) and add "finished_at": run_end_ts.isoformat()
into the timing_data dict alongside "started_at" and "total_seconds" (update
references to run_start_ts, total_elapsed, and timing_data/timing_path
accordingly so timing_path.write_text writes the new field).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/run_full_benchmark.py`:
- Line 177: The main() function lacks a return type annotation; update its
signature to declare it returns an int (e.g., change def main(): to def main()
-> int:) and ensure all return paths within main (the places returning 0 or 1)
remain consistent with the int type annotation so type checkers and readers see
the intended return type.
- Around line 147-174: In _run, the timeout branch currently prints "Finished:"
which can be misleading; update the subprocess.TimeoutExpired except block to
print a clearer label (e.g., "Timed out at:" or "Timeout at:") instead of
"Finished:" so the console output matches the earlier "TIMEOUT" message — adjust
the print that uses finish_ts.strftime('%H:%M:%S UTC') in the exception handler
within _run to use the new label.

---

Duplicate comments:
In `@scripts/run_full_benchmark.py`:
- Around line 300-316: The timing.json lacks an explicit finished_at; compute
the end timestamp (either capture run_end_ts when the run completes or derive
run_end_ts = run_start_ts + timedelta(seconds=total_elapsed)) and add
"finished_at": run_end_ts.isoformat() into the timing_data dict alongside
"started_at" and "total_seconds" (update references to run_start_ts,
total_elapsed, and timing_data/timing_path accordingly so timing_path.write_text
writes the new field).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 673e3cdb-94ef-4a5d-8cf7-7d8df4d59999

📥 Commits

Reviewing files that changed from the base of the PR and between 1f695ac and 6e1c59c.

📒 Files selected for processing (1)

scripts/run_full_benchmark.py

…json Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add per-target timing to run_full_benchmark.py

1f695ac

Log elapsed time for each target, show durations in the summary table, write timing.json alongside results for programmatic analysis. Closes #51 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread scripts/run_full_benchmark.py

Comment thread scripts/run_full_benchmark.py

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread scripts/run_full_benchmark.py

Comment thread scripts/run_full_benchmark.py Outdated

Add main() return type, fix timeout label, add finished_at to timing.…

1ff2869

…json Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

aallan merged commit acf1693 into main Apr 13, 2026
10 checks passed

aallan deleted the feature/benchmark-timing branch April 13, 2026 23:24

coderabbitai Bot mentioned this pull request Apr 17, 2026

docs: document all scripts in scripts/README.md; make plot_results data-driven #59

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-target timing to benchmark runner#55

Add per-target timing to benchmark runner#55
aallan merged 3 commits into
mainfrom
feature/benchmark-timing

aallan commented Apr 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aallan commented Apr 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Example summary output

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aallan commented Apr 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading

codecov Bot commented Apr 13, 2026 •

edited

Loading