Skip to content

Report T1-T4 aggregate separately for cross-language comparison#56

Merged
aallan merged 2 commits into
aallan:mainfrom
jasisz:tier5-methodology
Apr 16, 2026
Merged

Report T1-T4 aggregate separately for cross-language comparison#56
aallan merged 2 commits into
aallan:mainfrom
jasisz:tier5-methodology

Conversation

@jasisz

@jasisz jasisz commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add exclude_tiers parameter to compute_metrics() for tier-filtered aggregation
  • Report shows "All Tiers (T1–T5)" and "Comparable (T1–T4)" sections in summary table
  • Methodology note explains why T5 is reported separately
  • README note directs readers to T1–T4 for cross-language headline rates

Context

Tier 5 tests algebraic effect handlers (State, Exn, IO) unique to Vera. Python uses try/except, TypeScript uses try/catch, Aver uses Result<T,E> and recursion. Cross-language T5 comparison is apples-to-oranges.

Discussed in PR #48, formalized in #50.

Changes

  • vera_bench/metrics.pycompute_metrics(results, *, exclude_tiers={5}) filters problems before aggregation
  • vera_bench/report.py — two summary sections + methodology note; extracted _SUMMARY_HEADER, _SUMMARY_SEP, _summary_row() to avoid duplication
  • README.md — cross-language comparison note
  • tests/test_runner.py — tests for tier filtering, backward compatibility, report rendering

Test plan

  • exclude_tiers={5} correctly removes T5 from aggregates
  • exclude_tiers=None and exclude_tiers=set() preserve existing behavior
  • Report renders both sections with correct problem counts
  • 434 tests pass, 85.5% coverage, triple ruff clean

Closes #50

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation

    • Clarified how to compute cross-language headline rates and explained why Tier 5 is reported separately; changelog notes Comparable section suppression when no T1–T4 problems exist.
  • New Features

    • Reports show separate metric tables for All Tiers (T1–T5) and Comparable (T1–T4).
    • Added option to exclude specific tiers from metric aggregation.
  • Tests

    • Expanded tests for tier-exclusion behaviour and report rendering scenarios.

@jasisz jasisz requested a review from aallan as a code owner April 14, 2026 07:16
@coderabbitai

coderabbitai Bot commented Apr 14, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4fa0d3f5-71ff-457d-bf3b-d5acc668663f

📥 Commits

Reviewing files that changed from the base of the PR and between b3e0a25 and 237bc75.

📒 Files selected for processing (5)
  • CHANGELOG.md
  • README.md
  • tests/test_runner.py
  • vera_bench/metrics.py
  • vera_bench/report.py

📝 Walkthrough

Walkthrough

Adds tier-filtered metrics and reporting so the primary cross-language headline uses a T1–T4 "Comparable" aggregate while Tier 5 is reported separately; implements filtering in metrics, dual-section report generation, test coverage, and documentation updates. (50 words)

Changes

Cohort / File(s) Summary
Documentation
README.md, CHANGELOG.md
Adds methodology note and changelog entries describing T1–T4 headline reporting and separate Tier‑5 reporting.
Metrics Filtering
vera_bench/metrics.py
`compute_metrics(results, *, exclude_tiers: set[int]
Report Generation
vera_bench/report.py
generate_report now computes both full and comparable (T1–T4, via exclude_tiers={5}) metric sets; _summary_table refactored to optionally render a "Comparable (T1–T4)" section, new _summary_row and shared formatting constants.
Tests
tests/test_runner.py
Adds tests for exclude_tiers behavior and report generation asserting both "All Tiers" and "Comparable" sections, plus scenarios where comparable section is omitted (T5‑only).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Possibly related PRs

Suggested labels

harness, docs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main objective: reporting T1-T4 aggregate separately for cross-language comparison, which is the core change across metrics, reporting, and documentation.
Linked Issues check ✅ Passed All coding objectives from issue #50 are met: compute_metrics accepts exclude_tiers parameter [vera_bench/metrics.py], report renders dual sections T1-T5 and T1-T4 with methodology note [vera_bench/report.py], and README updated with T1-T4 guidance [README.md]. Plotting script updates are deferred (acknowledged in PR context as separate work).
Out of Scope Changes check ✅ Passed All changes directly support the stated objectives: tier filtering logic, dual-section reporting, test coverage for new behaviour, and documentation updates. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_runner.py`:
- Around line 345-348: The current assertions in tests/test_runner.py check
generic fragments ("| 2 |", "| 1 |") on the shared variable report, which can
match unrelated table cells; update the assertions to scope to the correct
section and assert full model rows. Locate the assertions referencing report and
replace them with checks that first isolate the section header text (e.g., "All
tiers" and "T1-T4 comparable") and then assert the exact row or use a anchored
regex that matches the full row containing the expected count (for example
assert re.search(r"^All tiers.*\|\s*2\s*\|", section_text, re.M) and similarly
for the "T1-T4 comparable" section to ensure you are matching the intended row
and count).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1792b57c-fdb7-4603-8c63-bc73aebad63a

📥 Commits

Reviewing files that changed from the base of the PR and between acf1693 and b3e0a25.

📒 Files selected for processing (4)
  • README.md
  • tests/test_runner.py
  • vera_bench/metrics.py
  • vera_bench/report.py

Comment thread tests/test_runner.py Outdated
@codecov

codecov Bot commented Apr 14, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.30%. Comparing base (acf1693) to head (237bc75).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #56      +/-   ##
==========================================
+ Coverage   83.06%   83.30%   +0.24%     
==========================================
  Files          10       10              
  Lines        1346     1360      +14     
==========================================
+ Hits         1118     1133      +15     
+ Misses        228      227       -1     
Flag Coverage Δ
python 83.30% <100.00%> (+0.24%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…an#50)

Tier 5 tests algebraic effect handlers unique to Vera. Other languages
solve these with native idioms (try/except, Result types, recursion),
making cross-language T5 comparison apples-to-oranges.

- metrics.py: add exclude_tiers parameter to compute_metrics()
- report.py: show "All Tiers (T1-T5)" and "Comparable (T1-T4)" sections
  in summary table, with methodology note
- README.md: add cross-language comparison note referencing aallan#50
- tests: add tests for exclude_tiers filtering, backward compatibility,
  and report T1-T4 section rendering

Closes aallan#50

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@aallan aallan left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasisz Overall this is a clean, well-scoped PR — backward-compatible compute_metrics signature, correct filter placement, and the rendered report looks right against our real results. A few notes:

1. Empty "Comparable (T1–T4)" row when only T5 problems are present (cosmetic)

If someone runs vera-bench run --model X --tier 5 and then reports, the Comparable section renders a row with all dashes and | 0 | for Problems. Not wrong, but looks odd. Consider suppressing rows where total_problems == 0 in the comparable section.

2. Hard-coded row assertion in test_report_t1t4_shows_different_counts (minor brittleness)

The test asserts exact strings like "| test-model | 100% | - | - | 100% | 2 |". Any future cosmetic change to _summary_row (new column, padding, em-dash) will break this without a real regression. Consider checking for the key data points separately (model name, problem count) rather than the full formatted row.

3. No CHANGELOG entry

Worth adding an [Unreleased] entry for this change. We can stamp the version when we merge — we are planning to merge this alongside #25 (expanding the test-case pool) since the two are complementary: this PR splits the headline by tiers, and #25 ensures the T1–T4 pool has enough testable problems to be statistically meaningful.


We are starting work on #25 now (adding test_cases to T2-004/T2-005 and new T2/T3 problems with testable signatures). Will link the PR here once it is up — plan is to merge them back-to-back.

@aallan

aallan commented Apr 16, 2026

Copy link
Copy Markdown
Owner

PR #57 is up — adds 10 new T2/T3 problems with testable signatures, bringing the T1–T4 testable pool from 18 to 30 problems. Plan is to merge these back-to-back so the T1–T4 headline rate in the report has a meaningful denominator.

- Suppress Comparable (T1–T4) section when no T1–T4 problems exist
- Replace brittle full-row assertions with key data point checks
- Add test for T5-only report hiding Comparable section
- Add CHANGELOG [Unreleased] entry

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aallan aallan merged commit b59fdc6 into aallan:main Apr 16, 2026
10 checks passed
@jasisz

jasisz commented Apr 16, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review! All three points addressed in 237bc75:

  1. Comparable section is now suppressed when no T1–T4 problems are present
  2. Test assertions check key data points instead of full formatted rows
  3. CHANGELOG [Unreleased] entry added

Looking forward to #57 landing — merging them back-to-back sounds like the right plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tier 5 cross-language methodology: report T1-T4 aggregate separately

2 participants