Report T1-T4 aggregate separately for cross-language comparison#56
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughAdds tier-filtered metrics and reporting so the primary cross-language headline uses a T1–T4 "Comparable" aggregate while Tier 5 is reported separately; implements filtering in metrics, dual-section report generation, test coverage, and documentation updates. (50 words) Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~28 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_runner.py`:
- Around line 345-348: The current assertions in tests/test_runner.py check
generic fragments ("| 2 |", "| 1 |") on the shared variable report, which can
match unrelated table cells; update the assertions to scope to the correct
section and assert full model rows. Locate the assertions referencing report and
replace them with checks that first isolate the section header text (e.g., "All
tiers" and "T1-T4 comparable") and then assert the exact row or use a anchored
regex that matches the full row containing the expected count (for example
assert re.search(r"^All tiers.*\|\s*2\s*\|", section_text, re.M) and similarly
for the "T1-T4 comparable" section to ensure you are matching the intended row
and count).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1792b57c-fdb7-4603-8c63-bc73aebad63a
📒 Files selected for processing (4)
README.mdtests/test_runner.pyvera_bench/metrics.pyvera_bench/report.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #56 +/- ##
==========================================
+ Coverage 83.06% 83.30% +0.24%
==========================================
Files 10 10
Lines 1346 1360 +14
==========================================
+ Hits 1118 1133 +15
+ Misses 228 227 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…an#50) Tier 5 tests algebraic effect handlers unique to Vera. Other languages solve these with native idioms (try/except, Result types, recursion), making cross-language T5 comparison apples-to-oranges. - metrics.py: add exclude_tiers parameter to compute_metrics() - report.py: show "All Tiers (T1-T5)" and "Comparable (T1-T4)" sections in summary table, with methodology note - README.md: add cross-language comparison note referencing aallan#50 - tests: add tests for exclude_tiers filtering, backward compatibility, and report T1-T4 section rendering Closes aallan#50 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b3e0a25 to
1e40379
Compare
There was a problem hiding this comment.
@jasisz Overall this is a clean, well-scoped PR — backward-compatible compute_metrics signature, correct filter placement, and the rendered report looks right against our real results. A few notes:
1. Empty "Comparable (T1–T4)" row when only T5 problems are present (cosmetic)
If someone runs vera-bench run --model X --tier 5 and then reports, the Comparable section renders a row with all dashes and | 0 | for Problems. Not wrong, but looks odd. Consider suppressing rows where total_problems == 0 in the comparable section.
2. Hard-coded row assertion in test_report_t1t4_shows_different_counts (minor brittleness)
The test asserts exact strings like "| test-model | 100% | - | - | 100% | 2 |". Any future cosmetic change to _summary_row (new column, padding, em-dash) will break this without a real regression. Consider checking for the key data points separately (model name, problem count) rather than the full formatted row.
3. No CHANGELOG entry
Worth adding an [Unreleased] entry for this change. We can stamp the version when we merge — we are planning to merge this alongside #25 (expanding the test-case pool) since the two are complementary: this PR splits the headline by tiers, and #25 ensures the T1–T4 pool has enough testable problems to be statistically meaningful.
We are starting work on #25 now (adding test_cases to T2-004/T2-005 and new T2/T3 problems with testable signatures). Will link the PR here once it is up — plan is to merge them back-to-back.
|
PR #57 is up — adds 10 new T2/T3 problems with testable signatures, bringing the T1–T4 testable pool from 18 to 30 problems. Plan is to merge these back-to-back so the T1–T4 headline rate in the report has a meaningful denominator. |
- Suppress Comparable (T1–T4) section when no T1–T4 problems exist - Replace brittle full-row assertions with key data point checks - Add test for T5-only report hiding Comparable section - Add CHANGELOG [Unreleased] entry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for the review! All three points addressed in 237bc75:
Looking forward to #57 landing — merging them back-to-back sounds like the right plan. |
Summary
exclude_tiersparameter tocompute_metrics()for tier-filtered aggregationContext
Tier 5 tests algebraic effect handlers (
State,Exn,IO) unique to Vera. Python usestry/except, TypeScript usestry/catch, Aver usesResult<T,E>and recursion. Cross-language T5 comparison is apples-to-oranges.Discussed in PR #48, formalized in #50.
Changes
vera_bench/metrics.py—compute_metrics(results, *, exclude_tiers={5})filters problems before aggregationvera_bench/report.py— two summary sections + methodology note; extracted_SUMMARY_HEADER,_SUMMARY_SEP,_summary_row()to avoid duplicationREADME.md— cross-language comparison notetests/test_runner.py— tests for tier filtering, backward compatibility, report renderingTest plan
exclude_tiers={5}correctly removes T5 from aggregatesexclude_tiers=Noneandexclude_tiers=set()preserve existing behaviorCloses #50
🤖 Generated with Claude Code
Summary by CodeRabbit
Documentation
New Features
Tests