Add 10 new T2/T3 problems with testable signatures#57
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: ⛔ Files ignored due to path filters (5)
📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR expands the benchmark from 50 → 60 problems by adding five Tier‑2 string problems and five Tier‑3 ADT pattern‑matching problems, fills test cases for two existing Tier‑2 problems, and updates docs/metadata (version, changelog, roadmap, citation). Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@problems/tier2/VB_T2_012_ends_with_suffix.json`:
- Line 23: The vera_verify_tier1 flag is incorrectly set to true for a problem
that uses string contracts; update the JSON so the "vera_verify_tier1" property
is false for this problem (VB_T2_012_ends_with_suffix) to reflect that string
contracts fall to Tier 3 runtime verification.
- Around line 15-21: The test harness entries under the "test_cases" array must
be removed because this problem's function accepts String parameters which vera
run --fn cannot parse; update the JSON by clearing the "test_cases" array (make
it empty) so there are no runtime tests; ensure the "test_cases" key remains but
its value is an empty list to rely on vera check/verify for validation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 04218f3b-6234-423c-8ff8-111fc382c7bc
⛔ Files ignored due to path filters (40)
solutions/aver/VB_T2_011_starts_with_prefix.avis excluded by!**/*.av,!solutions/aver/**solutions/aver/VB_T2_012_ends_with_suffix.avis excluded by!**/*.av,!solutions/aver/**solutions/aver/VB_T2_013_get_char_code.avis excluded by!**/*.av,!solutions/aver/**solutions/aver/VB_T2_014_combined_length.avis excluded by!**/*.av,!solutions/aver/**solutions/aver/VB_T2_015_is_longer_than.avis excluded by!**/*.av,!solutions/aver/**solutions/aver/VB_T3_011_safe_divide.avis excluded by!**/*.av,!solutions/aver/**solutions/aver/VB_T3_012_pair_sum.avis excluded by!**/*.av,!solutions/aver/**solutions/aver/VB_T3_013_classify_sign.avis excluded by!**/*.av,!solutions/aver/**solutions/aver/VB_T3_014_color_code.avis excluded by!**/*.av,!solutions/aver/**solutions/aver/VB_T3_015_either_select.avis excluded by!**/*.av,!solutions/aver/**solutions/python/VB_T2_011_starts_with_prefix.pyis excluded by!solutions/python/**solutions/python/VB_T2_012_ends_with_suffix.pyis excluded by!solutions/python/**solutions/python/VB_T2_013_get_char_code.pyis excluded by!solutions/python/**solutions/python/VB_T2_014_combined_length.pyis excluded by!solutions/python/**solutions/python/VB_T2_015_is_longer_than.pyis excluded by!solutions/python/**solutions/python/VB_T3_011_safe_divide.pyis excluded by!solutions/python/**solutions/python/VB_T3_012_pair_sum.pyis excluded by!solutions/python/**solutions/python/VB_T3_013_classify_sign.pyis excluded by!solutions/python/**solutions/python/VB_T3_014_color_code.pyis excluded by!solutions/python/**solutions/python/VB_T3_015_either_select.pyis excluded by!solutions/python/**solutions/typescript/VB_T2_011_starts_with_prefix.tsis excluded by!solutions/typescript/**solutions/typescript/VB_T2_012_ends_with_suffix.tsis excluded by!solutions/typescript/**solutions/typescript/VB_T2_013_get_char_code.tsis excluded by!solutions/typescript/**solutions/typescript/VB_T2_014_combined_length.tsis excluded by!solutions/typescript/**solutions/typescript/VB_T2_015_is_longer_than.tsis excluded by!solutions/typescript/**solutions/typescript/VB_T3_011_safe_divide.tsis excluded by!solutions/typescript/**solutions/typescript/VB_T3_012_pair_sum.tsis excluded by!solutions/typescript/**solutions/typescript/VB_T3_013_classify_sign.tsis excluded by!solutions/typescript/**solutions/typescript/VB_T3_014_color_code.tsis excluded by!solutions/typescript/**solutions/typescript/VB_T3_015_either_select.tsis excluded by!solutions/typescript/**solutions/vera/VB-T2-011_starts_with_prefix.verais excluded by!**/*.verasolutions/vera/VB-T2-012_ends_with_suffix.verais excluded by!**/*.verasolutions/vera/VB-T2-013_get_char_code.verais excluded by!**/*.verasolutions/vera/VB-T2-014_combined_length.verais excluded by!**/*.verasolutions/vera/VB-T2-015_is_longer_than.verais excluded by!**/*.verasolutions/vera/VB-T3-011_safe_divide.verais excluded by!**/*.verasolutions/vera/VB-T3-012_pair_sum.verais excluded by!**/*.verasolutions/vera/VB-T3-013_classify_sign.verais excluded by!**/*.verasolutions/vera/VB-T3-014_color_code.verais excluded by!**/*.verasolutions/vera/VB-T3-015_either_select.verais excluded by!**/*.vera
📒 Files selected for processing (15)
CHANGELOG.mdCONTRIBUTING.mdREADME.mdproblems/tier2/VB_T2_004_is_empty_string.jsonproblems/tier2/VB_T2_005_contains_substring.jsonproblems/tier2/VB_T2_011_starts_with_prefix.jsonproblems/tier2/VB_T2_012_ends_with_suffix.jsonproblems/tier2/VB_T2_013_get_char_code.jsonproblems/tier2/VB_T2_014_combined_length.jsonproblems/tier2/VB_T2_015_is_longer_than.jsonproblems/tier3/VB_T3_011_safe_divide.jsonproblems/tier3/VB_T3_012_pair_sum.jsonproblems/tier3/VB_T3_013_classify_sign.jsonproblems/tier3/VB_T3_014_color_code.jsonproblems/tier3/VB_T3_015_either_select.json
|
Re: CodeRabbit suggestion to remove test_cases from VB-T2-012 — this is incorrect. The test_cases are correct and should stay. What |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #57 +/- ##
=======================================
Coverage 83.30% 83.30%
=======================================
Files 10 10
Lines 1360 1360
=======================================
Hits 1133 1133
Misses 227 227
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Expand the testable problem pool so that T1-T4 run_correct has a meaningful denominator when T5 is excluded for cross-language comparison. Before: 18/40 testable T1-T4 problems (T2: 0, T3: 0) After: 30/50 testable T1-T4 problems (T2: 7, T3: 5) New T2 problems (VB-T2-011 through 015) test string builtin discovery with testable signatures (String/Int inputs, Bool/Nat outputs): starts_with, ends_with, char_code, combined_length, is_longer_than. New T3 problems (VB-T3-011 through 015) test ADT definition and pattern matching with Int-only signatures — the function takes primitive args but must construct ADTs and match internally: safe_divide, pair_sum, classify_sign, color_code, either_select. Also adds test_cases to VB-T2-004 (is_empty_string) and VB-T2-005 (contains_substring), which were testable but had empty test_cases. All canonical solutions provided in Vera, Python, TypeScript, and Aver. Python and TypeScript baselines verified passing. Partial progress on #25. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Vera let syntax is `let @type = expr;`, not `let name: Type = expr;`. The parser expects AT or UPPER_IDENT after let, not a lowercase identifier. Fixed all 5 new T3 canonical solutions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adding problems is a version break for the benchmark — results from v0.0.9 (60 problems) must not be conflated with v0.0.8 (50 problems). The bench_version field in JSONL output comes from pyproject.toml via importlib.metadata, so bumping here ensures result files are tagged correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ff07af5 to
8928723
Compare
|
Heads up — the new Aver solutions have a couple of issues: T2: T3: ADT syntax uses Haskell-style (See T2-014 and T2-015 ( |
|
Good catches, thanks @jasisz! Fixed in 7ad8d7e: T2: Removed VB-T2-011/012/013 Aver solutions entirely — T3: All 5 solutions now use indented constructor syntax and qualified names ( I don't have Aver installed locally to verify — would you mind giving these a quick sanity check? |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@problems/tier2/VB_T2_011_starts_with_prefix.json`:
- Line 24: The notes for starts_with_prefix are missing the De Bruijn slot
ordering which causes param inversion; update the "notes" field for
starts_with_prefix(`@String`, `@String` -> `@Bool`) to explicitly state slot ordering:
`@String.0` = prefix (the first argument), `@String.1` = haystack (the
second/rightmost argument), and that `@Bool` is the result, so future readers
won't confuse the parameter order. Reference the function name
starts_with_prefix and the param slots `@String.0` and `@String.1` when adding the
sentence.
In `@problems/tier2/VB_T2_012_ends_with_suffix.json`:
- Line 24: Update the "notes" field in the JSON entry to explicitly call out the
parameter slot ordering for the two-String problem: state that `@String.0` refers
to the rightmost parameter and warn that string_ends_with(haystack, suffix) can
be easily inverted (suffix vs haystack); mention which slot is expected for
"suffix" and which for "haystack" to prevent argument inversion when using
string_ends_with(...).
In `@problems/tier2/VB_T2_013_get_char_code.json`:
- Around line 23-24: Update the JSON metadata to enable Tier 1 verification by
changing the boolean field "vera_verify_tier1" from false to true; this file
uses trivial contracts (requires(true) and ensures(true)) so set
"vera_verify_tier1": true to reflect that Tier 1 verification applies.
In `@problems/tier2/VB_T2_014_combined_length.json`:
- Around line 23-24: Update the JSON to enable Tier 1 verification by setting
"vera_verify_tier1" to true and revise the "notes" field to state that the
contracts (requires(true), ensures(true)) are trivially true and Z3-verifiable
at Tier 1; modify the "vera_verify_tier1" and "notes" entries accordingly so
they reflect that string_length in the function body does not force Tier 3
fallback.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: cf815003-10bc-40f7-89a2-415ac6215345
⛔ Files ignored due to path filters (40)
solutions/aver/VB_T2_011_starts_with_prefix.avis excluded by!**/*.av,!solutions/aver/**solutions/aver/VB_T2_012_ends_with_suffix.avis excluded by!**/*.av,!solutions/aver/**solutions/aver/VB_T2_013_get_char_code.avis excluded by!**/*.av,!solutions/aver/**solutions/aver/VB_T2_014_combined_length.avis excluded by!**/*.av,!solutions/aver/**solutions/aver/VB_T2_015_is_longer_than.avis excluded by!**/*.av,!solutions/aver/**solutions/aver/VB_T3_011_safe_divide.avis excluded by!**/*.av,!solutions/aver/**solutions/aver/VB_T3_012_pair_sum.avis excluded by!**/*.av,!solutions/aver/**solutions/aver/VB_T3_013_classify_sign.avis excluded by!**/*.av,!solutions/aver/**solutions/aver/VB_T3_014_color_code.avis excluded by!**/*.av,!solutions/aver/**solutions/aver/VB_T3_015_either_select.avis excluded by!**/*.av,!solutions/aver/**solutions/python/VB_T2_011_starts_with_prefix.pyis excluded by!solutions/python/**solutions/python/VB_T2_012_ends_with_suffix.pyis excluded by!solutions/python/**solutions/python/VB_T2_013_get_char_code.pyis excluded by!solutions/python/**solutions/python/VB_T2_014_combined_length.pyis excluded by!solutions/python/**solutions/python/VB_T2_015_is_longer_than.pyis excluded by!solutions/python/**solutions/python/VB_T3_011_safe_divide.pyis excluded by!solutions/python/**solutions/python/VB_T3_012_pair_sum.pyis excluded by!solutions/python/**solutions/python/VB_T3_013_classify_sign.pyis excluded by!solutions/python/**solutions/python/VB_T3_014_color_code.pyis excluded by!solutions/python/**solutions/python/VB_T3_015_either_select.pyis excluded by!solutions/python/**solutions/typescript/VB_T2_011_starts_with_prefix.tsis excluded by!solutions/typescript/**solutions/typescript/VB_T2_012_ends_with_suffix.tsis excluded by!solutions/typescript/**solutions/typescript/VB_T2_013_get_char_code.tsis excluded by!solutions/typescript/**solutions/typescript/VB_T2_014_combined_length.tsis excluded by!solutions/typescript/**solutions/typescript/VB_T2_015_is_longer_than.tsis excluded by!solutions/typescript/**solutions/typescript/VB_T3_011_safe_divide.tsis excluded by!solutions/typescript/**solutions/typescript/VB_T3_012_pair_sum.tsis excluded by!solutions/typescript/**solutions/typescript/VB_T3_013_classify_sign.tsis excluded by!solutions/typescript/**solutions/typescript/VB_T3_014_color_code.tsis excluded by!solutions/typescript/**solutions/typescript/VB_T3_015_either_select.tsis excluded by!solutions/typescript/**solutions/vera/VB-T2-011_starts_with_prefix.verais excluded by!**/*.verasolutions/vera/VB-T2-012_ends_with_suffix.verais excluded by!**/*.verasolutions/vera/VB-T2-013_get_char_code.verais excluded by!**/*.verasolutions/vera/VB-T2-014_combined_length.verais excluded by!**/*.verasolutions/vera/VB-T2-015_is_longer_than.verais excluded by!**/*.verasolutions/vera/VB-T3-011_safe_divide.verais excluded by!**/*.verasolutions/vera/VB-T3-012_pair_sum.verais excluded by!**/*.verasolutions/vera/VB-T3-013_classify_sign.verais excluded by!**/*.verasolutions/vera/VB-T3-014_color_code.verais excluded by!**/*.verasolutions/vera/VB-T3-015_either_select.verais excluded by!**/*.vera
📒 Files selected for processing (18)
CHANGELOG.mdCITATION.cffCONTRIBUTING.mdREADME.mdROADMAP.mdproblems/tier2/VB_T2_004_is_empty_string.jsonproblems/tier2/VB_T2_005_contains_substring.jsonproblems/tier2/VB_T2_011_starts_with_prefix.jsonproblems/tier2/VB_T2_012_ends_with_suffix.jsonproblems/tier2/VB_T2_013_get_char_code.jsonproblems/tier2/VB_T2_014_combined_length.jsonproblems/tier2/VB_T2_015_is_longer_than.jsonproblems/tier3/VB_T3_011_safe_divide.jsonproblems/tier3/VB_T3_012_pair_sum.jsonproblems/tier3/VB_T3_013_classify_sign.jsonproblems/tier3/VB_T3_014_color_code.jsonproblems/tier3/VB_T3_015_either_select.jsonpyproject.toml
|
The T3 Aver solutions still fail — Aver has no All branching must go through VB_T3_011_safe_divide.avVB_T3_012_pair_sum.avVB_T3_013_classify_sign.avVB_T3_014_color_code.avVB_T3_015_either_select.avKey patterns:
|
Aver has no `let` or `if/then/else` — conditionals use `match` on booleans, bindings use `name = expr`. Rewrote all 5 T3 solutions to use idiomatic Aver. All 25 verify cases pass (aver 0.9.5). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7ad8d7e to
2bb0329
Compare
|
Thanks @jasisz — installed Aver 0.9.5 locally and fixed everything properly this time. T2: Removed VB-T2-011/012/013 (no T3: Turned out the qualified constructor fix alone wasn't enough — Aver also has no Verified with |
- T2-011/012: Add De Bruijn slot ordering to notes (matches T2-005 pattern) - T2-013/014: Change vera_verify_tier1 from false to true — contracts are trivial (ensures(true)) so Z3-verifiable at Tier 1. String ops are in the function body, not in the contracts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
||
| fn color_code(code: Int) -> Int | ||
| ? "Returns the RGB integer for the given color code." | ||
| c: Color = match code |
There was a problem hiding this comment.
I've just re-discovered this is valid Aver, love the pattern!
Items 2, 3, 4 from @aallan's consolidated review on PR aallan#70. (Item 1 — extracting --parallel N into its own PR — addressed via PR aallan#73.) ### Item 2: README headline section -> single sentence in §Overview Removed the "AILANG: AI-designed language..." headline section (13 lines: the heading, the description paragraphs, the per-mode results table, the "full-circle finding" paragraph). The phrasing included editorial claims about VeraBench's identity that should be a project-owner call, and "added in this fork" wouldn't read correctly post-merge. Replaced the §Overview line about baselines with the form @aallan suggested verbatim: The same problems are also run in Python, TypeScript, [Aver](https://github.com/jasisz/aver), and [AILANG](https://ailang.sunholo.com/) as baselines. AILANG and Aver are zero-training-data languages, providing additional data points alongside Vera for the language-design-vs-training-data thesis. Matches the existing Aver pattern: light-touch mention without results writeups in the README. ### Item 3: Delete AILANG_MAPPING.md and AILANG_RESULTS.md Neither file is load-bearing — no code or tests reference them. Aver landed across PRs aallan#57 / aallan#62 / aallan#65 without AVER_RESULTS.md or AVER_MAPPING.md. Numbers and writeups go in PR descriptions and external content; in-repo docs are reserved for things future maintainers need. ### Item 4: .coderabbit.yaml path_filters Added the two missing AILANG entries to mirror the existing {python, typescript, aver} pattern: - "!**/*.ail" (alongside !**/*.vera, !**/*.av) - "!solutions/ailang/**" (alongside the other solutions/* entries) This stops CodeRabbit from generating speculative findings on .ail solution files in future review passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Context
When #56 splits the report into T1-T5 and T1-T4 sections, the T1-T4
run_correctdenominator was only 18 problems (T1: 10, T4: 8) because T2 and T3 had zero test cases --vera run --fncould not handle their Array/ADT/String-return signatures.This PR increases the T1-T4 testable pool to 30 problems by:
@String -> @Bool, T2-005@String, @String -> @Bool)starts_with_prefix,ends_with_suffix,get_char_code,combined_length,is_longer_thansafe_divide,pair_sum,classify_sign,color_code,either_selectThe T3 design is the key insight: functions take
@Intargs and return@Int, but the LLM must defineprivate datatypes and writematchexpressions internally. This preserves the tier's intent (testing ADT + pattern matching) while making the problems testable viavera run --fn.Partial progress on #25. Intended to be merged alongside #56.
Test plan
npx tsxdatadefinitionsruff checkcleanvera check+vera verify(blocked by local vera install)Generated with Claude Code
Summary by CodeRabbit
New Features
Tests
Documentation
Chores