Skip to content

Tier 5 cross-language methodology: report T1-T4 aggregate separately #50

@aallan

Description

@aallan

Context

Tier 5 problems test effect handling (State, Exn, IO). In Vera, this means algebraic effect handlers. But every other language solves these problems with native idioms:

  • Python: try/except, mutable variables
  • TypeScript: try/catch, closures
  • Aver: Result<T,E>, pure recursion

This means T5 run_correct isn't comparing the same capability across languages — Vera is testing effect-handler wiring while others are testing general error handling / state management.

Proposal

Report T1-T4 aggregate as the primary cross-language headline score, with T5 reported separately as "functional equivalents" that test language-specific mechanisms.

This was discussed in PR #48 (Aver support) with @jasisz, who offered to submit a follow-up PR for the reporting changes.

What needs changing

  • vera_bench/report.py — add T1-T4 aggregate row alongside the existing all-tier row
  • vera_bench/metrics.py — add tier-filtered metric computation
  • scripts/plot_results.py — update charts to show T1-T4 and T5 separately
  • README results section — use T1-T4 as the headline comparison

Relates to

  • #48 — Aver support (where this was identified)
  • #21 — Go support (will have same T5 mismatch)
  • #49 — MoonBit support (same)

Metadata

Metadata

Assignees

No one assigned

    Labels

    evaluationBenchmark evaluation modes and model runs

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions