Skip to content

lfglabs-dev/verity-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

323 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Verity Benchmark

Measuring AI agents at formally verifying smart contracts in Lean 4.

Verity documentation Built with Lean 4 Check

Documentation  ·  Verity compiler  ·  Research note  ·  Paper (PDF)


What is this?

Verity Benchmark is an open evaluation suite that measures how well AI agents can produce formal proofs of smart contract correctness in Lean 4, on top of the Verity formally verified smart contract compiler. Cases are drawn from real-world Ethereum protocols, DeFi systems, token standards, and security challenge contracts.

Verity lets you write smart contracts, state what they should do, prove correctness, and compile to EVM bytecode with machine-checked proofs that compilation preserves semantics. This benchmark is an initiative made in partnership with the Ethereum Foundation and various protocols of the ecosystem. Full documentation lives at veritylang.com; the team behind it is LFG Labs.

Each benchmark task gives an agent:

  • A fixed contract implementation
  • A fixed formal specification
  • One editable proof file with a single theorem to prove

The agent must produce a valid Lean proof. No placeholders (sorry, admit) are allowed, and benchmark proof files may not introduce axiom declarations. A small CI-enforced trusted boundary axiom ledger documents semantic boundaries such as fixed-point exp/ln models.


Benchmark suite

21 active cases, 124 active task manifests, and 8 backlog task manifests are drawn from real-world contracts. All active and backlog task manifests are currently runnable proof tasks with hidden reference proofs.

Case Source Tasks
alchemix/earmark_conservation Alchemix V3 5
balancer/reclamm_swap_rounding Balancer ReClamm 1
cork/pool_solvency Cork Phoenix 1
damn_vulnerable_defi/side_entrance Damn Vulnerable DeFi 5
ethereum/deposit_contract_minimal Ethereum deposit contract 5
forgeyields/global_solvency ForgeYields TokenGateway 7
kleros/sortition_trees Kleros sortition module 6
lagoon/guardrails Lagoon vault guardrails 3
lido/vaulthub_locked Lido VaultHub 5
nexus_mutual/ramm_price_band Nexus Mutual RAMM 4
onedelta/caller_address_integrity OneDelta callback caller integrity 10
paladin_votes/stream_recovery_claim_usdc Paladin Votes 26
piku/fund_conservation Piku / Inverter oracle funding manager 4
polygon/agglayer_bridge Polygon Agglayer bridge 2
reserve/auction_price_band Reserve DTF 4
rootstock/flyover_quote_lifecycle Rootstock Flyover quote lifecycle 3
safe/owner_manager_reach Safe OwnerManager 15
termmax/order_v2_buy_xt_single_segment TermMax Order V2 1
usual/dao_collateral Usual DaoCollateral 5
wildcat/borrow_liquidity_safety Wildcat V2 1
zama/erc7984_confidential_token Zama / OpenZeppelin ERC-7984 12

Every runnable task includes a reference proof hidden from the agent during benchmarking. Case-level proof_status: partial means the broader case family is not fully complete; it does not imply that runnable per-task reference proofs are missing.

Coverage is strongest today for accounting, local state preservation, storage effects, linked-list ownership structures, and solvency invariants. Known thinner areas include reentrancy beyond modeled guards, oracle manipulation, governance/timelock properties, temporal or liveness properties, cross-contract compositional reasoning, cryptographic assumptions, and adversarial EVM-level behavior. See docs/evaluated-surface.md for the current evaluation surface.


Results

Verity bench MiniMax M3 Grok Build 0.1 GLM 5 Turbo

We measure cost to a verified proof, not pass/fail alone. Each agent runs in an isolated workspace with the reference proofs withheld; an independent verifier recompiles the submitted file and checks the theorem statement is untouched. Token usage is metered at the API boundary and priced at live OpenRouter rates. We evaluate two harness families on identical tasks: the builtin harness (a minimal Lean-native tool loop: goal inspection, declaration search, proof checking) and generic coding agents (opencode, codex, grok CLI) given shell access to the same workspace.

Current results on a 5-task slice spanning four proof families, ranked by total cost (full table, per-task data, and methodology notes in the leaderboard):

Harness Model Verified Median cost / proof Total cost
builtin MiniMax M3 5/5 $0.24 $1.49
opencode MiniMax M3 3/5 $0.59 $3.38
codex GPT-5.5 5/5 ~$0.8–1.2 (est.) ~$4–6 (est.)
opencode GLM 5 Turbo 5/5 $0.39 $5.29
builtin Grok Build 0.1 4/5 $0.48 $5.84
builtin GLM 5 Turbo 5/5 $1.52 $7.39
builtin GPT-5.5 5/5 $1.23 $8.42
grok CLI Grok Build 0.1 4/5 ~$0.5–5 (est.) ~$3–25 (est.)

Two observations so far, to be confirmed at larger scale:

  1. Given enough budget, every model proves almost everything. The discriminating variable is cost: across models the spread is ~6× in total cost at equal success.
  2. Harness×model interaction is real. More capable models (GPT-5.5, MiniMax M3) perform best inside the constrained builtin loop, while cheaper models (GLM 5 Turbo) do better as unconstrained shell agents — the structured tool protocol appears to help models that can exploit it and hinder those that cannot.

Estimates marked (est.) cover harnesses that expose no token telemetry (grok CLI) or only an undecomposed total (codex); derivation is documented in the leaderboard. Results come from the manually-dispatched benchmark workflow (models, budgets, task slice, and endpoint are dispatch inputs) and publish to the benchmark-results branch; single-seed runs, so treat small deltas as noise.

Running the benchmark

Verify reference proofs

# Single task
./scripts/run_task.sh ethereum/deposit_contract_minimal/deposit_count

# All tasks in a case
./scripts/run_case.sh ethereum/deposit_contract_minimal

# Full suite
./scripts/run_all.sh

Run with a harness

There are two kinds of harness:

  • default: the built-in fair harness. The model works through an OpenAI-compatible tool loop with Lean-native tools (show_task, read_file, show_goal, definition_outline, tactic_sandbox, check_proof, try_tactics, search_declarations); every tool call and conversation turn is logged in run artifacts. Native tool calls and JSON-encoded text tool calls are both supported.
  • shell agent profiles (grok-build, opencode, codex, ... from harness/agents/*.json): an off-the-shelf coding agent CLI runs inside the workspace against a local metering proxy that measures token usage at the API boundary.

What every harness sees is identical and enforced, not promised:

  • The workspace contains only public case files (contracts, specs, skeletons) plus the generic, contract-agnostic Grindset (Attr, Monad, Core, Reach, ArithCore) — the same lemma library the repo's own reference proofs compile against. scripts/check_grindset_generic.py (CI) forbids case-specific content in it.
  • Hidden reference proofs (Benchmark/Cases/*/Proofs.lean) and .env are absent from the workspace, and the private .lake build dir is pruned to workspace sources, so they are not importable either.
  • The verifier rebuilds the submission in its own private copy and rejects imports of any module the agent could not see, plus sorry/admit/axiom and theorem-statement changes.

Every harness receives the same generated harness/TASK_SUMMARY.md (target theorem, editable files, public files, check command, policy) and the same ./harness/check.sh.

Operational notes: chat requests retry transient provider failures and log retry events in conversations/*.jsonl; task results carry failure_class so provider failures, no-tool loops, parse errors, unknown names, unsolved goals, and Lean timeouts are distinguishable. Provider switching lives in .env (DEFAULT_HARNESS_PROVIDER=qwen|glm reads DEFAULT_HARNESS_<PROVIDER>_* before the generic DEFAULT_HARNESS_* values); for small-context providers set DEFAULT_HARNESS_NATIVE_TOOLS=0 and lower DEFAULT_HARNESS_TOOL_RESULT_CHARS / DEFAULT_HARNESS_TASK_SUMMARY_CHARS.

Budget profiles:

  • --budget quick: CI-sized smoke budget.
  • --budget normal: small comparison budget.
  • --budget deep: long agent budget for real attempts.
# Run a single task with the default harness
python3 -m harness.cli run-task lido/vaulthub_locked/locked_funds_solvency --harness default

# Run a deeper fair agent attempt
python3 -m harness.cli run-task ethereum/deposit_contract_minimal/deposit_count --harness default --budget deep

# Run a full case with the default harness
./scripts/run_default_harness_group.sh lido/vaulthub_locked --max-attempts 2

# Run the full suite with the default harness
./scripts/run_default_harness_suite.sh --suite active --max-attempts 1

# Run a shell agent profile (grok-build, opencode, codex)
VERITY_ALLOW_HOST_GROK_AUTH=1 python3 -m harness.cli run-task ethereum/deposit_contract_minimal/deposit_count --harness grok-build --budget deep

# Compare runs across harnesses
python3 -m harness.cli run-task ethereum/deposit_contract_minimal/deposit_count --harness default
python3 -m harness.cli compare --runs results/runs/<default-fair-run> results/runs/<grok-build-run>

Default harness API configuration:

cp .env.example .env
$EDITOR .env

Grok Build can use GROK_CODE_XAI_API_KEY in CI. For local comparisons against an already logged-in grok CLI, set VERITY_ALLOW_HOST_GROK_AUTH=1; the runner copies only ~/.grok/auth.json into an isolated temporary home for that run.


Project structure

verity-benchmark/
├── Benchmark/
│   ├── Cases/           # Reference proofs (hidden from agents)
│   └── Generated/       # Public proof templates
├── cases/               # Task manifests and contract sources
├── harness/             # Agent runner, tools, and evaluation
├── scripts/             # CLI entry points
├── schemas/             # JSON schemas for results
└── results/             # Run artifacts

Documentation

Document Description
harness/README.md Harness internals and agent integration
docs/architecture/task-api.md Task contract and manifest format

About

Benchmark for Verity-based smart contract verification research

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors