Evaluate code efficiency on the Mercury benchmark.
| Metric | Description |
|---|---|
| Pass@1 | Fraction of problems where the solution passes all test cases |
| Beyond@1 | Efficiency score in [0, 1] — how close to the fastest reference solution |
| Percentile | Rank among reference solutions — P85 means faster than 85% of references |
score = (max_ref_runtime − solution_runtime) / (max_ref_runtime − min_ref_runtime)
Clamped to [0, 1]. 1.0 = matches the fastest reference; 0.0 = at or slower than the slowest.
Each solution is benchmarked with:
- 128 timed iterations per test case with
copy.deepcopyinputs (no mutation bias) - CPU time via
resource.getrusage(user + system), same astime -v— immune to system load - CI 95% confidence interval from per-iteration wall-clock times
git clone https://github.com/Elfsong/Mercury_Eval.git
cd Mercury_Eval
uv sync --extra all# Individual solver backends
uv sync --extra openai
uv sync --extra gemini
uv sync --extra huggingface
# Everything (all solvers + .env support)
uv sync --extra allCreate a .env file in the project root:
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=AI...
HF_TOKEN=hf_...
# Evaluate reference solutions
mercury-eval
# Evaluate with a specific model (backend auto-detected)
mercury-eval gpt-4.1 --limit 20
mercury-eval gemini-2.5-pro --limit 20
mercury-eval Qwen/Qwen2.5-Coder-32B-Instruct --limit 20
# Train split, custom timeout
mercury-eval gpt-4.1 --split train --limit 50 --timeout 120
# Re-collect reference runtimes (ignores cache)
mercury-eval --recollectBackend auto-detection:
| Model prefix | Backend |
|---|---|
gemini* |
Google Gemini |
gpt-*, o1-*, o3-*, o4-* |
OpenAI |
org/model (contains /) |
HuggingFace Inference |
The evaluation displays a live-updating rich terminal UI:
╭──────────────────────── Mercury Eval ────────────────────────╮
│ gpt-4.1 Split: eval Progress: 15/20 Pass@1: 0.867 │
│ Beyond@1: 0.732 │
╰──────────────────────────────────────────────────────────────╯
┏━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━┳━━━━━┓
┃ # ┃ Problem ┃Diff┃STATUS┃CPU Time┃ Pctl ┃ B@1 ┃
┡━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━╇━━━━━┩
│ 1 │ spiral-matrix │ M │ OK │ 1.0µs │ P92 │0.986│
│ 2 │ summary-ranges │ E │ OK │ 2.3µs │ P43 │0.449│
│ 3 │ distinct-subsequences │ H │ OK │0.020ms │ P96 │0.991│
│ 4 │ expression-add-operators │ H │ FAIL │ - │ - │ - │
│ 5 │ burst-balloons │ H │ TLE │ - │ - │ - │
└───┴─────────────────────────────┴────┴──────┴────────┴──────┴─────┘
╭──────────────────────────────────────────────────────────────╮
│ Evaluating ━━━━━━━━━━━━━━━━━━━━━━━━━ 15/20 0:01:23 ETA 0:28 │
╰──────────────────────────────────────────────────────────────╯
from mercury_eval import evaluate
# Evaluate reference solutions
results = evaluate(split="eval", limit=10)
print(f"Pass@1: {results['pass_at_1']:.4f}")
print(f"Beyond@1: {results['beyond_at_1']:.4f}")Write a function that takes a problem dict and returns Python source code with a class Solution:
from mercury_eval import evaluate
def my_solver(problem: dict) -> str:
# Call an LLM, look up a cache, etc.
return '''
class Solution:
def twoSum(self, nums, target):
seen = {}
for i, n in enumerate(nums):
if target - n in seen:
return [seen[target - n], i]
seen[n] = i
'''
results = evaluate(solution_fn=my_solver, split="eval", limit=50)from functools import partial
from mercury_eval import evaluate
from mercury_eval.solvers.openai import solve_with_openai
results = evaluate(
solution_fn=partial(solve_with_openai, model="gpt-4.1"),
limit=10,
)Run a single solution against test cases:
import json
from datasets import load_dataset
from mercury_eval import Sandbox
ds = load_dataset("Elfsong/Mercury", split="eval")
problem = ds[0]
sandbox = Sandbox(timeout=60)
output = sandbox.run(
solution_code=problem["solutions"][0]["solution"],
test_cases=json.loads(problem["test_cases"]),
entry_point=problem["entry_point"],
convert_offline=problem["convert_offline"],
evaluate_offline=problem["evaluate_offline"],
)
for case in output["results"]:
status = "PASS" if case["passed"] else "FAIL"
print(f" {status} {case['runtime_ms']:.3f}ms CI[{case['ci95_lo']:.3f}, {case['ci95_hi']:.3f}]")-
Reference collection — Every reference solution in the dataset is executed (128 iterations each) and its CPU time recorded. Cached to
ref_runtimes_cache.jsonso this only runs once per machine. This step is slow (~1 min per problem) but only happens once. -
Candidate evaluation — Your solver is called for each problem. The returned code runs in an isolated subprocess with the dataset's test cases, also benchmarked with 128 iterations.
-
Scoring — Pass@1 counts correct solutions. Beyond@1 normalizes your runtime against the reference distribution. Percentile ranks you among reference solutions.
The subprocess sandbox injects LeetCode-standard imports (TreeNode, ListNode, math, collections, etc.) so solutions don't need boilerplate.
The Mercury dataset contains:
- 1,889 problems (train: 1,633 / eval: 256)
- Multiple reference solutions per problem with recorded runtimes
- Test cases with input/output pairs
convert_offline/evaluate_offlinefor type conversion and custom evaluation
MIT
