Skip to content

Elfsong/Mercury_Eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mercury-eval

Evaluate code efficiency on the Mercury benchmark.

Mercury Eval

Metrics

Metric Description
Pass@1 Fraction of problems where the solution passes all test cases
Beyond@1 Efficiency score in [0, 1] — how close to the fastest reference solution
Percentile Rank among reference solutions — P85 means faster than 85% of references

Beyond@1

score = (max_ref_runtime − solution_runtime) / (max_ref_runtime − min_ref_runtime)

Clamped to [0, 1]. 1.0 = matches the fastest reference; 0.0 = at or slower than the slowest.

Runtime Measurement

Each solution is benchmarked with:

  • 128 timed iterations per test case with copy.deepcopy inputs (no mutation bias)
  • CPU time via resource.getrusage (user + system), same as time -v — immune to system load
  • CI 95% confidence interval from per-iteration wall-clock times

Installation

git clone https://github.com/Elfsong/Mercury_Eval.git
cd Mercury_Eval
uv sync --extra all

Optional dependencies

# Individual solver backends
uv sync --extra openai
uv sync --extra gemini
uv sync --extra huggingface

# Everything (all solvers + .env support)
uv sync --extra all

API keys

Create a .env file in the project root:

OPENAI_API_KEY=sk-...
GEMINI_API_KEY=AI...
HF_TOKEN=hf_...

Quick Start

# Evaluate reference solutions
mercury-eval

# Evaluate with a specific model (backend auto-detected)
mercury-eval gpt-4.1 --limit 20
mercury-eval gemini-2.5-pro --limit 20
mercury-eval Qwen/Qwen2.5-Coder-32B-Instruct --limit 20

# Train split, custom timeout
mercury-eval gpt-4.1 --split train --limit 50 --timeout 120

# Re-collect reference runtimes (ignores cache)
mercury-eval --recollect

Backend auto-detection:

Model prefix Backend
gemini* Google Gemini
gpt-*, o1-*, o3-*, o4-* OpenAI
org/model (contains /) HuggingFace Inference

Terminal UI

The evaluation displays a live-updating rich terminal UI:

╭──────────────────────── Mercury Eval ────────────────────────╮
│ gpt-4.1  Split: eval  Progress: 15/20  Pass@1: 0.867        │
│ Beyond@1: 0.732                                              │
╰──────────────────────────────────────────────────────────────╯
┏━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━┳━━━━━┓
┃ # ┃ Problem                     ┃Diff┃STATUS┃CPU Time┃ Pctl ┃ B@1 ┃
┡━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━╇━━━━━┩
│ 1 │ spiral-matrix               │ M  │  OK  │  1.0µs │  P92 │0.986│
│ 2 │ summary-ranges              │ E  │  OK  │  2.3µs │  P43 │0.449│
│ 3 │ distinct-subsequences       │ H  │  OK  │0.020ms │  P96 │0.991│
│ 4 │ expression-add-operators    │ H  │ FAIL │      - │    - │   - │
│ 5 │ burst-balloons              │ H  │  TLE │      - │    - │   - │
└───┴─────────────────────────────┴────┴──────┴────────┴──────┴─────┘
╭──────────────────────────────────────────────────────────────╮
│ Evaluating ━━━━━━━━━━━━━━━━━━━━━━━━━ 15/20 0:01:23 ETA 0:28 │
╰──────────────────────────────────────────────────────────────╯

Python API

from mercury_eval import evaluate

# Evaluate reference solutions
results = evaluate(split="eval", limit=10)
print(f"Pass@1:   {results['pass_at_1']:.4f}")
print(f"Beyond@1: {results['beyond_at_1']:.4f}")

Custom solver

Write a function that takes a problem dict and returns Python source code with a class Solution:

from mercury_eval import evaluate

def my_solver(problem: dict) -> str:
    # Call an LLM, look up a cache, etc.
    return '''
class Solution:
    def twoSum(self, nums, target):
        seen = {}
        for i, n in enumerate(nums):
            if target - n in seen:
                return [seen[target - n], i]
            seen[n] = i
'''

results = evaluate(solution_fn=my_solver, split="eval", limit=50)

Built-in solvers

from functools import partial
from mercury_eval import evaluate
from mercury_eval.solvers.openai import solve_with_openai

results = evaluate(
    solution_fn=partial(solve_with_openai, model="gpt-4.1"),
    limit=10,
)

Sandbox (low-level)

Run a single solution against test cases:

import json
from datasets import load_dataset
from mercury_eval import Sandbox

ds = load_dataset("Elfsong/Mercury", split="eval")
problem = ds[0]

sandbox = Sandbox(timeout=60)
output = sandbox.run(
    solution_code=problem["solutions"][0]["solution"],
    test_cases=json.loads(problem["test_cases"]),
    entry_point=problem["entry_point"],
    convert_offline=problem["convert_offline"],
    evaluate_offline=problem["evaluate_offline"],
)

for case in output["results"]:
    status = "PASS" if case["passed"] else "FAIL"
    print(f"  {status}  {case['runtime_ms']:.3f}ms  CI[{case['ci95_lo']:.3f}, {case['ci95_hi']:.3f}]")

How It Works

  1. Reference collection — Every reference solution in the dataset is executed (128 iterations each) and its CPU time recorded. Cached to ref_runtimes_cache.json so this only runs once per machine. This step is slow (~1 min per problem) but only happens once.

  2. Candidate evaluation — Your solver is called for each problem. The returned code runs in an isolated subprocess with the dataset's test cases, also benchmarked with 128 iterations.

  3. ScoringPass@1 counts correct solutions. Beyond@1 normalizes your runtime against the reference distribution. Percentile ranks you among reference solutions.

The subprocess sandbox injects LeetCode-standard imports (TreeNode, ListNode, math, collections, etc.) so solutions don't need boilerplate.

Dataset

The Mercury dataset contains:

  • 1,889 problems (train: 1,633 / eval: 256)
  • Multiple reference solutions per problem with recorded runtimes
  • Test cases with input/output pairs
  • convert_offline / evaluate_offline for type conversion and custom evaluation

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages