LiveCodeBench Pro - LLM Benchmarking Toolkit

This repository contains a benchmarking toolkit for evaluating Large Language Models (LLMs) on competitive programming tasks. The toolkit provides a standardized way to test your LLM's code generation capabilities across a diverse set of problems.

Overview

LiveCodeBench Pro evaluates LLMs on their ability to generate solutions for programming problems. The benchmark includes problems of varying difficulty levels from different competitive programming platforms.

Getting Started

Prerequisites

Ubuntu 20.04 or higher (or other distros with kernel version >= 3.10, and cgroup support. Refer to go-judge for more details)
Python 3.12 or higher
pip package manager
docker (for running the judge server), and ensure the user has permission to run docker commands

Installation

Install the required dependencies:
```
pip install -r requirements.txt
```
Or install directly using uv:
```
uv sync
```
Ensure Docker is installed and running:
```
docker --version
```
Make sure your user has permission to run Docker commands. On Linux, you may need to add your user to the docker group:
```
sudo usermod -aG docker $USER
```
Then log out and back in for the changes to take effect.

How to Use

Step 1: Implement Your LLM Interface

Create your own LLM class by extending the abstract LLMInterface class in api_interface.py. Your implementation needs to override the call_llm method.

Example:

from api_interface import LLMInterface

class YourLLM(LLMInterface):
    def __init__(self):
        super().__init__()
        # Initialize your LLM client or resources here
        
    def call_llm(self, user_prompt: str):
        # Implement your logic to call your LLM with user_prompt
        # Return a tuple containing (response_text, metadata)
        
        # Example:
        response = your_llm_client.generate(user_prompt)
        return response.text, response.metadata

You can use the ExampleLLM class as a reference, which shows how to integrate with OpenAI's API.

Step 2: Configure the Benchmark

Edit the benchmark.py file to use your LLM implementation:

from your_module import YourLLM

# Replace this line:
llm_instance = YourLLM()  # Update with your LLM class

And change the number of judge workers (recommended to <= physical CPU cores).

Step 3: Run the Benchmark

Execute the benchmark script:

python benchmark.py

The script will:

Load the LiveCodeBench-Pro dataset from Hugging Face
Process each problem with your LLM
Extract C++ code from LLM responses automatically
Submit solutions to the integrated judge system for evaluation
Collect judge results and generate comprehensive statistics
Save the results to benchmark_result.json

(Optional) Step 4: Submit Your Results

Email your benchmark_result.json file to zz4242@nyu.edu to have it displayed on the leaderboard.

Please include the following information in your submission:

LLM name and version
Any specific details
Contact information

Understanding the Codebase

api_interface.py

This file defines the abstract interface for LLM integration:

LLMInterface: Abstract base class with methods for LLM interaction
ExampleLLM: Example implementation with OpenAI's GPT-4o

benchmark.py

The main benchmarking script that:

Loads the dataset
Processes each problem through your LLM
Extracts C++ code from responses
Submits solutions to the judge system
Collects results and generates statistics
Saves comprehensive results with judge verdicts

judge.py

Contains the judge system integration:

Judge: Abstract base class for judge implementations
LightCPVerifierJudge: LightCPVerifier integration for local solution evaluation
Automatic problem data downloading from Hugging Face

util.py

Utility functions for code processing:

extract_longest_cpp_code(): Intelligent C++ code extraction from LLM responses

Dataset

The benchmark uses the QAQAQAQAQ/LiveCodeBench-Pro and QAQAQAQAQ/LiveCodeBench-Pro-Testcase datasets from Hugging Face, which contains competitive programming problems with varying difficulty levels.

Contact

For questions or support, please contact us at zz4242@nyu.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
LightCPVerifier @ 021d121		LightCPVerifier @ 021d121
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
README.md		README.md
api_interface.py		api_interface.py
benchmark.py		benchmark.py
judge.py		judge.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
util.py		util.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LiveCodeBench Pro - LLM Benchmarking Toolkit

Overview

Getting Started

Prerequisites

Installation

How to Use

Step 1: Implement Your LLM Interface

Step 2: Configure the Benchmark

Step 3: Run the Benchmark

(Optional) Step 4: Submit Your Results

Understanding the Codebase

api_interface.py

benchmark.py

judge.py

util.py

Dataset

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

GavinZhengOI/LiveCodeBench-Pro

Folders and files

Latest commit

History

Repository files navigation

LiveCodeBench Pro - LLM Benchmarking Toolkit

Overview

Getting Started

Prerequisites

Installation

How to Use

Step 1: Implement Your LLM Interface

Step 2: Configure the Benchmark

Step 3: Run the Benchmark

(Optional) Step 4: Submit Your Results

Understanding the Codebase

api_interface.py

benchmark.py

judge.py

util.py

Dataset

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages