CocoaAgent

A Framework for Evaluating and Developing Next-Generation Unified Agents

What's Inside

CocoaBench Dataset — Benchmark tasks designed for agents capable of solving complex tasks by writing code, operating GUI, etc.
CocoaAgent Framework — Model-agnostic agent executor that equips agents with general tools (browser, terminal, file operations, code interpreter) via AIO Sandbox

Prerequisites

Python 3.13+
Docker & Docker Compose
uv (recommended) or pip

Quick Start

Option A: Use the Dataset Only (with your own agent)

# 1. Download and decrypt
curl -LO https://cocoabench.github.io/assets/data/cocoa-bench-v0.1.zip
unzip cocoa-bench-v0.1.zip && rm cocoa-bench-v0.1.zip
python decrypt.py

# 2. Browse tasks
ls cocoa-bench-v0.1/

Each task directory contains:

File	Purpose
`task.yaml`	Task instruction to give your agent
`test.py`	Evaluation script with `test(result)` function
`Dockerfile`	Task environment setup
`docker-compose.yaml`	Docker config
`assets/`	Additional files for the task (optional)

Evaluation: Each test.py exports a test(result) function. If you're using your own agent, you typically just need to pass {"task_result": "<agent's final answer>"}. See Evaluation for details.

Option B: Run with CocoaAgent Framework

# 1. Install
git clone https://github.com/cocoabench/cocoa-agent.git && cd cocoa-agent
uv sync  # or: pip install -r requirements.txt

# 2. Choose tasks
# See included example tasks: cocoabench-example-tasks/
# Or download full benchmark dataset: follow Option A above

# 3. Configure
cp configs/default_gpt.json configs/my-config.json
# Edit my-config.json: set your API key

# 4. Run with example tasks
python inference_main.py \
  --config configs/my-config.json \
  --tasks-dir cocoabench-example-tasks/ \
  --output-dir results/

# Or run with full dataset (after downloading):
# python inference_main.py \
#   --config configs/my-config.json \
#   --tasks-dir cocoa-bench-v0.1/ \
#   --output-dir results/

Configuration

Edit your config file to customize the agent:

{
  "controller": {
    "type": "llm",
    "args": {
      "model": "gpt-5.2",
      "api_key": "sk-...",
      "base_url": ""
    }
  },
  "sandbox": {
    "docker_port": 8080,
    "max_iterations": 30
  }
}

Key	Description
`controller.args.model`	Model name (e.g., `gpt-5.2`)
`controller.args.api_key`	Your API key
`controller.args.base_url`	Custom endpoint for local models (optional)
`sandbox.docker_port`	Port for sandbox container (default: 8080)
`sandbox.max_iterations`	Max agent iterations per task (default: 30)

Evaluation

Each task includes a test.py that runs on the host machine after the agent completes. The framework calls test(result) with the full execution result and expects a pass/fail verdict.

def test(result: dict) -> dict:
    """Evaluate task results after execution.

    Args:
        result: Complete execution result containing:
            - task_result: Agent's final answer
            - conversation: Full message history with controller
            - execution_trace: All actions and their outputs
            - status: Task status ("success" or "failed")
            - instruction: Original task instruction
            - iterations: Number of iterations completed
            - sandbox: Sandbox configuration (docker_port, etc.)

    Returns:
        Dictionary with:
            - passed (bool): Whether task passed evaluation
            - feedback (str): Human-readable evaluation message
            - details (dict, optional): Additional metrics
    """

Tip

Most test.py scripts first try to extract the answer from task_result, then fall back to searching the conversation history. If you're using your own agent, you can typically just pass task_result with the agent's final answer.

Results are saved to results/<task-name>.json when using the CocoaAgent framework.

Learn more:

Evaluation Guide — Complete result dictionary structure and return format
Sandbox API Reference — How to access files and state inside the sandbox container

Contributing New Tasks

We welcome new benchmark tasks! See contrib/CONTRIBUTING.md for guidelines.

Important

Please encrypt your task before submitting a PR to keep benchmark data safe.

Citation

@misc{cocoabench2025,
  title={CocoaBench: An Evaluation Framework for General Agents with Compositional Cognitive Abilities},
  author={Shibo Hao and Zhining Zhang and Zhiqi Liang and Tianyang Liu and Zilong Wang and others},
  howpublished={Blog post},
  month={December},
  year={2025},
  url={https://cocoabench.github.io/}
}

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
assets		assets
cocoabench-example-tasks		cocoabench-example-tasks
cocoabench-head		cocoabench-head
configs		configs
contrib		contrib
docs		docs
executor		executor
visualizer		visualizer
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
decrypt.py		decrypt.py
encrypt.py		encrypt.py
inference_main.py		inference_main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CocoaAgent

What's Inside

Prerequisites

Quick Start

Option A: Use the Dataset Only (with your own agent)

Option B: Run with CocoaAgent Framework

Configuration

Evaluation

Contributing New Tasks

Citation

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

License

cocoabench/cocoa-agent

Folders and files

Latest commit

History

Repository files navigation

CocoaAgent

What's Inside

Prerequisites

Quick Start

Option A: Use the Dataset Only (with your own agent)

Option B: Run with CocoaAgent Framework

Configuration

Evaluation

Contributing New Tasks

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

Packages