feat: Support Tau-Bench and BFCL evaluation benchmarks by aaron-seq · Pull Request #11953 · OpenHands/OpenHands

aaron-seq · 2025-12-08T12:12:04Z

Description
This PR adds support for two new evaluation benchmarks: Tau-Bench and BFCL (Berkeley Function-Calling Leaderboard).

Changes
Tau-Bench:
Added evaluation/benchmarks/tau_bench/ directory.
Implemented run_infer.py to interface with Tau-Bench environment.
Added robustness handling for missing tau-bench package.
BFCL:
Added evaluation/benchmarks/bfcl/ directory.
Implemented run_infer.py for function-calling evaluation.
Dependencies:
Added pandas to pyproject.toml (needed for evaluation utilities).
How to Test
Tau-Bench:
poetry run python evaluation/benchmarks/tau_bench/run_infer.py --agent-cls CodeActAgent --max-iterations 1 --eval-n-limit 1
BFCL:
poetry run python evaluation/benchmarks/bfcl/run_infer.py --agent-cls CodeActAgent --max-iterations 1 --eval-n-limit 1
Related Issue
Fixes #10629

neubig

Hi @aaron-seq , thanks so much! We'd be happy to accept this if you can fix the issues that are causing github CI to fail. But also, as a note, we're transitioning our benchmarks over to https://github.com/OpenHands/benchmarks, it'd be even better if we could get a contribution there.

neubig · 2025-12-31T02:13:41Z

@OpenHands fix linting and merge conflicts

openhands-ai · 2025-12-31T02:13:50Z

I'm on it! neubig can track my progress at all-hands.dev

- Resolve poetry.lock conflicts - Remove unused imports in tau_bench/run_infer.py and bfcl/run_infer.py - Apply ruff formatting fixes Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2025-12-31T02:21:36Z

I've successfully fixed the linting issues and resolved the merge conflicts for PR #11953. Here's a summary of the changes:

Changes Made

1. Resolved Merge Conflicts

poetry.lock: Regenerated the lock file to resolve conflicts between the PR branch and upstream main

2. Fixed Linting Issues

The following linting issues were fixed by ruff:

evaluation/benchmarks/tau_bench/run_infer.py:

Removed unused imports: json, logging, multiprocessing, time, Event, Runtime
Removed unused import EnvInfo from tau_bench.types (F401 error)
Added missing blank lines between function definitions
Fixed string quote consistency

evaluation/benchmarks/bfcl/run_infer.py:

Removed unused imports: json, logging, multiprocessing, time, Any, Event, Runtime
Added missing blank lines between function definitions
Fixed string quote consistency
Reformatted multi-line dictionary literals

evaluation/utils/shared.py:

Removed unused import Awaitable
Reformatted function signature for better readability

Commit

The changes were committed with the message:

Merge upstream/main and fix linting issues

- Resolve poetry.lock conflicts
- Remove unused imports in tau_bench/run_infer.py and bfcl/run_infer.py
- Apply ruff formatting fixes

The changes have been pushed to the fork at aaron-seq/OpenHands on branch feat/tau-bench-bfcl-support, which will automatically update PR #11953.

View full conversation

The ruff auto-fix incorrectly removed the Awaitable import and changed the type hint, which would break other benchmarks that use async process_instance functions. Co-authored-by: openhands <openhands@all-hands.dev>

feat: Support Tau-Bench and BFCL evaluation benchmarks

3aa940b

aaron-seq requested review from neubig and xingyaoww as code owners December 8, 2025 12:12

neubig reviewed Dec 10, 2025

View reviewed changes

aaron-seq added 2 commits December 11, 2025 21:47

Merge upstream/main and resolve poetry.lock conflicts

e4e59fb

fix: Resolve mypy errors and update type hints for benchmarks

414eb58

neubig requested review from juanmichelini and removed request for xingyaoww December 15, 2025 16:24

neubig mentioned this pull request Dec 15, 2025

Port tau-bench to this repo OpenHands/benchmarks#163

Open

juanmichelini mentioned this pull request Dec 15, 2025

benchmark: add tau bench OpenHands/benchmarks#164

Closed

Merge upstream/main and fix linting issues

682bf24

- Resolve poetry.lock conflicts - Remove unused imports in tau_bench/run_infer.py and bfcl/run_infer.py - Apply ruff formatting fixes Co-authored-by: openhands <openhands@all-hands.dev>

fix: Restore Awaitable type hint in evaluation/utils/shared.py

6108d79

The ruff auto-fix incorrectly removed the Awaitable import and changed the type hint, which would break other benchmarks that use async process_instance functions. Co-authored-by: openhands <openhands@all-hands.dev>

neubig approved these changes Dec 31, 2025

View reviewed changes

neubig enabled auto-merge (squash) December 31, 2025 02:41

neubig merged commit 4c0f0a1 into OpenHands:main Dec 31, 2025
22 checks passed

aaron-seq deleted the feat/tau-bench-bfcl-support branch January 4, 2026 10:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support Tau-Bench and BFCL evaluation benchmarks#11953

feat: Support Tau-Bench and BFCL evaluation benchmarks#11953
neubig merged 5 commits intoOpenHands:mainfrom
aaron-seq:feat/tau-bench-bfcl-support

aaron-seq commented Dec 8, 2025

Uh oh!

neubig left a comment

Uh oh!

neubig commented Dec 31, 2025

Uh oh!

openhands-ai Bot commented Dec 31, 2025

Uh oh!

openhands-ai Bot commented Dec 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aaron-seq commented Dec 8, 2025

Uh oh!

neubig left a comment

Choose a reason for hiding this comment

Uh oh!

neubig commented Dec 31, 2025

Uh oh!

openhands-ai Bot commented Dec 31, 2025

Uh oh!

openhands-ai Bot commented Dec 31, 2025

Changes Made

1. Resolved Merge Conflicts

2. Fixed Linting Issues

Commit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants