Skip to content

feat: Support Tau-Bench and BFCL evaluation benchmarks#11953

Merged
neubig merged 5 commits intoOpenHands:mainfrom
aaron-seq:feat/tau-bench-bfcl-support
Dec 31, 2025
Merged

feat: Support Tau-Bench and BFCL evaluation benchmarks#11953
neubig merged 5 commits intoOpenHands:mainfrom
aaron-seq:feat/tau-bench-bfcl-support

Conversation

@aaron-seq
Copy link
Copy Markdown
Contributor

Description
This PR adds support for two new evaluation benchmarks: Tau-Bench and BFCL (Berkeley Function-Calling Leaderboard).

Changes
Tau-Bench:
Added evaluation/benchmarks/tau_bench/ directory.
Implemented run_infer.py to interface with Tau-Bench environment.
Added robustness handling for missing tau-bench package.
BFCL:
Added evaluation/benchmarks/bfcl/ directory.
Implemented run_infer.py for function-calling evaluation.
Dependencies:
Added pandas to pyproject.toml (needed for evaluation utilities).
How to Test
Tau-Bench:
poetry run python evaluation/benchmarks/tau_bench/run_infer.py --agent-cls CodeActAgent --max-iterations 1 --eval-n-limit 1
BFCL:
poetry run python evaluation/benchmarks/bfcl/run_infer.py --agent-cls CodeActAgent --max-iterations 1 --eval-n-limit 1
Related Issue
Fixes #10629

Copy link
Copy Markdown
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aaron-seq , thanks so much! We'd be happy to accept this if you can fix the issues that are causing github CI to fail. But also, as a note, we're transitioning our benchmarks over to https://github.com/OpenHands/benchmarks, it'd be even better if we could get a contribution there.

@neubig
Copy link
Copy Markdown
Contributor

neubig commented Dec 31, 2025

@OpenHands fix linting and merge conflicts

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented Dec 31, 2025

I'm on it! neubig can track my progress at all-hands.dev

- Resolve poetry.lock conflicts
- Remove unused imports in tau_bench/run_infer.py and bfcl/run_infer.py
- Apply ruff formatting fixes

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented Dec 31, 2025

I've successfully fixed the linting issues and resolved the merge conflicts for PR #11953. Here's a summary of the changes:

Changes Made

1. Resolved Merge Conflicts

  • poetry.lock: Regenerated the lock file to resolve conflicts between the PR branch and upstream main

2. Fixed Linting Issues

The following linting issues were fixed by ruff:

evaluation/benchmarks/tau_bench/run_infer.py:

  • Removed unused imports: json, logging, multiprocessing, time, Event, Runtime
  • Removed unused import EnvInfo from tau_bench.types (F401 error)
  • Added missing blank lines between function definitions
  • Fixed string quote consistency

evaluation/benchmarks/bfcl/run_infer.py:

  • Removed unused imports: json, logging, multiprocessing, time, Any, Event, Runtime
  • Added missing blank lines between function definitions
  • Fixed string quote consistency
  • Reformatted multi-line dictionary literals

evaluation/utils/shared.py:

  • Removed unused import Awaitable
  • Reformatted function signature for better readability

Commit

The changes were committed with the message:

Merge upstream/main and fix linting issues

- Resolve poetry.lock conflicts
- Remove unused imports in tau_bench/run_infer.py and bfcl/run_infer.py
- Apply ruff formatting fixes

The changes have been pushed to the fork at aaron-seq/OpenHands on branch feat/tau-bench-bfcl-support, which will automatically update PR #11953.

View full conversation

The ruff auto-fix incorrectly removed the Awaitable import and changed
the type hint, which would break other benchmarks that use async
process_instance functions.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig enabled auto-merge (squash) December 31, 2025 02:41
@neubig neubig merged commit 4c0f0a1 into OpenHands:main Dec 31, 2025
22 checks passed
@aaron-seq aaron-seq deleted the feat/tau-bench-bfcl-support branch January 4, 2026 10:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: Support evaluation on Tau-Bench and BFCL

3 participants