feat: Support Tau-Bench and BFCL evaluation benchmarks#11953
feat: Support Tau-Bench and BFCL evaluation benchmarks#11953neubig merged 5 commits intoOpenHands:mainfrom
Conversation
neubig
left a comment
There was a problem hiding this comment.
Hi @aaron-seq , thanks so much! We'd be happy to accept this if you can fix the issues that are causing github CI to fail. But also, as a note, we're transitioning our benchmarks over to https://github.com/OpenHands/benchmarks, it'd be even better if we could get a contribution there.
|
@OpenHands fix linting and merge conflicts |
|
I'm on it! neubig can track my progress at all-hands.dev |
- Resolve poetry.lock conflicts - Remove unused imports in tau_bench/run_infer.py and bfcl/run_infer.py - Apply ruff formatting fixes Co-authored-by: openhands <openhands@all-hands.dev>
|
I've successfully fixed the linting issues and resolved the merge conflicts for PR #11953. Here's a summary of the changes: Changes Made1. Resolved Merge Conflicts
2. Fixed Linting IssuesThe following linting issues were fixed by ruff: evaluation/benchmarks/tau_bench/run_infer.py:
evaluation/benchmarks/bfcl/run_infer.py:
evaluation/utils/shared.py:
CommitThe changes were committed with the message: The changes have been pushed to the fork at |
The ruff auto-fix incorrectly removed the Awaitable import and changed the type hint, which would break other benchmarks that use async process_instance functions. Co-authored-by: openhands <openhands@all-hands.dev>
Description
This PR adds support for two new evaluation benchmarks: Tau-Bench and BFCL (Berkeley Function-Calling Leaderboard).
Changes
Tau-Bench:
Added evaluation/benchmarks/tau_bench/ directory.
Implemented run_infer.py to interface with Tau-Bench environment.
Added robustness handling for missing tau-bench package.
BFCL:
Added evaluation/benchmarks/bfcl/ directory.
Implemented run_infer.py for function-calling evaluation.
Dependencies:
Added pandas to pyproject.toml (needed for evaluation utilities).
How to Test
Tau-Bench:
poetry run python evaluation/benchmarks/tau_bench/run_infer.py --agent-cls CodeActAgent --max-iterations 1 --eval-n-limit 1
BFCL:
poetry run python evaluation/benchmarks/bfcl/run_infer.py --agent-cls CodeActAgent --max-iterations 1 --eval-n-limit 1
Related Issue
Fixes #10629