Skip to content

sierra-research/tau2-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

$\tau$-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

python Ruff arXiv blog Twitter LinkedIn Leaderboard

Trajectory

🚀 τ³-bench is here!

From text-only to multimodal, knowledge-aware agent evaluation.
Voice full-duplex · Knowledge retrieval · 75+ task fixes
τ-Voice paper · τ-Knowledge paper · Task fixes paper · Release notes

How do you say $\tau^3$-bench? We just say "tau three," but you do you!

What's New in $\tau^3$-bench

  • Knowledge Domain (banking_knowledge) — A knowledge-retrieval-based customer service domain with configurable RAG pipelines, document search, embeddings, and agentic shell-based search. Learn more →
  • Voice Full-Duplex (Audio Native) — End-to-end voice evaluation with realtime providers (OpenAI, Gemini, xAI). Learn more →
  • Task Quality (75+ fixes) — Removed incorrect expected actions, clarified ambiguous instructions, fixed impossible constraints, and added missing fallback behaviors across airline, retail, and banking domains. Based on analysis from SABER (Cuadron et al., 2025). Learn more →
  • Updated Leaderboard — Now includes voice and knowledge results. Compare model performance at taubench.com. Submit your results →

See CHANGELOG.md for the full version history.

Backward compatibility note: If you are evaluating an agent (not training), use the base task split to evaluate on the complete task set that matches the original τ-bench structure. This is the default.

Upgrading from $\tau^2$-bench? Installation now uses uv instead of pip install -e ., and Python >=3.12, <3.14 is required (was >=3.10). Some internal APIs have been refactored — see CHANGELOG.md for details.

Overview

$\tau$-bench is a simulation framework for evaluating customer service agents across multiple domains. It supports text-based half-duplex (turn-based) evaluation and voice full-duplex (simultaneous) evaluation using real-time audio APIs.

Each domain specifies:

  • A policy that the agent must follow
  • A set of tools that the agent can use
  • A set of tasks to evaluate the agent's performance
  • Optionally: a set of user tools for the user simulator

Available domains: mock · airline · retail · telecom · banking_knowledge

Mode Description
Text (half-duplex) Turn-based chat with tool use
Voice (full-duplex) End-to-end audio via realtime providers (OpenAI, Gemini, xAI)

Quick Start

1. Install

git clone https://github.com/sierra-research/tau2-bench
cd tau2-bench
uv sync                        # core only (text-mode: airline, retail, telecom, mock)

Optional extras (install what you need):

uv sync --extra voice          # + voice/audio-native features
uv sync --extra knowledge      # + banking_knowledge domain (retrieval pipeline)
uv sync --extra gym            # + gymnasium RL interface
uv sync --extra dev            # + pytest, ruff, pre-commit (required for contributing)
uv sync --all-extras           # everything

This requires uv. Voice features also need system dependencies (brew install portaudio ffmpeg on macOS). See the full installation guide for details.

2. Set up API keys

cp .env.example .env
# Edit .env with your API keys (uses LiteLLM — any supported provider works)

3. Run an evaluation

tau2 run --domain airline --agent-llm gpt-4.1 --user-llm gpt-4.1 \
  --num-trials 1 --num-tasks 5

Results are saved to data/simulations/. Use tau2 view to browse them.

Tip: Run tau2 intro for an overview of available domains, commands, and examples.

Documentation

Getting Started

Document Description
Getting Started Installation, API keys, first run, output structure, configuration
CLI Reference All tau2 commands and options

Core Concepts

Document Description
Agent Developer Guide Build and evaluate your own agent
Domains Domain structure, data format, and available domains
Orchestrator & Communication Modes Half-duplex and full-duplex orchestration

Knowledge Retrieval

Document Description
Knowledge Retrieval Retrieval pipeline configs, embeddings, RAG, and sandbox setup for the banking_knowledge domain

Voice & Audio

Document Description
Voice (Full-Duplex) Providers, speech complexity, CLI options, and output structure for voice evaluation
Audio Native Architecture Internal architecture for adding or modifying realtime provider adapters

RL & Training

Document Description
Gym Interface Gymnasium-compatible environment, play mode, train/test splits

Leaderboard & Experiments

Document Description
Leaderboard Submission How to submit results to taubench.com
Experiments Experimental features and research code

Project

Document Description
Contributing How to contribute to τ-bench
Changelog Version history and release notes

Contributing

We welcome contributions! Whether you're fixing bugs, adding features, creating domains, or contributing research code, see our Contributing Guide for guidelines.

Citation

If you use a specific component of $\tau^3$-bench, please cite the corresponding paper below.

Knowledge Domain (banking_knowledge)

@article{shi2026tau,
  title={$\tau$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge},
  author={Shi, Quan and Zytek, Alexandra and Razavi, Pedram and Narasimhan, Karthik and Barres, Victor},
  journal={arXiv preprint arXiv:2603.04370},
  year={2026}
}

Voice Full-Duplex Benchmark

@misc{ray2026tauvoicebenchmarkingfullduplexvoice,
      title={$\tau$-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains},
      author={Soham Ray and Keshav Dhandhania and Victor Barres and Karthik Narasimhan},
      year={2026},
      eprint={2603.13686},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2603.13686},
}

Core $\tau$-Bench

@misc{barres2025tau2,
      title={$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment}, 
      author={Victor Barres and Honghua Dong and Soham Ray and Xujie Si and Karthik Narasimhan},
      year={2025},
      eprint={2506.07982},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.07982}, 
}

@misc{yao2024tau,
      title={$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains}, 
      author={Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan},
      year={2024},
      eprint={2406.12045},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2406.12045}, 
}

Task Fixes

@inproceedings{cuadron2026saber,
      title={{SABER}: Small Actions, Big Errors {\textemdash} Safeguarding Mutating Steps in {LLM} Agents},
      author={Alejandro Cuadron and Pengfei Yu and Yang Liu and Arpit Gupta},
      booktitle={ICLR 2026 Workshop on Memory for LLM-Based Agentic Systems},
      year={2026},
      url={https://openreview.net/forum?id=En2z9dckgP},
}