From text-only to multimodal, knowledge-aware agent evaluation.
Voice full-duplex · Knowledge retrieval · 75+ task fixes
τ-Voice paper · τ-Knowledge paper · Task fixes paper · Release notes
How do you say
$\tau^3$ -bench? We just say "tau three," but you do you!
- Knowledge Domain (
banking_knowledge) — A knowledge-retrieval-based customer service domain with configurable RAG pipelines, document search, embeddings, and agentic shell-based search. Learn more → - Voice Full-Duplex (Audio Native) — End-to-end voice evaluation with realtime providers (OpenAI, Gemini, xAI). Learn more →
- Task Quality (75+ fixes) — Removed incorrect expected actions, clarified ambiguous instructions, fixed impossible constraints, and added missing fallback behaviors across airline, retail, and banking domains. Based on analysis from SABER (Cuadron et al., 2025). Learn more →
- Updated Leaderboard — Now includes voice and knowledge results. Compare model performance at taubench.com. Submit your results →
See CHANGELOG.md for the full version history.
Backward compatibility note: If you are evaluating an agent (not training), use the
basetask split to evaluate on the complete task set that matches the original τ-bench structure. This is the default.
Upgrading from
$\tau^2$ -bench? Installation now usesuvinstead ofpip install -e ., and Python>=3.12, <3.14is required (was>=3.10). Some internal APIs have been refactored — see CHANGELOG.md for details.
Each domain specifies:
- A policy that the agent must follow
- A set of tools that the agent can use
- A set of tasks to evaluate the agent's performance
- Optionally: a set of user tools for the user simulator
Available domains: mock · airline · retail · telecom · banking_knowledge
| Mode | Description |
|---|---|
| Text (half-duplex) | Turn-based chat with tool use |
| Voice (full-duplex) | End-to-end audio via realtime providers (OpenAI, Gemini, xAI) |
git clone https://github.com/sierra-research/tau2-bench
cd tau2-bench
uv sync # core only (text-mode: airline, retail, telecom, mock)Optional extras (install what you need):
uv sync --extra voice # + voice/audio-native features
uv sync --extra knowledge # + banking_knowledge domain (retrieval pipeline)
uv sync --extra gym # + gymnasium RL interface
uv sync --extra dev # + pytest, ruff, pre-commit (required for contributing)
uv sync --all-extras # everythingThis requires uv. Voice features also need system dependencies (brew install portaudio ffmpeg on macOS). See the full installation guide for details.
cp .env.example .env
# Edit .env with your API keys (uses LiteLLM — any supported provider works)tau2 run --domain airline --agent-llm gpt-4.1 --user-llm gpt-4.1 \
--num-trials 1 --num-tasks 5Results are saved to data/simulations/. Use tau2 view to browse them.
Tip: Run
tau2 introfor an overview of available domains, commands, and examples.
| Document | Description |
|---|---|
| Getting Started | Installation, API keys, first run, output structure, configuration |
| CLI Reference | All tau2 commands and options |
| Document | Description |
|---|---|
| Agent Developer Guide | Build and evaluate your own agent |
| Domains | Domain structure, data format, and available domains |
| Orchestrator & Communication Modes | Half-duplex and full-duplex orchestration |
| Document | Description |
|---|---|
| Knowledge Retrieval | Retrieval pipeline configs, embeddings, RAG, and sandbox setup for the banking_knowledge domain |
| Document | Description |
|---|---|
| Voice (Full-Duplex) | Providers, speech complexity, CLI options, and output structure for voice evaluation |
| Audio Native Architecture | Internal architecture for adding or modifying realtime provider adapters |
| Document | Description |
|---|---|
| Gym Interface | Gymnasium-compatible environment, play mode, train/test splits |
| Document | Description |
|---|---|
| Leaderboard Submission | How to submit results to taubench.com |
| Experiments | Experimental features and research code |
| Document | Description |
|---|---|
| Contributing | How to contribute to τ-bench |
| Changelog | Version history and release notes |
We welcome contributions! Whether you're fixing bugs, adding features, creating domains, or contributing research code, see our Contributing Guide for guidelines.
If you use a specific component of
@article{shi2026tau,
title={$\tau$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge},
author={Shi, Quan and Zytek, Alexandra and Razavi, Pedram and Narasimhan, Karthik and Barres, Victor},
journal={arXiv preprint arXiv:2603.04370},
year={2026}
}@misc{ray2026tauvoicebenchmarkingfullduplexvoice,
title={$\tau$-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains},
author={Soham Ray and Keshav Dhandhania and Victor Barres and Karthik Narasimhan},
year={2026},
eprint={2603.13686},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2603.13686},
}@misc{barres2025tau2,
title={$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment},
author={Victor Barres and Honghua Dong and Soham Ray and Xujie Si and Karthik Narasimhan},
year={2025},
eprint={2506.07982},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.07982},
}
@misc{yao2024tau,
title={$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains},
author={Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan},
year={2024},
eprint={2406.12045},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2406.12045},
}@inproceedings{cuadron2026saber,
title={{SABER}: Small Actions, Big Errors {\textemdash} Safeguarding Mutating Steps in {LLM} Agents},
author={Alejandro Cuadron and Pengfei Yu and Yang Liu and Arpit Gupta},
booktitle={ICLR 2026 Workshop on Memory for LLM-Based Agentic Systems},
year={2026},
url={https://openreview.net/forum?id=En2z9dckgP},
}