$\tau$-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

🚀 τ³-bench is here!

From text-only to multimodal, knowledge-aware agent evaluation.
Voice full-duplex · Knowledge retrieval · 75+ task fixes
τ-Voice paper · τ-Knowledge paper · Task fixes paper · Release notes

How do you say $\tau^3$-bench? We just say "tau three," but you do you!

What's New in $\tau^3$-bench

Knowledge Domain (banking_knowledge) — A knowledge-retrieval-based customer service domain with configurable RAG pipelines, document search, embeddings, and agentic shell-based search. Learn more →
Voice Full-Duplex (Audio Native) — End-to-end voice evaluation with realtime providers (OpenAI, Gemini, xAI). Learn more →
Task Quality (75+ fixes) — Removed incorrect expected actions, clarified ambiguous instructions, fixed impossible constraints, and added missing fallback behaviors across airline, retail, and banking domains. Based on analysis from SABER (Cuadron et al., 2025). Learn more →
Updated Leaderboard — Now includes voice and knowledge results. Compare model performance at taubench.com. Submit your results →

See CHANGELOG.md for the full version history.

Backward compatibility note: If you are evaluating an agent (not training), use the base task split to evaluate on the complete task set that matches the original τ-bench structure. This is the default.

Upgrading from $\tau^2$-bench? Installation now uses uv instead of pip install -e ., and Python >=3.12, <3.14 is required (was >=3.10). Some internal APIs have been refactored — see CHANGELOG.md for details.

Overview

$\tau$-bench is a simulation framework for evaluating customer service agents across multiple domains. It supports text-based half-duplex (turn-based) evaluation and voice full-duplex (simultaneous) evaluation using real-time audio APIs.

Each domain specifies:

A policy that the agent must follow
A set of tools that the agent can use
A set of tasks to evaluate the agent's performance
Optionally: a set of user tools for the user simulator

Available domains: mock · airline · retail · telecom · banking_knowledge

Mode	Description
Text (half-duplex)	Turn-based chat with tool use
Voice (full-duplex)	End-to-end audio via realtime providers (OpenAI, Gemini, xAI)

Quick Start

1. Install

git clone https://github.com/sierra-research/tau2-bench
cd tau2-bench
uv sync                        # core only (text-mode: airline, retail, telecom, mock)

Optional extras (install what you need):

uv sync --extra voice          # + voice/audio-native features
uv sync --extra knowledge      # + banking_knowledge domain (retrieval pipeline)
uv sync --extra gym            # + gymnasium RL interface
uv sync --extra dev            # + pytest, ruff, pre-commit (required for contributing)
uv sync --all-extras           # everything

This requires uv. Voice features also need system dependencies (brew install portaudio ffmpeg on macOS). See the full installation guide for details.

2. Set up API keys

cp .env.example .env
# Edit .env with your API keys (uses LiteLLM — any supported provider works)

3. Run an evaluation

tau2 run --domain airline --agent-llm gpt-4.1 --user-llm gpt-4.1 \
  --num-trials 1 --num-tasks 5

Results are saved to data/simulations/. Use tau2 view to browse them.

Tip: Run tau2 intro for an overview of available domains, commands, and examples.

Documentation

Getting Started

Document	Description
Getting Started	Installation, API keys, first run, output structure, configuration
CLI Reference	All `tau2` commands and options

Core Concepts

Document	Description
Agent Developer Guide	Build and evaluate your own agent
Domains	Domain structure, data format, and available domains
Orchestrator & Communication Modes	Half-duplex and full-duplex orchestration

Knowledge Retrieval

Document	Description
Knowledge Retrieval	Retrieval pipeline configs, embeddings, RAG, and sandbox setup for the `banking_knowledge` domain

Voice & Audio

Document	Description
Voice (Full-Duplex)	Providers, speech complexity, CLI options, and output structure for voice evaluation
Audio Native Architecture	Internal architecture for adding or modifying realtime provider adapters

RL & Training

Document	Description
Gym Interface	Gymnasium-compatible environment, play mode, train/test splits

Leaderboard & Experiments

Document	Description
Leaderboard Submission	How to submit results to taubench.com
Experiments	Experimental features and research code

Project

Document	Description
Contributing	How to contribute to τ-bench
Changelog	Version history and release notes

Contributing

We welcome contributions! Whether you're fixing bugs, adding features, creating domains, or contributing research code, see our Contributing Guide for guidelines.

Citation

If you use a specific component of $\tau^3$-bench, please cite the corresponding paper below.

Knowledge Domain (`banking_knowledge`)

@article{shi2026tau,
  title={$\tau$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge},
  author={Shi, Quan and Zytek, Alexandra and Razavi, Pedram and Narasimhan, Karthik and Barres, Victor},
  journal={arXiv preprint arXiv:2603.04370},
  year={2026}
}

Voice Full-Duplex Benchmark

@misc{ray2026tauvoicebenchmarkingfullduplexvoice,
      title={$\tau$-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains},
      author={Soham Ray and Keshav Dhandhania and Victor Barres and Karthik Narasimhan},
      year={2026},
      eprint={2603.13686},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2603.13686},
}

Core $\tau$-Bench

@misc{barres2025tau2,
      title={$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment}, 
      author={Victor Barres and Honghua Dong and Soham Ray and Xujie Si and Karthik Narasimhan},
      year={2025},
      eprint={2506.07982},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.07982}, 
}

@misc{yao2024tau,
      title={$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains}, 
      author={Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan},
      year={2024},
      eprint={2406.12045},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2406.12045}, 
}

Task Fixes

@inproceedings{cuadron2026saber,
      title={{SABER}: Small Actions, Big Errors {\textemdash} Safeguarding Mutating Steps in {LLM} Agents},
      author={Alejandro Cuadron and Pengfei Yu and Yang Liu and Arpit Gupta},
      booktitle={ICLR 2026 Workshop on Memory for LLM-Based Agentic Systems},
      year={2026},
      url={https://openreview.net/forum?id=En2z9dckgP},
}

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.cursor/rules		.cursor/rules
.github		.github
data		data
docs		docs
examples/agents		examples/agents
figs		figs
scripts		scripts
src		src
tests		tests
web/leaderboard		web/leaderboard
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.release-template.md		.release-template.md
AGENTS.md		AGENTS.md
AUTOMATION_GUIDE.md		AUTOMATION_GUIDE.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
VERSIONING.md		VERSIONING.md
github-release-body.md		github-release-body.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

$\tau$-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

🚀 τ³-bench is here!

What's New in $\tau^3$-bench

Overview

Quick Start

1. Install

2. Set up API keys

3. Run an evaluation

Documentation

Getting Started

Core Concepts

Knowledge Retrieval

Voice & Audio

RL & Training

Leaderboard & Experiments

Project

Contributing

Citation

Knowledge Domain (`banking_knowledge`)

Voice Full-Duplex Benchmark

Core $\tau$-Bench

Task Fixes

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors 9

Languages

Folders and files

Latest commit

History

Repository files navigation

$\tau$-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

🚀 τ³-bench is here!

What's New in $\tau^3$-bench

Overview

Quick Start

1. Install

2. Set up API keys

3. Run an evaluation

Documentation

Getting Started

Core Concepts

Knowledge Retrieval

Voice & Audio

RL & Training

Leaderboard & Experiments

Project

Contributing

Citation

Knowledge Domain (banking_knowledge)

Voice Full-Duplex Benchmark

Core $\tau$-Bench

Task Fixes

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors 9

Languages

Knowledge Domain (`banking_knowledge`)

Packages