CLASSic Benchmark

🤗 Dataset • 📝 Paper

Benchmarking LLM Agents on Real-World Enterprise Tasks

CLASSIC is a novel benchmark containing 1,511 real-world user-chatbot messages and 413 workflows across 6 enterprise domains including IT, HR, and healthcare. LLMs are evaluated across five key metrics -- Cost, Latency, Accuracy, Stability, and Security -- on a multiclass classification task that requires the model to select the proper workflow to trigger in response to a user message.

💿 Installation

conda create -n classicbench python=3.10 -y
conda activate classicbench
git clone https://github.com/Miking98/classic_benchmark.git
cd classic_benchmark && pip install -e .

🚀 Quick Start

Run the benchmark:

python3 run.py --data [PATH_TO_DATASET_YAML] --agent [PATH_TO_AGENT_YAML]

# Examples:
python3 run.py --data v1 --agent aisera --eval no_security
python3 run.py --data v1 --agent cot_gpt4 --eval no_security

Or, download the dataset from HuggingFace and run your own custom scripts.

from datasets import load_dataset

# Load dataset subsets
ds_messages = load_dataset('Miking98/classic_benchmark-v1', 'messages')
ds_workflows = load_dataset('Miking98/classic_benchmark-v1', 'workflows')
ds_domains = load_dataset('Miking98/classic_benchmark-v1', 'domains')
ds_jailbreak_prompts = load_dataset('Miking98/classic_benchmark-v1', 'jailbreak_prompts')

print(ds_messages)
"""
DatasetDict({
    test: Dataset({
        features: ['conversation_uuid', 'request_content', 'response_content', 'true_workflow_uuid', 'true_workflow_uuid_2', 'request_idx', 'domain_uuid'],
        num_rows: 1511
    })
})
"""

👨‍💻 Examples

Run GPT-4o agent: python3 run.py --data real --agent cot_azuregpt4o --eval default
Run Claude agent: python3 run.py --data real --agent cot_claude35 --eval default

🤗 Dataset

Download the dataset from 🤗 HuggingFace here

📀 Dataset Generation

Listed in order of creation. Each subsequent folder depends on the previous one.

`./data/0_raw`

Raw data dump from Aisera.

`./data/1_sampled`

Next, we sample a subset of chats from the raw data dump by running:

python3 scripts/scripts_to_create_dataset/1_convert_raw_to_sampled.py

`./data/2_annotations`

Next, we generate an Excel file to send to AMT workers to annotate the sampled data by running:

python3 scripts/scripts_to_create_dataset/2_convert_sampled_to_annotations.py

IRL, we need to:

Use Amazon Mechanical Turk to annotate the chats. Generate one Excel file per annotator.
Save the annotated Excel files into data/2_annotations and delete the original unannotated Excel file.

`./data/3_clean`

Next, we clean the dataset from annotations by removing conversations flagged by our annotators by running:

python3 scripts/scripts_to_create_dataset/3_convert_annotations_to_clean.py

This is our final, cleaned dataset.

`./data/4_iclr_workshop_sample.zip`

Submitted to ICLR reviewers.

To generate:

python3 scripts/scripts_to_create_dataset/4_convert_clean_to_iclr_workshop_sample.py

`./data/5_iclr_workshop_full.zip`

Original dataset reported in ICLR paper.

`./data/6_hf_dataset`

Convert the dataset to a Hugging Face Dataset and upload it to the Hub.

To generate:

python3 scripts/scripts_to_create_dataset/6_hf_dataset.py --path_to_dataset_dir ./data/3_clean --hf_version v0

📊 Leaderboard

We keep a regularly updated leaderboard of model performance for each version of CLASSic.

v0

Original dataset from 2025 ICLR Workshop submission.
Access: Not released due to privacy considerations.
# of messages: 2311

all

accuracy

v1

Filtered version of v0
Access: 🤗 HuggingFace
# of messages: 1511

🎓 Citation

@inproceedings{wornow2025top,
  title={Top of the CLASS: Benchmarking LLM Agents on Real-World Enterprise Tasks},
  author={Wornow, Michael and Garodia, Vaishnav and Vassalos, Vasilis and Contractor, Utkarsh},
  booktitle={ICLR 2025 Workshop on Building Trust in Language Models and Applications}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
classicbench		classicbench
configs		configs
figures		figures
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLASSic Benchmark

🤗 Dataset • 📝 Paper

Benchmarking LLM Agents on Real-World Enterprise Tasks

📖 Table of Contents

💿 Installation

🚀 Quick Start

👨‍💻 Examples

🤗 Dataset

📀 Dataset Generation

`./data/0_raw`

`./data/1_sampled`

`./data/2_annotations`

`./data/3_clean`

`./data/4_iclr_workshop_sample.zip`

`./data/5_iclr_workshop_full.zip`

`./data/6_hf_dataset`

📊 Leaderboard

v0

v1

🎓 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CLASSic Benchmark

🤗 Dataset • 📝 Paper

Benchmarking LLM Agents on Real-World Enterprise Tasks

📖 Table of Contents

💿 Installation

🚀 Quick Start

👨‍💻 Examples

🤗 Dataset

📀 Dataset Generation

./data/0_raw

./data/1_sampled

./data/2_annotations

./data/3_clean

./data/4_iclr_workshop_sample.zip

./data/5_iclr_workshop_full.zip

./data/6_hf_dataset

📊 Leaderboard

v0

v1

🎓 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`./data/0_raw`

`./data/1_sampled`

`./data/2_annotations`

`./data/3_clean`

`./data/4_iclr_workshop_sample.zip`

`./data/5_iclr_workshop_full.zip`

`./data/6_hf_dataset`

Packages