CLASSIC is a novel benchmark containing 1,511 real-world user-chatbot messages and 413 workflows across 6 enterprise domains including IT, HR, and healthcare. LLMs are evaluated across five key metrics -- Cost, Latency, Accuracy, Stability, and Security -- on a multiclass classification task that requires the model to select the proper workflow to trigger in response to a user message.
conda create -n classicbench python=3.10 -y
conda activate classicbench
git clone https://github.com/Miking98/classic_benchmark.git
cd classic_benchmark && pip install -e .Run the benchmark:
python3 run.py --data [PATH_TO_DATASET_YAML] --agent [PATH_TO_AGENT_YAML]
# Examples:
python3 run.py --data v1 --agent aisera --eval no_security
python3 run.py --data v1 --agent cot_gpt4 --eval no_securityOr, download the dataset from HuggingFace and run your own custom scripts.
from datasets import load_dataset
# Load dataset subsets
ds_messages = load_dataset('Miking98/classic_benchmark-v1', 'messages')
ds_workflows = load_dataset('Miking98/classic_benchmark-v1', 'workflows')
ds_domains = load_dataset('Miking98/classic_benchmark-v1', 'domains')
ds_jailbreak_prompts = load_dataset('Miking98/classic_benchmark-v1', 'jailbreak_prompts')
print(ds_messages)
"""
DatasetDict({
test: Dataset({
features: ['conversation_uuid', 'request_content', 'response_content', 'true_workflow_uuid', 'true_workflow_uuid_2', 'request_idx', 'domain_uuid'],
num_rows: 1511
})
})
"""- Run GPT-4o agent:
python3 run.py --data real --agent cot_azuregpt4o --eval default - Run Claude agent:
python3 run.py --data real --agent cot_claude35 --eval default
Download the dataset from π€ HuggingFace here
Listed in order of creation. Each subsequent folder depends on the previous one.
Raw data dump from Aisera.
Next, we sample a subset of chats from the raw data dump by running:
python3 scripts/scripts_to_create_dataset/1_convert_raw_to_sampled.pyNext, we generate an Excel file to send to AMT workers to annotate the sampled data by running:
python3 scripts/scripts_to_create_dataset/2_convert_sampled_to_annotations.pyIRL, we need to:
-
Use Amazon Mechanical Turk to annotate the chats. Generate one Excel file per annotator.
-
Save the annotated Excel files into
data/2_annotationsand delete the original unannotated Excel file.
Next, we clean the dataset from annotations by removing conversations flagged by our annotators by running:
python3 scripts/scripts_to_create_dataset/3_convert_annotations_to_clean.pyThis is our final, cleaned dataset.
Submitted to ICLR reviewers.
To generate:
python3 scripts/scripts_to_create_dataset/4_convert_clean_to_iclr_workshop_sample.pyOriginal dataset reported in ICLR paper.
Convert the dataset to a Hugging Face Dataset and upload it to the Hub.
To generate:
python3 scripts/scripts_to_create_dataset/6_hf_dataset.py --path_to_dataset_dir ./data/3_clean --hf_version v0We keep a regularly updated leaderboard of model performance for each version of CLASSic.
- Original dataset from 2025 ICLR Workshop submission.
- Access: Not released due to privacy considerations.
- # of messages: 2311


- Filtered version of v0
- Access: π€ HuggingFace
- # of messages: 1511
@inproceedings{wornow2025top,
title={Top of the CLASS: Benchmarking LLM Agents on Real-World Enterprise Tasks},
author={Wornow, Michael and Garodia, Vaishnav and Vassalos, Vasilis and Contractor, Utkarsh},
booktitle={ICLR 2025 Workshop on Building Trust in Language Models and Applications}
}

