GitHub - KOR-Bench/KOR-Bench

KOR-Bench: Benchmarking Language Models
on Knowledge-Orthogonal Reasoning Tasks

🌐 Homepage | 🤗 Paper | 📖 ArXiv | 🏆 Leaderboard | 🐙 GitHub

This repository contains the evaluation code for the paper "KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks".

🔔 Introduction

Knowledge-Orthogonal Reasoning Benchmark (KOR-Bench) is designed to evaluate models' intrinsic reasoning and planning abilities while minimizing interference from pretrained knowledge. It introduces new rules that are independent of prior knowledge, allowing for a more accurate assessment of how models adapt to novel rule-driven tasks. KOR-Bench consists of five task categories: Operation, Logic, Cipher, Puzzle, and Counterfactual. Leading models, such as Claude-3.5-Sonnet and GPT-4o, score around 58% on this challenging benchmark.

⚙️ Installation

To install the required packages, run:

# Prepare repository and environment
git clone https://github.com/KOR-Bench/KOR-Bench.git
cd ./KOR-Bench
pip install -r requirements.txt

🔍 Inference

You can directly perform inference on <MODEL_NAME> using the following command:

export PYTHONPATH=$(pwd)

# local model infer
python infer/infer.py --config <CONFIG_PATH> --split <TASKS> --mode <MODE> --model_name <MODEL_NAME> --output_dir <OUTPUT_DIR> --batch_size <BATCH_SIZE> --use_accel

# API calls
python infer/infer.py --config <CONFIG_PATH> --split <TASKS> --mode <MODE> --model_name <MODEL_NAME> --output_dir <OUTPUT_DIR> --num_workers <NUM_WORKERS>

Example:

export PYTHONPATH=$(pwd)

# local model infer
python infer/infer.py --config config/config.yaml --split logic cipher counterfactual operation puzzle --mode zero-shot --model_name Yi-1.5-6B-Chat --output_dir results --batch_size 250 --use_accel

# API calls
python infer/infer.py --config config/config.yaml --split logic cipher counterfactual operation puzzle --mode zero-shot --model_name gpt-4o --output_dir results --num_workers 16

More examples can be found in the shell scripts of this repository. 🔗

📜Parameter Explanations for Inference Script

--config: Path to the configuration file.
--split: Specify the task categories for evaluation.
Available options include:
- logic
- cipher
- counterfactual
- operation
- puzzle
- ...
Multiple categories can be selected at once, separated by spaces.
--mode: Choose from different evaluation modes (zero-shot, three-shot, etc.). Default is to evaluate all modes.
--infer_limit: Limit the number of problems processed during inference to save costs during API debugging. Defaults to unlimited.
--use_accel: Enables acceleration options for faster inference. All local model experimental results in the paper are inferred using vllm.
--num_workers: Set the number of concurrent processes (use for API calls; set to 1 for local models).
--batch_size: Set the batch size for local model inference (use for local model infer; set to 1 for API calls).

📝 Notes

During inference, a temporary file .jsonl.tmp will be saved. If the inference is unexpectedly interrupted, you can directly rerun the command to resume from the last checkpoint.
After inference is complete, check the response field in the saved JSONL file in output_dir. This field should typically be of string type; if it is of dict type, the error field will contain error information. You can rerun the command to re-infer any issues that caused errors.

🛠️ Run Custom Model

--model_name: This parameter must align with the filenames in the infer/models directory. We have some built-in models available for direct selection.
Adding a Custom Model:
If you want to add a custom model for testing, follow these steps:
1. Refer to the files in the infer/models directory.
2. Create and add a new .py file for your model.
3. Update the configuration in __init__.py.
For more details, please check the documentation of the specific model you are adding.

⭐ Evaluation

Before you begin: To ensure accurate evaluation, please make sure to install the following packages:

SymPy: Required for handling symbolic mathematical computations.
antlr4-python3-runtime (version 4.11): Needed for processing LaTeX format answers.

pip install sympy
pip install antlr4-python3-runtime==4.11

If you are using an environment created from the requirements.txt file, no additional installation is needed, as these dependencies are already included.

Next Step: After you finish inference and confirm there are no error messages, please run the answer parsing and evaluation pipeline as follows:

export PYTHONPATH=$(pwd)

python eval/eval.py <source_folder> <target_root_folder> <csv_file>

# example:
python eval/eval.py results/0923_main eval/results/0923_main eval/results_0923_main.csv

Detailed evaluation results can be found in the target_root_folder.

📫 Contact

Kaijing Ma: mkj3085003@gmail.com

Xinrun Du: duxinrun2000@gmail.com

Ge Zhang: gezhang@umich.edu

📚 Citation

BibTeX:

@misc{ma2024korbenchbenchmarkinglanguagemodels,
title={KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks}, 
author={Kaijing Ma and Xinrun Du and Yunran Wang and Haoran Zhang and Zhoufutu Wen and Xingwei Qu and Jian Yang and Jiaheng Liu and Minghao Liu and Xiang Yue and Wenhao Huang and Ge Zhang},
year={2024},
eprint={2410.06526},
archivePrefix={arXiv},
primaryClass={cs.DB},
url={https://arxiv.org/abs/2410.06526}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cipher_code		cipher_code
config		config
data		data
eval		eval
images		images
infer		infer
post_process		post_process
results		results
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_cipher_subquestions_examples.sh		run_cipher_subquestions_examples.sh
run_examples.sh		run_examples.sh
run_mixed_examples.sh		run_mixed_examples.sh
run_puzzle_trick_examples.sh		run_puzzle_trick_examples.sh
run_self_correction_examples.sh		run_self_correction_examples.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KOR-Bench: Benchmarking Language Models
on Knowledge-Orthogonal Reasoning Tasks

🔔 Introduction

⚙️ Installation

🔍 Inference

📜Parameter Explanations for Inference Script

📝 Notes

🛠️ Run Custom Model

⭐ Evaluation

📫 Contact

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

KOR-Bench/KOR-Bench

Folders and files

Latest commit

History

Repository files navigation

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

🔔 Introduction

⚙️ Installation

🔍 Inference

📜Parameter Explanations for Inference Script

📝 Notes

🛠️ Run Custom Model

⭐ Evaluation

📫 Contact

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

KOR-Bench: Benchmarking Language Models
on Knowledge-Orthogonal Reasoning Tasks

Packages