Skip to content

KOR-Bench/KOR-Bench

Repository files navigation

KOR-Bench: Benchmarking Language Models
on Knowledge-Orthogonal Reasoning Tasks

🌐 Homepage | 🤗 Paper | 📖 ArXiv | 🏆 Leaderboard | 🐙 GitHub

This repository contains the evaluation code for the paper "KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks".


🔔 Introduction

KOR-Bench Overview

Knowledge-Orthogonal Reasoning Benchmark (KOR-Bench) is designed to evaluate models' intrinsic reasoning and planning abilities while minimizing interference from pretrained knowledge. It introduces new rules that are independent of prior knowledge, allowing for a more accurate assessment of how models adapt to novel rule-driven tasks. KOR-Bench consists of five task categories: Operation, Logic, Cipher, Puzzle, and Counterfactual. Leading models, such as Claude-3.5-Sonnet and GPT-4o, score around 58% on this challenging benchmark.


⚙️ Installation

To install the required packages, run:

# Prepare repository and environment
git clone https://github.com/KOR-Bench/KOR-Bench.git
cd ./KOR-Bench
pip install -r requirements.txt

🔍 Inference

You can directly perform inference on <MODEL_NAME> using the following command:

export PYTHONPATH=$(pwd)

# local model infer
python infer/infer.py --config <CONFIG_PATH> --split <TASKS> --mode <MODE> --model_name <MODEL_NAME> --output_dir <OUTPUT_DIR> --batch_size <BATCH_SIZE> --use_accel

# API calls
python infer/infer.py --config <CONFIG_PATH> --split <TASKS> --mode <MODE> --model_name <MODEL_NAME> --output_dir <OUTPUT_DIR> --num_workers <NUM_WORKERS>

Example:

export PYTHONPATH=$(pwd)

# local model infer
python infer/infer.py --config config/config.yaml --split logic cipher counterfactual operation puzzle --mode zero-shot --model_name Yi-1.5-6B-Chat --output_dir results --batch_size 250 --use_accel

# API calls
python infer/infer.py --config config/config.yaml --split logic cipher counterfactual operation puzzle --mode zero-shot --model_name gpt-4o --output_dir results --num_workers 16

More examples can be found in the shell scripts of this repository. 🔗

📜Parameter Explanations for Inference Script

  • --config: Path to the configuration file.

  • --split: Specify the task categories for evaluation.
    Available options include:

    • logic
    • cipher
    • counterfactual
    • operation
    • puzzle
    • ...

    Multiple categories can be selected at once, separated by spaces.

  • --mode: Choose from different evaluation modes (zero-shot, three-shot, etc.). Default is to evaluate all modes.

  • --infer_limit: Limit the number of problems processed during inference to save costs during API debugging. Defaults to unlimited.

  • --use_accel: Enables acceleration options for faster inference. All local model experimental results in the paper are inferred using vllm.

  • --num_workers: Set the number of concurrent processes (use for API calls; set to 1 for local models).

  • --batch_size: Set the batch size for local model inference (use for local model infer; set to 1 for API calls).

📝 Notes

  • During inference, a temporary file .jsonl.tmp will be saved. If the inference is unexpectedly interrupted, you can directly rerun the command to resume from the last checkpoint.

  • After inference is complete, check the response field in the saved JSONL file in output_dir. This field should typically be of string type; if it is of dict type, the error field will contain error information. You can rerun the command to re-infer any issues that caused errors.

🛠️ Run Custom Model

  • --model_name: This parameter must align with the filenames in the infer/models directory. We have some built-in models available for direct selection.

  • Adding a Custom Model:
    If you want to add a custom model for testing, follow these steps:

    1. Refer to the files in the infer/models directory.
    2. Create and add a new .py file for your model.
    3. Update the configuration in __init__.py.
  • For more details, please check the documentation of the specific model you are adding.


⭐ Evaluation

Before you begin: To ensure accurate evaluation, please make sure to install the following packages:

  1. SymPy: Required for handling symbolic mathematical computations.
  2. antlr4-python3-runtime (version 4.11): Needed for processing LaTeX format answers.
pip install sympy
pip install antlr4-python3-runtime==4.11

If you are using an environment created from the requirements.txt file, no additional installation is needed, as these dependencies are already included.

Next Step: After you finish inference and confirm there are no error messages, please run the answer parsing and evaluation pipeline as follows:

export PYTHONPATH=$(pwd)

python eval/eval.py <source_folder> <target_root_folder> <csv_file>

# example:
python eval/eval.py results/0923_main eval/results/0923_main eval/results_0923_main.csv

Detailed evaluation results can be found in the target_root_folder.

📫 Contact

Kaijing Ma: mkj3085003@gmail.com

Xinrun Du: duxinrun2000@gmail.com

Ge Zhang: gezhang@umich.edu

📚 Citation

BibTeX:

@misc{ma2024korbenchbenchmarkinglanguagemodels,
title={KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks}, 
author={Kaijing Ma and Xinrun Du and Yunran Wang and Haoran Zhang and Zhoufutu Wen and Xingwei Qu and Jian Yang and Jiaheng Liu and Minghao Liu and Xiang Yue and Wenhao Huang and Ge Zhang},
year={2024},
eprint={2410.06526},
archivePrefix={arXiv},
primaryClass={cs.DB},
url={https://arxiv.org/abs/2410.06526}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •