- CodeIF-Bench is a benchmark for evaluating the instruction-following ability of LLM in interactive code generation tasks.
- CodeIF-Bench contains 9 verifiable instruction strategies collected from code review tasks.
- CodeIF-Bench contains 900+ verifiable instructions with test cases that cover both SA and Non-SA programming tasks and support Multi-Turn dialogue.
- The original repositories can be downloaded from link.
- The data file can be finded in /data.
conda create --name xxx --file environment.txt
conda activate xxx
pip install -r requirement.txt
- run
inference.sh. Note that you should set the LLM settings (such as, url or keys) inllm_factory.py. - run
run_metrics.shto get metrics.
- run
inference_mbpp.shorinference_repo.sh. Note that you should set the LLM settings (such as, url or keys) inmulti_turn_xxx_eval.py. - run
run_metrics.shto get metrics.
-
IA: The LLM's ability to follow current instructions
-
CA: The LLM's ability to follow instructions throughout the entire conversation
-
IFR: The proportion of instructions an LLM forgets during the conversation
-
CIF: The number of instructions last followed in a dynamic conversation
For further details, please refer to our paper. New version is coming soon!
If you have any questions or suggestions, please email us at wangpeiding@buaa.edu.cn
If you find this repository useful, please cite our paper:
@misc{wang2025codeifbenchevaluatinginstructionfollowingcapabilities,
title={CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation},
author={Peiding Wang and Li Zhang and Fang Liu and Lin Shi and Minxiao Li and Bo Shen and An Fu},
year={2025},
eprint={2503.22688},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2503.22688},
}