ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts
https://huggingface.co/datasets/KISTI-KONI/ScholarBench
conda create -n sb_env python=3.12.9
conda activate sb_env
pip install requirements.txt
# install bleurt
wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip .
unzip BLEURT-20.zip
- data : 1 is the answer generated by giving only problems to LLM, 2 is the answer generated by giving problems and topics to LLM, 3 is the answer generated by giving problems and parameters to LLM, 4 is the answer generated by giving problems and parameters to LLM, and 5 is the answer generated by giving problems and categories
- API (1~5)
- GPT-4o
- o1-mini
- o3-mini
- Memory (1~4)
- Bllossom-8b
- Bllossom-70b
- Exaone-8b
- Exaone-32b
- Exaone-32b-reasoning
- Gemma2-9b
- Gemma2-27b
- Koni-8b
- llama-8b
- llama-70b
- Mistral-8b
- Mistral-24b
- Qwen-7b
- Qwen-32b-reasoning
- Qwen-72b
- Trilion-7b
- API (1~5)
- eval_scripts
- eval_all : summarization, short_answer, multiple_choice, multiple_select, true_false
- config.py
- data_loader.py
- evaluation_utils.py
- metrics_calculator.py
- main.py
- quality_eval.py
- eval_mc_ms_tf : multiple_choice, multiple_select, true_false
- config.py
- data_loader.py
- evaluation_utils.py
- metrics_calculator.py
- main.py
- accuracy_eval.py
- eval_sum_sa : summarization, short_answer
- config.py
- data_loader.py
- evaluation_utils.py
- metrics_calculator.py
- main.py
- quality_eval.py
- eval_all : summarization, short_answer, multiple_choice, multiple_select, true_false
-
Experiment 1: Includes multiple-choice, multiple-select, short-answer, true/false, and summarization tasks.
-
Experiment 2: Includes multiple-choice, multiple-select, short-answer, and true/false tasks with a topic field.
-
Experiment 3: Includes multiple-choice, multiple-select, short-answer, and true/false tasks with a paragraph field.
-
Experiment 4: Includes multiple-choice, multiple-select, short-answer, true/false, and summarization tasks with a paragraph field and "Think through step by step" instruction.
-
Experiment 5: Includes multiple-choice, multiple-select, short-answer, true/false, and summarization tasks with a category field (summarization uses paragraph).
-
Memory LLM: Code using the vllm package for inference in Memory LLM, experiments 1-4 can be selected from the shell script (original, topic, paragraph, cot), GPU used: A100 x 4
Create a .env file in the root directory and add your OpenAI API key:OPENAI_API_KEY=your-api-key
The dataset is not included in this repository. You need to download it in JSONL format from the KISTI huggingface and place it in the dataset/ directory.
Open src/config.py and specify the input_file_path and output_file_path for each experiment. Use relative paths based on the project root.
For example:
MODEL_CONFIGS = {
"gpt-4o_1": {
"input_file_path": "../../../dataset/en_eval_dataset.jsonl",
"output_file_path": "../../../result/gpt-4o/gpt-4o_result_1_en.json",
"model_name": "gpt-4o",
"experiment_type": "1"
},
# Other experiment configurations
}
Run an experiment for a specific model and type from the project root (/home/kilab_ndw/generate_model_answer):
python3 -m src.exp1.gpt-4o_1 # GPT-4o
python3 -m src.exp1.o1-mini_1 # o1-mini
python3 -m src.exp1.o3-mini_1 # o3-mini
python3 -m src.exp2.gpt-4o_2 # GPT-4o
python3 -m src.exp2.o1-mini_2 # o1-mini
python3 -m src.exp2.o3-mini_2 # o3-mini
python3 -m src.exp3.gpt-4o_3 # GPT-4o
python3 -m src.exp3.o1-mini_3 # o1-mini
python3 -m src.exp3.o3-mini_3 # o3-mini
python3 -m src.exp4.gpt-4o_4 # GPT-4o
python3 -m src.exp4.o1-mini_4 # o1-mini
python3 -m src.exp4.o3-mini_4 # o3-mini
python3 -m src.exp5.gpt-4o_5 # GPT-4o
python3 -m src.exp5.o1-mini_5 # o1-mini
python3 -m src.exp5.o3-mini_5 # o3-mini
python3 main.py --model_nick USE_MODEL_NICK \
--task_list PROBLEM_TYPES \
--exp_type EXPERIMENT_TYPE \
--save_path YOUR_SAVE_PATH \
--data_path YOUR_DATA_PATH
In addition, you can change the parameters and execute the script in the predict_vllm.sh script, and the command is as follows.
sh ./predict_vllm.sh
The script processes the dataset specified in config.py and saves results to the corresponding result/ folder (e.g., result/gpt-4o/gpt-4o_result_1_en.json).
| Problem type | Evaluation Metrics |
|---|---|
| Summarization | rouge-1, rouge-2, rouge-l, bert_score |
| Short_answer | exact_match, f1, bert_score, bluert_score, rouge-1, bleu-1 |
| Multiple_choice | accuracy |
| Multiple_select | accuracy |
| True_false | accuracy |
The model answers are ready, specify the directory where you want to store the correct answer file and the results of the directory with the original answers and run it as follows
# Summarization, Short_answer, Multiple_choice, Multiple_select, True_false
python scripts/eval_all/evaluate_all.py \
--language [ko or en] \
batch \
--ground-truth [ground-truth-path] \
--results-dir [model-answer-path] \
--output [output-path]
--detailed / If you want a categorical summary
# Multiple_choice, Multiple_select, True_false
python scripts/eval_mc_ms_tf/accuracy_eval.py \
batch \
--ground-truth [ground-truth-path] \
--results-dir [model-answer-path] \
--output [output-path]
# Summarization, Short_answer
python scripts/eval_sum_sa/quality_eval.py \
--language [ko or en] \
batch \
--ground-truth [ground-truth-path] \
--results-dir [model-answer-path] \
--output [output-path]
If you find our work (dataset, code, paper, etc.) useful, please consider citing our paper:
@article{noh2025scholarbench,
title={ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts},
author={Noh, Dongwon and Koh, Donghyeok and Yuk, Junghun and Kim, Gyuwan and Lee, Jaeyong and Lim, Kyungtae and Park, Cheoneum},
journal={arXiv preprint arXiv:2505.16566},
year={2025}
}