ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

Dataset URL

https://huggingface.co/datasets/KISTI-KONI/ScholarBench

Overview and Installation

Setup

conda create -n sb_env python=3.12.9
conda activate sb_env
pip install requirements.txt
 
# install bleurt
wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip .
unzip BLEURT-20.zip

Overview

data : 1 is the answer generated by giving only problems to LLM, 2 is the answer generated by giving problems and topics to LLM, 3 is the answer generated by giving problems and parameters to LLM, 4 is the answer generated by giving problems and parameters to LLM, and 5 is the answer generated by giving problems and categories
- API (1~5)
  - GPT-4o
  - o1-mini
  - o3-mini
- Memory (1~4)
  - Bllossom-8b
  - Bllossom-70b
  - Exaone-8b
  - Exaone-32b
  - Exaone-32b-reasoning
  - Gemma2-9b
  - Gemma2-27b
  - Koni-8b
  - llama-8b
  - llama-70b
  - Mistral-8b
  - Mistral-24b
  - Qwen-7b
  - Qwen-32b-reasoning
  - Qwen-72b
  - Trilion-7b
eval_scripts
- eval_all : summarization, short_answer, multiple_choice, multiple_select, true_false
  - config.py
  - data_loader.py
  - evaluation_utils.py
  - metrics_calculator.py
  - main.py
  - quality_eval.py
- eval_mc_ms_tf : multiple_choice, multiple_select, true_false
  - config.py
  - data_loader.py
  - evaluation_utils.py
  - metrics_calculator.py
  - main.py
  - accuracy_eval.py
- eval_sum_sa : summarization, short_answer
  - config.py
  - data_loader.py
  - evaluation_utils.py
  - metrics_calculator.py
  - main.py
  - quality_eval.py

Generate model answer: generate_model_answer

Experiment 1: Includes multiple-choice, multiple-select, short-answer, true/false, and summarization tasks.
Experiment 2: Includes multiple-choice, multiple-select, short-answer, and true/false tasks with a topic field.
Experiment 3: Includes multiple-choice, multiple-select, short-answer, and true/false tasks with a paragraph field.
Experiment 4: Includes multiple-choice, multiple-select, short-answer, true/false, and summarization tasks with a paragraph field and "Think through step by step" instruction.
Experiment 5: Includes multiple-choice, multiple-select, short-answer, true/false, and summarization tasks with a category field (summarization uses paragraph).
Memory LLM: Code using the vllm package for inference in Memory LLM, experiments 1-4 can be selected from the shell script (original, topic, paragraph, cot), GPU used: A100 x 4

Setup environment variables

Create a .env file in the root directory and add your OpenAI API key:OPENAI_API_KEY=your-api-key

Prepare the dataset

The dataset is not included in this repository. You need to download it in JSONL format from the KISTI huggingface and place it in the dataset/ directory.

Configure input and output paths

Open src/config.py and specify the input_file_path and output_file_path for each experiment. Use relative paths based on the project root.

For example:

MODEL_CONFIGS = {
    "gpt-4o_1": {
        "input_file_path": "../../../dataset/en_eval_dataset.jsonl",
        "output_file_path": "../../../result/gpt-4o/gpt-4o_result_1_en.json",
        "model_name": "gpt-4o",
        "experiment_type": "1"
        },
    # Other experiment configurations
    }

Usage

Run an experiment for a specific model and type from the project root (/home/kilab_ndw/generate_model_answer):

Experiment 1

python3 -m src.exp1.gpt-4o_1   # GPT-4o
python3 -m src.exp1.o1-mini_1  # o1-mini
python3 -m src.exp1.o3-mini_1  # o3-mini

Experiment 2

python3 -m src.exp2.gpt-4o_2   # GPT-4o
python3 -m src.exp2.o1-mini_2  # o1-mini
python3 -m src.exp2.o3-mini_2  # o3-mini

Experiment 3

python3 -m src.exp3.gpt-4o_3   # GPT-4o
python3 -m src.exp3.o1-mini_3  # o1-mini
python3 -m src.exp3.o3-mini_3  # o3-mini

Experiment 4

python3 -m src.exp4.gpt-4o_4   # GPT-4o
python3 -m src.exp4.o1-mini_4  # o1-mini
python3 -m src.exp4.o3-mini_4  # o3-mini

Experiment 5

python3 -m src.exp5.gpt-4o_5   # GPT-4o
python3 -m src.exp5.o1-mini_5  # o1-mini
python3 -m src.exp5.o3-mini_5  # o3-mini

Memory LLM

python3 main.py --model_nick USE_MODEL_NICK \
                --task_list PROBLEM_TYPES \
                --exp_type EXPERIMENT_TYPE \
                --save_path YOUR_SAVE_PATH \
                --data_path YOUR_DATA_PATH

In addition, you can change the parameters and execute the script in the predict_vllm.sh script, and the command is as follows.

sh ./predict_vllm.sh

The script processes the dataset specified in config.py and saves results to the corresponding result/ folder (e.g., result/gpt-4o/gpt-4o_result_1_en.json).

Evaluation : Eval_scripts

Problem type	Evaluation Metrics
Summarization	rouge-1, rouge-2, rouge-l, bert_score
Short_answer	exact_match, f1, bert_score, bluert_score, rouge-1, bleu-1
Multiple_choice	accuracy
Multiple_select	accuracy
True_false	accuracy

Run evaluation

The model answers are ready, specify the directory where you want to store the correct answer file and the results of the directory with the original answers and run it as follows

# Summarization, Short_answer, Multiple_choice, Multiple_select, True_false
python scripts/eval_all/evaluate_all.py \
    --language [ko or en] \
    batch \
    --ground-truth [ground-truth-path] \
    --results-dir [model-answer-path] \
    --output [output-path]
    --detailed / If you want a categorical summary

# Multiple_choice, Multiple_select, True_false
python scripts/eval_mc_ms_tf/accuracy_eval.py \
    batch \
    --ground-truth [ground-truth-path] \
    --results-dir [model-answer-path] \
    --output [output-path]

# Summarization, Short_answer
python scripts/eval_sum_sa/quality_eval.py \
    --language [ko or en] \
    batch \
    --ground-truth [ground-truth-path] \
    --results-dir [model-answer-path] \
    --output [output-path]

Citation

If you find our work (dataset, code, paper, etc.) useful, please consider citing our paper:

@article{noh2025scholarbench,
  title={ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts},
  author={Noh, Dongwon and Koh, Donghyeok and Yuk, Junghun and Kim, Gyuwan and Lee, Jaeyong and Lim, Kyungtae and Park, Cheoneum},
  journal={arXiv preprint arXiv:2505.16566},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
generate_model_answer		generate_model_answer
scripts		scripts
README.md		README.md
json2jsonl.py		json2jsonl.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

Dataset URL

Overview and Installation

Setup

Overview

Generate model answer: generate_model_answer

Setup environment variables

Prepare the dataset

Configure input and output paths

Usage

Experiment 1

Experiment 2

Experiment 3

Experiment 4

Experiment 5

Memory LLM

Evaluation : Eval_scripts

Run evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

Dataset URL

Overview and Installation

Setup

Overview

Generate model answer: generate_model_answer

Setup environment variables

Prepare the dataset

Configure input and output paths

Usage

Experiment 1

Experiment 2

Experiment 3

Experiment 4

Experiment 5

Memory LLM

Evaluation : Eval_scripts

Run evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages