Skip to content

hbnu-kilab/ScholarBench

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

EMNLP Hugging Face Dataset

Dataset URL

https://huggingface.co/datasets/KISTI-KONI/ScholarBench

Overview and Installation

Setup

conda create -n sb_env python=3.12.9
conda activate sb_env
pip install requirements.txt
 
# install bleurt
wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip .
unzip BLEURT-20.zip

Overview

  • data : 1 is the answer generated by giving only problems to LLM, 2 is the answer generated by giving problems and topics to LLM, 3 is the answer generated by giving problems and parameters to LLM, 4 is the answer generated by giving problems and parameters to LLM, and 5 is the answer generated by giving problems and categories
    • API (1~5)
      • GPT-4o
      • o1-mini
      • o3-mini
    • Memory (1~4)
      • Bllossom-8b
      • Bllossom-70b
      • Exaone-8b
      • Exaone-32b
      • Exaone-32b-reasoning
      • Gemma2-9b
      • Gemma2-27b
      • Koni-8b
      • llama-8b
      • llama-70b
      • Mistral-8b
      • Mistral-24b
      • Qwen-7b
      • Qwen-32b-reasoning
      • Qwen-72b
      • Trilion-7b
  • eval_scripts
    • eval_all : summarization, short_answer, multiple_choice, multiple_select, true_false
      • config.py
      • data_loader.py
      • evaluation_utils.py
      • metrics_calculator.py
      • main.py
      • quality_eval.py
    • eval_mc_ms_tf : multiple_choice, multiple_select, true_false
      • config.py
      • data_loader.py
      • evaluation_utils.py
      • metrics_calculator.py
      • main.py
      • accuracy_eval.py
    • eval_sum_sa : summarization, short_answer
      • config.py
      • data_loader.py
      • evaluation_utils.py
      • metrics_calculator.py
      • main.py
      • quality_eval.py

Generate model answer: generate_model_answer

  • Experiment 1: Includes multiple-choice, multiple-select, short-answer, true/false, and summarization tasks.

  • Experiment 2: Includes multiple-choice, multiple-select, short-answer, and true/false tasks with a topic field.

  • Experiment 3: Includes multiple-choice, multiple-select, short-answer, and true/false tasks with a paragraph field.

  • Experiment 4: Includes multiple-choice, multiple-select, short-answer, true/false, and summarization tasks with a paragraph field and "Think through step by step" instruction.

  • Experiment 5: Includes multiple-choice, multiple-select, short-answer, true/false, and summarization tasks with a category field (summarization uses paragraph).

  • Memory LLM: Code using the vllm package for inference in Memory LLM, experiments 1-4 can be selected from the shell script (original, topic, paragraph, cot), GPU used: A100 x 4

Setup environment variables

Create a .env file in the root directory and add your OpenAI API key:OPENAI_API_KEY=your-api-key

Prepare the dataset

The dataset is not included in this repository. You need to download it in JSONL format from the KISTI huggingface and place it in the dataset/ directory.

Configure input and output paths

Open src/config.py and specify the input_file_path and output_file_path for each experiment. Use relative paths based on the project root.

For example:

MODEL_CONFIGS = {
    "gpt-4o_1": {
        "input_file_path": "../../../dataset/en_eval_dataset.jsonl",
        "output_file_path": "../../../result/gpt-4o/gpt-4o_result_1_en.json",
        "model_name": "gpt-4o",
        "experiment_type": "1"
        },
    # Other experiment configurations
    }

Usage

Run an experiment for a specific model and type from the project root (/home/kilab_ndw/generate_model_answer):

Experiment 1

python3 -m src.exp1.gpt-4o_1   # GPT-4o
python3 -m src.exp1.o1-mini_1  # o1-mini
python3 -m src.exp1.o3-mini_1  # o3-mini

Experiment 2

python3 -m src.exp2.gpt-4o_2   # GPT-4o
python3 -m src.exp2.o1-mini_2  # o1-mini
python3 -m src.exp2.o3-mini_2  # o3-mini

Experiment 3

python3 -m src.exp3.gpt-4o_3   # GPT-4o
python3 -m src.exp3.o1-mini_3  # o1-mini
python3 -m src.exp3.o3-mini_3  # o3-mini

Experiment 4

python3 -m src.exp4.gpt-4o_4   # GPT-4o
python3 -m src.exp4.o1-mini_4  # o1-mini
python3 -m src.exp4.o3-mini_4  # o3-mini

Experiment 5

python3 -m src.exp5.gpt-4o_5   # GPT-4o
python3 -m src.exp5.o1-mini_5  # o1-mini
python3 -m src.exp5.o3-mini_5  # o3-mini

Memory LLM

python3 main.py --model_nick USE_MODEL_NICK \
                --task_list PROBLEM_TYPES \
                --exp_type EXPERIMENT_TYPE \
                --save_path YOUR_SAVE_PATH \
                --data_path YOUR_DATA_PATH

In addition, you can change the parameters and execute the script in the predict_vllm.sh script, and the command is as follows.

sh ./predict_vllm.sh

The script processes the dataset specified in config.py and saves results to the corresponding result/ folder (e.g., result/gpt-4o/gpt-4o_result_1_en.json).

Evaluation : Eval_scripts

Problem type Evaluation Metrics
Summarization rouge-1, rouge-2, rouge-l, bert_score
Short_answer exact_match, f1, bert_score, bluert_score, rouge-1, bleu-1
Multiple_choice accuracy
Multiple_select accuracy
True_false accuracy

Run evaluation

The model answers are ready, specify the directory where you want to store the correct answer file and the results of the directory with the original answers and run it as follows

# Summarization, Short_answer, Multiple_choice, Multiple_select, True_false
python scripts/eval_all/evaluate_all.py \
    --language [ko or en] \
    batch \
    --ground-truth [ground-truth-path] \
    --results-dir [model-answer-path] \
    --output [output-path]
    --detailed / If you want a categorical summary

# Multiple_choice, Multiple_select, True_false
python scripts/eval_mc_ms_tf/accuracy_eval.py \
    batch \
    --ground-truth [ground-truth-path] \
    --results-dir [model-answer-path] \
    --output [output-path]

# Summarization, Short_answer
python scripts/eval_sum_sa/quality_eval.py \
    --language [ko or en] \
    batch \
    --ground-truth [ground-truth-path] \
    --results-dir [model-answer-path] \
    --output [output-path]

Citation

If you find our work (dataset, code, paper, etc.) useful, please consider citing our paper:

@article{noh2025scholarbench,
  title={ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts},
  author={Noh, Dongwon and Koh, Donghyeok and Yuk, Junghun and Kim, Gyuwan and Lee, Jaeyong and Lim, Kyungtae and Park, Cheoneum},
  journal={arXiv preprint arXiv:2505.16566},
  year={2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.7%
  • Shell 0.3%