The dataset is available on Hugging Face in two versions:
This repository contains the scripts used for generating and evaluating the conclusions:
generate.py: Script to generate conclusions from abstract inputs using various LLMs via API.evaluate.py: Script to execute evaluation metrics (Rule-based scores like ROUGE/BLEU, Perplexity, and LLM-as-a-judge).
If you find this work useful, please cite:
@article{li2026medconclusion,
title={MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts},
author={Li, Weiyue and Qian, Ruizhi and Li, Yi and Li, Yongce and Long, Yunfan and Cai, Jiahui and Luo, Yan and Wang, Mengyu},
journal={arXiv preprint arXiv:2604.06505},
year={2026}
}