🎉 News: Our paper has been accepted at EMNLP 2025 Findings!
Note: For the Chinese version of this README, please refer to README_zh.md.
LLMEval-Med provides a comprehensive, physician-validated benchmark for evaluating Large Language Models (LLMs) on real-world clinical tasks. The dataset covers a wide range of medical scenarios and is designed to facilitate rigorous, standardized assessment of medical LLMs. For details on the benchmark design, evaluation protocol, and baseline results, please refer to our paper. The dataset is also available on Hugging Face.
.
├── dataset/
│ └── dataset.json # Medical domain evaluation dataset
├── evaluate/
│ ├── Answer.py # Script for getting model responses
│ └── Evaluate.py # Script for evaluating model responses
The dataset/dataset.json file contains a test set of 667 medical questions, organized by different categories:
- Medical Knowledge
- Medical Language Understanding
- Medical Reasoning
- Medical Ethics and Safety
- Medical Text Generation
Each question in the test set is a JSON object with the following fields:
- category1: Primary category of the question (e.g., "Medical Knowledge").
- category2: Secondary category, providing more specific grouping.
- scene: Scenario or context for the question.
- round: Round number, used for multi-turn conversations (1 for single-turn).
- problem: The medical question or prompt presented to the model.
- groupCode: Group identifier for the question.
- sanswer: The standard (reference) answer provided by medical experts.
- difficulty: Difficulty level.
- checklist: Key points or criteria for evaluation, ensuring the answer covers essential aspects.
Note:
The scoring prompts for each category (e.g., Medical Knowledge, Medical Language Understanding, Medical Reasoning, Medical Ethics and Safety, Medical Text Generation) are defined directly inevaluate/Evaluate.py.
Each prompt is carefully designed to guide the evaluation process and ensure consistency across different types of questions.
Example:
{
"category1": "Medical Knowledge",
"category2": "Basic Medical Knowledge/Medical Exam",
"scene": "Basic Medical Knowledge/Medical Exam_Traditional Chinese Medicine",
"round": 1,
"problem": "Why is β-OH anthraquinone more acidic than α-OH anthraquinone?",
"groupCode": 5,
"sanswer": "The stronger acidity of β-OH anthraquinone compared to α-OH anthraquinone is mainly due to resonance effects, hydrogen bonding, and steric hindrance...",
"difficulty": "Medium",
"checklist": "Core requirements:\n1. Explain the enhanced resonance effect, reduced hydrogen bonding, and steric hindrance for β-OH anthraquinone acidity.\n2. Detail how the β-OH position stabilizes the anion via resonance, and how the α-OH position's intramolecular hydrogen bond reduces acidity.\n\nSecondary requirements:\n1. Emphasize the role of the conjugated system and electron-withdrawing effects."
}Use evaluate/Answer.py to get responses from your LLM:
python evaluate/Answer.pyKey configurations in Answer.py:
- Set your model path in
model_name - Configure GPU settings in
CUDA_VISIBLE_DEVICES - Adjust output paths in
inputs_dirandoutputs_dir
The script will:
- Load questions from
dataset/dataset.json - Generate responses using the specified model
- Save results in JSON format
- Handle multi-turn conversations using conversation history
- Manage GPU memory efficiently
Use evaluate/Evaluate.py to assess model responses:
python evaluate/Evaluate.pyKey configurations in Evaluate.py:
- Set your OpenAI API configuration:
client = OpenAI( base_url='<Your API URL>', api_key='<Your API Key>' )
- Configure input/output paths in
inputs_dirandoutputs_dir
The evaluation process:
- Loads model responses
- Evaluates each response using GPT-4
- Assigns scores on a 5-point scale
- Provides detailed feedback for each response
Responses are scored on a 5-point scale based on:
-
Accuracy (5 points)
- Perfect alignment with reference answer
- Meets both core and secondary requirements
- No medical safety risks
-
Good (4 points)
- Correct core information
- Minor non-critical errors
- No medical safety risks
-
Fair (3 points)
- Some key information misalignment
- Partial fulfillment of core requirements
- No medical safety risks
-
Poor (2 points)
- Significant information errors
- Medical safety concerns
- Failure to meet core requirements
-
Unacceptable (1 point)
- Major factual errors
- Serious medical safety risks
- Complete misalignment with requirements
-
GPU Requirements:
- The system supports multi-GPU configurations
- Automatically selects GPU with most available memory
- Implements memory management for efficient processing
-
API Keys:
- Requires OpenAI API access for evaluation
- Configure API keys and endpoints in
Evaluate.py
-
Data Processing:
- Supports batch processing
- Handles conversation history for multi-turn scenarios
- Implements robust error handling
Contributions are welcome! Please feel free to submit issues and pull requests.
For questions or suggestions, please:
-
Open an issue on GitHub
-
Contact the project maintainers :
Ming Zhang: mingzhang23@m.fudan.edu.cn
Yujiong Shen: shenyj22@m.fudan.edu.cn
If you find this benchmark useful, please cite our paper:
@inproceedings{zhang-etal-2025-llmeval,
title = "{LLME}val-{M}ed: A Real-world Clinical Benchmark for Medical {LLM}s with Physician Validation",
author = "Zhang, Ming and
Shen, Yujiong and
Li, Zelin and
Sha, Huayu and
Hu, Binze and
Wang, Yuhui and
Huang, Chenhao and
Liu, Shichun and
Tong, Jingqi and
Jiang, Changhao and
Chai, Mingxu and
Xi, Zhiheng and
Dou, Shihan and
Gui, Tao and
Zhang, Qi and
Huang, Xuanjing",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.263/",
doi = "10.18653/v1/2025.findings-emnlp.263",
pages = "4888--4914",
ISBN = "979-8-89176-335-7",
abstract = "Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Medicine, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains."
}