Skip to content

ll0ruc/AutoMIR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels

arXiv Download the EmbodiedAgentInterface Dataset from Hugging Face License: MIT

Lei Li, Xiangxu Zhang, Xiao Zhou, Zheng Liu,

Gaoling School of Artificial Intelligence, Renmin University of China

Beijing Academy of Artificial Intelligence

🔭 Overview

AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels (accepted in EMNLP2025)

In this work, we propose Self-Learning Hypothetical Document Embeddings for zero-shot medical information retrieval, eliminating the need for relevance-labeled data.

We alse develop a comprehensive Chinese Medical Information Retrieval Benchmark and evaluate the performance of various text embedding models on it.

AutoMIR

⚙️ Installation

Note that the code in this repo runs under Linux system. We have not tested whether it works under other OS.

  1. Clone this repository:

    git clone https://github.com/ll0ruc/AutoMIR.git
    cd automir
  2. Create and activate the conda environment:

    conda create -n automir python=3.10
    conda activate automir
    pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
    pip install beir==2.0.0
    pip install mteb==1.1.1
    pip install deepspeed==0.15.1
    pip install peft==0.12.0
    pip install transformers==4.44.2
    pip install sentence-transformers==3.1.1
    pip install datasets==2.21.0
    pip install vllm==0.5.4

💾 Datasets Preparation

CMIRB Description:

CMIRB (Chinese Medical Information Retrieval Benchmark) is a specialized multi-task dataset designed specifically for medical information retrieval.

It consists of data collected from various medical online websites, encompassing 5 tasks and 10 datasets, and has practical application scenarios.

CMIRB

Datasets

The data preprocessing process can be seen in data_collection_and_processing.

An overview datasets available in CMIRB is provided in the following table:

Name Hub URL Description Query #Samples Doc #Samples
MedExamRetrieval CMIRB/MedExamRetrieval Medical multi-choice exam 697 27,871
DuBaikeRetrieval CMIRB/DuBaikeRetrieval Medical search query from BaiDu Search 318 56,441
DXYDiseaseRetrieval CMIRB/DXYDiseaseRetrieval Disease question from medical website 1,255 54,021
MedicalRetrieval CMIRB/MedicalRetrieval Passage retrieval dataset collected from Alibaba search engine systems in medical domain 1,000 100,999
CmedqaRetrieval CMIRB/CmedqaRetrieval Online medical consultation text 3,999 100,001
DXYConsultRetrieval CMIRB/DXYConsultRetrieval Online medical consultation text 943 12,577
CovidRetrieval CMIRB/CovidRetrieval COVID-19 news articles 949 100,001
IIYiPostRetrieval CMIRB/IIYiPostRetrieval Medical post articles 789 27,570
CSLCiteRetrieval CMIRB/CSLCiteRetrieval Medical literature citation prediction 573 36,703
CSLRelatedRetrieval CMIRB/CSLRelatedRetrieval Medical similar literatue 439 36,758

Download the CMIRB dataset:

  • CMIRB: HF Datasets

    Place all zip files under ./AutoMIR/dataset and extract them.

Data Structure:

For each dataset, the data is expected in the following structure:

${DATASET_ROOT} # Dataset root directory, e.g., ./dataset/MedExamRetrieval
├── query.jsonl        # Query file
├── corpus.jsonl        # Document file
└── qrels.txt         # Relevant label file

🤖 Training

Download the medical corpus from huatuo_encyclopedia_qa

1.0 Generate query from corpus

python gen_Data.gen_Query_data.py --corpus_path "./train_data/corpus.jsonl" --llm_name Qwen-32b

You will get the query.jsonl file in the train_data folder, which contains the generated queries for each document in the corpus.

1.1 Generate training data for LLM

python gen_Data.gen_LLM_data.py --query_path "./train_data/query.jsonl" --llm_name_gen_llm qwen

You will get the llm_train_data.jsonl file in the train_data/qwen folder, which contains the generated training data for LLM.

1.2 Fine-tuning LLM as Generator

bash run train_llm.sh

You will get the fine-tuned LLM model in the outputs/qwen folder, which can be used as a generator for generating retwriten queries.

2.1 Generate training data for Retriever

python gen_Data.gen_EMB_data.py --llm_name qwen

You will get the emb_train_data.jsonl file in the train_data/qwen folder, which contains the generated training data for retriever.

2.2 Fine-tuning Retriever

bash run train_emb.sh

You will get the fine-tuned retriever model in the outputs/qwen folder, which can be used for retrieving relevant documents based on the generated queries.

💽 Evaluate

We evaluate 10+ representative retrieval models of diverse sizes and architectures. Run the following command to get results:

cd ./src
python evaluate.py --retrieval_name bge-FT --llm_name qwen
* `--retriever_name`: the retrieval model to evaluate.
* `--llm_name`: the generator to evaluate.

🏆 Leaderboard

Information Retrieval

Model Dim. Avg. MedExam DuBaike DXYDisease Medical Cmedqa DXYConsult Covid IIYiPost CSLCite CSLRel
text2vec-large-zh 1024 30.56 41.39 21.13 41.52 30.93 15.53 21.92 60.48 29.47 20.21 23.01
mcontriever(masmarco) 768 35.20 51.5 22.25 44.34 38.5 22.71 20.04 56.01 28.11 34.59 33.95
bm25 - 35.35 31.95 17.89 40.12 29.33 6.83 17.78 78.9 66.95 33.74 29.97
text-embedding-ada-002 - 42.55 53.48 43.12 58.72 37.92 22.36 27.69 57.21 48.6 32.97 43.4
m3e-large 768 45.25 33.29 46.48 62.57 48.66 30.73 41.05 61.33 45.03 35.79 47.54
multilingual-e5-large 1024 52.08 53.96 53.27 72.1 51.47 28.67 41.35 75.54 63.86 42.65 37.94
piccolo-large-zh 1024 54.75 43.11 45.91 70.69 59.04 41.99 47.35 85.04 65.89 44.31 44.21
gte-large-zh 1024 55.40 41.22 42.66 70.59 62.88 43.15 46.3 88.41 63.02 46.4 49.32
bge-large-zh-v1.5 1024 55.40 58.61 44.26 71.71 59.6 42.57 47.73 73.33 67.13 43.27 45.79
peg 1024 57.46 52.78 51.68 77.38 60.96 44.42 49.3 82.56 70.38 44.74 40.38
HyDE (qwen+bge) 1024 56.62 64.39 52.73 73.98 57.27 38.52 47.11 74.32 73.07 46.16 38.68
SL-HyDE (qwen+bge) 1024 59.38 71.49 60.96 75.34 58.58 39.07 50.13 76.95 73.81 46.78 40.71

📜Reference

If this code or dataset contributes to your research, please kindly consider citing our paper and give this repo ⭐️ :)

@inproceedings{li-etal-2025-automir,
    title = "{A}uto{MIR}: Effective Zero-Shot Medical Information Retrieval without Relevance Labels",
    author = "Li, Lei  and Zhang, Xiangxu  and Zhou, Xiao  and Liu, Zheng",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.1305/",
    doi = "10.18653/v1/2025.findings-emnlp.1305",
    pages = "24028--24047",
    ISBN = "979-8-89176-335-7"
}

About

Accept in EMNLP2025

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors