This repository provides the source code, data, and supplemental material of our paper "CE-RAG4EM: Cost-efficient RAG for Entity Matching with LLMs: A Blocking-based Exploration". The full version of our paper including additional related work and technical details have been made available on Github.
CE-RAG4EM is a cost-efficient RAG for entity matching that reduces computation through blocking-based batch retrieval and generation.
- Introduce a blocking strategy to reduce the overall cost of context retrieval and LLM inference for entity matching
- Retrieves and searches relevant context from external knowledge graphs (e.g., Wikidata)
- Augment LLMs for entity matching with retrieved and refined context from an external knowledge base
- Supports multiple entity matching datasets (abt, amgo, beer, dbac, dbgo, foza, itam, waam, wdc)
The project supports multiple entity matching benchmark datasets:
- abt: Abt-Buy dataset
- amgo: Amazon-Google dataset
- beer: Beer dataset
- dbac: DBLP-ACM dataset
- dbgo: DBLP-GoogleScholar dataset
- foza: Fodors-Zagats dataset
- itam: iTunes-Amazon dataset
- waam: Walmart-Amazon dataset
- wdc: Web Data Commons dataset
Place your raw datasets in the data/raw/ directory.
The project includes a rag4em.yml file that contains all necessary dependencies, and creates a conda environment from YAML file.
# Create conda environment from the YAML file
conda env create -f rag4em.yml
# Activate the environment
conda activate rag4emSet your OpenAI API key as an environment variable:
export OPENAI_API_KEY="your-openai-api-key-here"Set your Google API key:
export GEMINI_API_KEY="your-gemini-api-key-here"Login with your Hugging Face token (required for gated models):
huggingface-cli login# Step 1: Generate blocking pairs
python blocking_pair_generation.py -d abt -p test
# Step 2: Retrieve contextual knowledge per block
python batch_retrieval.py -d abt -p test -b QG -maxb 6
# Step 3: Run CE-RAG for knowledge-augmented inference for entity matching
python ce_rag4em_main.py -d abt -p test -m gpt-4o-mini -b QG -maxb 6Key Arguments:
-d: Dataset to use (abt, amgo, beer, dbac, dbgo, foza, itam, waam, wdc)-p: Data partition (train, test, valid)-m: LLM model to use (gpt-4o-mini, qwen3-4b, etc.)-b: Blocking method to use (SB, QG, EQG, SA, ESA)-maxb: Maximum blocking size to process for batch retreival and inference
# Step 1: Configure context_conifg in the `ce_rag4em_main.yml`
context_config = {
"enabled": False, # Set to False to disable context retrieval
"context_type": "qid", # "pid", "qid", or "triple"
"top_k": 2 # Number of top retrieval results to use (1 or 2)
}
# Setep 2: Run the main python file
python ce_rag4em_main.py -d abt -p test -m gpt-4o-mini -b QG -maxb 6# Step 1: Configure context_conifg in the `ce_rag4em_main.py`
context_config = {
"enabled": True, # Set to False to disable context retrieval
"context_type": "qid", # "pid", "qid", or "triple"
"top_k": 1 # Number of top retrieval results to use (1 or 2)
}
# Setep 2: Run the main python file
python ce_rag4em_main.py -d abt -p test -m gemini-2.0-flash-lite -b QG -maxb 6# Step 1: Configure context in the `ce_rag4em_main.py`
context_config = {
"enabled": True, # Set to False to disable context retrieval
"context_type": "triple", # "pid", "qid", or "triple"
"top_k": 2 # Number of top retrieval results to use (1 or 2)
}
# Step 2: Configure tirple in the `ce_rag4em_main.py` if context_type is "triple"
triple_id_type = "QID" # "QID" or "PID",
triple_generation_type = "BFS" # "BFS" or "EXP (expansion)" Triple search approach for triple generation
top_k_entities = 3 # Number of top entities/properties to use for triple generation
# Setep 3: Run the main python file
python ce_rag4em_main.py -d abt -p test -m qwen3-4b QG -maxb 6The system generates several types of outputs:
- Blocking outputs (
blocking_outputs/): Candidate entity pairs generated by blocking methods - Retrieval outputs (
retrieval_outputs/): Retrieved context from knowledge graphs - Outputs (
output/): The prompt with different retrieved contexts, output of LLM inference, and the final results with evaluation metrics - Logs (
logs/): Detailed execution logs for further analysis
If you find our work helpful, please cite it by using the following BibTeX entry:
@article{ma2026cerag4em,
title={Cost-Efficient RAG for Entity Matching with LLMs: A Blocking-based Exploration},
author={Ma, Chuangtao and Zhang, Zeyu and Khan, Arijit and Schelter, Sebastian and Groth, Paul},
journal={arXiv preprint arXiv:2602.05708},
year={2026}
}The abt, amgo, beer, dbac, dbgo, foza, itam, waam datasets and the wdc dataset originated from the following works:
Deep Learning for Entity Matching: A Design Space Exploration
https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md
SC-Block: Supervised Contrastive Blocking Within Entity Resolution Pipelines
https://webdatacommons.org/largescaleproductcorpus/wdc-block/
we thank them for sharing the dataset.
The Wikidata VectorDB and its API access are provided by the team behind the Wikidata Embedding Project. We thank them for creating and maintaining this excellent project.