CE-RAG4EM: Cost-Efficient RAG for Entity Matching with LLMs: A Blocking-based Exploration

This repository provides the source code, data, and supplemental material of our paper "CE-RAG4EM: Cost-efficient RAG for Entity Matching with LLMs: A Blocking-based Exploration". The full version of our paper including additional related work and technical details have been made available on Github.

Introduction

CE-RAG4EM is a cost-efficient RAG for entity matching that reduces computation through blocking-based batch retrieval and generation.

Introduce a blocking strategy to reduce the overall cost of context retrieval and LLM inference for entity matching
Retrieves and searches relevant context from external knowledge graphs (e.g., Wikidata)
Augment LLMs for entity matching with retrieved and refined context from an external knowledge base
Supports multiple entity matching datasets (abt, amgo, beer, dbac, dbgo, foza, itam, waam, wdc)

Datasets

The project supports multiple entity matching benchmark datasets:

abt: Abt-Buy dataset
amgo: Amazon-Google dataset
beer: Beer dataset
dbac: DBLP-ACM dataset
dbgo: DBLP-GoogleScholar dataset
foza: Fodors-Zagats dataset
itam: iTunes-Amazon dataset
waam: Walmart-Amazon dataset
wdc: Web Data Commons dataset

Place your raw datasets in the data/raw/ directory.

Quick Start

1. Environment Setup

The project includes a rag4em.yml file that contains all necessary dependencies, and creates a conda environment from YAML file.

# Create conda environment from the YAML file
conda env create -f rag4em.yml

# Activate the environment
conda activate rag4em

2. Configure LLM APIs

For OpenAI GPT Models

Set your OpenAI API key as an environment variable:

export OPENAI_API_KEY="your-openai-api-key-here"

For Google Gemini Models

Set your Google API key:

export GEMINI_API_KEY="your-gemini-api-key-here"

For Hugging Face Models

Login with your Hugging Face token (required for gated models):

huggingface-cli login

3. Run the Method

End-to-End Pipeline from Blocking to Matching

# Step 1: Generate blocking pairs
python blocking_pair_generation.py -d abt -p test 

# Step 2: Retrieve contextual knowledge per block
python batch_retrieval.py -d abt -p test -b QG -maxb 6

# Step 3: Run CE-RAG for knowledge-augmented inference for entity matching
python ce_rag4em_main.py -d abt -p test -m gpt-4o-mini -b QG -maxb 6

Key Arguments:

-d: Dataset to use (abt, amgo, beer, dbac, dbgo, foza, itam, waam, wdc)
-p: Data partition (train, test, valid)
-m: LLM model to use (gpt-4o-mini, qwen3-4b, etc.)
-b: Blocking method to use (SB, QG, EQG, SA, ESA)
-maxb: Maximum blocking size to process for batch retreival and inference

Usage Examples

Example 1: LLM4EM with GPT-4o-mini

# Step 1: Configure context_conifg in the `ce_rag4em_main.yml`
context_config = {
        "enabled": False,  # Set to False to disable context retrieval
        "context_type": "qid",  # "pid", "qid", or "triple"
        "top_k": 2   # Number of top retrieval results to use (1 or 2)
}

# Setep 2: Run the main python file
python ce_rag4em_main.py -d abt -p test -m gpt-4o-mini -b QG -maxb 6

Example 2: RAG4EM with Top-1 QID triple and Gemini-2.0-flash-lite

# Step 1: Configure context_conifg in the `ce_rag4em_main.py`
context_config = {
        "enabled": True,  # Set to False to disable context retrieval
        "context_type": "qid",  # "pid", "qid", or "triple"
        "top_k": 1   # Number of top retrieval results to use (1 or 2)
}

# Setep 2: Run the main python file
python ce_rag4em_main.py -d abt -p test -m gemini-2.0-flash-lite -b QG -maxb 6

Example 3: KG-RAG4EM with Top-2 BFS triple and Qwen3-4b

# Step 1: Configure context in the `ce_rag4em_main.py`
context_config = {
        "enabled": True,  # Set to False to disable context retrieval
        "context_type": "triple",  # "pid", "qid", or "triple"
        "top_k": 2   # Number of top retrieval results to use (1 or 2)
}
# Step 2: Configure tirple in the `ce_rag4em_main.py` if context_type is "triple"
    triple_id_type = "QID"  # "QID" or "PID", 
    triple_generation_type = "BFS"  # "BFS" or "EXP (expansion)" Triple search approach for triple generation
    top_k_entities = 3  # Number of top entities/properties to use for triple generation

# Setep 3: Run the main python file
python ce_rag4em_main.py -d abt -p test -m qwen3-4b QG -maxb 6

Output

The system generates several types of outputs:

Blocking outputs (blocking_outputs/): Candidate entity pairs generated by blocking methods
Retrieval outputs (retrieval_outputs/): Retrieved context from knowledge graphs
Outputs (output/): The prompt with different retrieved contexts, output of LLM inference, and the final results with evaluation metrics
Logs (logs/): Detailed execution logs for further analysis

Citation

If you find our work helpful, please cite it by using the following BibTeX entry:

@article{ma2026cerag4em,
    title={Cost-Efficient RAG for Entity Matching with LLMs: A Blocking-based Exploration}, 
    author={Ma, Chuangtao and Zhang, Zeyu and Khan, Arijit and Schelter, Sebastian and Groth, Paul},
    journal={arXiv preprint arXiv:2602.05708},
    year={2026}
}

Acknowledgment

Dataset

The abt, amgo, beer, dbac, dbgo, foza, itam, waam datasets and the wdc dataset originated from the following works:

Deep Learning for Entity Matching: A Design Space Exploration
https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md

SC-Block: Supervised Contrastive Blocking Within Entity Resolution Pipelines
https://webdatacommons.org/largescaleproductcorpus/wdc-block/

we thank them for sharing the dataset.

VectorDB

The Wikidata VectorDB and its API access are provided by the team behind the Wikidata Embedding Project. We thank them for creating and maintaining this excellent project.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
blocking_outputs		blocking_outputs
data/raw		data/raw
data_utils		data_utils
model_utils		model_utils
retrieval_outputs		retrieval_outputs
LICENSE		LICENSE
README.md		README.md
batch_retrieval.py		batch_retrieval.py
blocking_pair_generation.py		blocking_pair_generation.py
ce_rag4em_full.pdf		ce_rag4em_full.pdf
ce_rag4em_main.py		ce_rag4em_main.py
rag4em.yml		rag4em.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CE-RAG4EM: Cost-Efficient RAG for Entity Matching with LLMs: A Blocking-based Exploration

Introduction

Datasets

Quick Start

1. Environment Setup

2. Configure LLM APIs

For OpenAI GPT Models

For Google Gemini Models

For Hugging Face Models

3. Run the Method

End-to-End Pipeline from Blocking to Matching

Usage Examples

Example 1: LLM4EM with GPT-4o-mini

Example 2: RAG4EM with Top-1 QID triple and Gemini-2.0-flash-lite

Example 3: KG-RAG4EM with Top-2 BFS triple and Qwen3-4b

Output

Citation

Acknowledgment

Dataset

VectorDB

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CE-RAG4EM: Cost-Efficient RAG for Entity Matching with LLMs: A Blocking-based Exploration

Introduction

Datasets

Quick Start

1. Environment Setup

2. Configure LLM APIs

For OpenAI GPT Models

For Google Gemini Models

For Hugging Face Models

3. Run the Method

End-to-End Pipeline from Blocking to Matching

Usage Examples

Example 1: LLM4EM with GPT-4o-mini

Example 2: RAG4EM with Top-1 QID triple and Gemini-2.0-flash-lite

Example 3: KG-RAG4EM with Top-2 BFS triple and Qwen3-4b

Output

Citation

Acknowledgment

Dataset

VectorDB

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages