Skip to content

Jantory/cross-dataset-em-study

Repository files navigation

Cross-Dataset-EM-Study

Lodo Image F1 vs Cost Image

This repository is the codebase of experiments and artificats of paper "[Experiments & Analysis] A Deep Dive Into Cross-Dataset Entity Matching with Large and Small Language Models".

It includes the basic components and configurations for reproducing the evaluations described in the paper. For detailed implementations of each method, please refer to the original implementations linked here.

Dataset

We use ten recognized benchmark datasets from the Magellan repository (the first eight are widely used in existing studies), along with the WDC dataset, which is a recent addition from the e-commerce data. For detailed source information about these datasets, please visit the links provided below:

Abbr. Dataset Link
wdc wdc wdc
abt abt_buy magellan
amgo amazon_google magellan
beer beer magellan
dbac dblp_acm magellan
dbgo dblp_scholar magellan
foza fodors_zagat magellan
itam itunes_amazon magellan
waam walmart_amazon magellan
roim rottentomato_imdb magellan data
zoye zomato_yelp magellan data

The training and validation sets for each dataset originate from their respective sources. However, the test set is down-sampled to a maximum of 1,250 samples to optimize the cost of OpenAI API calls. For all baseline comparisons, the test set remains identical, adhering to the same leave-one-dataset-out configuration.

To conserve space, only the raw datasets are included in this repository. The code for processing and formatting them is available in data/data_preparation.ipynb. The datasets are first processed into a base format, which is subsequently adapted to create method-specific datasets.

Moreover, to address the data leakage issue that might happend during fine-tuning phase, we conduct a SQL analysis to validate there is zero overlapping datapoints between any pair of datasets, which can be found in data/data_leakage.py.

Different cross-dataset EM matchers

We compared the efficency of the following cross-dataset matchers, from both the predictive quality and cost dimensions.

ZeroER

The code can be found in the zeroer folder. We use the vanilla implementation with the application of the transitivity constraint. To run the experiments, navigate to the folder and run --

python zeroer.py DATASET_NAME --run_transitivity

Ditto

The code is located in the ditto folder. To comply with our 'leave-one-dataset-out' strategy, the training, validation, and test data must be configured accordingly, resulting in a new config.json file in the folder. Additionally, the dataset needs to be serialized into a format that Ditto recognizes. We provide the necessary code for this in the 'ditto/data' folder. To run the experiments, navigate to the folder and run --

python train_ditto.py \ --task DATASET_NAME \
  --batch_size 64 --max_len 64 --lr 3e-5 \
  --n_epochs 40 --lm bert --fp16 --da del --summarize

Unicorn

The code is located in the unicorn folder. To ensure the evaluation aligns with the 'leave-one-dataset-out' strategy, we adjust the 'dataprocess/dataformat.py' file, using an ordinal number to represent each dataset that is left out. To run the experiments, navigate to the folder and run --

python main-zero-ins.py --pretrain --model deberta_base --loo DATASET_ORDINAL

AnyMatch

The code is located in the anymatch folder. To run the experiments, navigate to the folder and run --

python loo.py --leaved_dataset_name DATASET_NAME --base_model BASE_MODEL

Jellyfish

The code is located in the jellyfish folder. We use the prompt suggested by the authors and download the model directly from Hugging Face for inference. A script to run the experiments is also provided

python jellyfish.py

MatchGPT

The code is located in the matchgpt folder. Instead of using a notebook for all experiments, we adopt the same prompt system described in the original MatchGPT paper and convert everything into a Python script for improved parallel execution. For GPT models, we provide code both with and without demonstrations. To run inference with GPT models, please use the following command:

python gpt.py --mid MODEL_ID --dem DEMONSTRATION_METHOD

To run other open models, use the following command:

python open_models.py --mid MODEL_ID

TableGPT & GPT-3

We are unable to evaluate these two models because TableGPT is not open-sourced, and GPT-3 has been deprecated. Therefore, we include their results from the original papers for reference.

Inference throughput experiments

We provide the code to run the throughput experiments. To run the experiments, please use the following code:

python throughput.py --model_name MODEL_NAME

Experimental results and analysis

The raw results for the reported numbers in Table 3 and Table 4 can be found in results. Moreover, a separate notebook containing all the analyses presented in the paper is available in results/analysis.ipynb.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors