A lightweight, interpretable framework for detecting and quantifying Benchmark Data Contamination (BDC) in Large Language Models (LLMs).
The Data Contamination Risk (DCR) framework provides an efficient method to detect and quantify contamination risk during LLM evaluations. It decomposes contamination into four distinct levels and leverages a fuzzy inference system to compute a comprehensive DCR Factor.
- Four-level contamination detection: Semantic (L1), Information (L2), Data (L3), and Label (L4)
- Fuzzy inference system for interpretable risk quantification
- Lightweight and efficient - suitable for real-world applications with limited resources
- Contamination-aware metrics - adjusts performance scores based on detected contamination
# Clone the repository
git clone https://github.com/chengxuphd/dcr.git
cd dcr
# Install dependencies
pip install -r requirements.txtdcr/
├── src/
│ ├── core/
│ │ ├── fuzzy_system.py # Fuzzy inference system implementation
│ │ └── dcr_calculator.py # Main DCR calculation logic
│ └── utils/
│ ├── data_loader.py # Data loading utilities
│ └── output_formatter.py # Result formatting utilities
├── data/
│ ├── sst2_experimental_data.csv # SST-2 benchmark data
│ ├── liar2_experimental_data.csv # LIAR2 benchmark data
│ └── gsm8k_experimental_data.csv # GSM8K benchmark data
├── config/
│ └── settings.py # Configuration settings
├── output/
│ ├── sst2_results.csv # SST-2 benchmark results
│ ├── liar2_results.csv # LIAR2 benchmark results
│ └── gsm8k_results.csv # GSM8K benchmark results
├── main.py # Main entry point
└── requirements.txt
# Analyze a specific benchmark
python main.py --benchmark sst2
# Analyze all benchmarks
python main.py --all
# Save results to CSV
python main.py --benchmark liar2 --save-csvfrom src.core import DCRCalculator
from src.utils import DataLoader
# Initialize calculator
calculator = DCRCalculator()
# Load experimental data
data = DataLoader.load_csv('data/sst2_experimental_data.csv')
# Process and analyze
results = calculator.process_experiment_data(data)
# Access results
for result in results['results']:
print(f"Model: {result['model']}, DCR: {result['dcr']:.4f}")- Semantic Level (L1): Model exposed to semantically equivalent content
- Information Level (L2): Model exposed to benchmark metadata or statistics
- Data Level (L3): Model exposed to actual test data (without labels)
- Label Level (L4): Model exposed to test data with labels
The framework has been validated on 9 LLMs (0.5B-72B parameters) across three benchmarks:
- SST-2: Sentiment Analysis
- LIAR2: Fake News Detection
- GSM8K: Arithmetic Reasoning
Average error across the three benchmarks: < 4%
If you find our work useful in your research, please consider citing:
@inproceedings{xu2025dcr,
title = "{DCR}: Quantifying Data Contamination in {LLM}s Evaluation",
author = "Xu, Cheng and
Yan, Nan and
Guan, Shuhao and
Jin, Changhong and
Mei, Yuke and
Guo, Yibing and
Kechadi, Tahar",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.1173/",
pages = "23013--23031",
ISBN = "979-8-89176-332-6",
}