DCR: Data Contamination Risk Framework

A lightweight, interpretable framework for detecting and quantifying Benchmark Data Contamination (BDC) in Large Language Models (LLMs).

Overview

The Data Contamination Risk (DCR) framework provides an efficient method to detect and quantify contamination risk during LLM evaluations. It decomposes contamination into four distinct levels and leverages a fuzzy inference system to compute a comprehensive DCR Factor.

Key Features

Four-level contamination detection: Semantic (L1), Information (L2), Data (L3), and Label (L4)
Fuzzy inference system for interpretable risk quantification
Lightweight and efficient - suitable for real-world applications with limited resources
Contamination-aware metrics - adjusts performance scores based on detected contamination

Installation

# Clone the repository
git clone https://github.com/chengxuphd/dcr.git
cd dcr

# Install dependencies
pip install -r requirements.txt

Project Structure

dcr/
├── src/
│   ├── core/
│   │   ├── fuzzy_system.py      # Fuzzy inference system implementation
│   │   └── dcr_calculator.py    # Main DCR calculation logic
│   └── utils/
│       ├── data_loader.py       # Data loading utilities
│       └── output_formatter.py  # Result formatting utilities
├── data/
│   ├── sst2_experimental_data.csv   # SST-2 benchmark data
│   ├── liar2_experimental_data.csv  # LIAR2 benchmark data
│   └── gsm8k_experimental_data.csv  # GSM8K benchmark data
├── config/
│   └── settings.py              # Configuration settings
├── output/
│   ├── sst2_results.csv     # SST-2 benchmark results
│   ├── liar2_results.csv    # LIAR2 benchmark results
│   └── gsm8k_results.csv    # GSM8K benchmark results
├── main.py                      # Main entry point
└── requirements.txt

Usage

Basic Usage

# Analyze a specific benchmark
python main.py --benchmark sst2

# Analyze all benchmarks
python main.py --all

# Save results to CSV
python main.py --benchmark liar2 --save-csv

Python API

from src.core import DCRCalculator
from src.utils import DataLoader

# Initialize calculator
calculator = DCRCalculator()

# Load experimental data
data = DataLoader.load_csv('data/sst2_experimental_data.csv')

# Process and analyze
results = calculator.process_experiment_data(data)

# Access results
for result in results['results']:
    print(f"Model: {result['model']}, DCR: {result['dcr']:.4f}")

Contamination Levels

Semantic Level (L1): Model exposed to semantically equivalent content
Information Level (L2): Model exposed to benchmark metadata or statistics
Data Level (L3): Model exposed to actual test data (without labels)
Label Level (L4): Model exposed to test data with labels

Experimental Results

The framework has been validated on 9 LLMs (0.5B-72B parameters) across three benchmarks:

SST-2: Sentiment Analysis
LIAR2: Fake News Detection
GSM8K: Arithmetic Reasoning

Average error across the three benchmarks: < 4%

Citation

If you find our work useful in your research, please consider citing:

@inproceedings{xu2025dcr,
    title = "{DCR}: Quantifying Data Contamination in {LLM}s Evaluation",
    author = "Xu, Cheng  and
      Yan, Nan  and
      Guan, Shuhao  and
      Jin, Changhong  and
      Mei, Yuke  and
      Guo, Yibing  and
      Kechadi, Tahar",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1173/",
    pages = "23013--23031",
    ISBN = "979-8-89176-332-6",
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
data		data
output		output
src		src
LICENSE		LICENSE
README.md		README.md
dcr_framework.png		dcr_framework.png
example.py		example.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DCR: Data Contamination Risk Framework

Overview

Key Features

Installation

Project Structure

Usage

Basic Usage

Python API

Contamination Levels

Experimental Results

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DCR: Data Contamination Risk Framework

Overview

Key Features

Installation

Project Structure

Usage

Basic Usage

Python API

Contamination Levels

Experimental Results

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages