GitHub - zhang-wei-chao/DC-PDD: This repository presents the original implementation of Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method by Weichao Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng

This is the official repository for the paper Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method by Weichao Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng

Overview

we proposes DC-PDD to improve methods that directly rely on token probabilities for pretraining data detection, which tend to misclassify non-training texts containing many common words as training texts. The key idea of DC-PDD is to calibrate the token probabilities and thereby make them more informative signals for detection. The calibration process is achieved by computing the cross-entropy (i.e., the divergence) between the token probability distribution and the token frequency distribution. To facilitate the evaluation of pretraining data detection for LLMs, we introduce a new benchmark named PatentMIA, specifically designed for Chinese-language pretraining data detection.

PatentMIA

The PatentMIA dataset serves as a benchmark designed to evaluate pretraining data detection methods, specifically in detecting Chinese-language pretraining data from models that are open-source Chinese LLMs released between January 1, 2023 and March 1, 2024 (e.g., Qwen1.5)

The dataset contains non-training and training data:

non-training data includes text snippets crawled from Google Patent websites published after March 1, 2024.
training data includes text snippets crawled from Google Patent websites published before January 1, 2023.

DC-PDD (& baselines)

We first obtain the token probability distribution by querying the LLM with the text.

python src/com_pro_dis.py --tar_mod <model_name> --ref_dat <data_file> --data <data_file>

Next, we use a large-scale publicly available corpus as a reference corpus to obtain an estimation of the token frequency distribution since an LLM’s pretraining corpus is not accessible usually.

python src/com_fre_dis.py --model <model_name> --ref_dat <reference_dataset>

We then calibrate the token probabilities by comparing the token probability distribution to the token frequency distribution to calibrate the token probability for each token, and derive a score for pretraining data detection based on the calibrated token probabilities.

python src/com_det_sco.py --tar_mod <model_name> --data <data_file> --max_cha <text-length> --lang <language> --a <hyperparameter of DC-PDD>

Citation

If you find this work useful, please consider citing our paper:

@article{zhang2024pretraining,
  title={Pretraining data detection for large language models: A divergence-based calibration method},
  author={Zhang, Weichao and Zhang, Ruqing and Guo, Jiafeng and de Rijke, Maarten and Fan, Yixing and Cheng, Xueqi},
  journal={arXiv preprint arXiv:2409.14781},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
figures		figures
src		src
README.md		README.md
data.zip		data.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

PatentMIA

DC-PDD (& baselines)

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

PatentMIA

DC-PDD (& baselines)

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages