This repository is the official implementation for the paper: Provable Training Data Identification for Large Language Models (ICML 2026)
1. Installation
conda env create -f environment.yaml
2. Evaluation
To evaluate our method, run:
./run_eval.sh
If you find this useful in your research, please consider citing:
@inproceedings{liu2026provable,
title={Provable Training Data Identification for Large Language Models},
author={Liu, Zhenlong and Zeng, Hao and Huang, Weiran and Wei, Hongxin},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://arxiv.org/abs/2510.09717},
}
Our code is inspired by Min-K% Prob. We thank the authors for releasing their code.