Code for the Security'26 submission "Membership Inference Attacks on Tokenizers of Large Language Models"
Note that this repo is anonymous and only intended for review purpose only.
First, set up the Python environment and install all required dependencies.
conda create -n MIA python=3.12
conda activate MIA
pip install -r requirements.txtNext, download the datasets used in our evaluations. These datasets have been collected by Google
python download_datasets.pyIn this step, train the target tokenizers, which serve as the attack targets in MIA experiments.
python train_target_tokenizer.pyShadow tokenizers are trained to mimic the behavior of the target tokenizer. These are used in the attack phase to help infer membership.
python train_shadow_tokenizer.pyNow, conduct membership inference attacks using various methods. Each script below implements a different attack method.
python mia_via_compression_rate.py
python mia_via_vocabulary_overlap.py
python mia_via_frequency_estimation.py
python mia_via_merge_similarity.py
python mia_via_naive_bayes.pyAll experimental results will be saved in the infer_results folder for further analysis.
The code for the min count defense is provided in the 'min_defense' folder. It can be deployed using the following code:
python min_defense.py