We present Tabular-MIA, the first benchmark designed to evaluate membership inference attacks (MIAs) on large language models (LLMs) trained on tabular data. We systematically explore various table encodings, model architectures, and attack methods, introducing a unified evaluation framework for analyzing privacy risks in tabular LLM training.
Key contributions include:
- A suite of datasets with consistent train/test/synthetic splits
- Encoding schemes: JSON, HTML, Markdown, Key-Value formats, Key-Is-Value, Line-separated text
- Support for fine-tuning LLMs with QLoRA
- Comprehensive MIA evaluation pipeline with AUROC, TPR@FPR, and robustness tests
conda create -n tab-mia-env python=3.11 -y
conda activate tab-mia-env
# Install dependencies
pip install -r requirements.txtThe datasets can be found and downloaded from Hugging Face: Tab-MIA
We provide formatted datasets for:
- Adult Income
- California Housing
- TabFact
- WikiTableQuestions (WTQ)
- WikiSQL
Each dataset is provided in six different encodings:
- JSON - Serializes the table as a JSON list of records.
- HTML - Converts the table into an HTML
<table>element. - Markdown - Formats the table as a Markdown table.
- Key-Value Pairs - Each cell is represented as a "Key: Value" pair.
- Key-Is-Value - Similar to Key-Value, but in a natural language sentence format.
- Line-separated - Outputs each row as a comma-separated line of text.
Each file includes an indicator specifying whether each record is a member (part of the training set) or a non-member.
Directory structure:
datasets/
└── adult/
├── adult_format_json.jsonl
├── adult_format_html.jsonl
└── ...
To run the full pipeline, use the main script:
Parameter explanations:
--target_model: Name or path of the base LLM to fine-tune (e.g.,mistralai/Mistral-7B-v0.1or any other model available on Hugging Face)).--output_dir: Directory where models, logs, and result files will be saved.--num_epochs: Number of epochs for QLoRA fine-tuning.--use_existing: Whether to reuse previously generated data/models (all,data, ormodel).--table_encoding: The encoding format used to serialize tables. Supported values are:json– JSON list of row dictionarieshtml- HTML<table>formatmarkdown– Markdown table syntaxkey-value-pair– "Key is Value" per cellkey-is-value– Same as above, in natural sentence formline-sep– Line-separated, comma-separated rows
- Run Tab-MIA: When
--datastarts withtabMIA_, the data is automatically fetched from Hugging Face datasets (e.g.,tabMIA_adult). - Create Tab-MIA Dataset:
- Long-Context Tables: To create a Tab-MIA dataset for long-context tables, use
--datawith a path to a CSV file. The code will process the CSV into text chunks based on the selected encoding if the encoded version does not already exist. - Short-Context Tables: For short-context tables, the dataset name should match one of the short-context datasets (
wtq,wikisqlortabfactonly). The code will generate the .jsonl files in the specified encoding if they do not already exist.
- Long-Context Tables: To create a Tab-MIA dataset for long-context tables, use
- Pretrained Model: When run
main_syn.pyfile,--dataneeds to be a path to a JSONL file, the code will use the provided data for MIA detection without fine-tuning."
python main.py --data tabMIA_adult
--target_model mistralai/Mistral-7B-v0.1 \
--output_dir results/ \
--num_epochs 3 \
--use_existing all \
--table_encoding json The script handles:
- Preprocessing the data for each encoding
- Fine-tuning with QLoRA (or reusing a previous model)
- Running MIA detection.
python main_syn.py --data <path_to_JSONL_file> \
--target_model mistralai/Mistral-7B-v0.1 \
--output_dir results/ \
--use_existing all \
--table_encoding jsonSee docs/EXTENDING.md for instructions on adding new attack methods, using different language models, and evaluating custom CSV datasets.
This project is licensed under the MIT License.