Code for the paper "Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models"
Here we provide the code to replicate the COLM'24 paper "Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models".
The code is organized as follows.
- The files
run_tabular_experiments.pyrun_time_series_experiments.pyandrun_statistical_experiments.pyrun the different experiments, that is sending queries to the LLM. - The LLM queries are saved to disk and analyzed in Jupyter Notebooks. These are contained in the
notebooksfolder. The notebooks generate the figures and tables in the paper. - The memorization tests can be directly performed with the
tabmemcheckpackage, see the notebookmemorization-tests.ipynb - The
datasetsfolder contains CSV files. - The
preprocessingfolder contains notebooks that create the ACS Income, ACS Travel and ICU datasets. - The
configfolder contains prompt configurations and YAML files that specify the different dataset transforms. - The environment used to run the experiments is given in
environment.yml
The data to replicate our results (LLM queries and responses) is available here.
@article{bordt2024colm,
title={Elephants Never Forget: Memorization and Learning of Tabular Data in
Large Language Models},
author={Bordt, Sebastian and Nori, Harsha and Rodrigues, Vanessa and Nushi, Besmira and Caruana, Rich},
journal={Conference on Language Modeling (COLM)},
year={2024}
}