SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought

Paper: https://www.arxiv.org/abs/2511.07772
Accepted to Neurips25 ResponsibleFM Workshop, AAAI26-Trustworthy Agentic AI Workshop

Overview

This is our research codebase for SALT. Includes evaluating and steering large language models to reduce privacy leakage in chain-of-thought (CoT) reasoning. It contains scripts to:

run baseline and steered generations while capturing layer activations,
compute privacy/utility metrics (with optional LLM-as-a-judge), and
analyze which layers are most associated with leakage and save steering vectors.

If you use this repository in academic work, please cite the accompanying paper:

@misc{batra2025saltsteeringactivationsleakagefree,
      title={SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought}, 
      author={Shourya Batra and Pierce Tillman and Samarth Gaggar and Shashank Kesineni and Kevin Zhu and Sunishchal Dev and Ashwinee Panda and Vasu Sharma and Maheep Chaudhary},
      year={2025},
      eprint={2511.07772},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2511.07772}, 
}

Repository structure

leak_eval/
- eval_cp.py: Baseline evaluation with optional multi-layer activation capture and optional GPT eval.
- steered_eval_cp_resume.py: Run evaluation with steering vectors applied (single- or multi-layer), resume-safe.
- find_leak_layers.py: Contrastive per-neuron analysis to rank layers by leakage association and optionally export steering vectors.
- cp_eval_utils.py: Metric helpers (utility/leakage), GPT evaluation, cost estimation.
- generate_utils.py: Provider/model helpers and generation utilities.
- prompts/cp_open_ended_chat/: Prompt templates (vanilla.txt, cot_explicit_unk.txt, reasoning_explicit_unk.txt, situation_template.txt).
- scripts/: Small utilities (precompute vectors, merge/split datasets, run GPT eval on results, count leaks, sweeps).
notebooks/: Prototyping and blueprint pipeline (instruction.ipynb).
results/: Example outputs, sweeps and paper figures (reference only; you can regenerate locally).

Requirements

Python 3.10+
A GPU is recommended for HF models; CPU works for small tests.
Optional APIs:
- OpenAI (for LLM-as-a-judge): OPENAI_API_KEY
- OpenRouter (if using --model_provider openrouter): OPENROUTER_API_KEY
- Hugging Face token as needed for gated models (HF_TOKEN or HUGGINGFACEHUB_API_TOKEN).

Quickstart

See notebooks/instruction.ipynb for a quick start.

Other

All of the other code in here is code we used for the paper. Data from our paper is also provided in /results/.

Reference results are under results/final_results/ and layer-analysis CSVs under results/leak_layer_csvs/.
Plot scripts: results/final_results/results_graph/graph.py and results/leak_layer_csvs/graphs/graph.py.

Notebooks

notebooks/instruction.ipynb is a high-level blueprint for the end-to-end pipeline.

Environment notes

Some models do not support a system role; the code handles this automatically (e.g., Gemma) by stripping the system role in chat templates.
Batch size is auto-tuned from GPU VRAM if not provided.

Results schema (abridged)

eval_cp.py writes a JSON with at least data (list of examples with outputs/metrics) and summary (aggregate metrics, averages). When GPT eval is run, the summary also includes gpt_utility_score, gpt_pii_leakage, and total_gpt_api_cost.

Acknowledgements

This work builds upon Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers by Tommaso Green, Martin Gubri, Haritz Puerto, Sangdoo Yun, and Seong Joon Oh (arXiv:2506.15674), and the accompanying AirGapAgent-R Dataset.

License

Contact & contributions Issues and PRs are welcome. For substantive contributions, please open an issue first to discuss scope. If you build on SALT, let us know—happy to link community extensions here.

Developers

Shourya Batra

Sophomore at Homestead High School who enjoys experimenting with LLMs and playing volleyball and the Euphonium.

Pierce Tillman

Junior at West Campus High School who loves to search for new ways to make LLMs more intuitive and enthusiast photographer (check out my work on Instagram @warrriorwatch)

Samarth Gaggar

Sophomore at Dublin High School who enjoys understanding LLM trustworthiness analysis as well as robotics, debate, and research.

Shashank Kesineni

Sophomore at Rock Ridge High School who loves learning why LLMs behave the way they do and enjoys soccer, debate, and volunteering.

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
leak_eval		leak_eval
notebooks		notebooks
results		results
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought

Overview

Repository structure

Requirements

Quickstart

Other

Acknowledgements

License

Developers

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought

Overview

Repository structure

Requirements

Quickstart

Other

Acknowledgements

License

Developers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages