Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

Initial code release for the paper:

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation (ICLR 2025)

Xinpeng Wang, Chengzhi Hu, Paul Röttger and Barbara Plank.

This code is build on top of the code from the great work Refusal in Language Models Is Mediated by a Single Direction.

🪜 Environment Setup

source setup.sh

Install the evaluation harness from source

cd lm-evaluation-harness
pip install -e .

🔭 Experiments

To run vector extraction, ablation and evaluation, run the script bellow:

python -m pipeline.run_pipeline --config_path configs/cfg.yaml

🏄‍♂️ Demo

We also provide a demo notebook here. We recommend using this as a hands-on intro of how our pipeline works and how the model is changed when doing the (fine-grained) vector ablation.

Cite

@inproceedings{wang2025surgical,
    title={Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation},
    author={Xinpeng Wang and Chengzhi Hu and Paul R{\"o}ttger and Barbara Plank},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=SCBn8MCLwc}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
bash_scripts		bash_scripts
configs		configs
dataset		dataset
lm-evaluation-harness		lm-evaluation-harness
pipeline		pipeline
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

🪜 Environment Setup

🔭 Experiments

🏄‍♂️ Demo

Cite

About

Uh oh!

Releases

Packages

Languages

License

mainlp/False-Refusal-Mitigation

Folders and files

Latest commit

History

Repository files navigation

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

🪜 Environment Setup

🔭 Experiments

🏄‍♂️ Demo

Cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages