The official code for the ACL 2025 paper "Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region"
uv venv
uv syncuv run get_insts.py --model-path PATH_TO_YOUR_MODELThe following models are supported:
- Meta-Llama-3
- Llama-2
- gemma-2
- Mistral
Example:
uv run get_insts.py --model-path /data/models/Meta-Llama-3-8B-InstructThe following script computes the causal effects of each head's value states from different regions (section 3.3):
uv run get_patching_scores.py --model-path PATH_TO_YOUR_MODEL --dataset DATASET_TYPEThe following datasets are supported:
- jbb (i.e, JailbreakBench)
- hb (i.e, HarmBench)
Example:
uv run get_patching_scores.py --model-path /data/models/Meta-Llama-3-8B-Instruct --dataset jbb The following script implements TempPatch operation to intervene response generation (section 4.1):
uv run temppatch.py --model-path PATH_TO_YOUR_MODEL --dataset DATASET_TYPE --patched-rate RATE --max-new-tokens NUM_TOKENS--patched_rate is the proportion where the attention head with the highest causal effects will be patched during generation.
Example:
uv run temppatch.py --model-path /data/models/Meta-Llama-3-8B-Instruct --dataset jbb --patched-rate 0.1 --max-new-tokens 512The following script evaluates the Attack Success Rate (ASR) of jailbreak responses using both SorryBench evaluation and string matching:
uv run asr_eval.py --input_file PATH_TO_JSONL_FILE --model_path PATH_TO_EVALUATION_MODEL--input_file should point to a JSONL file containing jailbreak responses (typically generated by TempPatch).
--model_path should point to the SorryBench evaluation model (https://huggingface.co/sorry-bench/ft-mistral-7b-instruct-v0.2-sorry-bench-202406).
The script performs two types of evaluation:
- SorryBench Evaluation: Uses a fine-tuned model to judge if responses comply with harmful requests
- String Matching: Checks for common refusal patterns in responses
Example:
uv run asr_eval.py --input_file outputs/jailbreak_resps/temp_patching/Meta-Llama-3-8B-Instruct_jbb_pr-0.1.jsonl --model_path /data/models/ft-mistral-7b-instruct-v0.2-sorry-bench-202406