Fengqing Jiang1 ,
Zhangchen Xu1 ,
Yuetai Li1 ,
Luyao Niu1 ,
Zhen Xiang2 ,
Bill Yuchen Lin1 ,
Bo Li3 ,
Radha Poovendran1
1University of Washington
2University of Georgia
3University of Chicago
Warning: This paper contains model outputs that may be considered offensive.
[Paper] [Project Page] [HuggingFace]
- [2025/04/02] 📢 We released two model trained on SafeChain dataset hosted on Huggingface.
- [2025/03/21] 🚀
SafeChainis selected as oral presentation on ICLR 2025 BiAlign workshop. See you in Singapore! - [2025/02/21] We released our code source.
Before running our code,
At the safechain dir, run
bash scripts/build_env.sh safechain- Add your HF token to access models
- Update the model config (e.g., the number of GPUs for each model/add new models) at
config.py - If you are running models with API access, make sure to add the endpoint setup in
utils_model.py. For using different model endpoint, also switch in theconfig.py(refer to the setup forDeepSeek R1).
Our pipeline includes two steps, generate model response and evaluation.
To run the step seperately, you can try the following command. The
python resp_gen.pyBelow is a summary of the command-line arguments provided by the script, along with their descriptions and default values.
| Argument | Type | Default | Choices | Description |
|---|---|---|---|---|
--model |
str |
RSM_LIST[0] (first entry in RSM_LIST) |
entry in RSM_LIST |
The model name to use, selected from the list of available models in RSM_LIST in config.py. |
--data |
str |
strongreject |
entry in EVAL_DATA |
The dataset to use for evaluation, selected from EVAL_DATA in config.py. |
--prompt |
str |
normal |
[normal, zerothink, lessthink, overthink] |
Setup for generation input (e.g., type of prompt). |
--system |
bool |
DEFAULT_GEN_CONFIG['system'] |
None | Whether to override system prompt setup in config.py. |
--temperature |
float |
DEFAULT_GEN_CONFIG['temperature'] |
None | Sampling temperature for text generation (higher means more randomness). |
--topp |
float |
DEFAULT_GEN_CONFIG['topp'] |
None | Nucleus sampling probability (top-p). |
--topk |
int |
DEFAULT_GEN_CONFIG['topk'] |
None | Top-k sampling parameter. |
--max_tokens |
int |
DEFAULT_GEN_CONFIG['max_tokens'] |
None | Maximum number of tokens to generate. |
--repeat_n |
int |
DEFAULT_GEN_CONFIG['repeat_n'] |
None | Number of samples to generate per prompt input. |
--n |
int |
-1 |
None | Number of samples to use. Use -1 to include all available samples. |
--start_idx |
int |
0 |
None | The starting index from which to use the dataset samples. |
--port |
int |
8000 |
None | Port number (or used as an identifier in file naming, depending on your use case). |
--think_budget |
int |
10000 |
None | Thinking Budget for internal "thinking" or hidden reasoning. |
--enforce_num |
int |
10 |
None | Enforced time limit for MoreThink setup. |
And with the output file, run the following command:
python resp_eval.py --file file_name
The experiment can also be running in end-to-end manner, replacing resp_gen.py with pipeline.py
We provide an efficient implementation for MoreThink setup. You must first boost the vllm server then running the resp_gen.py. We also provide a script to run this setup.
Under scripts dir, run
bash morethink_uni.sh MODEL_PATH TENSOR_PARALLEL_SIZE GEN_DEVICE EVAL_DEVICE RUN_PY
If RUN_PY is gen, the script will not run evaluation after response generation, it can help running experiment if you do not have enough GPU devices (e.g., only 1 GPU).
Example:
bash morethink_uni.sh deepseek-ai/DeepSeek-R1-Distill-Llama-70B 4 "0,1,2,3" "2" "gen"
Regarding math-related task, we adapted the codebase here.
For coding task, we adapted EvalPlus for HumanEval and MBPP, and we adapted
SkyThought codebase for Livecodebench evaluation (we upgrade to v5 for evaluation). As mentioned in our paper, we use greedy decoding for evaluation, and we set repetition_penalty to 1.1 for coding task.
We will prepare the off-the-shell script for easy evaluation.
If you find our work useful, please consider citing our paper:
@article{jiang2025safechain,
title={SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities},
author={Jiang, Fengqing and Xu, Zhangchen and Li, Yuetai and Niu, Luyao and Xiang, Zhen and Li, Bo and Lin, Bill Yuchen and Poovendran, Radha},
journal={arXiv preprint arXiv:2502.12025},
year={2025}
}