Skip to content

uw-nsl/safechain

Repository files navigation

SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities

Fengqing Jiang1 ,  Zhangchen Xu1 ,  Yuetai Li1 ,  Luyao Niu1 , 
Zhen Xiang2 ,  Bill Yuchen Lin1 ,  Bo Li3 ,  Radha Poovendran1

1University of Washington    2University of Georgia    3University of Chicago

Warning: This paper contains model outputs that may be considered offensive.

[Paper]    [Project Page]    [HuggingFace]

News

  • [2025/04/02] 📢 We released two model trained on SafeChain dataset hosted on Huggingface.
  • [2025/03/21] 🚀 SafeChain is selected as oral presentation on ICLR 2025 BiAlign workshop. See you in Singapore!
  • [2025/02/21] We released our code source.

How to use our project

Before running our code,

Build the environment

At the safechain dir, run

bash scripts/build_env.sh safechain

Some configuration setups

  • Add your HF token to access models
  • Update the model config (e.g., the number of GPUs for each model/add new models) at config.py
  • If you are running models with API access, make sure to add the endpoint setup in utils_model.py. For using different model endpoint, also switch in the config.py (refer to the setup for DeepSeek R1).

Run our code

Our pipeline includes two steps, generate model response and evaluation.

To run the step seperately, you can try the following command. The

python resp_gen.py

Command-Line Arguments Help

Below is a summary of the command-line arguments provided by the script, along with their descriptions and default values.

Argument Type Default Choices Description
--model str RSM_LIST[0] (first entry in RSM_LIST) entry in RSM_LIST The model name to use, selected from the list of available models in RSM_LIST in config.py.
--data str strongreject entry in EVAL_DATA The dataset to use for evaluation, selected from EVAL_DATA in config.py.
--prompt str normal [normal, zerothink, lessthink, overthink] Setup for generation input (e.g., type of prompt).
--system bool DEFAULT_GEN_CONFIG['system'] None Whether to override system prompt setup in config.py.
--temperature float DEFAULT_GEN_CONFIG['temperature'] None Sampling temperature for text generation (higher means more randomness).
--topp float DEFAULT_GEN_CONFIG['topp'] None Nucleus sampling probability (top-p).
--topk int DEFAULT_GEN_CONFIG['topk'] None Top-k sampling parameter.
--max_tokens int DEFAULT_GEN_CONFIG['max_tokens'] None Maximum number of tokens to generate.
--repeat_n int DEFAULT_GEN_CONFIG['repeat_n'] None Number of samples to generate per prompt input.
--n int -1 None Number of samples to use. Use -1 to include all available samples.
--start_idx int 0 None The starting index from which to use the dataset samples.
--port int 8000 None Port number (or used as an identifier in file naming, depending on your use case).
--think_budget int 10000 None Thinking Budget for internal "thinking" or hidden reasoning.
--enforce_num int 10 None Enforced time limit for MoreThink setup.

And with the output file, run the following command:

python resp_eval.py --file file_name

The experiment can also be running in end-to-end manner, replacing resp_gen.py with pipeline.py

Running MoreThink Experiment

We provide an efficient implementation for MoreThink setup. You must first boost the vllm server then running the resp_gen.py. We also provide a script to run this setup.

Under scripts dir, run

bash morethink_uni.sh  MODEL_PATH TENSOR_PARALLEL_SIZE GEN_DEVICE EVAL_DEVICE RUN_PY 

If RUN_PY is gen, the script will not run evaluation after response generation, it can help running experiment if you do not have enough GPU devices (e.g., only 1 GPU).

Example:

bash morethink_uni.sh deepseek-ai/DeepSeek-R1-Distill-Llama-70B 4  "0,1,2,3" "2" "gen"

Benchmark Evaluation on Math and Coding

Regarding math-related task, we adapted the codebase here. For coding task, we adapted EvalPlus for HumanEval and MBPP, and we adapted SkyThought codebase for Livecodebench evaluation (we upgrade to v5 for evaluation). As mentioned in our paper, we use greedy decoding for evaluation, and we set repetition_penalty to 1.1 for coding task.

We will prepare the off-the-shell script for easy evaluation.

Citation

If you find our work useful, please consider citing our paper:

@article{jiang2025safechain,
  title={SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities},
  author={Jiang, Fengqing and Xu, Zhangchen and Li, Yuetai and Niu, Luyao and Xiang, Zhen and Li, Bo and Lin, Bill Yuchen and Poovendran, Radha},
  journal={arXiv preprint arXiv:2502.12025},
  year={2025}
}

About

[ACL 25] SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published