Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

Overview

News
Introduction
Repository Structure
Getting Started
Usage
Open-source Models
Open-source Datasets
Evaluation
Citation

📢News

[2025/04/11] We release the raw dataset before filtering and random selection (44k queries, each with 32 responses): Huggingface, Modelscope. We also release the token-level log probability and entropy from the teacher model. We did not make full use of this dataset, and believe that it might be useful for future research.
[2025/04/10] Our paper is available on Huggingface daily paper. If you enjoy our work, we warmly invite you to upvote it on Huggingface!
[2026/04/09] Our paper is available on arXiv. We open-sourced all models and datasets in this Huggingface collection.

Introduction

The paper revisits the prevailing narrative that "SFT memorizes, RL generalizes", mainly focusing on reasoning SFT (with long-CoT supervision). Our core conclusion is that generalization in reasoning SFT is highly conditional, jointly shaped by:

Optimization dynamics
Training data quality and structure
Base model capability

Main findings:

Cross-domain performance often follows a dip-and-recovery trajectory, so short training can underestimate final generalization.
Verified long-CoT supervision yields stronger transfer, while low-quality data can produce misleading non-generalization signals.
Stronger base models are more likely to internalize transferable procedural reasoning patterns and generalize better.
Generalization is asymmetric: reasoning improves, but safety can degrade.

Repository Structure

|-- training_scripts/                 # SFT scripts used for the paper experiments
|-- evaluation/                       # Unified evaluation toolkit (lm-eval-harness/evalchemy/math/alpaca/safety)
|-- verl/                             # Core training framework (based on verl)
|   |-- trainer/fsdp_sft_trainer_ours.py
|   `-- utils/dataset/sft_dataset.py

Getting Started

1) [WIP] Environment setup

We will provide the requirements and a docker image in one week.

Comming soon.

2) Required script edits before training

Update placeholders in training_scripts/*.sh:

ROOT_DIR=/path/to/this/repo
TRAIN_DATA=/path/to/{dataset_name}
WANDB_API_KEY=your_wandb_key

MODEL_PATH is already set in the provided scripts. Change it only if you want to use a different base model.

Distributed variables are expected from the runtime environment:

NODE_COUNT
PROC_PER_NODE
NODE_RANK
MASTER_ADDR

We trained all models on 8 H200 GPUs. If you encounter OOM errors, consider using a smaller micro_batch_size.

3) Training data schema

Current scripts call verl.trainer.fsdp_sft_trainer_ours with:

data.prompt_key=message
data.response_key=response
data.advantage_key=advantage

Each parquet row should include at least:

message (chat-message list)
response (string)
advantage (scalar, commonly just 1.0)

Usage

1) Run training (example)

bash training_scripts/Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256.sh

2) Representative scripts aligned with paper settings

Default Math-CoT setting
training_scripts/Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256.sh
Short-training replication
training_scripts/Qwen3-14B_Math-CoT-20k_lr5e-5_ep1_bs256.sh
Smaller LR setting
training_scripts/Qwen3-14B_Math-CoT-20k_lr1e-5_ep1_bs256.sh
Overfitting stress-test schedules
training_scripts/Qwen3-14B_Math-CoT-20k_lr5e-5_ep16_bs256_ConstLR.sh
training_scripts/Qwen3-14B_Math-CoT-20k_lr1e-4_ep16_bs256_ConstLR.sh
Data variants
training_scripts/Qwen3-14B_Math-NoCoT-20k_lr5e-5_ep8_bs256.sh
training_scripts/Qwen3-14B_Countdown-CoT-20k_lr5e-5_ep8_bs256.sh
training_scripts/Qwen3-14B_DeepSeek-R1-20k_lr5e-5_ep8_bs256.sh training_scripts/Qwen3-14B_NuminaMath-20k_lr5e-5_ep8_bs256.sh
Capability scaling (Qwen3 1.7B/4B/8B/14B)
training_scripts/Qwen3-1.7B_Math-CoT-20k_lr5e-5_ep8_bs256.sh
training_scripts/Qwen3-4B_Math-CoT-20k_lr5e-5_ep8_bs256.sh
training_scripts/Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256.sh
training_scripts/Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256.sh

3) Merge FSDP checkpoints

python -m verl.model_merger merge \
  --backend fsdp \
  --local_dir /path/to/ckpt/global_step_640 \
  --target_dir /path/to/ckpt/merged_step640 \
  --trust_remote_code

Batch merge helper:

bash training_scripts/model_merger.sh

Before using model_merger.sh, update ckpt_path_list in the script (for example /path/to/model/global_step_640).

Open-source Models

We have open-sourced ALL models trained in our experiments, including the intermediate checkpoints (you can find them in the stepxxx folder in the repo).

Note that the following model list may include repeated entries, as it is organized by the experiments and conclusions presented in the paper.

Model Name	Huggingface	ModelScope
Weak cross-domain generalization is more pronounced under short training and smaller learning rates (refer to Sec. 3.1; App. C.1, Table 4)
Qwen3-14B_Math-CoT-20k_lr5e-5_ep1_bs256	Huggingface	ModelScope
Qwen3-14B_Math-CoT-20k_lr1e-5_ep1_bs256	Huggingface	ModelScope
Qwen3-14B_Math-CoT-20k_lr1e-5_ep2_bs256	Huggingface	ModelScope
Apparent non-generalization can be an under-optimization artifact, with a dip-and-recovery pattern under extended training (refer to Sec. 3.1-3.2, Fig. 3)
Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
InternLM2.5-20B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
The above optimization dynamics remain robust under a different teacher model (refer to App. C.2, Fig. 7)
Qwen3-14B_DeepSeek-R1-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-8B_DeepSeek-R1-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
InternLM2.5-20B_DeepSeek-R1-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Under a fixed 640-step budget, repeated exposure is more effective than one-pass coverage (refer to Sec. 3.3, Table 1)
Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-14B_Math-CoT-2.5k_lr5e-5_ep8_bs32	Huggingface	ModelScope
Qwen3-14B_Math-CoT-20k_lr5e-5_ep1_bs32	Huggingface	ModelScope
Overfitting symptoms emerge mainly under combined aggressive schedules (refer to Sec. 3.4, Fig. 4; App. C.4)
Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-14B_Math-CoT-20k_lr5e-5_ep16_bs256	Huggingface	ModelScope
Qwen3-14B_Math-CoT-20k_lr5e-5_ep16_bs256_ConstLR	Huggingface	ModelScope
Qwen3-14B_Math-CoT-20k_lr1e-4_ep16_bs256_ConstLR	Huggingface	ModelScope
Training data quality and structure jointly shape generalization (refer to Sec. 4, Table 2)
Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-14B_Math-NoCoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-14B_Numina-Math-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-14B_Countdown-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-8B_Math-NoCoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-8B_Numina-Math-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-8B_Countdown-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
InternLM2.5-20B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
InternLM2.5-20B_Math-NoCoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
InternLM2.5-20B_Numina-Math-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
InternLM2.5-20B_Countdown-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Higher-capability models internalize transferable reasoning patterns more effectively and generalize better (refer to Sec. 5, Fig. 5)
Qwen3-1.7B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-4B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
The capability-dependent trend extends to another model family (refer to App. C.2/C.5, Fig. 8/14/15)
Qwen2.5-1.5B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen2.5-3B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen2.5-7B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen2.5-14B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Asymmetric generalization: reasoning improves while safety degrades under long-CoT SFT (refer to Sec. 6, Fig. 6)
Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-14B_Math-NoCoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-8B_Math-NoCoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
InternLM2.5-20B_Math-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
InternLM2.5-20B_Math-NoCoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Appendix: smaller and mid-scale models across data configurations (refer to App. D)
Qwen3-1.7B_Countdown-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-1.7B_Math-NoCoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-4B_Countdown-CoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope
Qwen3-4B_Math-NoCoT-20k_lr5e-5_ep8_bs256	Huggingface	ModelScope

Open-source Datasets

We provide the main datasets used in our experiments.

Dataset Name	Description	Size	Huggingface	ModelScope
Math-CoT-20k	Verified long-CoT math reasoning data (default setting in the paper)	20,480	Huggingface	ModelScope
Math-NoCoT-20k	Math-CoT-20k with CoT traces removed (final summary/answer retained)	20,480	Huggingface	ModelScope
Countdown-CoT-20k	Countdown arithmetic-game long-CoT data for procedural transfer analysis	20,480	Huggingface	ModelScope
NuminaMath-20k	No-CoT math data with the matched queries, sourced from NuminaMath-1.5	20,480	Huggingface	ModelScope
DeepSeek-R1-20k	Verified long-CoT responses from DeepSeek-R1 on the same queries, sourced from the LUFFY dataset	20,480	Huggingface	ModelScope

✨Note: We also release the raw dataset before filtering and random selection: Huggingface, Modelscope. This dataset contains around 44k queries, each with 32 responses generated by Qwen3-32B. For each textual response, we also release the token-level log probability and entropy from the teacher model. We believe this might be useful for someone who is interested in response diversity, token probability distribution, or other related topics.

Evaluation

Please refer to the evaluation guide in evaluation/README.md for details.

Citation

If you use our code, model, or dataset in your project, please consider citing us.

@article{ren2026rethinking_sft_generalization,
  title={Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability},
  author={Qihan Ren and Peng Wang and Ruikun Cai and Shuai Shao and Dadi Guo and Yuejin Xie and Yafu Li and Quanshi Zhang and Xia Hu and Jing Shao and Dongrui Liu},
  journal={arXiv preprint arXiv:2604.06628},
  year={2026}
}

Acknowledgement

[WIP]

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
evaluation		evaluation
examples		examples
recipe		recipe
scripts		scripts
tests		tests
training_scripts		training_scripts
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements-npu.txt		requirements-npu.txt
requirements-verl.txt		requirements-verl.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

Overview

📢News

Introduction

Repository Structure

Getting Started

1) [WIP] Environment setup

2) Required script edits before training

3) Training data schema

Usage

1) Run training (example)

2) Representative scripts aligned with paper settings

3) Merge FSDP checkpoints

Open-source Models

Open-source Datasets

Evaluation

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

Overview

📢News

Introduction

Repository Structure

Getting Started

1) [WIP] Environment setup

2) Required script edits before training

3) Training data schema

Usage

1) Run training (example)

2) Representative scripts aligned with paper settings

3) Merge FSDP checkpoints

Open-source Models

Open-source Datasets

Evaluation

Citation

Acknowledgement

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages