- News
- Introduction
- Repository Structure
- Getting Started
- Usage
- Open-source Models
- Open-source Datasets
- Evaluation
- Citation
- [2025/04/11] We release the raw dataset before filtering and random selection (44k queries, each with 32 responses): Huggingface, Modelscope. We also release the token-level log probability and entropy from the teacher model. We did not make full use of this dataset, and believe that it might be useful for future research.
- [2025/04/10] Our paper is available on Huggingface daily paper. If you enjoy our work, we warmly invite you to upvote it on Huggingface!
- [2026/04/09] Our paper is available on arXiv. We open-sourced all models and datasets in this Huggingface collection.
The paper revisits the prevailing narrative that "SFT memorizes, RL generalizes", mainly focusing on reasoning SFT (with long-CoT supervision). Our core conclusion is that generalization in reasoning SFT is highly conditional, jointly shaped by:
- Optimization dynamics
- Training data quality and structure
- Base model capability
Main findings:
- Cross-domain performance often follows a dip-and-recovery trajectory, so short training can underestimate final generalization.
- Verified long-CoT supervision yields stronger transfer, while low-quality data can produce misleading non-generalization signals.
- Stronger base models are more likely to internalize transferable procedural reasoning patterns and generalize better.
- Generalization is asymmetric: reasoning improves, but safety can degrade.
|-- training_scripts/ # SFT scripts used for the paper experiments
|-- evaluation/ # Unified evaluation toolkit (lm-eval-harness/evalchemy/math/alpaca/safety)
|-- verl/ # Core training framework (based on verl)
| |-- trainer/fsdp_sft_trainer_ours.py
| `-- utils/dataset/sft_dataset.py
We will provide the requirements and a docker image in one week.
Comming soon.Update placeholders in training_scripts/*.sh:
ROOT_DIR=/path/to/this/repoTRAIN_DATA=/path/to/{dataset_name}WANDB_API_KEY=your_wandb_key
MODEL_PATH is already set in the provided scripts. Change it only if you want to use a different base model.
Distributed variables are expected from the runtime environment:
NODE_COUNTPROC_PER_NODENODE_RANKMASTER_ADDR
We trained all models on 8 H200 GPUs. If you encounter OOM errors, consider using a smaller micro_batch_size.
Current scripts call verl.trainer.fsdp_sft_trainer_ours with:
data.prompt_key=messagedata.response_key=responsedata.advantage_key=advantage
Each parquet row should include at least:
message(chat-message list)response(string)advantage(scalar, commonly just1.0)
bash training_scripts/Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256.sh- Default Math-CoT setting
training_scripts/Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256.sh - Short-training replication
training_scripts/Qwen3-14B_Math-CoT-20k_lr5e-5_ep1_bs256.sh - Smaller LR setting
training_scripts/Qwen3-14B_Math-CoT-20k_lr1e-5_ep1_bs256.sh - Overfitting stress-test schedules
training_scripts/Qwen3-14B_Math-CoT-20k_lr5e-5_ep16_bs256_ConstLR.sh
training_scripts/Qwen3-14B_Math-CoT-20k_lr1e-4_ep16_bs256_ConstLR.sh - Data variants
training_scripts/Qwen3-14B_Math-NoCoT-20k_lr5e-5_ep8_bs256.sh
training_scripts/Qwen3-14B_Countdown-CoT-20k_lr5e-5_ep8_bs256.sh
training_scripts/Qwen3-14B_DeepSeek-R1-20k_lr5e-5_ep8_bs256.shtraining_scripts/Qwen3-14B_NuminaMath-20k_lr5e-5_ep8_bs256.sh - Capability scaling (Qwen3 1.7B/4B/8B/14B)
training_scripts/Qwen3-1.7B_Math-CoT-20k_lr5e-5_ep8_bs256.sh
training_scripts/Qwen3-4B_Math-CoT-20k_lr5e-5_ep8_bs256.sh
training_scripts/Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256.sh
training_scripts/Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256.sh
python -m verl.model_merger merge \
--backend fsdp \
--local_dir /path/to/ckpt/global_step_640 \
--target_dir /path/to/ckpt/merged_step640 \
--trust_remote_codeBatch merge helper:
bash training_scripts/model_merger.shBefore using model_merger.sh, update ckpt_path_list in the script (for example /path/to/model/global_step_640).
We have open-sourced ALL models trained in our experiments, including the intermediate checkpoints (you can find them in the stepxxx folder in the repo).
Note that the following model list may include repeated entries, as it is organized by the experiments and conclusions presented in the paper.
| Model Name | Huggingface | ModelScope |
|---|---|---|
| Weak cross-domain generalization is more pronounced under short training and smaller learning rates (refer to Sec. 3.1; App. C.1, Table 4) | ||
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep1_bs256 | Huggingface | ModelScope |
| Qwen3-14B_Math-CoT-20k_lr1e-5_ep1_bs256 | Huggingface | ModelScope |
| Qwen3-14B_Math-CoT-20k_lr1e-5_ep2_bs256 | Huggingface | ModelScope |
| Apparent non-generalization can be an under-optimization artifact, with a dip-and-recovery pattern under extended training (refer to Sec. 3.1-3.2, Fig. 3) | ||
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| InternLM2.5-20B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| The above optimization dynamics remain robust under a different teacher model (refer to App. C.2, Fig. 7) | ||
| Qwen3-14B_DeepSeek-R1-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-8B_DeepSeek-R1-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| InternLM2.5-20B_DeepSeek-R1-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Under a fixed 640-step budget, repeated exposure is more effective than one-pass coverage (refer to Sec. 3.3, Table 1) | ||
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-14B_Math-CoT-2.5k_lr5e-5_ep8_bs32 | Huggingface | ModelScope |
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep1_bs32 | Huggingface | ModelScope |
| Overfitting symptoms emerge mainly under combined aggressive schedules (refer to Sec. 3.4, Fig. 4; App. C.4) | ||
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep16_bs256 | Huggingface | ModelScope |
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep16_bs256_ConstLR | Huggingface | ModelScope |
| Qwen3-14B_Math-CoT-20k_lr1e-4_ep16_bs256_ConstLR | Huggingface | ModelScope |
| Training data quality and structure jointly shape generalization (refer to Sec. 4, Table 2) | ||
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-14B_Math-NoCoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-14B_Numina-Math-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-14B_Countdown-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-8B_Math-NoCoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-8B_Numina-Math-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-8B_Countdown-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| InternLM2.5-20B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| InternLM2.5-20B_Math-NoCoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| InternLM2.5-20B_Numina-Math-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| InternLM2.5-20B_Countdown-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Higher-capability models internalize transferable reasoning patterns more effectively and generalize better (refer to Sec. 5, Fig. 5) | ||
| Qwen3-1.7B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-4B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| The capability-dependent trend extends to another model family (refer to App. C.2/C.5, Fig. 8/14/15) | ||
| Qwen2.5-1.5B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen2.5-3B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen2.5-7B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen2.5-14B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Asymmetric generalization: reasoning improves while safety degrades under long-CoT SFT (refer to Sec. 6, Fig. 6) | ||
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-14B_Math-NoCoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-8B_Math-NoCoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| InternLM2.5-20B_Math-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| InternLM2.5-20B_Math-NoCoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Appendix: smaller and mid-scale models across data configurations (refer to App. D) | ||
| Qwen3-1.7B_Countdown-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-1.7B_Math-NoCoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-4B_Countdown-CoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
| Qwen3-4B_Math-NoCoT-20k_lr5e-5_ep8_bs256 | Huggingface | ModelScope |
We provide the main datasets used in our experiments.
| Dataset Name | Description | Size | Huggingface | ModelScope |
|---|---|---|---|---|
| Math-CoT-20k | Verified long-CoT math reasoning data (default setting in the paper) | 20,480 | Huggingface | ModelScope |
| Math-NoCoT-20k | Math-CoT-20k with CoT traces removed (final summary/answer retained) | 20,480 | Huggingface | ModelScope |
| Countdown-CoT-20k | Countdown arithmetic-game long-CoT data for procedural transfer analysis | 20,480 | Huggingface | ModelScope |
| NuminaMath-20k | No-CoT math data with the matched queries, sourced from NuminaMath-1.5 | 20,480 | Huggingface | ModelScope |
| DeepSeek-R1-20k | Verified long-CoT responses from DeepSeek-R1 on the same queries, sourced from the LUFFY dataset | 20,480 | Huggingface | ModelScope |
✨Note: We also release the raw dataset before filtering and random selection: Huggingface, Modelscope. This dataset contains around 44k queries, each with 32 responses generated by Qwen3-32B. For each textual response, we also release the token-level log probability and entropy from the teacher model. We believe this might be useful for someone who is interested in response diversity, token probability distribution, or other related topics.
Please refer to the evaluation guide in evaluation/README.md for details.
If you use our code, model, or dataset in your project, please consider citing us.
@article{ren2026rethinking_sft_generalization,
title={Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability},
author={Qihan Ren and Peng Wang and Ruikun Cai and Shuai Shao and Dadi Guo and Yuejin Xie and Yafu Li and Quanshi Zhang and Xia Hu and Jing Shao and Dongrui Liu},
journal={arXiv preprint arXiv:2604.06628},
year={2026}
}[WIP]
