-
Notifications
You must be signed in to change notification settings - Fork 3k
[trainer, fsdp, vllm, recipe] feat: one step off async training recipe #2231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[trainer, fsdp, vllm, recipe] feat: one step off async training recipe #2231
Conversation
PeterSH6
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work
recipe/async/async_ray_trainer.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you simplify the code in this file? There's too much redundancy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you simplify the code in this file? There's too much redundancy
OK, I will try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PeterSH6 I've removed some redundant code, but I'm not sure whether it's enought.
7a6c847 to
071ddc2
Compare
ccclyu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work! Have you tried the testing on multiple nodes and observed some throughput delta?
eric-haibin-lin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the contribution! please add a README.md for the scope of this recipe, for instance, indicating the support status for features available in the original ray trainer such as vlm/multi-turn.
Please also make a copy of the doc to section to docs/advance/ for documentation. Please include the convergence curve in these docs
|
|
||
| # Define worker classes based on the actor strategy. | ||
| if config.actor_rollout_ref.actor.strategy in ["fsdp", "fsdp2"]: | ||
| assert config.critic.strategy in ["fsdp", "fsdp2"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we're deprecating fsdp, we can limit this recipe to fsdp2 only, and make sure it is tested with fsdp2
| if not self.hybrid_engine: | ||
| self.actor_wg.sync_rollout_weights() | ||
| ray.get(self.rollout_wg.sync_rollout_weights()) | ||
| # param_ref = self.actor_wg.sync_rollout_weights_v2(None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls remove unused code
|
Hello! Thank you so much for implementing asynchronous RLHF framework. I have little questions: |
|
@lzxdjb I agree. Mistral AI does the same without even recomputing kv cache (in Magistral paper https://arxiv.org/pdf/2506.10910). |
…mh966/verl into recipe/async_training_megatron
…mh966/verl into recipe/async_training_megatron
…into recipe/async_training
[trainer, fsdp, vllm, recipe] feat: one step off async training recipe
| if self.config.trainer.profile_steps is not None | ||
| else False | ||
| ) | ||
| if do_profile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the next PR, could you reuse the function _start_profiling and _stop_profiling from the parent class? thanks.
https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/ray_trainer.py#L1042-L1063
|
The snapshot of this recipe development branch is pushed to https://github.com/volcengine/verl/tree/recipe/one_step_off_async. Thanks team for the great work! |
What does this PR do?
This PR provides a simple implementation of one step off async training with fsdp and vllm backend.
We conducted three different experiments with qwen2.5_3b model on 8 A100 GPUs:
The pictures below demonstrate the results of these experiments:



In these experiments, baseline has the highest throughput, but we think it is just because we didn't find the best configure for one step off async training.
The exciting point is that our nccl based weights updating for rollout model has great performance. The latency is showed below:

At most of time, the latency is under 300ms, which is negligible for RLHF. Although it is only implemented with fsdp and vllm now, we think it is not complex to extend it to the other backend.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
To use this feature,
hybrid_engineoption must be disabled to separate actor model and rollout model into difference GPU cluster.rollout.n_gpusoption has been added to configure file to indicate how many GPUs rollout model would be occupied. The script below is an example to trainqwen2.5_3bwith 8 GPUs.python3 -m recipe.async.async_main_ppo \ algorithm.adv_estimator=grpo \ data.train_files=$HOME/data/gsm8k/train.parquet \ data.val_files=$HOME/data/gsm8k/test.parquet \ data.train_batch_size=1024 \ data.max_prompt_length=512 \ data.max_response_length=1024 \ data.filter_overlong_prompts=True \ data.truncation='error' \ data.shuffle=False \ actor_rollout_ref.model.path=Qwen/Qwen2.5-3B-Instruct \ actor_rollout_ref.actor.optim.lr=3e-6 \ actor_rollout_ref.hybrid_engine=False \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ppo_mini_batch_size=256 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=40 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=0.001 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=40 \ actor_rollout_ref.rollout.tensor_model_parallel_size=2 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ actor_rollout_ref.rollout.n=5 \ actor_rollout_ref.rollout.n_gpus=4 \ actor_rollout_ref.rollout.load_format=safetensors \ actor_rollout_ref.rollout.layered_summon=True \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=40 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.val_before_train=True \ trainer.logger=['console','wandb'] \ trainer.project_name='verl_grpo_example_gsm8k' \ trainer.experiment_name='qwen2.5_3b_grpo_async_one_step_off' \ trainer.n_gpus_per_node=8 \ trainer.nnodes=1 \ trainer.save_freq=-1 \ trainer.test_freq=-1 \ trainer.total_epochs=15 $@High-Level Design
Specific Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace.