misc: update kl penaty names by richardodliu · Pull Request #979 · verl-project/verl

richardodliu · 2025-04-08T14:15:53Z

Rename the KL penalty parameter to be consistent with common sense

rename kl penlty

vermouth1992 · 2025-04-08T15:11:38Z

Could you point the reference to the common sense?

richardodliu · 2025-04-08T15:20:06Z

Could you point the reference to the common sense?

http://joschu.net/blog/kl-approx.html

When only given the chosen token prob, all three methods have been proposed in the blog, not only the$k_3$. Meanwhile, I can't not find a reference that supports the docstring that k3 can be low variance.
We can find the correct KL penalty in OpenRLHF.
We have a notion that point the true kl estimator function

vermouth1992 · 2025-04-08T15:31:47Z

Nice, could you add a reference of naming in the code? Thanks!

richardodliu · 2025-04-08T15:36:42Z

Nice, could you add a reference of naming in the code? Thanks!

already done

eric-haibin-lin

Thank you for the PR! We'd like to keep backward compatibility. Could you also add the checks for the old name as well. We can update example and docs with the better name to guide new users

CLAassistant · 2025-04-08T16:07:34Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

richardodliu · 2025-04-08T16:08:19Z

Thank you for the PR! We'd like to keep backward compatibility. Could you also add the checks for the old name as well. We can update example and docs with the better name to guide new users

Is this version ok?

hiyouga · 2025-04-08T16:57:51Z

 def kl_penalty(logprob: torch.FloatTensor, ref_logprob: torch.FloatTensor, kl_penalty) -> torch.FloatTensor:
    """Compute KL divergence given logprob and ref_logprob.
-    Copied from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1104
+    reference from https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/models/utils.py#L7


this function originated from https://github.com/huggingface/trl/blob/v0.11.0/trl/trainer/ppo_trainer.py#L1150-L1164

Yes， but these three methods are from joschu's blog. Trl named these methods in a wrong way. I am trying to correct them.

eric-haibin-lin · 2025-04-09T02:59:03Z

@@ -458,18 +458,18 @@ def kl_penalty(logprob: torch.FloatTensor, ref_logprob: torch.FloatTensor, kl_pe
    Returns:


please update the list of values in docs/examples/config.rst. remove the old names in the doc since we do not want others to use them.

also please search in the codebase for existing values used in scripts. for example:

examples/reinforce_plus_plus_trainer/run_qwen2-7b_math_rf.sh

examples/reinforce_plus_plus_trainer/run_qwen2-7b_math_rf_baseline.sh

eric-haibin-lin · 2025-04-09T03:00:00Z


    """
-    if kl_penalty == "kl":
+    if kl_penalty == "k1":


we can simply do if kl_penalty in ('kl', 'k1'):
and then remove the verbose changes in verl/trainer/main_ppo.py

eric-haibin-lin · 2025-05-30T22:14:19Z

thanks for the suggestion. moving the changes to #1781

### Checklist Before Starting - [x] Search for similar PR(s). This PR includes contribution and suggestions from [richardodliu](https://github.com/richardodliu) in #979 ### What does this PR do? Update documentation page, include key configs for PPO and other recipes. Pending docs: - GRPO - DrGRPO - DAPO, etc TODO: let config.rst directly show the content of ppo_trainer.yaml and other related yaml files. In the yaml file, colocate the comment and explanation with the option. This way the yaml is always consistent with the documentation page. For critical feature or algorithms, we list the core configs in a self-contained page like PPO.md ### High-Level Design None ### Specific Changes - use k1, k2, k3 for the kl calculation, still backward compatible - changed ppo.rst to baseline.md - added ppo.md to explain core options for ppo ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary.

### Checklist Before Starting - [x] Search for similar PR(s). This PR includes contribution and suggestions from [richardodliu](https://github.com/richardodliu) in verl-project#979 ### What does this PR do? Update documentation page, include key configs for PPO and other recipes. Pending docs: - GRPO - DrGRPO - DAPO, etc TODO: let config.rst directly show the content of ppo_trainer.yaml and other related yaml files. In the yaml file, colocate the comment and explanation with the option. This way the yaml is always consistent with the documentation page. For critical feature or algorithms, we list the core configs in a self-contained page like PPO.md ### High-Level Design None ### Specific Changes - use k1, k2, k3 for the kl calculation, still backward compatible - changed ppo.rst to baseline.md - added ppo.md to explain core options for ppo ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary.

### Checklist Before Starting - [x] Search for similar PR(s). This PR includes contribution and suggestions from [richardodliu](https://github.com/richardodliu) in verl-project/verl#979 ### What does this PR do? Update documentation page, include key configs for PPO and other recipes. Pending docs: - GRPO - DrGRPO - DAPO, etc TODO: let config.rst directly show the content of ppo_trainer.yaml and other related yaml files. In the yaml file, colocate the comment and explanation with the option. This way the yaml is always consistent with the documentation page. For critical feature or algorithms, we list the core configs in a self-contained page like PPO.md ### High-Level Design None ### Specific Changes - use k1, k2, k3 for the kl calculation, still backward compatible - changed ppo.rst to baseline.md - added ppo.md to explain core options for ppo ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary.

### Checklist Before Starting - [x] Search for similar PR(s). This PR includes contribution and suggestions from [richardodliu](https://github.com/richardodliu) in verl-project#979 ### What does this PR do? Update documentation page, include key configs for PPO and other recipes. Pending docs: - GRPO - DrGRPO - DAPO, etc TODO: let config.rst directly show the content of ppo_trainer.yaml and other related yaml files. In the yaml file, colocate the comment and explanation with the option. This way the yaml is always consistent with the documentation page. For critical feature or algorithms, we list the core configs in a self-contained page like PPO.md ### High-Level Design None ### Specific Changes - use k1, k2, k3 for the kl calculation, still backward compatible - changed ppo.rst to baseline.md - added ppo.md to explain core options for ppo ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary.

### Checklist Before Starting - [x] Search for similar PR(s). This PR includes contribution and suggestions from [richardodliu](https://github.com/richardodliu) in verl-project/verl#979 ### What does this PR do? Update documentation page, include key configs for PPO and other recipes. Pending docs: - GRPO - DrGRPO - DAPO, etc TODO: let config.rst directly show the content of ppo_trainer.yaml and other related yaml files. In the yaml file, colocate the comment and explanation with the option. This way the yaml is always consistent with the documentation page. For critical feature or algorithms, we list the core configs in a self-contained page like PPO.md ### High-Level Design None ### Specific Changes - use k1, k2, k3 for the kl calculation, still backward compatible - changed ppo.rst to baseline.md - added ppo.md to explain core options for ppo ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary.

Update core_algos.py

2e653ce

rename kl penlty

Update core_algos.py

371968c

eric-haibin-lin reviewed Apr 8, 2025

View reviewed changes

update

05523f3

hiyouga reviewed Apr 8, 2025

View reviewed changes

hiyouga changed the title ~~Update core_algos.py~~ misc: update kl penaty names Apr 8, 2025

eric-haibin-lin reviewed Apr 9, 2025

View reviewed changes

ZihengJiang added the status: review in process label Apr 29, 2025

eric-haibin-lin mentioned this pull request May 30, 2025

[docs] ppo: add a page for PPO algorithm #1781

Merged

6 tasks

eric-haibin-lin closed this May 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

misc: update kl penaty names#979

misc: update kl penaty names#979
richardodliu wants to merge 3 commits intoverl-project:mainfrom
richardodliu:main

richardodliu commented Apr 8, 2025

Uh oh!

vermouth1992 commented Apr 8, 2025

Uh oh!

richardodliu commented Apr 8, 2025

Uh oh!

vermouth1992 commented Apr 8, 2025

Uh oh!

richardodliu commented Apr 8, 2025

Uh oh!

eric-haibin-lin left a comment

Uh oh!

CLAassistant commented Apr 8, 2025

Uh oh!

richardodliu commented Apr 8, 2025

Uh oh!

hiyouga Apr 8, 2025

Uh oh!

richardodliu Apr 8, 2025

Uh oh!

eric-haibin-lin Apr 9, 2025

Uh oh!

eric-haibin-lin Apr 9, 2025

Uh oh!

eric-haibin-lin commented May 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		@@ -458,18 +458,18 @@ def kl_penalty(logprob: torch.FloatTensor, ref_logprob: torch.FloatTensor, kl_pe
		Returns:

Conversation

richardodliu commented Apr 8, 2025

Uh oh!

vermouth1992 commented Apr 8, 2025

Uh oh!

richardodliu commented Apr 8, 2025

Uh oh!

vermouth1992 commented Apr 8, 2025

Uh oh!

richardodliu commented Apr 8, 2025

Uh oh!

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Uh oh!

CLAassistant commented Apr 8, 2025

Uh oh!

richardodliu commented Apr 8, 2025

Uh oh!

hiyouga Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

richardodliu Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

eric-haibin-lin Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

eric-haibin-lin Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

eric-haibin-lin commented May 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants