Skip to content

misc: update kl penaty names#979

Closed
richardodliu wants to merge 3 commits intoverl-project:mainfrom
richardodliu:main
Closed

misc: update kl penaty names#979
richardodliu wants to merge 3 commits intoverl-project:mainfrom
richardodliu:main

Conversation

@richardodliu
Copy link
Copy Markdown

Rename the KL penalty parameter to be consistent with common sense

rename kl penlty
@vermouth1992
Copy link
Copy Markdown
Collaborator

Could you point the reference to the common sense?

@richardodliu
Copy link
Copy Markdown
Author

Could you point the reference to the common sense?

http://joschu.net/blog/kl-approx.html

  • When only given the chosen token prob, all three methods have been proposed in the blog, not only the$k_3$. Meanwhile, I can't not find a reference that supports the docstring that k3 can be low variance.
  • We can find the correct KL penalty in OpenRLHF.
  • We have a notion that point the true kl estimator function

@vermouth1992
Copy link
Copy Markdown
Collaborator

Nice, could you add a reference of naming in the code? Thanks!

@richardodliu
Copy link
Copy Markdown
Author

Nice, could you add a reference of naming in the code? Thanks!

already done

Copy link
Copy Markdown
Collaborator

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! We'd like to keep backward compatibility. Could you also add the checks for the old name as well. We can update example and docs with the better name to guide new users

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@richardodliu
Copy link
Copy Markdown
Author

Thank you for the PR! We'd like to keep backward compatibility. Could you also add the checks for the old name as well. We can update example and docs with the better name to guide new users

Is this version ok?

def kl_penalty(logprob: torch.FloatTensor, ref_logprob: torch.FloatTensor, kl_penalty) -> torch.FloatTensor:
"""Compute KL divergence given logprob and ref_logprob.
Copied from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1104
reference from https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/models/utils.py#L7
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but these three methods are from joschu's blog. Trl named these methods in a wrong way. I am trying to correct them.

@hiyouga hiyouga changed the title Update core_algos.py misc: update kl penaty names Apr 8, 2025
@@ -458,18 +458,18 @@ def kl_penalty(logprob: torch.FloatTensor, ref_logprob: torch.FloatTensor, kl_pe
Returns:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update the list of values in docs/examples/config.rst. remove the old names in the doc since we do not want others to use them.

also please search in the codebase for existing values used in scripts. for example:

  • examples/reinforce_plus_plus_trainer/run_qwen2-7b_math_rf.sh
  • examples/reinforce_plus_plus_trainer/run_qwen2-7b_math_rf_baseline.sh


"""
if kl_penalty == "kl":
if kl_penalty == "k1":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can simply do if kl_penalty in ('kl', 'k1'):
and then remove the verbose changes in verl/trainer/main_ppo.py

@eric-haibin-lin
Copy link
Copy Markdown
Collaborator

thanks for the suggestion. moving the changes to #1781

eric-haibin-lin added a commit that referenced this pull request May 31, 2025
### Checklist Before Starting

- [x] Search for similar PR(s).

This PR includes contribution and suggestions from
[richardodliu](https://github.com/richardodliu) in
#979

### What does this PR do?

Update documentation page, include key configs for PPO and other
recipes.
Pending docs:
- GRPO
- DrGRPO
- DAPO, etc

TODO: let config.rst directly show the content of ppo_trainer.yaml and
other related yaml files. In the yaml file, colocate the comment and
explanation with the option. This way the yaml is always consistent with
the documentation page.

For critical feature or algorithms, we list the core configs in a
self-contained page like PPO.md

### High-Level Design

None

### Specific Changes

- use k1, k2, k3 for the kl calculation, still backward compatible
- changed ppo.rst to baseline.md 
- added ppo.md to explain core options for ppo 


### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add CI test(s) if necessary.
yzlnew pushed a commit to yzlnew/verl that referenced this pull request Jun 4, 2025
### Checklist Before Starting

- [x] Search for similar PR(s).

This PR includes contribution and suggestions from
[richardodliu](https://github.com/richardodliu) in
verl-project#979

### What does this PR do?

Update documentation page, include key configs for PPO and other
recipes.
Pending docs:
- GRPO
- DrGRPO
- DAPO, etc

TODO: let config.rst directly show the content of ppo_trainer.yaml and
other related yaml files. In the yaml file, colocate the comment and
explanation with the option. This way the yaml is always consistent with
the documentation page.

For critical feature or algorithms, we list the core configs in a
self-contained page like PPO.md

### High-Level Design

None

### Specific Changes

- use k1, k2, k3 for the kl calculation, still backward compatible
- changed ppo.rst to baseline.md 
- added ppo.md to explain core options for ppo 


### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add CI test(s) if necessary.
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Jun 6, 2025
### Checklist Before Starting

- [x] Search for similar PR(s).

This PR includes contribution and suggestions from
[richardodliu](https://github.com/richardodliu) in
verl-project#979

### What does this PR do?

Update documentation page, include key configs for PPO and other
recipes.
Pending docs:
- GRPO
- DrGRPO
- DAPO, etc

TODO: let config.rst directly show the content of ppo_trainer.yaml and
other related yaml files. In the yaml file, colocate the comment and
explanation with the option. This way the yaml is always consistent with
the documentation page.

For critical feature or algorithms, we list the core configs in a
self-contained page like PPO.md

### High-Level Design

None

### Specific Changes

- use k1, k2, k3 for the kl calculation, still backward compatible
- changed ppo.rst to baseline.md 
- added ppo.md to explain core options for ppo 


### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add CI test(s) if necessary.
wwwjn pushed a commit to wwwjn/verl that referenced this pull request Jun 10, 2025
### Checklist Before Starting

- [x] Search for similar PR(s).

This PR includes contribution and suggestions from
[richardodliu](https://github.com/richardodliu) in
verl-project#979

### What does this PR do?

Update documentation page, include key configs for PPO and other
recipes.
Pending docs:
- GRPO
- DrGRPO
- DAPO, etc

TODO: let config.rst directly show the content of ppo_trainer.yaml and
other related yaml files. In the yaml file, colocate the comment and
explanation with the option. This way the yaml is always consistent with
the documentation page.

For critical feature or algorithms, we list the core configs in a
self-contained page like PPO.md

### High-Level Design

None

### Specific Changes

- use k1, k2, k3 for the kl calculation, still backward compatible
- changed ppo.rst to baseline.md 
- added ppo.md to explain core options for ppo 


### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add CI test(s) if necessary.
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
### Checklist Before Starting

- [x] Search for similar PR(s).

This PR includes contribution and suggestions from
[richardodliu](https://github.com/richardodliu) in
verl-project#979

### What does this PR do?

Update documentation page, include key configs for PPO and other
recipes.
Pending docs:
- GRPO
- DrGRPO
- DAPO, etc

TODO: let config.rst directly show the content of ppo_trainer.yaml and
other related yaml files. In the yaml file, colocate the comment and
explanation with the option. This way the yaml is always consistent with
the documentation page.

For critical feature or algorithms, we list the core configs in a
self-contained page like PPO.md

### High-Level Design

None

### Specific Changes

- use k1, k2, k3 for the kl calculation, still backward compatible
- changed ppo.rst to baseline.md 
- added ppo.md to explain core options for ppo 


### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add CI test(s) if necessary.
paolo328 added a commit to paolo328/Verl that referenced this pull request Nov 27, 2025
### Checklist Before Starting

- [x] Search for similar PR(s).

This PR includes contribution and suggestions from
[richardodliu](https://github.com/richardodliu) in
verl-project/verl#979

### What does this PR do?

Update documentation page, include key configs for PPO and other
recipes.
Pending docs:
- GRPO
- DrGRPO
- DAPO, etc

TODO: let config.rst directly show the content of ppo_trainer.yaml and
other related yaml files. In the yaml file, colocate the comment and
explanation with the option. This way the yaml is always consistent with
the documentation page.

For critical feature or algorithms, we list the core configs in a
self-contained page like PPO.md

### High-Level Design

None

### Specific Changes

- use k1, k2, k3 for the kl calculation, still backward compatible
- changed ppo.rst to baseline.md 
- added ppo.md to explain core options for ppo 


### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add CI test(s) if necessary.
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
### Checklist Before Starting

- [x] Search for similar PR(s).

This PR includes contribution and suggestions from
[richardodliu](https://github.com/richardodliu) in
verl-project#979

### What does this PR do?

Update documentation page, include key configs for PPO and other
recipes.
Pending docs:
- GRPO
- DrGRPO
- DAPO, etc

TODO: let config.rst directly show the content of ppo_trainer.yaml and
other related yaml files. In the yaml file, colocate the comment and
explanation with the option. This way the yaml is always consistent with
the documentation page.

For critical feature or algorithms, we list the core configs in a
self-contained page like PPO.md

### High-Level Design

None

### Specific Changes

- use k1, k2, k3 for the kl calculation, still backward compatible
- changed ppo.rst to baseline.md 
- added ppo.md to explain core options for ppo 


### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add CI test(s) if necessary.
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
### Checklist Before Starting

- [x] Search for similar PR(s).

This PR includes contribution and suggestions from
[richardodliu](https://github.com/richardodliu) in
verl-project#979

### What does this PR do?

Update documentation page, include key configs for PPO and other
recipes.
Pending docs:
- GRPO
- DrGRPO
- DAPO, etc

TODO: let config.rst directly show the content of ppo_trainer.yaml and
other related yaml files. In the yaml file, colocate the comment and
explanation with the option. This way the yaml is always consistent with
the documentation page.

For critical feature or algorithms, we list the core configs in a
self-contained page like PPO.md

### High-Level Design

None

### Specific Changes

- use k1, k2, k3 for the kl calculation, still backward compatible
- changed ppo.rst to baseline.md 
- added ppo.md to explain core options for ppo 


### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add CI test(s) if necessary.
FortPercent pushed a commit to FortPercent/TeleBoost that referenced this pull request May 7, 2026
### Checklist Before Starting

- [x] Search for similar PR(s).

This PR includes contribution and suggestions from
[richardodliu](https://github.com/richardodliu) in
verl-project/verl#979

### What does this PR do?

Update documentation page, include key configs for PPO and other
recipes.
Pending docs:
- GRPO
- DrGRPO
- DAPO, etc

TODO: let config.rst directly show the content of ppo_trainer.yaml and
other related yaml files. In the yaml file, colocate the comment and
explanation with the option. This way the yaml is always consistent with
the documentation page.

For critical feature or algorithms, we list the core configs in a
self-contained page like PPO.md

### High-Level Design

None

### Specific Changes

- use k1, k2, k3 for the kl calculation, still backward compatible
- changed ppo.rst to baseline.md 
- added ppo.md to explain core options for ppo 


### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add CI test(s) if necessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants