Skip to content

[sglang] feat: update weights in batch with FSDP#2559

Merged
zhaochenyang20 merged 11 commits intoverl-project:mainfrom
zhaochenyang20:fsdp_buffer
Jul 15, 2025
Merged

[sglang] feat: update weights in batch with FSDP#2559
zhaochenyang20 merged 11 commits intoverl-project:mainfrom
zhaochenyang20:fsdp_buffer

Conversation

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Thanks so much to @Yangruipis and @zhuzilin, we implemented the group-wise weights update for SGLang in FSDP.

We are still testing the speed up in megtron and FSDP.

For megatron: #2418

At sgl, we're currently exploring two approaches to optimize resharding:

  1. Grouped calls to update weights from tensor: Previously, we called this endpoint for each tensor individually. We're now grouping tensors to reduce the CPU overhead of these calls.
  2. Single large data buffer update: We're investigating whether we can form a single large data buffer to update a group of tensors all at once. This would reduce the number of times the IPC handler is opened and closed.

For the first approach, we're implementing it separately in Megatron and FSDP. I'm starting by merging the FSDP implementation, and then I'll create a common interface for Megatron. We're still evaluating the second approach to see if it's feasible.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an optimization for updating weights in FSDP by batching tensor updates. There are two critical issues that must be addressed: the tensor bucketing logic is not synchronized across ranks when using FSDP v1 with sharded state dicts, and the use of zip(..., strict=False) can hide bugs by silently truncating data if batches from different ranks have inconsistent lengths.

Comment thread verl/workers/sharding_manager/fsdp_sglang.py
Comment thread verl/workers/sharding_manager/fsdp_sglang.py Outdated
@zhaochenyang20 zhaochenyang20 changed the title Fsdp buffer [sglang] feat: update weights in batch with FSDP Jul 15, 2025
@zhaochenyang20
Copy link
Copy Markdown
Collaborator Author

Move #2499 into this

Comment thread verl/workers/sharding_manager/fsdp_sglang.py Outdated
Copy link
Copy Markdown
Collaborator

@hebiao064 hebiao064 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one minor question

Comment thread verl/trainer/config/rollout/rollout.yaml Outdated
@zhaochenyang20
Copy link
Copy Markdown
Collaborator Author

zhaochenyang20 commented Jul 15, 2025

thanks @hebiao064 @Yangruipis

@zhaochenyang20
Copy link
Copy Markdown
Collaborator Author

@zhaochenyang20 zhaochenyang20 merged commit f0d4c76 into verl-project:main Jul 15, 2025
28 of 29 checks passed
HelloWorld686 pushed a commit to HelloWorld686/verl that referenced this pull request Jul 17, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Thanks so much to @Yangruipis and @zhuzilin, we implemented the
group-wise weights update for SGLang in FSDP.

We are still testing the speed up in megtron and FSDP.

For megatron: verl-project#2418


At sgl, we're currently exploring two approaches to optimize resharding:

1. **Grouped calls to `update weights from tensor`**: Previously, we
called this endpoint for each tensor individually. We're now grouping
tensors to reduce the CPU overhead of these calls.
2. **Single large data buffer update**: We're investigating whether we
can form a single large data buffer to update a group of tensors all at
once. This would reduce the number of times the IPC handler is opened
and closed.

For the first approach, we're implementing it separately in Megatron and
FSDP. I'm starting by merging the FSDP implementation, and then I'll
create a common interface for Megatron. We're still evaluating the
second approach to see if it's feasible.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: zhaochenyang <zhaochenyang20@gmail.com>
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Jul 25, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Thanks so much to @Yangruipis and @zhuzilin, we implemented the
group-wise weights update for SGLang in FSDP.

We are still testing the speed up in megtron and FSDP.

For megatron: verl-project#2418


At sgl, we're currently exploring two approaches to optimize resharding:

1. **Grouped calls to `update weights from tensor`**: Previously, we
called this endpoint for each tensor individually. We're now grouping
tensors to reduce the CPU overhead of these calls.
2. **Single large data buffer update**: We're investigating whether we
can form a single large data buffer to update a group of tensors all at
once. This would reduce the number of times the IPC handler is opened
and closed.

For the first approach, we're implementing it separately in Megatron and
FSDP. I'm starting by merging the FSDP implementation, and then I'll
create a common interface for Megatron. We're still evaluating the
second approach to see if it's feasible.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: zhaochenyang <zhaochenyang20@gmail.com>
oseyosey pushed a commit to oseyosey/verl that referenced this pull request Jul 28, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Thanks so much to @Yangruipis and @zhuzilin, we implemented the
group-wise weights update for SGLang in FSDP.

We are still testing the speed up in megtron and FSDP.

For megatron: verl-project#2418


At sgl, we're currently exploring two approaches to optimize resharding:

1. **Grouped calls to `update weights from tensor`**: Previously, we
called this endpoint for each tensor individually. We're now grouping
tensors to reduce the CPU overhead of these calls.
2. **Single large data buffer update**: We're investigating whether we
can form a single large data buffer to update a group of tensors all at
once. This would reduce the number of times the IPC handler is opened
and closed.

For the first approach, we're implementing it separately in Megatron and
FSDP. I'm starting by merging the FSDP implementation, and then I'll
create a common interface for Megatron. We're still evaluating the
second approach to see if it's feasible.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: zhaochenyang <zhaochenyang20@gmail.com>
Juniper1021 pushed a commit to Juniper1021/verl that referenced this pull request Aug 7, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Thanks so much to @Yangruipis and @zhuzilin, we implemented the
group-wise weights update for SGLang in FSDP.

We are still testing the speed up in megtron and FSDP.

For megatron: verl-project#2418


At sgl, we're currently exploring two approaches to optimize resharding:

1. **Grouped calls to `update weights from tensor`**: Previously, we
called this endpoint for each tensor individually. We're now grouping
tensors to reduce the CPU overhead of these calls.
2. **Single large data buffer update**: We're investigating whether we
can form a single large data buffer to update a group of tensors all at
once. This would reduce the number of times the IPC handler is opened
and closed.

For the first approach, we're implementing it separately in Megatron and
FSDP. I'm starting by merging the FSDP implementation, and then I'll
create a common interface for Megatron. We're still evaluating the
second approach to see if it's feasible.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: zhaochenyang <zhaochenyang20@gmail.com>
whatadayG pushed a commit to whatadayG/verl that referenced this pull request Sep 5, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Thanks so much to @Yangruipis and @zhuzilin, we implemented the
group-wise weights update for SGLang in FSDP.

We are still testing the speed up in megtron and FSDP.

For megatron: verl-project#2418


At sgl, we're currently exploring two approaches to optimize resharding:

1. **Grouped calls to `update weights from tensor`**: Previously, we
called this endpoint for each tensor individually. We're now grouping
tensors to reduce the CPU overhead of these calls.
2. **Single large data buffer update**: We're investigating whether we
can form a single large data buffer to update a group of tensors all at
once. This would reduce the number of times the IPC handler is opened
and closed.

For the first approach, we're implementing it separately in Megatron and
FSDP. I'm starting by merging the FSDP implementation, and then I'll
create a common interface for Megatron. We're still evaluating the
second approach to see if it's feasible.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: zhaochenyang <zhaochenyang20@gmail.com>
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Thanks so much to @Yangruipis and @zhuzilin, we implemented the
group-wise weights update for SGLang in FSDP.

We are still testing the speed up in megtron and FSDP.

For megatron: verl-project#2418


At sgl, we're currently exploring two approaches to optimize resharding:

1. **Grouped calls to `update weights from tensor`**: Previously, we
called this endpoint for each tensor individually. We're now grouping
tensors to reduce the CPU overhead of these calls.
2. **Single large data buffer update**: We're investigating whether we
can form a single large data buffer to update a group of tensors all at
once. This would reduce the number of times the IPC handler is opened
and closed.

For the first approach, we're implementing it separately in Megatron and
FSDP. I'm starting by merging the FSDP implementation, and then I'll
create a common interface for Megatron. We're still evaluating the
second approach to see if it's feasible.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: zhaochenyang <zhaochenyang20@gmail.com>
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Thanks so much to @Yangruipis and @zhuzilin, we implemented the
group-wise weights update for SGLang in FSDP.

We are still testing the speed up in megtron and FSDP.

For megatron: verl-project#2418


At sgl, we're currently exploring two approaches to optimize resharding:

1. **Grouped calls to `update weights from tensor`**: Previously, we
called this endpoint for each tensor individually. We're now grouping
tensors to reduce the CPU overhead of these calls.
2. **Single large data buffer update**: We're investigating whether we
can form a single large data buffer to update a group of tensors all at
once. This would reduce the number of times the IPC handler is opened
and closed.

For the first approach, we're implementing it separately in Megatron and
FSDP. I'm starting by merging the FSDP implementation, and then I'll
create a common interface for Megatron. We're still evaluating the
second approach to see if it's feasible.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: zhaochenyang <zhaochenyang20@gmail.com>
oseyosey pushed a commit to oseyosey/verl that referenced this pull request Jan 20, 2026
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Thanks so much to @Yangruipis and @zhuzilin, we implemented the
group-wise weights update for SGLang in FSDP.

We are still testing the speed up in megtron and FSDP.

For megatron: verl-project#2418


At sgl, we're currently exploring two approaches to optimize resharding:

1. **Grouped calls to `update weights from tensor`**: Previously, we
called this endpoint for each tensor individually. We're now grouping
tensors to reduce the CPU overhead of these calls.
2. **Single large data buffer update**: We're investigating whether we
can form a single large data buffer to update a group of tensors all at
once. This would reduce the number of times the IPC handler is opened
and closed.

For the first approach, we're implementing it separately in Megatron and
FSDP. I'm starting by merging the FSDP implementation, and then I'll
create a common interface for Megatron. We're still evaluating the
second approach to see if it's feasible.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: zhaochenyang <zhaochenyang20@gmail.com>
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Thanks so much to @Yangruipis and @zhuzilin, we implemented the
group-wise weights update for SGLang in FSDP.

We are still testing the speed up in megtron and FSDP.

For megatron: verl-project#2418


At sgl, we're currently exploring two approaches to optimize resharding:

1. **Grouped calls to `update weights from tensor`**: Previously, we
called this endpoint for each tensor individually. We're now grouping
tensors to reduce the CPU overhead of these calls.
2. **Single large data buffer update**: We're investigating whether we
can form a single large data buffer to update a group of tensors all at
once. This would reduce the number of times the IPC handler is opened
and closed.

For the first approach, we're implementing it separately in Megatron and
FSDP. I'm starting by merging the FSDP implementation, and then I'll
create a common interface for Megatron. We're still evaluating the
second approach to see if it's feasible.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

---------

Co-authored-by: zhaochenyang <zhaochenyang20@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants