[RL] Support Multi-Stage Awake by hebiao064 · Pull Request #6962 · sgl-project/sglang

hebiao064 · 2025-06-08T01:09:24Z

To Reviewer: Need this PR to check in first: fzyzcjy/torch_memory_saver#17

Currently, many unit tests are failed because tms only support singleton without above change.

Related PR:

Thanks a lot for @fzyzcjy 's guidance and help.

Motivation

vLLM has supported Multi-Stage Awake for vllm engine: vllm-project/vllm#15254

But in SGLang, we are using torch_memory_saver for holding the model weight and KV Cache virtual address (to make sure CUDA Graph works across different rollouts)

And torch_memory_saver is a singleton, which make it hard for SGLang to support Multi-Stage Awake which is critical in RL use case

Without this feature, we can only set KV Cache mem frac rate to <= 0.7 or even lower (e.g 0.3)

Need this PR to check in first: fzyzcjy/torch_memory_saver#17

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

fzyzcjy

not yet fully reviewed, just spent a few minute to glance

fzyzcjy

some nits

fzyzcjy · 2025-06-09T13:21:59Z

+                self.tp_worker.worker.model_runner.model
+            )
+
+            self.weights_memory_saver_adapter.check_validity(


nit: to be honest I feel the PR to be a little bit over complicated, if time permits, could you please update them a bit

simplified, let me know if its better now

zhaochenyang20

LGTM

…ject/sglang into bhe/support_multiple_tms

hebiao064 · 2025-06-17T04:06:48Z

Will be using this PR: #7099

And here is the issue for tracking: #7009

Support Multiple Torch Memory Saver for Multi-Stage-Awake

22f87aa

hebiao064 requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy, xiezhq-hermann, zhaochenyang20 and zhyncs as code owners June 8, 2025 01:09

hebiao064 had a problem deploying to prod June 8, 2025 01:09 — with GitHub Actions Failure

This comment was marked as spam.

Sign in to view

rm unnecessary code

eead001

hebiao064 had a problem deploying to prod June 8, 2025 01:12 — with GitHub Actions Failure

add test

b834ea6

hebiao064 temporarily deployed to prod June 8, 2025 01:57 — with GitHub Actions Inactive

hebiao064 requested a review from fzyzcjy June 8, 2025 02:00

hebiao064 assigned hebiao064 and fzyzcjy Jun 8, 2025

hebiao064 changed the title ~~Support Multiple Torch Memory Saver for Multi-Stage-Awake~~ [RL] Support Multi-Stage-Awake Jun 8, 2025

zhaochenyang20 requested changes Jun 8, 2025

View reviewed changes

Comment thread python/sglang/srt/managers/scheduler.py Outdated

Comment thread test/srt/test_release_memory_occupation.py Outdated

hebiao064 changed the title ~~[RL] Support Multi-Stage-Awake~~ [RL] Support Multi-Stage Awake Jun 8, 2025

uncomment disable_cuda_graph

11e0461

hebiao064 temporarily deployed to prod June 8, 2025 04:55 — with GitHub Actions Inactive

fix server up issue

a336515

hebiao064 had a problem deploying to prod June 8, 2025 06:12 — with GitHub Actions Error

remove debugging code

ceb314f

hebiao064 temporarily deployed to prod June 8, 2025 06:22 — with GitHub Actions Inactive

fzyzcjy reviewed Jun 8, 2025

View reviewed changes

Comment thread python/sglang/srt/torch_memory_saver_adapter.py Outdated

Comment thread python/sglang/srt/torch_memory_saver_adapter.py Outdated

polish the test

7095f02

hebiao064 temporarily deployed to prod June 8, 2025 22:31 — with GitHub Actions Inactive

Address reviewers feedback

d4eae34

hebiao064 temporarily deployed to prod June 8, 2025 22:57 — with GitHub Actions Inactive

modify test

062b349

hebiao064 temporarily deployed to prod June 9, 2025 06:28 — with GitHub Actions Inactive

fzyzcjy reviewed Jun 9, 2025

View reviewed changes

hebiao064 mentioned this pull request Jun 9, 2025

[RFC] Support Multi-Stage Awake for RL #7009

Closed

10 tasks

hebiao064 added 2 commits June 9, 2025 21:06

fix test del model issue

10d9e26

simplify code

0ea29ca

zhaochenyang20 reviewed Jun 9, 2025

View reviewed changes

Comment thread python/sglang/srt/managers/scheduler.py Outdated

zhaochenyang20 and others added 4 commits June 9, 2025 16:29

Merge branch 'main' into bhe/support_multiple_tms

bf5e4b3

removing unnecesary code comment

effced3

upd

723609f

Merge branch 'bhe/support_multiple_tms' of https://github.com/sgl-pro…

c9d7be8

…ject/sglang into bhe/support_multiple_tms

hebiao064 mentioned this pull request Jun 11, 2025

[rollout] feat: Support Multi-stage Awake for SGLang verl-project/verl#1911

Merged

10 tasks

hebiao064 closed this Jun 17, 2025

zhyncs deleted the bhe/support_multiple_tms branch June 20, 2025 05:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RL] Support Multi-Stage Awake#6962

[RL] Support Multi-Stage Awake#6962
hebiao064 wants to merge 15 commits intomainfrom
bhe/support_multiple_tms

hebiao064 commented Jun 8, 2025 •

edited

Loading

Uh oh!

This comment was marked as spam.

Uh oh!

This comment was marked as spam.

Uh oh!

Uh oh!

Uh oh!

fzyzcjy left a comment

Uh oh!

Uh oh!

Uh oh!

fzyzcjy left a comment

Uh oh!

Uh oh!

Uh oh!

fzyzcjy Jun 9, 2025

Uh oh!

hebiao064 Jun 9, 2025

Uh oh!

zhaochenyang20 left a comment

Uh oh!

Uh oh!

hebiao064 commented Jun 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hebiao064 commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

This comment was marked as spam.

Uh oh!

This comment was marked as spam.

Uh oh!

Uh oh!

Uh oh!

fzyzcjy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fzyzcjy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fzyzcjy Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

hebiao064 Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hebiao064 commented Jun 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hebiao064 commented Jun 8, 2025 •

edited

Loading