[HiSparse] Optimize the scheduling of decode backup.#21932
Merged
xiezhq-hermann merged 12 commits intosgl-project:mainfrom Apr 7, 2026
Merged
[HiSparse] Optimize the scheduling of decode backup.#21932xiezhq-hermann merged 12 commits intosgl-project:mainfrom
xiezhq-hermann merged 12 commits intosgl-project:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request implements an asynchronous backup mechanism for decode tokens within the HiSparseCoordinator, aiming to overlap host memory transfers with model execution using a dedicated stream and CUDA events. The review feedback identifies a critical issue regarding Tensor Parallelism where backups might be skipped on non-scheduler ranks, and suggests several performance optimizations, such as using collections.deque for the pending backup queue and removing redundant tensor operations like .clone(), .contiguous(), and inefficient list comprehensions.
Co-authored-by: hzh0425 <hzh0425@apache.org>
e4ca5be to
3584ff3
Compare
ac00794 to
b67f64f
Compare
xiezhq-hermann
approved these changes
Apr 7, 2026
Collaborator
Author
|
/rerun-failed-ci |
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
sequenceDiagram participant S as Scheduler participant C as HiSparseCoordinator participant P as Decode Producer Stream participant B as Decode Backup Stream S->>C: map_last_loc_to_buffer(...) C->>C: prepare previous-token backup metadata C->>B: enqueue backup work Note over B: Backup is queued immediately<br/>but waits on forward_done_event C->>C: grow device buffer / remap reserved slot P->>C: wait_for_pending_backup() Note over P,C: Before each decode pass,<br/>wait for the previous backup to finish C-->>P: clear pending_backup_done_event P->>P: run decode forward P->>C: note_decode_forward_done() C->>C: record decode_forward_done_event B->>B: wait(decode_forward_done_event) B->>B: host_locs = alloc(...) B->>B: req_to_host_pool[...] = host_locs B->>B: backup_from_device_all_layer(...) B->>C: record backup_done_event C->>C: publish pending_backup_done_event Note over C: The next decode pass<br/>waits on this eventWith this optimization, end-to-end TPOT performance improves by 5%.
Benchmark
h20-96g
bench serving
before:
after
Accuracy Tests
gsm8k
before
after
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci