Fix CUDA stream synchronization in sampler logprobs extraction by chanh · Pull Request #20064 · sgl-project/sglang

chanh · 2026-03-06T22:55:55Z

Motivation

get_token_ids_logprobs_batch_optimized was creating tensors directly on GPU and calling torch.repeat_interleave with a GPU tensor, both of which force a cudaStreamSynchronize — a known PyTorch issue
(pytorch/pytorch#108968). This stalls the GPU, preventing it from executing kernels concurrently during the sampler phase.

Modifications

Compute lengths and flatten token IDs as Python lists on CPU
Create tensors on CPU, then transfer to GPU with non_blocking=True
This eliminates the implicit device sync and allows the GPU to continue executing kernels without stalling

Accuracy Tests

No model output changes — this is a performance fix only.

Benchmarking and Profiling

Profiling confirms the cudaStreamSynchronize call is removed after this change.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

`get_token_ids_logprobs_batch_optimized` was creating tensors directly on GPU and using `torch.repeat_interleave` with a GPU tensor, both of which force a `cudaStreamSynchronize` — a known issue (pytorch/pytorch#108968). Fix: compute lengths and flatten token IDs as Python lists on CPU, create tensors on CPU, then transfer to GPU with `non_blocking=True`. This eliminates the device sync and allows the GPU to continue executing kernels without stalling. Profiling confirms the cuda sync is removed after this change.

gemini-code-assist · 2026-03-06T22:55:59Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Qiaolin-Yu

Is it possible to do similar things like

sglang/python/sglang/srt/layers/utils/logprob.py

Line 176 in f97c09d

def get_token_ids_logprobs(logprobs, token_ids_logprobs, no_copy_to_cpu=False):

Adding no_copy_to_cpu here and do the actual gpu-> cpu transfer in

sglang/python/sglang/srt/managers/utils.py

Line 52 in f97c09d

def copy_to_cpu(self, return_logprob: bool):

hnyls2002 assigned hnyls2002 and Qiaolin-Yu Mar 6, 2026

Qiaolin-Yu mentioned this pull request Mar 20, 2026

[Roadmap] logprob refactor and improvement #21048

Open

9 tasks

Qiaolin-Yu reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CUDA stream synchronization in sampler logprobs extraction#20064

Fix CUDA stream synchronization in sampler logprobs extraction#20064
chanh wants to merge 1 commit intosgl-project:mainfrom
chanh:cnguyen/fix-sampler-memsync

chanh commented Mar 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 6, 2026

Uh oh!

Qiaolin-Yu left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chanh commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 6, 2026

Uh oh!

Qiaolin-Yu left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chanh commented Mar 6, 2026 •

edited

Loading