Skip to content

Enable piecewise-cuda-graph when logprob_start_len = -1#19453

Merged
Fridge003 merged 10 commits intosgl-project:mainfrom
Qiaolin-Yu:fix_pcg_logprob
Mar 10, 2026
Merged

Enable piecewise-cuda-graph when logprob_start_len = -1#19453
Fridge003 merged 10 commits intosgl-project:mainfrom
Qiaolin-Yu:fix_pcg_logprob

Conversation

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator

@Qiaolin-Yu Qiaolin-Yu commented Feb 26, 2026

Motivation

Modifications

Currently logprob_start_len= len(input_ids) - 1 is useless, since the loprob of the first decode token will be included in output_log_probs and not controlled by this attribute. If logprob_start_len= len(input_ids) - 1, it only adds a useless computation and blocks pcg. As a workaround, we adjust the default value.

Tests

python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --enable-piecewise-cuda-graph

# -1 (last token)
curl -X POST http://localhost:30000/generate \  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, world!",
    "sampling_params": {
      "max_new_tokens": 16,
      "temperature": 0
    },
    "return_logprob": true,
    "logprob_start_len": -1,
    "top_logprobs_num": 3
  }'

See the prefill log.

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

other_args=[
"--attention-backend",
"flashinfer",
"triton",
Copy link
Copy Markdown
Collaborator Author

@Qiaolin-Yu Qiaolin-Yu Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flashinfer uses non-ragged w/ pcg, and ragged w/o pcg. therefore, it will be not bit-wise and make the test fail.

@Fridge003 Fridge003 merged commit a3d88a2 into sgl-project:main Mar 10, 2026
160 of 169 checks passed
liubiyongge pushed a commit to liubiyongge/sglang that referenced this pull request Mar 13, 2026
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants