Skip to content

Convert cu_seqlens to CPU for npu_flash_attention_unpad operator#15434

Merged
iforgetmyname merged 15 commits intosgl-project:mainfrom
xiaobaicxy:main
Jan 4, 2026
Merged

Convert cu_seqlens to CPU for npu_flash_attention_unpad operator#15434
iforgetmyname merged 15 commits intosgl-project:mainfrom
xiaobaicxy:main

Conversation

@xiaobaicxy
Copy link
Copy Markdown
Contributor

@xiaobaicxy xiaobaicxy commented Dec 19, 2025

Motivation

In order to improve the performance of VisionAscendAttention, we convert cu_seqlens to CPU before the first transformer layer, because converting it to CPU per layer would interrupt operator dispatch and cause kernel bubbles.

Modifications

Accuracy Tests

All modifications do not affect the precision

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the Multi-modal multi-modal language model label Dec 19, 2025
Comment thread python/sglang/srt/models/qwen2_5_vl.py Outdated
self.act = ACT2FN[hidden_act]
self.hidden_act = hidden_act
if self.hidden_act == "silu":
from sglang.srt.layers.activation import SiluAndMul
Copy link
Copy Markdown
Collaborator

@yuan-luo yuan-luo Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move imports to the top of the file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* 'main' of https://github.com/sgl-project/sglang: (136 commits)
  fix: unreachable error check in retraction (sgl-project#15433)
  [sgl-kernel] chore: update deepgemm version (sgl-project#13402)
  [diffusion] multi-platform: support diffusion on amd and fix encoder loading on MI325 (sgl-project#13760)
  [amd] Add deterministic all-reduce kernel for AMD (ROCm) (sgl-project#15340)
  [diffusion] refactor: refactor _build_req_from_sampling to use shallow_asdict (sgl-project#13782)
  Add customized sampler registration (sgl-project#15423)
  Update readme (sgl-project#15425)
  Fix Mindspore model import warning (sgl-project#15287)
  [Feature] Xiaomi `MiMo-V2-Flash` day0 support (sgl-project#15207)
  [diffusion] profiling: add bench_serving.py and VBench (sgl-project#15410)
  [DLLM] Fix dLLM regression (sgl-project#15371)
  [Deepseek V3.2] Fix Deepseek MTP in V1 mode (sgl-project#15429)
  chore: update CI_PERMISSIONS (sgl-project#15431)
  [DLLM] Add CI for diffusion LLMs (sgl-project#14723)
  Support using different attention backend for draft decoding. (sgl-project#14843)
  feat(dsv32): better error handling for DeepSeek-v3.2 encoder (sgl-project#14353)
  tiny fix lint on main (sgl-project#15424)
  multimodal: precompute hash for MultimodalDataItem (sgl-project#14354)
  [AMD] Clear pre-built AITER kernels and warmup to prevent segfaults and test timeouts (sgl-project#15318)
  [Performance] optimize NSA backend metadata computation for multi-step speculative decoding (sgl-project#14781)
  ...
@JustinTong0323
Copy link
Copy Markdown
Collaborator

please fix lint by pre-commit -a ~

@xiaobaicxy xiaobaicxy closed this Dec 23, 2025
@xiaobaicxy xiaobaicxy reopened this Dec 23, 2025
iforgetmyname added a commit that referenced this pull request Dec 26, 2025
Liwansi added a commit to iforgetmyname/sglang that referenced this pull request Dec 29, 2025
…glang into eagle-sche

* 'ifmn/eagle-dp-attn' of https://github.com/sgl-project/sglang: (22 commits)
  dp scheduler enhance support with chunked prefill (sgl-project#16071)
  modify suffix decoding
  CI dependency update (sgl-project#16063)
  fix rotary_embedding init npu (sgl-project#16011)
  feat: bugfix and accuracy fix for stablelm2_1_6b (sgl-project#15932)
  Update model and feature support for Ascend NPU (sgl-project#16005)
  Bugfix for Llama4 (sgl-project#15929)
  Bugfix for ds-vl2 (sgl-project#15894)
  gme qwen vl runners fix (sgl-project#15899)
  add profiling in scheduler (sgl-project#15876)
  llama use triton rope op (sgl-project#15855)
  suffix decoding adapt npu
  suffix decoding adapt npu
  Add suffix decoding speculative algorithm from feature 13553
  cherry sgl-project#15434: qwen3 vl performance update
  cherry sgl-project#15597: fix Qwen3-VL-30B-A3B-Instruct accuracy loss
  [Schedule] bug fix for schedule enhancer (sgl-project#15834)
  minilb support roundrobin (sgl-project#15824)
  fix torchair compile issue
  cherry sgl-project#15187: lora fix
  ...

# Conflicts:
#	python/sglang/srt/managers/scheduler.py
#	python/sglang/srt/managers/scheduler_enhancer.py
@iforgetmyname
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

Comment thread python/sglang/srt/models/qwen3_vl.py Outdated
Comment on lines +490 to +493
if is_npu():
cu_seqlens = cu_seqlens.to("cpu")
else:
cu_seqlens = cu_seqlens.to(self.device, non_blocking=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if is_npu():
cu_seqlens = cu_seqlens.to("cpu")
else:
cu_seqlens = cu_seqlens.to(self.device, non_blocking=True)
if not is_npu():
xxx
else:
xxx

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@xiaobaicxy xiaobaicxy changed the title Qwen2.5-vl support SiluAndMul/GeluAndMul & Convert cu_seqlens to CPU for npu_flash_attention_unpad operator Convert cu_seqlens to CPU for npu_flash_attention_unpad operator Jan 3, 2026
@iforgetmyname iforgetmyname merged commit 25fa2ac into sgl-project:main Jan 4, 2026
153 of 168 checks passed
JiaruiChang5268 pushed a commit to JiaruiChang5268/sglang that referenced this pull request Jan 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Multi-modal multi-modal language model run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants