Fix InternVL and vision attention for non-CUDA backends (e.g. XPU)#19997
Fix InternVL and vision attention for non-CUDA backends (e.g. XPU)#19997hnyls2002 merged 9 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@mingfeima and @airMeng, I am currently working with @yangw1234 on these pull requests. I was wondering if you could help us get these reviewed. |
|
@jmunetong Thank you for your help, would you mind to add at least one test cases to XPU CI to avoid broken again? You can refer to https://github.com/sgl-project/sglang/tree/main/test/srt/xpu cc validation leader @MingxuZh |
|
@airMeng Should be passing now. I forgot to rebase from a different pr where we modified some cuda calls that were breaking the xpu test. |
@jmunetong The internvl specific changes are gone, you need to add it back. And I think @airMeng is asking you to add a test case similar to this one https://github.com/sgl-project/sglang/blob/main/test/srt/xpu/test_deepseek_ocr.py. You may also refer this file https://github.com/sgl-project/sglang/blob/main/test/registered/vlm/test_vision_openai_server_a.py#L112 |
|
/tag-and-rerun-ci |
…gl-project#19997) Co-authored-by: Yang Wang <mr.yang.wang@outlook.com>
…gl-project#19997) Co-authored-by: Yang Wang <mr.yang.wang@outlook.com>
…gl-project#19997) Co-authored-by: Yang Wang <mr.yang.wang@outlook.com>
…gl-project#19997) Co-authored-by: Yang Wang <mr.yang.wang@outlook.com>
This commit introduces comprehensive ROCm wheel building infrastructure for SGLang, targeting AWS S3 for internal distribution. scripts/check_aiter_version.sh: - Smart AITER version detection from docker/rocm.Dockerfile - Handles both version tags (e.g., v0.1.12.post1) and commit SHAs - Checks S3 for existing wheels to avoid unnecessary rebuilds - Returns rebuild decision and detected version .github/workflows/release-whl-sglang-rocm.yml: - Unified workflow with 4 stages and explicit job dependencies - Stage 1: Check if AITER needs rebuild (conditional) - Stage 2: Build AITER wheels for rocm700/rocm720 (only if needed) - Stage 3: Build sglang wheels for both ROCm versions - Stage 4: Upload to S3 with proper directory structure 1. S3 Structure: Clean separation of HTML indices and wheel files - simple/: HTML indices (PEP 503 compliant) - packages/: Actual wheel files - Relative links: ../../packages/pkg/file.whl 2. AITER Workflow: Integrated as conditional stage - Triggered by docker/rocm.Dockerfile changes - Smart rebuild: only when version changes - AITER must complete before sglang build 3. Version Format: Standard Python versioning - Release: 0.5.9 - Nightly: 0.5.10.dev20260421+g4cf4f08 - Note: AITER and sglang-kernel keep +rocm suffix (compiled binaries) 4. AWS Secrets: AMD_* naming convention - AMD_AWS_ACCESS_KEY_ID - AMD_AWS_SECRET_ACCESS_KEY - AMD_S3_BUCKET_NAME 5. Workflow Triggers: - Daily schedule (3 AM UTC) - Push to docker/rocm.Dockerfile - Manual dispatch with options Related-to: sgl-project#19997 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add pyproject_rocm.toml with ROCm/HIP-specific dependencies:
- Inherits runtime_common, diffusion_common, tracing, test extras
- Defines rocm700 and rocm720 extras with pinned packages:
* torch, triton, torchaudio, torchvision from repo.radeon.com
* sglang-kernel from GitHub releases
* amd-aiter (discovered via --extra-index-url)
* mooncake-transfer-engine-non-cuda
- Defines srt_hip and diffusion_hip for HIP runtime
- Removed non-ROCm architectures (HPU, MUSA, MPS)
Users install with:
pip install 'sglang[srt_hip,rocm700]' \
--extra-index-url https://aioss-pypi-prod.s3.amazonaws.com/sglang/rocm700/simple/
setuptools-scm configured with local_scheme='no-local-version' to suppress
+rocm suffix on sglang wheels (standard Python versioning).
Related-to: sgl-project#19997
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Motivation
InternVL and vision attention currently assume CUDA: they use hardcoded
"cuda"and.cuda(), and the vision attention backend selection does not handle XPU. This prevents running InternVL and vision models on Intel XPU (and other non-CUDA devices). This PR makes both components device-agnostic so they work on the configured backend (CUDA, XPU, etc.).Modifications
python/sglang/srt/multimodal/processors/internvl.pyget_devicefromsglang.srt.utils.device="cuda"withdevice=get_device()in normalization and preprocessing..cuda()and.to("cuda")with.to(get_device())for image/video tensors andinput_idsso tensors are created on the actual backend device.python/sglang/srt/layers/attention/vision.pyis_xpuand set_is_xpu = is_xpu()alongside existing_is_cuda,_is_npu,_is_hip.VisionTritonAttention: usecu_seqlens.to(q.device)andseq_lens.to(q.device)instead of.cuda()so tensors follow the model device (works on XPU and other backends).VisionAttentionbackend selection: addelif _is_xpu: backend = "triton_attn"so XPU uses the Triton attention backend instead of falling through to SDPA or unsupported paths.Accuracy Tests
This PR does not change model forward or kernel math; it only changes device placement and backend selection for non-CUDA. No new accuracy test results are required. Existing InternVL and vision model behavior on CUDA is unchanged; on XPU, models can now run with correct device placement and backend.
Benchmarking and Profiling
Not applicable. Changes are for correctness and multi-device support (device placement and backend selection). No intentional inference-speed changes; benchmarking can be done by maintainers on XPU if needed.