Skip to content

[XPU] Whisper model support on XPU Platform#25123

Merged
jikunshang merged 1 commit intovllm-project:mainfrom
chaojun-zhang:whisper_model_support
Sep 18, 2025
Merged

[XPU] Whisper model support on XPU Platform#25123
jikunshang merged 1 commit intovllm-project:mainfrom
chaojun-zhang:whisper_model_support

Conversation

@chaojun-zhang
Copy link
Copy Markdown
Contributor

@chaojun-zhang chaojun-zhang commented Sep 18, 2025

Purpose

Add Whisper model support on XPU

Test Plan

VLLM_USE_V1=1 XPU_CCL_BACKEND=xccl CCL_ATL_SHM=1 VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_WORKER_MULTIPROC_METHOD=spawn python3 -m vllm.entrypoints.openai.api_server --model openai/whisper-large-v3 --dtype=float16 --enforce-eager --port 8000 --trust-remote-code --max_num_batched_tokens 32768 --gpu-memory-util 0.85

Test Result

with this PR:

server started

Without this pr:

  1. raise error "ViT attention hasn't supported _Backend.IPEX "
  2. raise error "_Backend.IPEX" when bind_kv_cache

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: chzhang <chaojun.zhang@intel.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Whisper model on the XPU platform. The changes are minimal and confined to two files, vllm/attention/layer.py and vllm/v1/worker/utils.py. In both cases, the changes add current_platform.is_xpu() to existing platform-specific conditional logic to make XPU behave similarly to CUDA or ROCm. Specifically, it forces the use of the TORCH_SDPA attention backend for MultiHeadAttention and allows a simplified KV cache binding logic for encoder-decoder models. These changes appear to be a correct and standard approach for enabling a new hardware backend. I have not found any issues of high or critical severity.

Copy link
Copy Markdown
Collaborator

@jikunshang jikunshang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for fixing.

@jikunshang jikunshang enabled auto-merge (squash) September 18, 2025 02:57
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 18, 2025
@jikunshang jikunshang merged commit 3bc1812 into vllm-project:main Sep 18, 2025
58 checks passed
845473182 pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 18, 2025
…litPR into model_register

* 'model_register' of https://github.com/dsxsteven/vllm_splitPR: (138 commits)
  Retrieve `sliding_window` from text config in Gemma3 MM (vllm-project#25085)
  [Docs] Fix API Reference (vllm-project#25140)
  [Kernel] Better inf handling for grouped topk cu (vllm-project#24886)
  [CLI] Use streaming in CLI chat and completion commands (vllm-project#23769)
  [benchmark] add peak throughput metrics and plot (vllm-project#23867)
  [Spec Decode] Efficient padded speculation (vllm-project#24539)
  [V0 Deprecation] Remove more V0 tests (vllm-project#25117)
  [EPLB] Add EPLB support for hunyuan_v1 (vllm-project#23078)
  [XPU] Whisper model support on XPU Platform (vllm-project#25123)
  Mark prompt logprobs as incompatible with prompt embeds at API level (vllm-project#25077)
  [Model] enable data parallel for InternVL vision encoder (vllm-project#23909)
  [Kernels] Overlap shared experts with combine instead of dispatch (vllm-project#24254)
  [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960)
  [Core][MM] Cleanup `MultiModalCache` (vllm-project#25006)
  [Docs] Clean up the contributing README (vllm-project#25099)
  [MM Encoder] Apply DP ViT for Qwen3-VL model series (vllm-project#24955)
  [Kernels] Enable DeepGEMM by default (vllm-project#24462)
  [V0 Deprecation] Skip PP test (vllm-project#25128)
  [V0 Deprecation] Remove misc V0 tests (vllm-project#25118)
  [V0 Deprecation] Remove V0 Tracing & Metrics tests (vllm-project#25115)
  ...
debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025
Signed-off-by: chzhang <chaojun.zhang@intel.com>
ABC12345anouys pushed a commit to ABC12345anouys/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: chzhang <chaojun.zhang@intel.com>
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: chzhang <chaojun.zhang@intel.com>
Signed-off-by: charlifu <charlifu@amd.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
Signed-off-by: chzhang <chaojun.zhang@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants