Resolve invalid argument index error for SDPA backend execution#35021
Merged
isanghao merged 4 commits intoopenvinotoolkit:masterfrom Apr 1, 2026
Merged
Resolve invalid argument index error for SDPA backend execution#35021isanghao merged 4 commits intoopenvinotoolkit:masterfrom
isanghao merged 4 commits intoopenvinotoolkit:masterfrom
Conversation
Signed-off-by: Min, Byungil <byungil.min@intel.com>
Signed-off-by: Min, Byungil <byungil.min@intel.com>
872430c to
41be0a3
Compare
Signed-off-by: Min, Byungil <byungil.min@intel.com>
isanghao
reviewed
Mar 31, 2026
| #if HAS_KV_CACHE_ZP_INPUT | ||
| VALUE_COMPRESSION_SCALE_TYPE comp_zp = val_zp[comp_offset]; | ||
| #else | ||
| VALUE_COMPRESSION_SCALE_TYPE comp_zp = val_scale[comp_offset + 1]; |
Contributor
There was a problem hiding this comment.
what about introducing macro like this? I guess this will be easy to read.
#if HAS_KV_CACHE_ZP_INPUT
#define GET_ZP(zp, scale, comp_offset) ((zp)[(comp_offset)])
#else
#define GET_ZP(zp, scale, comp_offset) ((scale)[(comp_offset) + 1])
#endif
Contributor
Author
There was a problem hiding this comment.
looks good. also added GET_SCALE together.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes an “invalid argument index” failure when executing the GPU SDPA optimized OpenCL kernel (sdpa_opt) with KV-cache compression in Planar output storage mode (used when supports_immad=true), where zero-points (ZP) are provided as separate buffers rather than interleaved with scales.
Changes:
- Update
sdpa_opt.clkernel signatures and dequantization logic to accept optionalkey_zp/val_zpinputs when asymmetric quantization is used with non-interleaved (Planar) scale/ZP storage. - Add a unit transformation test that validates the KV-cache compression rewrite for Planar storage, including separate ZP buffers passed into
IndirectSDPA. - Extend functional KV-cache+SDPA dynamic tests to include compressed beam-search cases with
batch > 1to exercise the indirectsdpa_optpath.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/plugins/intel_gpu/tests/unit/transformations/kv_cache_compression.cpp |
Adds a Planar-mode KV-cache compression transformation test that wires separate scale and ZP buffers into IndirectSDPA. |
src/plugins/intel_gpu/tests/functional/subgraph_tests/dynamic/kv_cache_sdpa.cpp |
Adds compressed beam-search parameter sets (batch=2) to cover the indirect optimized SDPA path. |
src/plugins/intel_gpu/src/graph/impls/ocl_v2/sdpa_opt.cl |
Adds conditional kernel args for separate key/value ZP buffers and uses them in asymmetric dequantization when scales and ZP are not combined. |
Signed-off-by: Min, Byungil <byungil.min@intel.com>
isanghao
approved these changes
Apr 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Details:
Description : benchmark failed in execution when it enabled KV-cache SDPA backend on dGPU (systolic array)
RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:246: Exception from src/plugins/intel_gpu/src/runtime/ocl/ocl_stream.cpp:277: [GPU] [CL_EXT] setArgUsm in KernelIntel failed, error code: -49 CL_INVALID_ARG_INDEXThe code and line that caused this issue
supports_immadatkv_cache_compression.cpp:145~148Reproduction step and snapshot
python benchmark.py -m qwen2.5-7b-instruct/pytorch/ov/OV_FP16-INT8_ASYM -d GPU.1 -n 0 --genai -mc 1 -pf repo-prompts/32_1024/qwen2.5-7b-instruct.jsonl -lc enable_sdpa_cache_u8_by-channel.jsoncat enable_sdpa_cache_u8_by-channel.json{ "ATTENTION_BACKEND": "SDPA", "KV_CACHE_PRECISION": "u8", "KEY_CACHE_QUANT_MODE": "BY_CHANNEL" }Tickets:
AI Assistance:
Generated unit-tests for this fix