[Intel GPU] Enable backward for SDPA XPU [WIP] by LuFinch · Pull Request #156272 · pytorch/pytorch

LuFinch · 2025-06-18T02:31:11Z

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @gujinghui @PenghuiCheng @jianyuh @min-jean-cho @yanbing-j @Guobing-Chen @Xia-Weiwen @snadampal @EikanWang @fengyuan14 @guangyey

pytorch-bot · 2025-06-18T02:31:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156272

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures, 1 Unrelated Failure

As of commit acbac05 with merge base 908c5cc ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-clang / linux-job (gh)
>>> Lint for aten/src/ATen/native/mkldnn/xpu/detail/Attention.cpp:
Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/_meta_registrations.py:
xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 1, 6, linux.idc.xpu) (gh)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 2, 6, linux.idc.xpu) (gh)
inductor/test_torchinductor_opinfo.py::TestInductorOpInfoXPU::test_comprehensive_ormqr_xpu_float32
xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 3, 6, linux.idc.xpu) (gh)
xpu/test_gemm.py::TestBasicGEMMXPU::test_addbmm_xpu_bfloat16
xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 6, 6, linux.idc.xpu) (gh)
'Test'

FLAKY - The following job failed but was likely due to flakiness present on trunk:

xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 4, 6, linux.idc.xpu) (gh) (disabled by #159950, #159951 but the issue was closed recently and a rebase is needed to make it pass)
dynamo/test_modes.py::TorchFunctionModeTests::test_hop

This comment was automatically generated by Dr. CI and updates every 15 minutes.

LuFinch · 2025-06-18T02:34:24Z

@pytorchbot label "topic: not user facing"

github-actions · 2025-06-18T02:35:11Z

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.

Caused by:

aten/src/ATen/native/native_functions.yaml

EikanWang

@LuFinch , the newly added parameter of sdpa backward overridable is not backward compatible.

EikanWang · 2025-08-11T22:45:42Z

aten/src/ATen/native/native_functions.yaml

  tags: nondeterministic_seeded

- func: _scaled_dot_product_fused_attention_overrideable(Tensor query, Tensor key, Tensor value, Tensor? attn_bias=None, float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False, *, float? scale=None) -> (Tensor output, Tensor logsumexp, Tensor cum_seq_q, Tensor cum_seq_k, SymInt max_q, SymInt max_k, Tensor philox_seed, Tensor philox_offset, Tensor debug_attn_mask)
+- func: _scaled_dot_product_fused_attention_overrideable(Tensor query, Tensor key, Tensor value, Tensor? attn_bias=None, bool compute_log_sumexp=False, float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False, *, float? scale=None) -> (Tensor output, Tensor logsumexp, Tensor cum_seq_q, Tensor cum_seq_k, SymInt max_q, SymInt max_k, Tensor philox_seed, Tensor philox_offset, Tensor debug_attn_mask)


@LuFinch , why do we need to add comput_log_sumexp? It breaks the ABI backward compatibility.

Ideally, we can check input tensors' attr like compute_logsumexp = query.requires_grad() || key.requires_grad() || value.requires_grad() to decide whether compute logsumexp. And this checking works in eager mode.

However, in torch.compile mode, the input tensors query/key/value require_grad()==True at the beginning would become require_grad()==False in the op after aot_autograd in some models. Hence, it needs a bool flag to indicate this op should compute logsumexp. I am not an expert of aot_autograd and not sure why it acts like this. But cudnn and efficient attention also has this parameter. I guess they meet the same issue, otherwise they should be able to move this check into op.

pytorch/aten/src/ATen/native/transformers/attention.cpp

Lines 739 to 742 in 78d7f0c

case SDPBackend::cudnn_attention: {

bool compute_logsumexp = should_compute_logsumexp(query_, key, value);

auto out_lse_softmax = at::_scaled_dot_product_cudnn_attention(

query_, key, value, attn_mask, compute_logsumexp, dropout_p, is_causal, false /*return_debug_mask*/, scale);

pytorch/aten/src/ATen/native/transformers/attention.cpp

Lines 761 to 769 in 78d7f0c

case SDPBackend::efficient_attention: {

bool compute_logsumexp = should_compute_logsumexp(query_, key, value);

if (attn_mask.has_value()) {

attn_mask.value() = preprocess_mask(attn_mask.value(), query_, key, value);;

}

auto out_and_lse = at::_scaled_dot_product_efficient_attention(

query_, key, value, attn_mask, compute_logsumexp, dropout_p, is_causal, scale);

return std::get<0>(out_and_lse);

}

@LuFinch , why do we need to add comput_log_sumexp? It breaks the ABI backward compatibility.

By default comput_log_sumexp=False, this should not break API-level BC, right?

Eikan recommends me to move this argument as the last argument, then it will not break BC.

Already moved this argument as the last argument.

EikanWang · 2025-08-11T23:35:44Z

aten/src/ATen/native/mkldnn/xpu/Attention.cpp

+       (attn_mask.has_value() && attn_mask.value().requires_grad()));
+}
+
+bool check_grad(sdp::sdp_params const& params, bool debug) {


According to the implementation details, it returns True when

Grad mode is not enabled

All input tensors do not require a gradient

Not Group Query Attention and the attention mask do not require a gradient

@LuFinch , is my understanding correct? If so, I would suggest refining the name of check_grad a little bit. Something could be like is_onednn_attention_backward_supported to illustrate your idea clearly.

This function should be used to check the grad requirements of inputs to determine whether they are suitable for supporting overrideable SDPA on XPU in the future.

As guangye saying, it is use to determine whether use overrideable SDPA. If return True, then it can use overrideable SDPA.

In OneDNN v3.9, SDPA training forward and backward don't support GQA and won't output grad for attn_mask.

Hence this function means:

If Grad mode is not enabled, we can use overrideable SDPA to run OneDNN SDPA inference forward graph.

Grad mode is enabled but none of q/k/v needs grad, we can use overrideable SDPA to run OneDNN SDPA inference forward graph.

If we need to compute grad, it is not GQA and attn_mask don't require gard, then we can use overrideable SDPA to run OneDNN SDPA training forward graph.

Otherwise, it should fallback to MATH backend.

EikanWang · 2025-08-11T23:45:41Z

aten/src/ATen/native/mkldnn/xpu/Attention.cpp

+  auto k_num_heads = params.key.sym_size(-3);
+  auto v_num_heads = params.value.sym_size(-3);
+  bool is_gqa = q_num_heads != k_num_heads || q_num_heads != v_num_heads;
+  if (debug && is_gqa)


Since it has been gqa, why does this function return false?

In OneDNN v3.9, SDPA training forward and backward don't support GQA and won't compute grad for attn_mask.

EikanWang · 2025-08-11T23:45:49Z

aten/src/ATen/native/mkldnn/xpu/Attention.cpp

+
+  bool attn_mask_needs_grad =
+      params.attn_mask.has_value() && params.attn_mask.value().requires_grad();
+  if (debug && attn_mask_needs_grad) {


In OneDNN v3.9, SDPA training forward and backward don't support GQA and won't compute grad for attn_mask.

EikanWang · 2025-08-12T02:59:17Z

aten/src/ATen/native/mkldnn/xpu/Attention.cpp

+  auto grad_attn_bias = attn_bias_opt.has_value()
+      ? at::empty_like(attn_bias_opt.value())
+      : at::Tensor();
+  at::native::onednn::gpu_float_sdpa_backward(


gpu_float_sdpa_backward has been defined? Does it mean the backward function only supports float?

I have the same question.

It supports FP32/FP16/BF16. I directly copy this function name from SDPA inference. We could rename it.

renamed as sdpa_backward

EikanWang · 2025-08-12T03:00:13Z

aten/src/ATen/native/mkldnn/xpu/Attention.cpp

+      grad_out.dim() == 4 && out.dim() == 4 &&
+          grad_out.size(0) == out.size(0) && grad_out.size(1) == out.size(1) &&
+          grad_out.size(2) == out.size(2) && grad_out.size(3) == out.size(3),
+      "scaled_dot_product_fused_attention_overrideable_backward_xpu: grad_out and out should have the same shape of {(B), H, T, K}");


What's the meaning of (B)?

Copy from forward code. It just means batchsize. Already removed the bracket.

guangyey · 2025-08-12T03:23:08Z