Skip to content

[AOTI][XPU] Support lazy Triton kernel compilation for cpp-wrapper on XPU#179239

Closed
etaf wants to merge 2 commits intogh/etaf/217/basefrom
gh/etaf/217/head
Closed

[AOTI][XPU] Support lazy Triton kernel compilation for cpp-wrapper on XPU#179239
etaf wants to merge 2 commits intogh/etaf/217/basefrom
gh/etaf/217/head

Conversation

@etaf
Copy link
Copy Markdown
Collaborator

@etaf etaf commented Apr 3, 2026

Stack from ghstack (oldest at bottom):

The lazy Triton kernel compilation feature (#175416) introduced
CUDA-specific assumptions that broke XPU cpp-wrapper codegen.

Fix the XPU incompatibilities:

  • lazy_triton_compile.h: conditional include for XPU vs CUDA device
    headers; change runTritonKernelWithAutotune stream param from
    cudaStream_t to void* (both pointer types convert implicitly)
  • cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch
    allocations instead of hardcoded CUdeviceptr
  • sycl_runtime_wrappers.h: query threads_per_warp from the kernel
    object via compile_sub_group_size instead of requiring it as an
    external parameter.
  • Remove threads_per_warp plumbing from triton_heuristics.py and
    cpp_wrapper_gpu.py codegen since it is no longer needed

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

etaf added 2 commits April 2, 2026 20:08
[ghstack-poisoned]
[ghstack-poisoned]
etaf added a commit that referenced this pull request Apr 3, 2026
… XPU

The lazy Triton kernel compilation feature (#175416) introduced
CUDA-specific assumptions that broke XPU cpp-wrapper codegen.

Fix the XPU incompatibilities:
- lazy_triton_compile.h: conditional include for XPU vs CUDA device
  headers; change runTritonKernelWithAutotune stream param from
  cudaStream_t to void* (both pointer types convert implicitly)
- cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch
  allocations instead of hardcoded CUdeviceptr
- sycl_runtime_wrappers.h: query threads_per_warp from the kernel
  object via compile_sub_group_size instead of requiring it as an
  external parameter, matching static_launcher/xpu.cpp
- Remove threads_per_warp plumbing from triton_heuristics.py and
  cpp_wrapper_gpu.py codegen since it is no longer needed


ghstack-source-id: 58de386
Pull-Request: #179239
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 3, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/179239

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 275b274 with merge base 1dc5e2f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@etaf etaf added the ciflow/xpu Run XPU CI tasks label Apr 3, 2026
@etaf etaf added the topic: not user facing topic category label Apr 3, 2026
etaf added a commit that referenced this pull request Apr 3, 2026
… XPU

The lazy Triton kernel compilation feature (#175416) introduced
CUDA-specific assumptions that broke XPU cpp-wrapper codegen.

Fix the XPU incompatibilities:
- lazy_triton_compile.h: conditional include for XPU vs CUDA device
  headers; change runTritonKernelWithAutotune stream param from
  cudaStream_t to void* (both pointer types convert implicitly)
- cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch
  allocations instead of hardcoded CUdeviceptr
- sycl_runtime_wrappers.h: query threads_per_warp from the kernel
  object via compile_sub_group_size instead of requiring it as an
  external parameter, matching static_launcher/xpu.cpp
- Remove threads_per_warp plumbing from triton_heuristics.py and
  cpp_wrapper_gpu.py codegen since it is no longer needed

ghstack-source-id: 9bcb595
Pull-Request: #179239
@etaf etaf requested review from desertfire and jansel April 3, 2026 14:36
@etaf
Copy link
Copy Markdown
Collaborator Author

etaf commented Apr 3, 2026

Hi @desertfire @jansel,

PR #175416 has caused a significant number of test failures on XPU, and this fix is needed to address them. Could you please help review this PR?

@etaf etaf added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 3, 2026
@etaf
Copy link
Copy Markdown
Collaborator Author

etaf commented Apr 3, 2026

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

weifengpy pushed a commit that referenced this pull request Apr 7, 2026
… XPU (#179239)

The lazy Triton kernel compilation feature (#175416) introduced
CUDA-specific assumptions that broke XPU cpp-wrapper codegen.

Fix the XPU incompatibilities:
- lazy_triton_compile.h: conditional include for XPU vs CUDA device
  headers; change runTritonKernelWithAutotune stream param from
  cudaStream_t to void* (both pointer types convert implicitly)
- cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch
  allocations instead of hardcoded CUdeviceptr
- sycl_runtime_wrappers.h: query threads_per_warp from the kernel
  object via compile_sub_group_size instead of requiring it as an
  external parameter.
- Remove threads_per_warp plumbing from triton_heuristics.py and
  cpp_wrapper_gpu.py codegen since it is no longer needed

Pull Request resolved: #179239
Approved by: https://github.com/desertfire
nklshy-aws pushed a commit to nklshy-aws/pytorch that referenced this pull request Apr 7, 2026
… XPU (pytorch#179239)

The lazy Triton kernel compilation feature (pytorch#175416) introduced
CUDA-specific assumptions that broke XPU cpp-wrapper codegen.

Fix the XPU incompatibilities:
- lazy_triton_compile.h: conditional include for XPU vs CUDA device
  headers; change runTritonKernelWithAutotune stream param from
  cudaStream_t to void* (both pointer types convert implicitly)
- cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch
  allocations instead of hardcoded CUdeviceptr
- sycl_runtime_wrappers.h: query threads_per_warp from the kernel
  object via compile_sub_group_size instead of requiring it as an
  external parameter.
- Remove threads_per_warp plumbing from triton_heuristics.py and
  cpp_wrapper_gpu.py codegen since it is no longer needed

Pull Request resolved: pytorch#179239
Approved by: https://github.com/desertfire
pytorch-bot Bot pushed a commit that referenced this pull request Apr 10, 2026
… XPU (#179239)

The lazy Triton kernel compilation feature (#175416) introduced
CUDA-specific assumptions that broke XPU cpp-wrapper codegen.

Fix the XPU incompatibilities:
- lazy_triton_compile.h: conditional include for XPU vs CUDA device
  headers; change runTritonKernelWithAutotune stream param from
  cudaStream_t to void* (both pointer types convert implicitly)
- cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch
  allocations instead of hardcoded CUdeviceptr
- sycl_runtime_wrappers.h: query threads_per_warp from the kernel
  object via compile_sub_group_size instead of requiring it as an
  external parameter.
- Remove threads_per_warp plumbing from triton_heuristics.py and
  cpp_wrapper_gpu.py codegen since it is no longer needed

Pull Request resolved: #179239
Approved by: https://github.com/desertfire
@github-actions github-actions Bot deleted the gh/etaf/217/head branch May 4, 2026 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants