[7/n] Migrate pos_encoding and norm kernels to libtorch stable ABI#38783

Closed

mikaylagawarecki wants to merge 31 commits into

vllm-project:mainfrom

mikaylagawarecki:new-stable-abi-phase7

mikaylagawarecki commented Apr 2, 2026 •

edited

Loading

Contributor

Purpose

Stacked on #38757, commits to review https://github.com/vllm-project/vllm/pull/38783/changes/deea6618c38afb4735b442c61e2697c273654292..8754a4250584115db08113e0889313c939d85eb6

Note: some declarations are not deleted from csrc/ops.h despite being moved to csrc/libtorch_stable/ops.h. This is because the CPU build also uses these declarations. These are

Layernorm kernels: rms_norm, fused_add_rms_norm
Pos encoding kernels: rotary_embedding

Test Plan

pytest tests/kernels/core/test_pos_encoding.py
pytest tests/kernels/core/test_fused_qk_norm_rope.py
pytest tests/kernels/core/test_layernorm.py
pytest tests/kernels/core/test_fused_quant_layernorm.py

Test Result

Screenshot 2026-04-02 at 2 45 33 PM

Screenshot 2026-04-02 at 2 46 02 PM

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify Bot added ci/build nvidia rocm labels

github-project-automation Bot added this to NVIDIA and AMD

github-project-automation Bot moved this to Todo in AMD

gemini-code-assist Bot reviewed

View reviewed changes

gemini-code-assist Bot left a comment

Contributor

Code Review

This pull request introduces a new _C_stable_libtorch extension to support a stable ABI, enabling better compatibility across different PyTorch versions and environments. It refactors several core kernels and quantization operations to use this stable ABI, including layernorm, positional encoding, and various quantization kernels. Additionally, it enables this stable extension for both CUDA and HIP backends. I have identified a potential compilation issue where the hadacore_transform declaration is placed outside the appropriate conditional compilation block, which may cause build failures on non-CUDA backends.

csrc/libtorch_stable/ops.h

Comment on lines +158 to +159

    
              torch::stable::Tensor hadacore_transform(torch::stable::Tensor& x,

                                                       bool inplace);

gemini-code-assist Bot Apr 2, 2026

Contributor

The hadacore_transform function is compiled only for CUDA, but its declaration is outside the #ifdef VLLM_CUDA block. This will lead to compilation errors when building for other backends like ROCm/HIP. This declaration should be moved inside the #ifdef VLLM_CUDA block, before the #endif on line 156.

mikaylagawarecki Apr 2, 2026 •

edited

Loading

Contributor Author

was pre-existing before this stack see

vllm/csrc/ops.h

Line 296 in 08ed2b9

torch::Tensor hadacore_transform(torch::Tensor& x, bool inplace);

mikaylagawarecki force-pushed the new-stable-abi-phase7 branch 2 times, most recently from 30e40eb to 59af75d Compare

April 2, 2026 03:57

mergify Bot commented Apr 2, 2026

Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikaylagawarecki.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify Bot added the needs-rebase label

mikaylagawarecki added 16 commits

April 2, 2026 08:58


          Move CUTLASS MLA files from csrc to csrc/libtorch_stable

bfe9697

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          [a/n] Migrate CUTLASS MLA to torch stable ABI

e3606eb

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          Move Hadamard files from csrc to csrc/libtorch_stable

ec7ca63

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          [b/n] Migrate Hadamard (hadacore) kernel to torch stable ABI

364e676

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          Move AWQ files from csrc to csrc/libtorch_stable

f0b7eee

Pure move, no code changes. Preparatory step for stable ABI migration.

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          [c/n] Migrate AWQ kernels to torch stable ABI

3500eae

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          Move DSV3 fused A GEMM from csrc to csrc/libtorch_stable

5e1b090

Pure move, no code changes. Preparatory step for stable ABI migration.

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          [d/n] Migrate DSV3 fused A GEMM to torch stable ABI

ec7793c

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          Move AllSpark files from csrc to csrc/libtorch_stable

a9b1ed2

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          [e/n] Migrate AllSpark kernels to torch stable ABI

2c13410

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          Enable _C_stable_libtorch for ROCm (HIP)

c41a8f8

Restructure the stable ABI extension build so it compiles on both CUDA
and HIP:
- Widen outer guard to include HIP
- Move CUDA-only sources (CUTLASS, FP4, AWQ, permute_cols) into
  a CUDA-conditional block
- Gate USE_CUDA / CUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL to CUDA;
  define USE_ROCM for HIP
- Link PyTorch's bundled libamdhip64.so on ROCm to avoid a dual HIP
  runtime (from 985769a)
- Enable _C_stable_libtorch in setup.py for HIP builds

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          Move activation kernel file from csrc to csrc/libtorch_stable

f64dd26

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          [f/n] Migrate activation kernels to torch stable ABI

690ee02

Move 9 basic activation ops (silu_and_mul, mul_and_silu, gelu_and_mul,
gelu_tanh_and_mul, fatrelu_and_mul, swigluoai_and_mul, gelu_new,
gelu_fast, gelu_quick) from the _C extension to _C_stable_libtorch.

Convert ATen types/APIs to stable ABI equivalents:
- torch::Tensor -> torch::stable::Tensor
- ATen device guard/stream -> stable accelerator APIs
- VLLM_DISPATCH_FLOATING_TYPES -> VLLM_STABLE_DISPATCH_FLOATING_TYPES
- data_ptr -> mutable_data_ptr

Quantized activation ops (silu_and_mul_quant,
persistent_masked_m_silu_mul_quant) remain in _C.

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          Move INT8 quant kernel file from csrc to csrc/libtorch_stable

66206be

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          [g/n] Migrate INT8 quant kernels to torch stable ABI

d0cf841

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          Move FP8 quant kernel file from csrc to csrc/libtorch_stable

2fa10a1

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

mikaylagawarecki force-pushed the new-stable-abi-phase7 branch from 59af75d to 4b5c459 Compare

April 2, 2026 16:27

mergify Bot removed the needs-rebase label

mikaylagawarecki commented

View reviewed changes

csrc/libtorch_stable/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu Outdated

mikaylagawarecki added 5 commits

April 2, 2026 11:39


          Move layernorm kernel file from csrc to csrc/libtorch_stable

34d59e7

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          [m/n] Migrate layernorm kernels (rms_norm, fused_add_rms_norm) to tor…

…ch stable ABI

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          Move layernorm quant kernel file from csrc to csrc/libtorch_stable

810523f

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          [n/n] Migrate layernorm quant kernels (rms_norm_static_fp8_quant, fus…

aafa91b

…ed_add_rms_norm_static_fp8_quant) to torch stable ABI

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>


          Move fused layernorm dynamic per-token quant files from csrc to csrc/…

3fa5976

…libtorch_stable

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

mikaylagawarecki changed the title ~~[7/n] libtorch stable ABI~~ [7/n] Migrate pos_encoding and norm kernels to libtorch stable ABI


          Migrate fused_layernorm_dynamic_per_token_quant to torch stable ABI

8754a42

Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>

mikaylagawarecki force-pushed the new-stable-abi-phase7 branch from 11661d8 to 8754a42 Compare

April 2, 2026 18:53

mikaylagawarecki mentioned this pull request

[8/n] Migrate merge_attn_states, mamba, sampler to torch stable ABI #38841

Closed

5 tasks

janeyx99 reviewed

View reviewed changes

csrc/ops.h

		@@ -91,12 +91,6 @@ void rms_norm(torch::Tensor& out, torch::Tensor& input, torch::Tensor& weight,
		void fused_add_rms_norm(torch::Tensor& input, torch::Tensor& residual,

janeyx99 Apr 2, 2026

Contributor

is this used by cpu too?

mikaylagawarecki Apr 2, 2026

Contributor Author

yep

vllm/csrc/cpu/torch_bindings.cpp

Lines 189 to 192 in 188defb

    
           ops.def( 
        
               "fused_add_rms_norm(Tensor! input, Tensor! residual, Tensor weight, " 
        
               "float epsilon) -> ()"); 
        
           ops.impl("fused_add_rms_norm", torch::kCPU, &fused_add_rms_norm);

janeyx99 reviewed

View reviewed changes

csrc/type_convert.cuh

+              #include <torch/headeronly/util/Half.h>
               #ifndef USE_ROCM
+                #include <cuda.h>

janeyx99 Apr 2, 2026

Contributor

why do we need this?

mikaylagawarecki Apr 2, 2026 •

edited

Loading

Contributor Author

I think before torch/all.h or some other torch include pulls this in, but now we need to explicitly include this for CUDA_VERSION used below on line 50

janeyx99 reviewed

View reviewed changes

CMakeLists.txt

+                  "csrc/libtorch_stable/fused_qknorm_rope_kernel.cu"
+                  "csrc/libtorch_stable/layernorm_kernels.cu"
+                  "csrc/libtorch_stable/layernorm_quant_kernels.cu"
+                  "csrc/libtorch_stable/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu")

janeyx99 Apr 2, 2026

Contributor

so cleannnn cleanest cmake change so far!

zou3519 Apr 3, 2026

Collaborator

😅

janeyx99 approved these changes

View reviewed changes

mikaylagawarecki marked this pull request as ready for review

April 3, 2026 15:09

mikaylagawarecki requested review from LucasWilkinson, tjtanaa and tlrmchlsmth as code owners

April 3, 2026 15:09

zou3519 approved these changes

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA

mergify Bot commented Apr 8, 2026

Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikaylagawarecki.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify Bot added the needs-rebase label

stmcgovern added a commit to TorchedHat/pytorch-stable-abi-transform that referenced this pull request


          Expand transformation coverage based on vLLM PR #38783 analysis

19819e1

Gaps identified by comparing tool output against a real 1,858-line manual
migration PR (vllm-project/vllm#38783):

Rules.h:
- Add torch::k* scalar type shorthands (kFloat, kBFloat16, kInt8, kInt32, etc.)
- Add c10::/at:: scalar type rewrites (Half, BFloat16, Float8_e4m3fn, etc.)
- Add CUDA check macro rules (C10_CUDA_CHECK, AT_CUDA_CHECK, C10_CUDA_KERNEL_LAUNCH_CHECK)
- Add TORCH_CHECK_NOT_IMPLEMENTED → STD_TORCH_CHECK_NOT_IMPLEMENTED
- Add more method-to-free-function rules (sum, pad, new_zeros, permute, slice,
  index_select, repeat, expand)

AstCallbacks.cpp:
- Register new type names and scalar type shorthands in AST matchers
- Register new method names for method-to-free-function conversion

Verifier.cpp:
- Detect torch::k* shorthands as unstable
- Detect C10_CUDA_CHECK, AT_CUDA_CHECK, C10_CUDA_KERNEL_LAUNCH_CHECK
- Detect TORCH_CHECK_NOT_IMPLEMENTED
- Detect .dtype() usage (unstable caffe2::TypeMeta, use .scalar_type())
- Detect torch::TensorOptions (needs decomposition into explicit args)
- Detect at::Half, c10::Half, c10::BFloat16, c10::Float8_* types
- Detect at::elementSize (use tensor.element_size())

mergify Bot removed the needs-rebase label

mergify Bot commented May 18, 2026

Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikaylagawarecki.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify Bot added the needs-rebase label

cleonard530 mentioned this pull request

[7/n] Migrate pos_encoding and norm kernels to libtorch stable ABI (continued) #43209

Merged

5 tasks

Harry-Chen commented Jun 2, 2026

Member

Superseded by newer PRs.

Harry-Chen closed this

github-project-automation Bot moved this from Todo to Done in AMD

github-project-automation Bot moved this from Ready to Done in NVIDIA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

zou3519 zou3519 approved these changes

tjtanaa Awaiting requested review from tjtanaa tjtanaa is a code owner

tlrmchlsmth Awaiting requested review from tlrmchlsmth tlrmchlsmth is a code owner

LucasWilkinson Awaiting requested review from LucasWilkinson LucasWilkinson is a code owner

+2 more reviewers

gemini-code-assist[bot] gemini-code-assist[bot] left review comments

janeyx99 janeyx99 approved these changes

Labels

ci/build needs-rebase nvidia rocm