[8/n] Migrate merge_attn_states, mamba, sampler to torch stable ABI by mikaylagawarecki · Pull Request #38841 · vllm-project/vllm

mikaylagawarecki · 2026-04-02T19:32:13Z

Commits to review

https://github.com/vllm-project/vllm/pull/38841/changes/ea6c06bc84378e855ce82ff08f302346d5dc4983..af6ab01a4f5055230635a499fc328afc444f3dba

Purpose

Stacked on #38783

Test Plan

pytest tests/kernels/attention/test_merge_attn_states.py
pytest tests/kernels/test_top_k_per_row.py
pytest tests/kernels/test_apply_repetition_penalties.py
pytest tests/kernels/mamba/test_mamba_ssm.py

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a new _C_stable_libtorch extension to support stable ABI, enabling cross-platform compatibility for custom operations. It includes significant refactoring of existing CUDA kernels to use torch::stable::Tensor and STD_TORCH_CHECK for stable ABI compliance, alongside updates to build configurations to support both CUDA and HIP. However, several critical issues were identified regarding const-correctness and in-place tensor modification logic. Specifically, multiple functions pass tensors as const references while attempting to obtain mutable pointers, and the in-place logic in hadacore_transform fails to correctly update the original storage when new storage is allocated.

gemini-code-assist · 2026-04-02T19:36:00Z

+    x = torch::stable::reshape(x, {-1, had_size});
+
    auto numel = x.numel();
    if (numel % 256 != 0) {
-        x = torch::nn::functional::pad(x, torch::nn::functional::PadFuncOptions({0, 0, 0, (256 - numel % 256) / had_size}));
+        x = torch::stable::pad(x, {0, 0, 0, (256 - numel % 256) / had_size});
    }
-    
+
    if (x.stride(-1) != 1) {
-        x = x.contiguous();
+        x = torch::stable::contiguous(x);


The in-place logic for hadacore_transform is broken when the input tensor x is reassigned due to reshape, pad, or contiguous. Reassigning the torch::stable::Tensor& x reference only updates the local handle and does not modify the original storage passed by the caller. If inplace is true, any operations that create new storage (like pad or contiguous) must eventually copy the result back to the original tensor's storage using torch::stable::copy_. Currently, the copy_ check at line 807 will always be false because out and x share the same handle after reassignment, meaning the original storage remains unchanged.

pre-existing

gemini-code-assist · 2026-04-02T19:36:00Z

+void selective_scan_fwd(const torch::stable::Tensor &u, const torch::stable::Tensor &delta,
+                  const torch::stable::Tensor &A, const torch::stable::Tensor &B, const torch::stable::Tensor &C,
+                  const std::optional<torch::stable::Tensor> &D_,
+                  const std::optional<torch::stable::Tensor> &z_,
+                  const std::optional<torch::stable::Tensor> &delta_bias_,
                  bool delta_softplus,
-                  const std::optional<torch::Tensor> &query_start_loc,
-                  const std::optional<torch::Tensor> &cache_indices,
-                  const std::optional<torch::Tensor> &has_initial_state,
-                  const torch::Tensor &ssm_states,
+                  const std::optional<torch::stable::Tensor> &query_start_loc,
+                  const std::optional<torch::stable::Tensor> &cache_indices,
+                  const std::optional<torch::stable::Tensor> &has_initial_state,
+                  const torch::stable::Tensor &ssm_states,
                  // used to identify padding entries if cache_indices provided
                  // in case of padding, the kernel will return early
                  int64_t null_block_id,
                  int64_t block_size,
-                  const std::optional<torch::Tensor> &block_idx_first_scheduled_token,
-                  const std::optional<torch::Tensor> &block_idx_last_scheduled_token,
-                  const std::optional<torch::Tensor> &initial_state_idx,
-                  const std::optional<torch::Tensor> &cu_chunk_seqlen,
-                  const std::optional<torch::Tensor> &last_chunk_indices) {
+                  const std::optional<torch::stable::Tensor> &block_idx_first_scheduled_token,
+                  const std::optional<torch::stable::Tensor> &block_idx_last_scheduled_token,
+                  const std::optional<torch::stable::Tensor> &initial_state_idx,
+                  const std::optional<torch::stable::Tensor> &cu_chunk_seqlen,
+                  const std::optional<torch::stable::Tensor> &last_chunk_indices) {


The C++ signature for selective_scan_fwd uses const torch::stable::Tensor& for tensors that are actually modified in-place (like delta and ssm_states), as indicated by the Tensor! markers in ops.def and the use of const_cast to obtain mutable pointers later in the code (e.g., at lines 576-577). This violates const correctness and can lead to undefined behavior. These parameters should be passed as non-const references (torch::stable::Tensor&).

pre-existing

gemini-code-assist · 2026-04-02T19:36:00Z

+    torch::stable::Tensor const& b_qweight,
+    torch::stable::Tensor const& b_scales,
+    std::optional<torch::stable::Tensor> const& b_zeros, bool has_zp,
+    torch::stable::Tensor& b_qweight_reorder,
+    torch::stable::Tensor& b_scales_reorder,
+    std::optional<torch::stable::Tensor> const& b_zeros_reorder,
+    const int64_t K, const int64_t N, const int64_t N_32align) {


b_zeros_reorder is passed as a const reference, but mutable_data_ptr() is called on it at line 151. This is inconsistent and will likely fail to compile if mutable_data_ptr() is correctly implemented as a non-const member function. It should be passed as a non-const reference (std::optional<torch::stable::Tensor>&).

pre-existing

vllm/csrc/quantization/gptq_allspark/allspark_repack.cu

Line 105 in 4b506ff

std::optional<torch::Tensor> const& b_zeros_reorder, const int64_t K,

gemini-code-assist · 2026-04-02T19:36:00Z

+    torch::stable::Tensor const& out,
+    torch::stable::Tensor const& lse,
+    torch::stable::Tensor const& q_nope,
+    torch::stable::Tensor const& q_pe,
+    torch::stable::Tensor const& kv_c_and_k_pe_cache,
+    torch::stable::Tensor const& seq_lens,
+    torch::stable::Tensor const& page_table,
+    torch::stable::Tensor const& workspace,
    double sm_scale,


out is passed as a const reference, but it is the output tensor of the operation (marked as Tensor! out in ops.def). It should be passed as a non-const reference to allow obtaining a mutable data pointer for the kernel.

gemini-code-assist · 2026-04-02T19:36:00Z

+      "selective_scan_fwd(Tensor! u, Tensor! delta,"
+      "Tensor! A, Tensor! B, Tensor! C,"
+      "Tensor? D_, Tensor!? z_, Tensor? delta_bias_,"
+      "bool delta_softplus,"
+      "Tensor? query_start_loc,"
+      "Tensor? cache_indices,"
+      "Tensor? has_initial_state,"
+      "Tensor! ssm_states,"
+      "int null_block_id,"
+      "int block_size,"
+      "Tensor? block_idx_first_scheduled_token,"
+      "Tensor? block_idx_last_scheduled_token,"
+      "Tensor? initial_state_idx,"
+      "Tensor? cu_chunk_seqlen,"
+      "Tensor? last_chunk_indices) -> ()");


The ops.def for selective_scan_fwd incorrectly marks almost all input tensors as mutable (Tensor!). Only tensors that are actually modified by the kernel (like delta if used for output, and ssm_states) should have the ! suffix. Marking immutable inputs as mutable prevents PyTorch from performing certain optimizations and is misleading for users of the API.

pre-existing

mergify · 2026-04-03T03:33:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mikaylagawarecki.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork