[Refactor][MLA]: Expose mla to torch.compile by morrison-turnansky · Pull Request #39346 · vllm-project/vllm

morrison-turnansky · 2026-04-08T20:36:24Z

Expose MLA to torch compile by splitting up custom op.

Purpose

Test Plan

I didn't see a great way to test this besides an end to end test, which I compared against the un-exposed path i.e. the current default implementation. Locally I ran in eager, and stock torch compile as well with exact matches, but I didn't want to put a lot of long tests in the repo.

Test Result

Tests are passing. I also did some local bench marking. Initial results showed a slowdown, which is most likely due to the additional graph breaks necessitated by splitting the custom op. If there is interest we can expose slightly fewer operations to remove some of the graph breaks and close the performance gap.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request implements the VLLM_MLA_EXPOSED_SPLIT feature to expose MLA prefill/decode batch splitting to torch.compile for improved fusion. It introduces several custom operators and modifies the compilation configuration and partition rules to support data-dependent batch sizes. Review feedback identifies a potential crash in the piecewise backend during AOT compilation for empty subgraphs, a missing check for null output_shape in the MLA forward pass, and a suggestion to use torch.cat for better optimization during tensor concatenation.

gemini-code-assist · 2026-04-08T20:40:06Z

+            assert self.graph is not None, "Eager fallback requires FX graph."
+            return self.graph(*args)


This assertion and subsequent call to self.graph will cause a crash when the model is loaded from an AOT compilation cache (where self.graph is None) and encounters an empty split subgraph (e.g., an all-prefill or all-decode microbatch). In AOT mode, the backend should handle shape 0 by either having a pre-compiled runnable or by returning appropriate empty/zero tensors without falling back to the FX graph module.

My understanding is that 0 is not one of standard compile sizes to AOT compile on. If 0 is not included then the AOT graph, then we would get a crash later anyways. Failing fast, seems like a safer option as opposed to forcing users to do something non standard.

mergify · 2026-04-08T20:43:10Z

Hi @morrison-turnansky, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-04-10T08:40:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @morrison-turnansky.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

morrison-turnansky · 2026-04-10T15:05:15Z

current performance:

vllm serve deepseek-ai/DeepSeek-V2-Lite
--trust-remote-code
-cc.mode=VLLM_COMPILE
-cc.cudagraph_mode=FULL_AND_PIECEWISE
-cc.use_inductor_graph_partition=true
--profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile"}'

vllm bench serve --backend vllm --model deepseek-ai/DeepSeek-V2-Lite --dataset-name sharegpt --num-prompts 128 --profile --dataset-path /workspaces/vllm-dev/vllm/sharegpt.json

profiler_out_0.txt

============ Serving Benchmark Result ============

Successful requests: 128
Failed requests: 0
Benchmark duration (s): 9.83
Total input tokens: 31879
Total generated tokens: 27379
Request throughput (req/s): 13.02
Output token throughput (tok/s): 2784.11
Peak output token throughput (tok/s): 6243.00
Peak concurrent requests: 128.00
Total token throughput (tok/s): 6025.82
---------------Time to First Token----------------
Mean TTFT (ms): 543.37
Median TTFT (ms): 625.09
P99 TTFT (ms): 645.95
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 17.00
Median TPOT (ms): 12.66
P99 TPOT (ms): 86.88
---------------Inter-token Latency----------------
Mean ITL (ms): 11.92
Median ITL (ms): 11.58
P99 ITL (ms): 23.41

==================================================

morrison-turnansky · 2026-04-10T15:06:18Z

pr performance:

for repro:
VLLM_MLA_EXPOSED_SPLIT=1
vllm serve deepseek-ai/DeepSeek-V2-Lite
--trust-remote-code
-cc.mode=VLLM_COMPILE
-cc.cudagraph_mode=FULL_AND_PIECEWISE
-cc.use_inductor_graph_partition=true
--profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile"}'

vllm bench serve --backend vllm --model deepseek-ai/DeepSeek-V2-Lite --dataset-name sharegpt --num-prompts 128 --profile --dataset-path /workspaces/vllm-dev/vllm/sharegpt.json

profiler_out_0.txt

============ Serving Benchmark Result ============

Successful requests: 128
Failed requests: 0
Benchmark duration (s): 15.27
Total input tokens: 31879
Total generated tokens: 26575
Request throughput (req/s): 8.38
Output token throughput (tok/s): 1740.69
Peak output token throughput (tok/s): 6028.00
Peak concurrent requests: 128.00
Total token throughput (tok/s): 3828.79
---------------Time to First Token----------------
Mean TTFT (ms): 3135.06
Median TTFT (ms): 3175.92
P99 TTFT (ms): 3304.33
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 37.26
Median TPOT (ms): 24.28
P99 TPOT (ms): 129.87
---------------Inter-token Latency----------------
Mean ITL (ms): 21.72
Median ITL (ms): 11.85
P99 ITL (ms): 28.04

==================================================

mergify · 2026-04-15T13:47:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @morrison-turnansky.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>

mergify · 2026-04-15T20:42:08Z

Hi @morrison-turnansky, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-04-16T10:34:16Z

Hi @morrison-turnansky, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: parsshar-RH <parsshar@redhat.com>

Signed-off-by: morrison-turnansky <mturnans@redhat.com>

… marked cuda graph unsafe Signed-off-by: morrison-turnansky <mturnans@redhat.com>

…p_proj functional to eliminate temp buffer Signed-off-by: parsshar-RH <parsshar@redhat.com>

Signed-off-by: morrison-turnansky <mturnans@redhat.com>

…tention_decode Signed-off-by: morrison-turnansky <mturnans@redhat.com>

Signed-off-by: morrison-turnansky <mturnans@redhat.com>

Signed-off-by: parsshar-RH <parsshar@redhat.com>

Signed-off-by: morrison-turnansky <mturnans@redhat.com>

mergify · 2026-04-22T19:08:07Z

Hi @morrison-turnansky, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: parsshar-RH <parsshar@redhat.com>

mergify · 2026-04-27T11:17:04Z

Hi @morrison-turnansky, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-04-30T14:33:05Z

Hi @morrison-turnansky, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

ProExpertProg · 2026-05-01T12:51:37Z

Btw I had this idea on how to unify the exposed/non-exposed paths, using the wrap_if_exposed decorator:

def wrap_if_exposed(op_name: str):
    def decorator(func):
        # could optionally register the custom op automatically here as well, 
        # but that might be more effort than is worth

        @wraps(func)
        def wrapper(self: "MLAAttention", *args, **kwargs):
            if not self.exposed:
                return func(self, *args, **kwargs)
            return getattr(torch.ops.vllm, op_name)(self.layer_name, *args, **kwargs)
        return wrapper
    return decorator


class MLAAttention(...):
    def __init__(self):
        self.exposed = ...
        ...

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.exposed:
            return self.inner_forward(x)
        return torch.ops.vllm.mla_attention(self.layer_name, x)

    def inner_forward(self, x: torch.Tensor) -> torch.Tensor:
        """Inner forward with decomposed method calls."""
        num_prefills = self.split_batch()

        output = torch.empty_like(x)
        output[num_prefills:] = self.forward_mha(x[num_prefills:])
        output[:num_prefills] = self.forward_mqa(x[:num_prefills])

        return output

    def forward_mqa(self, x: torch.Tensor) -> torch.Tensor:
        ...

    @wrap_if_exposed("mla_forward_mha")
    def forward_mha(self, x: torch.Tensor) -> torch.Tensor:
        ...

mergify · 2026-05-04T15:43:11Z

Hi @morrison-turnansky, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-05-23T08:07:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @morrison-turnansky.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

morrison-turnansky requested review from BoyuanFeng, LucasWilkinson, MatthewBonanni, ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, vadiklyutiy, yewentao256, youkaichao and zou3519 as code owners April 8, 2026 20:36

mergify Bot added the v1 label Apr 8, 2026

gemini-code-assist Bot reviewed Apr 8, 2026

View reviewed changes

ProExpertProg added the verified Run pre-commit for new contributors without triggering other tests label Apr 8, 2026

mergify Bot added the needs-rebase label Apr 10, 2026

morrison-turnansky force-pushed the issue-34823-mla-custom-op-unwrap-unoptimized branch from 8f98f9e to f8325db Compare April 10, 2026 15:14

mergify Bot removed the needs-rebase label Apr 10, 2026

parsshar-RH force-pushed the issue-34823-mla-custom-op-unwrap-unoptimized branch from 737a685 to 0d7149d Compare April 13, 2026 13:10

mergify Bot added the needs-rebase label Apr 15, 2026

therealnaveenkamal added 3 commits April 15, 2026 19:36

compile split - code upgraded

6d557c3

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>

added test

7442aef

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>

works now

fa7d6f9

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>

morrison-turnansky force-pushed the issue-34823-mla-custom-op-unwrap-unoptimized branch from 0d7149d to 682be0f Compare April 15, 2026 19:40

parsshar-RH force-pushed the issue-34823-mla-custom-op-unwrap-unoptimized branch from 682be0f to 2d2d000 Compare April 16, 2026 10:29

mergify Bot removed the needs-rebase label Apr 16, 2026

morrison-turnansky force-pushed the issue-34823-mla-custom-op-unwrap-unoptimized branch from 2d2d000 to d53f813 Compare April 16, 2026 15:48

parsshar-RH and others added 10 commits April 21, 2026 12:07

replace torch.empty+copy_ with torch.cat

042a332

Signed-off-by: parsshar-RH <parsshar@redhat.com>

reverted turning off combo kernel from original pr, reviwer comments

5f91caf

Signed-off-by: morrison-turnansky <mturnans@redhat.com>

updated path for has context

7f2c022

Signed-off-by: morrison-turnansky <mturnans@redhat.com>

cleaned up after rebase

5668290

Signed-off-by: morrison-turnansky <mturnans@redhat.com>

changed mla_merge_prefill_decode_output to standard torch compile and…

3bd6bb1

… marked cuda graph unsafe Signed-off-by: morrison-turnansky <mturnans@redhat.com>

convert mla_split_batch to torch.library API with SymInt and make v_u…

1101665

…p_proj functional to eliminate temp buffer Signed-off-by: parsshar-RH <parsshar@redhat.com>

allowed unbacked sym int to be compiled with torch check hints

5e020b0

Signed-off-by: morrison-turnansky <mturnans@redhat.com>

added dynamic batching to decode avoided extra allocations for mla_at…

f7af444

…tention_decode Signed-off-by: morrison-turnansky <mturnans@redhat.com>

cleanup variable names

f6d9eb5

Signed-off-by: morrison-turnansky <mturnans@redhat.com>

more optimizations

92d2f88

Signed-off-by: parsshar-RH <parsshar@redhat.com>

parsshar-RH force-pushed the issue-34823-mla-custom-op-unwrap-unoptimized branch from abb4d92 to 92d2f88 Compare April 21, 2026 12:07

removed merge custom op, remvoed copy in prefill

2fe1388

Signed-off-by: morrison-turnansky <mturnans@redhat.com>

ProExpertProg mentioned this pull request Apr 23, 2026

[Refactor][MLA]: Expose prefill/decode split to torch.compile #34823

Closed

reduce copy and allocation overhead

db97579

Signed-off-by: parsshar-RH <parsshar@redhat.com>

parsshar-RH force-pushed the issue-34823-mla-custom-op-unwrap-unoptimized branch from fc1955a to db97579 Compare April 27, 2026 11:10

parsshar-RH force-pushed the issue-34823-mla-custom-op-unwrap-unoptimized branch from e3185dc to db97579 Compare May 4, 2026 15:34

xaguilar-amd mentioned this pull request May 4, 2026

[Performance][MLA] Lift decode Q-prep (q-absorb + cat + FP8 quant) out of forward_impl #41568

Open

mergify Bot added the needs-rebase label May 23, 2026

		assert self.graph is not None, "Eager fallback requires FX graph."
		return self.graph(*args)

Uh oh!

Conversation

morrison-turnansky commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

morrison-turnansky Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Apr 8, 2026

Uh oh!

mergify Bot commented Apr 10, 2026

Uh oh!

morrison-turnansky commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

morrison-turnansky commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented Apr 15, 2026

Uh oh!

mergify Bot commented Apr 15, 2026

Uh oh!

mergify Bot commented Apr 16, 2026

Uh oh!

mergify Bot commented Apr 22, 2026

Uh oh!

mergify Bot commented Apr 27, 2026

Uh oh!

mergify Bot commented Apr 30, 2026

Uh oh!

ProExpertProg commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented May 4, 2026

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

morrison-turnansky commented Apr 8, 2026 •

edited

Loading

morrison-turnansky commented Apr 10, 2026 •

edited

Loading

morrison-turnansky commented Apr 10, 2026 •

edited

Loading

ProExpertProg commented May 1, 2026 •

edited

Loading