[torch.compile] Enable attention and allreduce fusion without custom ops enabled by ProExpertProg · Pull Request #24604 · vllm-project/vllm

ProExpertProg · 2025-09-10T19:23:47Z

Purpose

This PR enables matching the torch implementations of custom ops QuantFP8 and RMSNorm. On main, fusion currently requires enabling custom ops, but they are slower than their torch counterparts, so the benefit of custom fusion passes is reduced.

We add a bunch of "matcher util" objects which can be called in patterns and get traced to the same fx nodes as the custom op they correspond to in both enabled and disabled form automatically.

This PR also adds additional debugging utilities and adds E2E fusion tests to verify fusions happen in models end-to-end instead of just in unit tests.

Test Plan

Unit tests, added more fusion E2E tests.

Test Result

Tests all pass

Performance numbers

Below are B200 numbers (with flashinfer) from vllm bench serve on the following serve command:

vllm serve redhatai/meta-llama-3.1-70B-Instruct-FP8 --no-enable-prefix-caching --load-format dummy --kv-cache-dtype=fp8 -O.splitting_ops=[] -O.cudagraph_mode=FULL_DECODE_ONLY

We test the following regimes with corresponding additional arguments:

none: -O.custom_ops='["none"]' -O.pass_config={"enable_fi_allreduce_fusion":false,"enable_attn_fusion":false,"enable_noop":true}
none_fusion_attention: -O.custom_ops='["none"]' -O.pass_config={"enable_fi_allreduce_fusion":false,"enable_attn_fusion":true,"enable_noop":true}
none_fusion_attention_allreduce: -O.custom_ops='["none"]' -O.pass_config={"enable_fi_allreduce_fusion":true,"enable_attn_fusion":true,"enable_noop":true}
rms_quant: -O.custom_ops='["none", "+quant_fp8", "+rms_norm"]' -O.pass_config={"enable_fi_allreduce_fusion":false,"enable_attn_fusion":false,"enable_noop":true}
rms_quant_fusion_attention: -O.custom_ops='["none", "+quant_fp8", "+rms_norm"]' -O.pass_config={"enable_fi_allreduce_fusion":false,"enable_attn_fusion":true,"enable_noop":true}
rms_quant_fusion_attention_allreduce: -O.custom_ops='["none", "+quant_fp8", "+rms_norm"]' -O.pass_config={"enable_fi_allreduce_fusion":true,"enable_attn_fusion":true,"enable_noop":true}

2 (none_fusion_attention) and 3 (none_fusion_attention_allreduce) are newly possible with this PR. On main, results are similar except those two are worse as fusion cannot happen without custom ops enabled.

redhatai/meta-llama-3.1-70B-Instruct-FP8 (TP=1):

Past QPS=10 the server is overloaded so the latency spikes and becomes much more variable. Also note that allreduce fusion is a noop for tp=1.

📊 TTFT Median (ms)

Source	1.0	5.0	10.0	15.0	20.0	inf
none	80.24	85.20	148.05	3740.51	27131.90	29992.91
none_fusion_attention_allreduce	78.42	83.89	146.75	1739.83	24117.31	28886.46
rms_quant	80.38	87.31	162.54	5298.40	26009.12	35230.81
rms_quant_fusion_attention_allreduce	79.50	85.38	149.76	3904.91	24737.15	34760.88

📊 TPOT Median (ms)

Source	1.0	5.0	10.0	15.0	20.0	inf
none	16.86	22.06	41.13	287.30	293.71	285.44
none_fusion_attention_allreduce	16.61	21.86	39.95	285.85	284.93	275.04
rms_quant	17.17	23.12	44.32	234.08	232.37	227.34
rms_quant_fusion_attention_allreduce	16.97	22.45	42.21	228.79	228.76	225.60

📊 ITL Median (ms)

Source	1.0	5.0	10.0	15.0	20.0	inf
none	16.11	16.49	21.31	207.46	180.96	64.85
none_fusion_attention_allreduce	15.87	16.28	21.07	223.46	191.32	64.14
rms_quant	16.35	17.02	22.16	103.97	50.63	47.99
rms_quant_fusion_attention_allreduce	16.20	16.71	21.46	143.86	48.40	47.59

redhatai/meta-llama-3.1-70B-Instruct-FP8 (TP=4):

Note that allreduce fusion reduces TPOT at low QP but increases it at high QPS and increases TTFT across the board, this will be addressed in #24248 and #24252.

📊 TTFT Median (ms)

Source	1.0	5.0	10.0	15.0	20.0	inf
none	72.01	76.32	89.95	112.80	136.13	12852.81
none_fusion_attention	71.10	75.38	93.56	116.45	133.33	12662.96
none_fusion_attention_allreduce	74.16	77.85	94.50	123.77	139.41	21980.02
rms_quant	74.95	78.65	90.93	115.75	162.43	12620.21
rms_quant_fusion_attention	73.77	76.84	94.10	124.59	146.62	20634.74
rms_quant_fusion_attention_allreduce	75.12	77.69	86.11	112.16	199.10	13078.33

📊 TPOT Median (ms)

Source	1.0	5.0	10.0	15.0	20.0	inf
none	9.66	12.72	20.68	31.90	46.11	257.66
none_fusion_attention	9.39	12.44	20.69	29.16	43.08	142.36
none_fusion_attention_allreduce	8.28	11.16	20.84	34.21	47.54	220.19
rms_quant	9.92	13.13	21.95	32.97	62.92	146.71
rms_quant_fusion_attention	9.60	12.78	20.58	34.18	50.37	146.19
rms_quant_fusion_attention_allreduce	8.43	11.39	20.38	31.10	78.54	145.60

📊 ITL Median (ms)

Source	1.0	5.0	10.0	15.0	20.0	inf
none	9.23	9.43	11.45	15.04	63.60	194.04
none_fusion_attention	8.97	9.16	11.29	20.99	62.27	188.26
none_fusion_attention_allreduce	7.85	8.05	11.51	23.63	65.02	191.91
rms_quant	9.47	9.71	11.75	16.26	66.33	188.91
rms_quant_fusion_attention	9.18	9.45	11.51	24.76	68.46	187.65
rms_quant_fusion_attention_allreduce	8.01	8.28	11.60	14.71	73.63	194.55

mergify · 2025-09-10T19:24:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-09-12T19:12:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-09-16T13:06:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…g utils, fix DCE bug (#23091), fix test (#24376), and prep for custom op matching (#24604) (#24542) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: luka <lgovedic@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

mergify · 2025-09-22T19:43:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…g utils, fix DCE bug (vllm-project#23091), fix test (vllm-project#24376), and prep for custom op matching (vllm-project#24604) (vllm-project#24542) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: luka <lgovedic@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

mgoin · 2025-10-16T20:02:04Z

vllm/compilation/collective_fusion.py

-STATIC_FP4_QUANT_OP = torch.ops._C.scaled_fp4_quant.default
+if hasattr(torch.ops._C, "scaled_fp4_quant"):
+    STATIC_FP4_QUANT_OP = torch.ops._C.scaled_fp4_quant.default


Why did this work before and not now? Should we change how this is registered?

Oh yeah I think the registration might have been fixed.

I think I'll punt this to a follow-up PR, in general these ops should be cleaned up

mgoin · 2025-10-16T21:29:09Z

vllm/_custom_ops.py

            )
        else:
-            scale = torch.empty(1, device=input.device, dtype=torch.float32)
+            scale = torch.empty((1, 1), device=input.device, dtype=torch.float32)


Why is this needed? AFAIK this tensor is just a scalar to the kernel

Needed for custom matching to work, (1,1) is still just one element

Can you elaborate? it's suspicious that it needs this change

The native implementation returns (1,1) so this just makes them consistent. I don't remember exactly what I was running into

vllm/compilation/matcher_utils.py

vllm/model_executor/layers/layernorm.py

vllm/compilation/fx_utils.py

vllm/compilation/matcher_utils.py

zou3519

My current understanding is that when we pattern match against the torch native implementation of a custom operator, we register a pattern in Inductor using that native implementation. I'm worried that this approach might be fragile. When the torch native implementation is passed through torch.compile, various graph passes can transform it, so by the time it reaches the post-grad phase (where vLLM’s pattern matching currently happens), the structure may look different.

For example, with rms_norm, it seems we’d need to modify the implementation in a non-trivial way to make it pattern match. I don't know if this is an issue in practice, but it suggests that this scheme could unintentionally constrain how custom operators need to be authored — in ways we might not fully understand yet.

It might be more robust to preserve the custom operator as-is (i.e., avoid decomposing it into torch native ops) and then perform pattern matching directly on the custom operator itself. That would make the process less sensitive to internal graph transformations.

I did see that you wanted this in for the release. Was there a specific reason? If we are turning on the allreduce+rmsnorm fusion by default, for example, then could the fusion instead imply "+rmsnorm"?

mergify · 2025-10-17T00:18:13Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ProExpertProg · 2025-10-17T01:27:27Z

The reason this is needed is it lets us do fusion without having to enable custom ops (-O.custom_ops=["+quant_fp8"]). Enabling custom ops leads to lost performance, as demonstrated in the PR description. That's because there are 4 quant ops per layer, one per matmul, and +quant_fp8 leads to lost performance for the 3 remaining ones that did not get fused onto attention, for example. In the example of SequenceParallelismPass, we're not even fusing, just moving the ops around. In AllReduceRMSNormFusion, we're going to be only fusing for some shapes, so we're again left with an inefficient implementation for others.

I agree this is a somewhat fragile approach. I would be happy to work on a "lowering" approach where we preserve the high-level structure of ops until later. The downside would be that it would require more work (I think), and we might lose access to optimizations that currently happen before our passes . But I think it wouldn't hurt Inductor in general to have a more explicit sense of converting between higher-level and lower-level representations (or we just move where our custom passes happen). We can tie this work into the "autotuning custom op implementations" like done in pytorch/pytorch#164212.

ProExpertProg · 2025-10-17T02:14:47Z

As discussed offline, we are going to proceed by merging this PR. After PTC, we will move our custom op matching passes to post_grad_custom_pre_pass to reduce the risk of inductor transformations breaking custom op matching. Some of our matching depends on removing noop views and slices, so we'll have to figure out a solution for that.

BoyuanFeng · 2025-10-17T02:28:56Z

view/slice noop eliminations were upstreamed to PyTorch so I'm wondering if this is sufficient pytorch/pytorch#151095 pytorch/pytorch#151175

…hing-2 Signed-off-by: Luka Govedič <lgovedic@redhat.com>

ProExpertProg · 2025-10-17T02:51:39Z

@BoyuanFeng wouldn't that run after post_grad_custom_pre_pass though?

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

ProExpertProg mentioned this pull request Sep 10, 2025

TEMP changes for matching #22856

Closed

4 tasks

mergify bot added the needs-rebase label Sep 10, 2025

ProExpertProg force-pushed the luka/custom-op-matching-2 branch from b374514 to 4a44829 Compare September 11, 2025 04:42

mergify bot removed the needs-rebase label Sep 11, 2025

mergify bot added the needs-rebase label Sep 12, 2025

ProExpertProg force-pushed the luka/custom-op-matching-2 branch 2 times, most recently from 42f2231 to a8c9181 Compare September 12, 2025 19:20

mergify bot removed the needs-rebase label Sep 12, 2025

mgoin self-assigned this Sep 15, 2025

mergify bot added the needs-rebase label Sep 16, 2025

ProExpertProg force-pushed the luka/custom-op-matching-2 branch 6 times, most recently from 1e9326c to e3d0c83 Compare September 20, 2025 02:04

mergify bot removed the needs-rebase label Sep 20, 2025

ProExpertProg mentioned this pull request Sep 20, 2025

[torch.compile] Cleanup compilation tests and custom passes, add debug utils, fix DCE bug (#23091), fix test (#24376), and prep for custom op matching (#24604) #24542

Merged

ProExpertProg force-pushed the luka/custom-op-matching-2 branch from e3d0c83 to 9151d01 Compare September 20, 2025 13:49

mergify bot added the needs-rebase label Sep 22, 2025

ProExpertProg force-pushed the luka/custom-op-matching-2 branch from 9151d01 to da3cb54 Compare September 23, 2025 14:24

mergify bot removed the needs-rebase label Sep 23, 2025

ProExpertProg mentioned this pull request Sep 25, 2025

[Feature]: Enabling performance optimizations by default #25689

Open

11 tasks

More tweaking of precision

e34d36d

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

ilmarkov mentioned this pull request Oct 16, 2025

[Compile] Conditional compilation. Introduce compile_ranges #24252

Merged

3 tasks

mgoin reviewed Oct 16, 2025

View reviewed changes

zou3519 reviewed Oct 16, 2025

View reviewed changes

vllm/model_executor/layers/layernorm.py Show resolved Hide resolved

zou3519 reviewed Oct 16, 2025

View reviewed changes

vllm/compilation/fx_utils.py Show resolved Hide resolved

zou3519 reviewed Oct 16, 2025

View reviewed changes

vllm/compilation/fx_utils.py Show resolved Hide resolved

zou3519 reviewed Oct 16, 2025

View reviewed changes

vllm/compilation/matcher_utils.py Show resolved Hide resolved

zou3519 requested changes Oct 17, 2025

View reviewed changes

mergify bot added the needs-rebase label Oct 17, 2025

ProExpertProg mentioned this pull request Oct 17, 2025

Sequence parallelism with torch rms norm #27062

Closed

Merge remote-tracking branch 'upstream/main' into luka/custom-op-matc…

c27a182

…hing-2 Signed-off-by: Luka Govedič <lgovedic@redhat.com>

zou3519 approved these changes Oct 17, 2025

View reviewed changes

ProExpertProg mentioned this pull request Oct 17, 2025

[RFC]: Move custom fusion passes to earlier in the compilation pipeline #27072

Closed

1 task

Removed TODO

c4f913d

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

mergify bot removed the needs-rebase label Oct 17, 2025

ProExpertProg mentioned this pull request Oct 17, 2025

[RFC]: To Inductor partition or to not Inductor partition (by default in v0.11.1) #27080

Open

1 task

mgoin merged commit bd7157a into vllm-project:main Oct 17, 2025
87 checks passed

ProExpertProg deleted the luka/custom-op-matching-2 branch October 17, 2025 14:41

angelayi mentioned this pull request Oct 17, 2025

[compile] Enable sequence parallelism matching w/o custom ops enabled #27126

Merged

ZJY0516 mentioned this pull request Oct 18, 2025

[torch.compile] Enable silu_mul_fp8_quant fusion without custom ops enabled #27146

Merged

5 tasks

ProExpertProg mentioned this pull request Mar 12, 2026

[Feature]: CustomOp cleanup #19817

Open

4 tasks

Uh oh!

Conversation

ProExpertProg commented Sep 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Performance numbers

📊 TTFT Median (ms)

📊 TPOT Median (ms)

📊 ITL Median (ms)

📊 TTFT Median (ms)

📊 TPOT Median (ms)

📊 ITL Median (ms)

Uh oh!

mergify bot commented Sep 10, 2025

Uh oh!

mergify bot commented Sep 12, 2025

Uh oh!

mergify bot commented Sep 16, 2025

Uh oh!

mergify bot commented Sep 22, 2025

Uh oh!

mgoin Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

zou3519 Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zou3519 left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 17, 2025

Uh oh!

ProExpertProg commented Oct 17, 2025

Uh oh!

ProExpertProg commented Oct 17, 2025

Uh oh!

BoyuanFeng commented Oct 17, 2025

Uh oh!

ProExpertProg commented Oct 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ProExpertProg commented Sep 10, 2025 •

edited by github-actions bot

Loading