[PERF] Allreduce fusion. Support torch native matching. Tuning of the thresholds#24248
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request is a significant enhancement to the all-reduce fusion capabilities, adding support for matching native PyTorch operations in addition to custom ops. This greatly improves usability and performance flexibility. The introduction of a comprehensive benchmark for tuning fusion thresholds is also a valuable addition. The changes are extensive, particularly with the large number of new fusion patterns in vllm/compilation/collective_fusion.py. While the overall approach is sound, I've identified several critical issues in the implementation of these new patterns. Specifically, the return values from some pattern and replacement functions appear to be incorrect, which could lead to fusion failures or incorrect model outputs. I've provided detailed comments and suggestions for these issues. The configuration updates and the new benchmark script are well-implemented and welcome improvements.
There was a problem hiding this comment.
The return values from the replacement function are incorrect. The pattern returns (rms_output, allreduce_output), which correspond to the normalized output and the all-reduced tensor. The replacement function should return the same structure.
auto_functionalized(flashinfer_trtllm_fused_allreduce_norm, ...) returns a tuple of 5 mutated arguments: (allreduce_in, residual, norm_out, quant_out, scale_out).
The rms_result corresponds to norm_out, which is allreduce[2].
The allreduce_in (which is input to the replacement function) corresponds to allreduce[0].
Therefore, the return statement should be return allreduce[2], allreduce[0].
The current code returns allreduce[3], allreduce[1], which corresponds to (quant_out, residual). This is incorrect and will lead to fusion failures or wrong results.
| return allreduce[3], allreduce[1] | |
| return allreduce[2], allreduce[0] |
There was a problem hiding this comment.
The return values from the replacement function are incorrect. The pattern returns (rms_output, rms_residual), which are the normalized output and the residual output. The replacement function should return the same structure.
When norm_out=None is passed to flashinfer_trtllm_fused_allreduce_norm, the allreduce_in tensor is used as the output buffer for the normalization result and is mutated. auto_functionalized will return a tuple where the first element (allreduce[0]) is the mutated allreduce_in (i.e., norm_out), and the second element (allreduce[1]) is the mutated residual.
Therefore, the correct return should be return allreduce[0], allreduce[1].
The current code returns allreduce[1], allreduce[2], which corresponds to (residual, norm_out). Since norm_out is None in the call, this is incorrect.
| return allreduce[1], allreduce[2] | |
| return allreduce[0], allreduce[1] |
There was a problem hiding this comment.
Just curious: why is the threshold still so low for TP8? I think AR+Norm should have pretty good perf up to some larger message sizes for TP8?
vllm/config/compilation.py
Outdated
There was a problem hiding this comment.
@nvpohanh Here are the results for TP=8 Blackwell with torch symm mem (VLLM_ALLREDUCE_USE_SYMM_MEM=1) enabled (see the set of results below). I used the best performant alternative to fused allreduce. Probably, we can condition on it checking if symm mem is available and enabled, it will overcomplicate the configuration, in my opinion. Compared default allreduce flashinfer fused alternative is not significantly better in 4-16MB region (see results in the end)
Symm mem enabled
World Size: 8
Hidden Dimension: 8192
Warmup Iterations: 5
Benchmark Trials: 20
Quantization Mode: none
Configuration: seq_len=32, dtype=bfloat16, no residual
Input Size: 0.50 MB
| Operation | Time (ms) | Speedup |
|---|---|---|
| Standard Allreduce Rmsnorm | 0.029 | 1.00x |
| Standard Allreduce Rmsnorm Native Compiled | 0.030 | 0.99x |
| Flashinfer Fused Allreduce Rmsnorm Oneshot | 0.012 | 2.39x |
| Flashinfer Fused Allreduce Rmsnorm Twoshot | 0.086 | 0.34x |
Configuration: seq_len=64, dtype=bfloat16, no residual
Input Size: 1.00 MB
| Operation | Time (ms) | Speedup |
|---|---|---|
| Standard Allreduce Rmsnorm | 0.030 | 1.00x |
| Standard Allreduce Rmsnorm Native Compiled | 0.030 | 0.99x |
| Flashinfer Fused Allreduce Rmsnorm Oneshot | 0.018 | 1.62x |
| Flashinfer Fused Allreduce Rmsnorm Twoshot | 0.056 | 0.54x |
Configuration: seq_len=128, dtype=bfloat16, no residual
Input Size: 2.00 MB
| Operation | Time (ms) | Speedup |
|---|---|---|
| Standard Allreduce Rmsnorm | 0.023 | 1.00x |
| Standard Allreduce Rmsnorm Native Compiled | 0.024 | 0.99x |
| Flashinfer Fused Allreduce Rmsnorm Oneshot | 0.033 | 0.71x |
| Flashinfer Fused Allreduce Rmsnorm Twoshot | 0.052 | 0.45x |
Configuration: seq_len=256, dtype=bfloat16, no residual
Input Size: 4.00 MB
| Operation | Time (ms) | Speedup |
|---|---|---|
| Standard Allreduce Rmsnorm | 0.031 | 0.97x |
| Standard Allreduce Rmsnorm Native Compiled | 0.030 | baseline |
| Flashinfer Fused Allreduce Rmsnorm Oneshot | 0.064 | 0.47x |
| Flashinfer Fused Allreduce Rmsnorm Twoshot | 0.050 | 0.60x |
Configuration: seq_len=256, dtype=bfloat16, no residual
Input Size: 4.00 MB
| Operation | Time (ms) | Speedup |
|---|---|---|
| Standard Allreduce Rmsnorm | 0.031 | 0.97x |
| Standard Allreduce Rmsnorm Native Compiled | 0.030 | baseline |
| Flashinfer Fused Allreduce Rmsnorm Twoshot | 0.049 | 0.61x |
Configuration: seq_len=512, dtype=bfloat16, no residual
Input Size: 8.00 MB
| Operation | Time (ms) | Speedup |
|---|---|---|
| Standard Allreduce Rmsnorm | 0.044 | 0.98x |
| Standard Allreduce Rmsnorm Native Compiled | 0.043 | baseline |
| Flashinfer Fused Allreduce Rmsnorm Twoshot | 0.297 | 0.15x |
Configuration: seq_len=1024, dtype=bfloat16, no residual
Input Size: 16.00 MB
| Operation | Time (ms) | Speedup |
|---|---|---|
| Standard Allreduce Rmsnorm | 0.071 | 1.00x |
| Standard Allreduce Rmsnorm Native Compiled | 0.077 | 0.93x |
| Flashinfer Fused Allreduce Rmsnorm Twoshot | 0.109 | 0.66x |
Configuration: seq_len=2048, dtype=bfloat16, no residual
Input Size: 32.00 MB
| Operation | Time (ms) | Speedup |
|---|---|---|
| Standard Allreduce Rmsnorm | 0.135 | 1.00x |
| Standard Allreduce Rmsnorm Native Compiled | 0.143 | 0.94x |
| Flashinfer Fused Allreduce Rmsnorm Twoshot | 0.205 | 0.66x |
Default allreduce
Configuration: seq_len=32, dtype=bfloat16, no residual
Input Size: 0.50 MB
| Operation | Time (ms) | Speedup |
|---|---|---|
| Standard Allreduce Rmsnorm | 0.029 | 1.00x |
| Standard Allreduce Rmsnorm Native Compiled | 0.030 | 0.99x |
| Flashinfer Fused Allreduce Rmsnorm Oneshot | 0.012 | 2.44x |
| Flashinfer Fused Allreduce Rmsnorm Twoshot | 0.087 | 0.34x |
Configuration: seq_len=64, dtype=bfloat16, no residual
Input Size: 1.00 MB
| Operation | Time (ms) | Speedup |
|---|---|---|
| Standard Allreduce Rmsnorm | 0.030 | 1.00x |
| Standard Allreduce Rmsnorm Native Compiled | 0.030 | 1.00x |
| Flashinfer Fused Allreduce Rmsnorm Oneshot | 0.019 | 1.63x |
| Flashinfer Fused Allreduce Rmsnorm Twoshot | 0.056 | 0.54x |
Configuration: seq_len=128, dtype=bfloat16, no residual
Input Size: 2.00 MB
| Operation | Time (ms) | Speedup |
|---|---|---|
| Standard Allreduce Rmsnorm | 0.032 | 1.00x |
| Standard Allreduce Rmsnorm Native Compiled | 0.032 | 1.00x |
| Flashinfer Fused Allreduce Rmsnorm Oneshot | 0.033 | 0.97x |
| Flashinfer Fused Allreduce Rmsnorm Twoshot | 0.052 | 0.62x |
Configuration: seq_len=256, dtype=bfloat16, no residual
Input Size: 4.00 MB
| Operation | Time (ms) | Speedup |
|---|---|---|
| Standard Allreduce Rmsnorm | 0.051 | 0.98x |
| Standard Allreduce Rmsnorm Native Compiled | 0.050 | baseline |
| Flashinfer Fused Allreduce Rmsnorm Oneshot | 0.064 | 0.77x |
| Flashinfer Fused Allreduce Rmsnorm Twoshot | 0.050 | 1.00x |
Configuration: seq_len=512, dtype=bfloat16, no residual
Input Size: 8.00 MB
| Operation | Time (ms) | Speedup |
|---|---|---|
| Standard Allreduce Rmsnorm | 0.079 | 1.00x |
| Standard Allreduce Rmsnorm Native Compiled | 0.081 | 0.97x |
| Flashinfer Fused Allreduce Rmsnorm Twoshot | 0.068 | 1.17x |
Configuration: seq_len=1024, dtype=bfloat16, no residual
Input Size: 16.00 MB
| Operation | Time (ms) | Speedup |
|---|---|---|
| Standard Allreduce Rmsnorm | 0.119 | 1.00x |
| Standard Allreduce Rmsnorm Native Compiled | 0.125 | 0.95x |
| Flashinfer Fused Allreduce Rmsnorm Twoshot | 0.109 | 1.09x |
Configuration: seq_len=2048, dtype=bfloat16, no residual
Input Size: 32.00 MB
| Operation | Time (ms) | Speedup |
|---|---|---|
| Standard Allreduce Rmsnorm | 0.195 | 1.00x |
| Standard Allreduce Rmsnorm Native Compiled | 0.211 | 0.93x |
| Flashinfer Fused Allreduce Rmsnorm Twoshot | 0.204 | 0.96x |
There was a problem hiding this comment.
@ilmarkov Is VLLM_ALLREDUCE_USE_SYMM_MEM=1 something that normal vLLM users would set by default? If it's good for performance, why can't we enable it by default? Does it require special environment or special builds? cc @ProExpertProg
@nvjullin Could you check if @ilmarkov 's measurements above match our understanding? Also, could you try if VLLM_ALLREDUCE_USE_SYMM_MEM=1 works in our case? Thanks!
There was a problem hiding this comment.
Yes, it can be enabled by default. There is a PR for it. It works on Hopper and Blackwell.
There was a problem hiding this comment.
Got it! we will try both your PRs and run some experiments on our side.
There was a problem hiding this comment.
@ilmarkov Just to clarify: the PyTorch SYMM_MEM implementation does not support AR+Norm fusion, right? So only the AR part uses SYMM_MEM while Norm part is based on native PyT?
There was a problem hiding this comment.
Yes, symm mem is only for allreduce part, Norm and quantization parts are in native pytorch.
|
cc @nvjullin @elvischenv for vis |
e808818 to
61ebc95
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @ilmarkov , is there any progress and ETA for this change? Thanks! |
|
Hi, @nvpohanh . @ProExpertProg works on general custom op matching in #24604. So we will apply allreduce related pattern matching after his PR is landed. I mark current PR as draft for now. |
Signed-off-by: ilmarkov <markovilya197@gmail.com>
ProExpertProg
left a comment
There was a problem hiding this comment.
LGTM, a few minor notes and then we can merge!
|
|
||
| @staticmethod | ||
| def default_fi_allreduce_fusion_max_size_mb() -> dict[int, float]: | ||
| from vllm.compilation.collective_fusion import FI_ALLREDUCE_FUSION_MAX_SIZE_MB |
There was a problem hiding this comment.
@ilmarkov if this is still an issue to unblock we can just move this computation into the collective_fusion.py file. We can always move it back here later. I am also concerned that the head process (which shouldn't initialize CUDA) might initialize CUDA when querying device capability during device config (not sure if that happens in the head or just the workers).
But if stuff is working feel free to leave it as is
Signed-off-by: ilmarkov <markovilya197@gmail.com>
|
We need to move dispatch and combine back under custom op as they conflict with torch.compile. In this PR we need only move (main experts) allreduce outside of custom op. |
Signed-off-by: ilmarkov <markovilya197@gmail.com>
|
Validation. TP=4 (DP+EP)=4 (TP+EP)=4 |
| ) | ||
| if use_flashinfer: | ||
|
|
||
| if num_tokens <= max_token_num: |
There was a problem hiding this comment.
@laithsakka and I ran over this a bit more offline and we're a bit worried this line might cause unintended specialization. Do you have instructions on how to trigger this line of code? (and if so, are you able to provide a tlparse of it? We want to check the symbolic shape constraints in the tlparse to see if this introduced anything negative)
There was a problem hiding this comment.
@ilmarkov could you take a look? tlparse instructions here: https://docs.vllm.ai/en/latest/design/debug_vllm_compile
There was a problem hiding this comment.
Btw, @zou3519 how do you want us to send you tlparse results, in an archive?
There was a problem hiding this comment.
archive would work, maybe need a better way to share these...
There was a problem hiding this comment.
Here are the tlparse results. But I am not sure if you will see the specialization in this line of code given that it is in custom op flashinfer_trtllm_fused_allreduce_norm
tl_out_vllm_fi_ar.tar.gz
… thresholds (#24248) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Luka Govedič <lgovedic@redhat.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> (cherry picked from commit d17ecc6)
… thresholds (vllm-project#24248) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Luka Govedič <lgovedic@redhat.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
First part of improvement on fused allreduce.
Purpose
Add tunings of thresholds for Flashinfer allreduce fusion.
Adds a benchmark for allreduce fusion to determine input size thresholds for flashinfer allreduce.
Updates thresholds for flashinfer allreduce (as well as adding two shot algorithm usage when it has better performance) on Hopper and Blackwell devices
Moves allreduce out of moe_forward custom op in order to be able to match for fusion for moe models.
Test Plan
Added tests for non custom ops fusion
Added e2e test for Qwen3 MoE
Based on #24604
Second part: #24252 Introduce compile ranges