Skip to content

[Compile] Conditional compilation. Introduce compile_ranges#24252

Merged
ProExpertProg merged 168 commits intovllm-project:mainfrom
neuralmagic:imarkov/conditional_compilation_ranges
Dec 5, 2025
Merged

[Compile] Conditional compilation. Introduce compile_ranges#24252
ProExpertProg merged 168 commits intovllm-project:mainfrom
neuralmagic:imarkov/conditional_compilation_ranges

Conversation

@ilmarkov
Copy link
Copy Markdown
Contributor

@ilmarkov ilmarkov commented Sep 4, 2025

Second part of splitting #22086

Dynamic Graph dispatch via compile_ranges: Introduces a new configuration option, compile_ranges, as an alternative to compile_sizes. This enables dynamic dispatch to different compiled graphs based on the input batch size.
Now with this approach, when allreduce fusion is enabled, vllm adds additional compile range split point in order to separate the graphs: 1. One with fused allreduce for small-middle shape inputs. 2 One with nccl based allreduce for large shape inputs

The existing compile_sizes feature is extended and generalized with compile_ranges. Defined by split points, these ranges allow vllm to dynamically dispatch requests to specific, pre-compiled graphs based on input batch size. For example, a configuration of (32, 64) defines three distinct ranges: [1, 32], [33, 64], and [65, max_num_batched_tokens). This provides granular control, allowing developers to statically enable or disable fusions within each graph to optimize performance for different batch sizes.

All the compilation now is going through piecewise_backend.py. All compilations will now be done in the bounds on certain compile range, dynamic shape compilation is removed.

Purpose

Corresponding RFC: #23113
The primary motivation for these changes is to enhance vllm's performance and adaptability for diverse workloads. By supporting allreduce fusion without custom ops and introducing dynamic graph dispatch, we empower users to fine-tune vllm for more efficient and scalable inference.

Test Plan

Added test test_compile_ranges.py

Follow ups

  • Deal with sharing shape env for all graphs which could lead to the situation when one compilation constraints SymInts for the other compilations. Might need support from torch.compile, e.g. shapenv.assume_ranges, shapenv.do_error_at_specialize.
  • Put fusions under O3 level of compilations.
  • Sharing an range info with the inductor for the SimInt. comment

Performance benchmarks:

Server:

 VLLM_ALLREDUCE_USE_SYMM_MEM=1  vllm serve {{model}} 
        --disable-log-requests --no-enable-prefix-caching -tp {{tp}} -dp 1 --max-num-seqs 256

To enable allreduce fusions:
--compilation-config "{\"pass_config\":{\"enable_fusion\":false,\"enable_attn_fusion\":false,\"enable_noop\":true,\"enable_sequence_parallelism\":false,\"enable_async_tp\":false,\"enable_fi_allreduce_fusion\":true}}"

Client. Input len 1024, output len 128.

B200 TP=2, Llama-3.1-70B-Instruct-FP8

Baseline:

QPS Mean TTFT (ms) Median TTFT (ms) Mean TPOT (ms) Median TPOT (ms) Request Throughput (req/s)
1 85.644 83.395 11.812 11.661 0.976
5 125.548 88.135 16.611 15.562 4.878
10 196.623 109.034 27.632 26.632 9.754
15 291.392 146.879 46.534 46.904 14.544

Allreduce + RMSNorm + QuantFp8

QPS Mean TTFT (ms) Median TTFT (ms) Mean TPOT (ms) Median TPOT (ms) Request Throughput (req/s)
1 71.489 70.008 10.725 10.647 0.978
5 116.128 74.080 14.436 13.352 4.888
10 183.171 91.187 23.219 20.959 9.776
15 201.879 124.434 36.656 34.716 14.607

B200 TP=4 Qwen3-Next-80B-A3B-Instruct, No EP

Baseline:

QPS Mean TTFT (ms) Median TTFT (ms) Mean TPOT (ms) Median TPOT (ms) Request Throughput (req/s)
5 93.241 84.538 33.883 34.209 4.715
10 106.084 96.828 41.167 41.103 9.431
15 120.676 119.744 49.314 49.832 14.101

Allreduce + RMSNorm + QuantFp8

QPS Mean TTFT (ms) Median TTFT (ms) Mean TPOT (ms) Median TPOT (ms) Request Throughput (req/s)
5 96.324 85.852 33.873 33.878 4.761
10 103.219 91.413 39.743 39.887 9.436
15 116.451 114.429 47.549 47.940 14.118

B200 TP=8 DeepSeek-V3.1, No EP.

Baseline:

QPS Mean TTFT (ms) Median TTFT (ms) Mean TPOT (ms) Median TPOT (ms) Request Throughput (req/s)
1 97.928 48.912 13.845 13.535 0.972
5 68.071 51.548 16.486 16.476 4.864
10 81.586 60.076 22.646 22.421 9.677
15 102.587 73.730 27.765 27.719 14.442

Allreduce + RMSNorm + QuantFp8

QPS Mean TTFT (ms) Median TTFT (ms) Mean TPOT (ms) Median TPOT (ms) Request Throughput (req/s)
1 98.466 47.478 13.175 12.933 0.973
5 67.292 51.342 15.711 15.695 4.869
10 81.177 58.212 20.094 19.978 9.699
15 97.646 73.333 25.690 25.834 14.486

Start up time increase

Increases start up time as it adds more graph compilations.
For the two graphs compilation (typical case for enabled allreduce fusions) cold start for Deepseek-V3 model takes 181.91 s , warm start takes 12.40 s.

Based on PR: #24604
First part: #24248

@ilmarkov ilmarkov changed the title [PERF] Introduce compile_ranges [PERF] Conditional compilation. Introduce compile_ranges Sep 4, 2025
@ilmarkov ilmarkov changed the title [PERF] Conditional compilation. Introduce compile_ranges [Compile] Conditional compilation. Introduce compile_ranges Sep 4, 2025
@mergify mergify bot added the ci/build label Sep 5, 2025
Comment on lines +102 to +103

def __call__(self, *args) -> Any:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, does this PR work, or is it mostly WIP? (Are you sure that the graph generated ends up being dynamic on the specific range that is passed?)

There's one problem that I don't know how to solve yet. Let's say we're compiling with ranges [2, 16] and (16, 4096]. Each compilation needs its own ShapeEnv (environment with symbols in it), which has the batch_size constrained to the particular range.

So what we should do is for each range, take the current ShapeEnv (which thinks the batch_size is dynamic on range [2, 4096], clone it, constrain to the current range (e.g. [2, 16]), and use this throughout the compilation.

I don't know how to "clone" ShapeEnvs. Is there anything else we can do here @laithsakka @bobrenjc93 ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

@ilmarkov ilmarkov Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works already leaving aside a pytorch standalone_compile that should be fixed in new pytorch release in this commit. But the graphs for each range are dynamically generated, and fusions are applied differently in each graph.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamo traces out a graph that is fully dynamic over the batch_size. We should tell torch.compile that that we know things about the batch_size for each range, for example, that the range is constrained to [2, 16]. This will help it generate better code. In order to do this, you'll need to grab the SymInt that is the batch_size and add constraints to it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it. These are the hints for torch.compile, I meant at the meeting. Thanks, I'll add ShapeEnv here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are using is_applicable_for_range (the current form of the PR this is fine), if we want to go with the other approach[see my other comment on the PR], which is more complicated i think if we are doing we want a reason) then yeh this is problematic mm./

Comment on lines +477 to +479
return compile_range is not None and (
compile_range[0]
== compile_range[1]) and (compile_range[1] % tp_size == 0)
Copy link
Copy Markdown
Collaborator

@zou3519 zou3519 Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I originally thought of doing this is something like:

return statically_known_true(batch_size %tp_size == 0):

If we are able to access the batch_size SymInt here, then we are able to query things about it.

cc @laithsakka @bobrenjc93 on if I'm butchering this API

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on how statically_known_true is going to improve the existing approach? Is it more stable?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of implementing your own range analysis, PyTorch already encodes range information in the SymInts themselves. So this is more of a code-reuse thing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it really depends on the goals of those ranges. If the goal is solely/mainly to allow custom passes to branch on ranges, this is fine. In fact, it's simpler than mutating the shape env and having to fork it.
Also, we can then keep the invariant that inductor itself does not specialize and run the same checks here (which we do not have yet).
On the other hand, if someone really thinks that inductor can do better itself significantly if we actually specialize the shape env, then yeah we would not have to do something else.
But it sounds to me like the intention is the earlier one?

@bobrenjc93
Copy link
Copy Markdown

@ilmarkov out of curiosity, do you have a sense of how much perf wins you'll get out of this (and from which models?)

@mergify
Copy link
Copy Markdown

mergify bot commented Sep 16, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ilmarkov.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 16, 2025
@ilmarkov
Copy link
Copy Markdown
Contributor Author

@bobrenjc93 Without multiple graphs our fallback (for the large input sizes, i.e. when we don't use allreduce fusion) uses either custom ops or non optimized pytorch operations and which are slower than torch triton operations. I think reasonable perf comparison was done in #19830

if compile_range[0] == compile_range[1]:
dynamic_shapes = "from_example_inputs"
else:
dynamic_shapes = "from_graph"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both "from_graph" and "from_tracing_context" here have the same effect of getting the shape env we traced the DS graph with? if yes lets do less divergence.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to get this PR over the line soon, could you take this on in a follow up?

Copy link
Copy Markdown
Contributor

@laithsakka laithsakka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one good side effect of this also other than custom passes is that
Each range is tuned with a hint from that range in inductor meaning that we can use this also to ensure that small inputs vs large inputs are max auto tuned with separate hints.
but splitting ranges

this would also work for unbacked which is good! (Well except that we would have to call override hint for unabcked with the actual example value when we do the range compilations cc @bobrenjc93 )

@laithsakka
Copy link
Copy Markdown
Contributor

here is once concern of this, it will make the soundness story with respect to the DS added by inductor harder.
to explain, inductor have the ability to specialized for dynamic shapes. well now we assume it does not
maybe soon we will add a check that it actually does not [ this also could cause BC breaking if it does].

Now the ideal and only actual right fix, is to use unbacked, unbacked comes with a perf hit.
so then come the idea, use unbacked as fallback .. the idea was evaluate dynamo+ inductor guards on the input of the DS graph and eaither call the backed DS graph or the unbacked DS graph.

with this! now we we have so much more branching, we would need to track Inductor guards per each of those compilations
(inductor can guard differently on each of those ranges based on the example input). so the fall back solution becomes more expensive and more complicated cc @zou3519 @bobrenjc93 @jamesjwu

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
…nels)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
…replacements). TODO pass to remove unnecessary conversions?

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
@mergify mergify bot removed the needs-rebase label Dec 3, 2025
Signed-off-by: ProExpertProg <lgovedic@redhat.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Dec 4, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ilmarkov.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 4, 2025
Signed-off-by: ProExpertProg <lgovedic@redhat.com>
@mergify mergify bot removed the needs-rebase label Dec 4, 2025
Signed-off-by: ProExpertProg <lgovedic@redhat.com>
@ProExpertProg
Copy link
Copy Markdown
Collaborator

ProExpertProg commented Dec 4, 2025

vllm bench sweep serve --serve-cmd "vllm serve redhatai/meta-llama-3.1-70B-Instruct-FP8 --tensor-parallel-size 2 --no-enable-prefix-caching --port 8869" --bench-cmd "vllm bench serve --model redhatai/meta-llama-3.1-70B-Instruct-FP8 --dataset-name random --ignore-eos" --bench-params sweep-qps.json --serve-params sweep-allreduce-fusion.json
sweep-allreduce-fusion.json
{
  "native-unfused": {
    "compilation_config": {
      "custom_ops": [
        "none"
      ],
      "pass_config": {
        "fuse_allreduce_rms": false
      }
    }
  },
  "native-fused": {
    "compilation_config": {
      "custom_ops": [
        "none"
      ],
      "pass_config": {
        "fuse_allreduce_rms": true
      }
    }
  },
  "custom-unfused": {
    "compilation_config": {
      "custom_ops": [
        "all"
      ],
      "pass_config": {
        "fuse_allreduce_rms": false
      }
    }
  },
  "custom-fused": {
    "compilation_config": {
      "custom_ops": [
        "all"
      ],
      "pass_config": {
        "fuse_allreduce_rms": true
      }
    }
  }
}
sweep-qps.json
[
  {
    "num-prompts": 120,
    "request-rate": 1
  },{
    "num-prompts": 600,
    "request-rate": 5
  },{
    "num-prompts": 1200,
    "request-rate": 10
  },{
    "num-prompts": 1800,
    "request-rate": 15
  },{
    "num-prompts": 2400,
    "request-rate": 20
  },{
    "num-prompts": 1000,
    "request-rate": "inf"
  }
]

Signed-off-by: ProExpertProg <lgovedic@redhat.com>
@ProExpertProg
Copy link
Copy Markdown
Collaborator

ProExpertProg commented Dec 5, 2025

vllm bench sweep plot .auxtmpdir/bench/results/20251204_202022/ --var-y median_ttft_ms --var-x request_rate --curve-by _benchmark_name --fig-name ttft_20 --filter-by="request_rate!=inf" --fig-height 10 
tpot_20
vllm bench sweep plot .auxtmpdir/bench/results/20251204_202022/ --var-y median_ttft_ms --var-x request_rate --curve-by _benchmark_name --fig-name ttft_20 --filter-by="request_rate!=inf" --fig-height 10 
ttft_20

@ProExpertProg ProExpertProg enabled auto-merge (squash) December 5, 2025 00:39
@mergify
Copy link
Copy Markdown

mergify bot commented Dec 5, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ilmarkov.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 5, 2025
# Conflicts:
#	tests/conftest.py

Signed-off-by: ProExpertProg <lgovedic@redhat.com>
@mergify mergify bot removed the needs-rebase label Dec 5, 2025
@ProExpertProg ProExpertProg merged commit 4e26d3b into vllm-project:main Dec 5, 2025
53 checks passed
@hjjq
Copy link
Copy Markdown
Contributor

hjjq commented Dec 6, 2025

Hi @ilmarkov , the following command seems to be broken after this PR. I suspect it is because setting --compilation_config.pass_config.fuse_allreduce_rms true while having tp=1 will cause the constructor to return early without setting self.max_token_num, and later accessing it within is_applicable_for_range() will lead to the error. Do you know what would be the way to gracefully handle this case? Thanks!

export VLLM_ATTENTION_BACKEND=FLASHINFER_MLA
export VLLM_FLASHINFER_MOE_BACKEND=latency
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_NCCL_SYMM_MEM=1
export NCCL_NVLS_ENABLE=1
export NCCL_CUMEM_ENABLE=1
export VLLM_USE_TRTLLM_RAGGED_DEEPSEEK_PREFILL=1

python3 -m vllm.entrypoints.openai.api_server --model nvidia/DeepSeek-R1-0528-FP4 --tokenizer nvidia/DeepSeek-R1-0528-FP4 --dtype auto --kv-cache-dtype fp8 --tensor-parallel-size 1 --pipeline-parallel-size 1 --data-parallel-size 4 --enable-expert-parallel --swap-space 16 --max-num-seqs 1024 --trust-remote-code --max-model-len 10240 --gpu-memory-utilization 0.9 --max-num-batched-tokens 8192 --no-enable-prefix-caching --async-scheduling --compilation_config.pass_config.fuse_allreduce_rms true --compilation_config.pass_config.fuse_attn_quant true --compilation_config.pass_config.eliminate_noops true --compilation_config.custom_ops+=+quant_fp8,+rms_norm --max-cudagraph-capture-size 2048 --compilation_config.cudagraph_mode FULL_DECODE_ONLY --stream-interval=20 --api-server-count=20
(EngineCore_DP1 pid=4041955)     if pass_.is_applicable_for_range(compile_range):
(EngineCore_DP1 pid=4041955)        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP1 pid=4041955)   File "/hanjieq/vllm/vllm/compilation/collective_fusion.py", line 1191, in is_applicable_for_range
(EngineCore_DP1 pid=4041955)     return compile_range.end <= self.max_token_num
(EngineCore_DP1 pid=4041955)                                 ^^^^^^^^^^^^^^^^^^
(EngineCore_DP1 pid=4041955) torch._inductor.exc.InductorError: AttributeError: 'AllReduceFusionPass' object has no attribute 'max_token_num'

@ProExpertProg
Copy link
Copy Markdown
Collaborator

We should probably check self.disabled in the is_applicable_for_range and emit a warning (warn_once), could you submit a PR?

@ZJY0516
Copy link
Copy Markdown
Member

ZJY0516 commented Dec 6, 2025

We should probably check self.disabled in the is_applicable_for_range and emit a warning (warn_once), could you submit a PR?

I have a PR #30178 for this. PTAL @ProExpertProg @hjjq

@zou3519
Copy link
Copy Markdown
Collaborator

zou3519 commented Dec 8, 2025

@ilmarkov compile_ranges tests are failing on main. Somehow the failing tests are not reporting as red. Could you take a look please?

https://buildkite.com/vllm/ci/builds/42175#019aefbd-bbf1-4ad0-a4ea-4b424efdbca4

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build frontend performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed tool-calling torch.compile v1

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

[RFC]: Enabling Multiple Graphs Based on pre-defined conditions

9 participants