[Feat] Iteration-level profiling for Torch and CUDA profiler by benchislett · Pull Request #28987 · vllm-project/vllm

benchislett · 2025-11-19T03:24:42Z

Purpose

This PR introduces VLLM_PROFILER_DELAY_ITERS and VLLM_PROFILER_MAX_ITERS that control the behaviour of the Torch and CUDA profilers. Respectively, they allow to offset the start of profiling and limit the total number of engine iterations that are profiled.

This mostly serves to facilitate lightweight profiling of heavy workloads on large models: when trying to measure high-throughput traffic, we need to send a lot of requests to get the batch size sufficiently high to be realistic, and we need to run for some time in order to warm up the hardware and get into a steady state. This leads to extremely large profiling footprints which can take a long time to record, serialize, store on disk, and process. This PR enables us to capture a small slice of a long benchmarking run, making it easy to capture a precise engine time slice.

This change is compatible with both the Torch profiler and the CUDA profiler, with the exception that it is not currently integrated with the CPU-side AsyncLLM Torch profiling which remains on for the full duration by default. I added an opt-in variable to disable this for cases when we need a torch profile capturing small slice of a long run.

In doing this, I refactored most of the torch profiling code out of gpu_worker.py and into gpu_profiler.py alongside the CUDA profiler. A common base class coordinates the delay and max-limit calculations, and is thoroughly unit-tested.

Also, I extended annotate_profile to work with the CUDA profiler, adding a handy NVTX range over each engine iteration.

TODO: update documentation

Test Plan

The common logic introduced in this PR is unit tested. I manually checked that the integration works properly for both CUDA and Torch profiler. Here are screenshots of a 256-batch-size slice with a delay of 128 and a max of 4 iterations:

Unit test passes locally. I need to make sure it gets run in the CI, I think it should get covered by tests/v1/worker runs but should double-check.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

gemini-code-assist

Code Review

This pull request introduces valuable iteration-level profiling controls for the Torch and CUDA profilers, allowing for more lightweight and targeted profiling. The refactoring of profiling logic into a common WorkerProfiler base class is a great improvement for code structure and maintainability. The new features are well-covered by unit tests. I've found one critical issue in the base profiler class that needs to be addressed.

vllm/profiler/gpu_profiler.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>

markmc · 2025-11-19T09:59:55Z

Nice!

Given you're adding a bunch more configurable knobs to the profiler, would now be a good time to add proper config support rather than continuing to use env vars? See #25700

benchislett · 2025-11-19T15:45:00Z

@markmc I'm 50/50 on converting these into flags. I think it's fairly non-intrusive as-is, and it would take some care to get the frontend flags right, since there are so many options for the torch profiler. Referring to the discussion in that issue for guidance:

There are many envvars should instead be configs. Like the attention backend, all2all kernel backend, and even there is a flag to control KV cache layout. envvars are evil because:

It is equivalent of using global variables everywhere in the code, which is a bad programming practice.

envvars have no advanced structure like, hierarchy, typechecks, etc.
...

I agree we should reduce our reliance on environment variables. I see a couple of reasonable uses for them:
debugging probes and other properties we want to be able to control directly from the outside without changing code or access to the actual vllm command that's running, for example:
VLLM_LOGGING_LEVEL
VLLM_CACHE_DIR
VLLM_DEBUG_DUMP_PATH (in #25651)
temporary switches for unstable experimental features

I feel intuitively, though not strongly, that the profiler flags fall more into the second kind of "instrumentation" type knobs, as opposed to "runtime configuration" type knobs. Maybe this is a meaningless distinction and everything should be a frontend flag.

Given the volume of knobs for the torch profiler, do you think we should have something like --profiler-config '{"mode": "torch", "max_iters": 10}, or more like --profiler torch --profiler-max-iters 10 ...? I think the former could be implemented a bit more cleanly since it won't weigh down the frontend with too many new args.

If you feel strongly that we should switch to flags ASAP, I can take this on. If so, would you like it to be a part of this PR?

mgoin · 2025-11-19T15:54:45Z

If we do this, I'm going to strongly advocate for the --profiler-config '{"mode": "torch", "max_iters": 10} style. I think we should not include that in this PR since it will involve a lot of plumbing and it is okay to land with env vars for now

markmc · 2025-11-19T16:13:32Z

Maybe this is a meaningless distinction and everything should be a frontend flag.

I'm not super convinced about the distinction. I just left a comment on that in the issue.

Given the volume of knobs for the torch profiler, do you think we should have something like --profiler-config '{"mode": "torch", "max_iters": 10},

Absolutely, much more expressive and readable, especially if you're tweaking more than a handful of these knobs 👍

If you feel strongly that we should switch to flags ASAP, I can take this on. If so, would you like it to be a part of this PR?

I'm happy for it to be a separate PR, but it merging this PR first does mean there'll be three more env vars to add backwards compat support for.

wangshangsam · 2025-11-19T18:54:49Z

vllm/profiler/gpu_profiler.py

+        if self._delay_iters == 0:
+            self._call_start()
+
+    def step(self) -> None:


We also need the ability to log the number of requests that got scheduled/executed per iteration and how many tokens per request.

The profiling annotation will add num_new and num_cached request count information to the NVTX range for each iteration already, thanks to the changes to annotate_profile. As for the number of tokens, this should probably just be an NVTX mark or range inside the model runner. This can be added separately and is independent of this PR.

wangshangsam · 2025-11-19T18:55:50Z

I'm fine with either env var or CLI flags (or, if anyone prefer, as parameter to the /start_profile API), but I do hope we can enable iteration-based nsys profiling sooner rather than later. I myself was getting a barrage of questions/requests of "when is vllm gonna support iteration-based profiling", and I would anticipate others who work on vllm at NV were in the same situation.

mgoin

I think this looks great! It makes the torch profiler feel better too, so I'm fairly confident this is a right step. If you can fast-follow a ProfilerConfig before the next release, that would be best so we can remove the new env vars without need for deprecation

…oject#28987) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

…oject#28987) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

add iteration-level profiling to gpu worker

ee4fd86

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett requested review from hmellor, mgoin and njhill November 19, 2025 03:24

mergify bot added nvidia v1 labels Nov 19, 2025

github-project-automation bot added this to NVIDIA Nov 19, 2025

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

vllm/profiler/gpu_profiler.py Outdated Show resolved Hide resolved

benchislett requested a review from ProExpertProg November 19, 2025 03:32

Update vllm/profiler/gpu_profiler.py

c51b137

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>

mgoin requested a review from WoosukKwon November 19, 2025 15:22

wangshangsam reviewed Nov 19, 2025

View reviewed changes

mgoin approved these changes Nov 19, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Nov 19, 2025

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 19, 2025

mgoin enabled auto-merge (squash) November 20, 2025 00:01

vllm-bot merged commit fcbcba6 into vllm-project:main Nov 20, 2025
47 of 48 checks passed

github-project-automation bot moved this from In review to Done in NVIDIA Nov 20, 2025

NobuoTsukamoto mentioned this pull request Nov 23, 2025

[Bug]: simple_profiling.py fails on CPU target #29259

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] Iteration-level profiling for Torch and CUDA profiler#28987

[Feat] Iteration-level profiling for Torch and CUDA profiler#28987
vllm-bot merged 2 commits intovllm-project:mainfrom
CentML:iteration-level-profile-ranges

benchislett commented Nov 19, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

markmc commented Nov 19, 2025

Uh oh!

benchislett commented Nov 19, 2025

Uh oh!

mgoin commented Nov 19, 2025

Uh oh!

markmc commented Nov 19, 2025

Uh oh!

wangshangsam Nov 19, 2025

Uh oh!

benchislett Nov 19, 2025

Uh oh!

wangshangsam commented Nov 19, 2025 •

edited

Loading

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

benchislett commented Nov 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

markmc commented Nov 19, 2025

Uh oh!

benchislett commented Nov 19, 2025

Uh oh!

mgoin commented Nov 19, 2025

Uh oh!

markmc commented Nov 19, 2025

Uh oh!

wangshangsam Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

benchislett Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

wangshangsam commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

benchislett commented Nov 19, 2025 •

edited by github-actions bot

Loading

wangshangsam commented Nov 19, 2025 •

edited

Loading