Skip to content

[Feat] Iteration-level profiling for Torch and CUDA profiler#28987

Merged
vllm-bot merged 2 commits intovllm-project:mainfrom
CentML:iteration-level-profile-ranges
Nov 20, 2025
Merged

[Feat] Iteration-level profiling for Torch and CUDA profiler#28987
vllm-bot merged 2 commits intovllm-project:mainfrom
CentML:iteration-level-profile-ranges

Conversation

@benchislett
Copy link
Copy Markdown
Collaborator

@benchislett benchislett commented Nov 19, 2025

Purpose

This PR introduces VLLM_PROFILER_DELAY_ITERS and VLLM_PROFILER_MAX_ITERS that control the behaviour of the Torch and CUDA profilers. Respectively, they allow to offset the start of profiling and limit the total number of engine iterations that are profiled.

This mostly serves to facilitate lightweight profiling of heavy workloads on large models: when trying to measure high-throughput traffic, we need to send a lot of requests to get the batch size sufficiently high to be realistic, and we need to run for some time in order to warm up the hardware and get into a steady state. This leads to extremely large profiling footprints which can take a long time to record, serialize, store on disk, and process. This PR enables us to capture a small slice of a long benchmarking run, making it easy to capture a precise engine time slice.

This change is compatible with both the Torch profiler and the CUDA profiler, with the exception that it is not currently integrated with the CPU-side AsyncLLM Torch profiling which remains on for the full duration by default. I added an opt-in variable to disable this for cases when we need a torch profile capturing small slice of a long run.

In doing this, I refactored most of the torch profiling code out of gpu_worker.py and into gpu_profiler.py alongside the CUDA profiler. A common base class coordinates the delay and max-limit calculations, and is thoroughly unit-tested.

Also, I extended annotate_profile to work with the CUDA profiler, adding a handy NVTX range over each engine iteration.

TODO: update documentation

Test Plan

The common logic introduced in this PR is unit tested. I manually checked that the integration works properly for both CUDA and Torch profiler. Here are screenshots of a 256-batch-size slice with a delay of 128 and a max of 4 iterations:

image image

Unit test passes locally. I need to make sure it gets run in the CI, I think it should get covered by tests/v1/worker runs but should double-check.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable iteration-level profiling controls for the Torch and CUDA profilers, allowing for more lightweight and targeted profiling. The refactoring of profiling logic into a common WorkerProfiler base class is a great improvement for code structure and maintainability. The new features are well-covered by unit tests. I've found one critical issue in the base profiler class that needs to be addressed.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
@markmc
Copy link
Copy Markdown
Member

markmc commented Nov 19, 2025

Nice!

Given you're adding a bunch more configurable knobs to the profiler, would now be a good time to add proper config support rather than continuing to use env vars? See #25700

@mgoin mgoin requested a review from WoosukKwon November 19, 2025 15:22
@benchislett
Copy link
Copy Markdown
Collaborator Author

@markmc I'm 50/50 on converting these into flags. I think it's fairly non-intrusive as-is, and it would take some care to get the frontend flags right, since there are so many options for the torch profiler. Referring to the discussion in that issue for guidance:

There are many envvars should instead be configs. Like the attention backend, all2all kernel backend, and even there is a flag to control KV cache layout. envvars are evil because:

  • It is equivalent of using global variables everywhere in the code, which is a bad programming practice.
  • envvars have no advanced structure like, hierarchy, typechecks, etc.
    ...

I agree we should reduce our reliance on environment variables. I see a couple of reasonable uses for them:
debugging probes and other properties we want to be able to control directly from the outside without changing code or access to the actual vllm command that's running, for example:
VLLM_LOGGING_LEVEL
VLLM_CACHE_DIR
VLLM_DEBUG_DUMP_PATH (in #25651)
temporary switches for unstable experimental features

I feel intuitively, though not strongly, that the profiler flags fall more into the second kind of "instrumentation" type knobs, as opposed to "runtime configuration" type knobs. Maybe this is a meaningless distinction and everything should be a frontend flag.

Given the volume of knobs for the torch profiler, do you think we should have something like --profiler-config '{"mode": "torch", "max_iters": 10}, or more like --profiler torch --profiler-max-iters 10 ...? I think the former could be implemented a bit more cleanly since it won't weigh down the frontend with too many new args.

If you feel strongly that we should switch to flags ASAP, I can take this on. If so, would you like it to be a part of this PR?

@mgoin
Copy link
Copy Markdown
Member

mgoin commented Nov 19, 2025

If we do this, I'm going to strongly advocate for the --profiler-config '{"mode": "torch", "max_iters": 10} style. I think we should not include that in this PR since it will involve a lot of plumbing and it is okay to land with env vars for now

@markmc
Copy link
Copy Markdown
Member

markmc commented Nov 19, 2025

Maybe this is a meaningless distinction and everything should be a frontend flag.

I'm not super convinced about the distinction. I just left a comment on that in the issue.

Given the volume of knobs for the torch profiler, do you think we should have something like --profiler-config '{"mode": "torch", "max_iters": 10},

Absolutely, much more expressive and readable, especially if you're tweaking more than a handful of these knobs 👍

If you feel strongly that we should switch to flags ASAP, I can take this on. If so, would you like it to be a part of this PR?

I'm happy for it to be a separate PR, but it merging this PR first does mean there'll be three more env vars to add backwards compat support for.

if self._delay_iters == 0:
self._call_start()

def step(self) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need the ability to log the number of requests that got scheduled/executed per iteration and how many tokens per request.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The profiling annotation will add num_new and num_cached request count information to the NVTX range for each iteration already, thanks to the changes to annotate_profile. As for the number of tokens, this should probably just be an NVTX mark or range inside the model runner. This can be added separately and is independent of this PR.

@wangshangsam
Copy link
Copy Markdown
Collaborator

wangshangsam commented Nov 19, 2025

I'm fine with either env var or CLI flags (or, if anyone prefer, as parameter to the /start_profile API), but I do hope we can enable iteration-based nsys profiling sooner rather than later. I myself was getting a barrage of questions/requests of "when is vllm gonna support iteration-based profiling", and I would anticipate others who work on vllm at NV were in the same situation.

Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great! It makes the torch profiler feel better too, so I'm fairly confident this is a right step. If you can fast-follow a ProfilerConfig before the next release, that would be best so we can remove the new env vars without need for deprecation

@github-project-automation github-project-automation bot moved this to In review in NVIDIA Nov 19, 2025
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 19, 2025
@mgoin mgoin enabled auto-merge (squash) November 20, 2025 00:01
@vllm-bot vllm-bot merged commit fcbcba6 into vllm-project:main Nov 20, 2025
47 of 48 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Nov 20, 2025
RunkaiTao pushed a commit to RunkaiTao/vllm that referenced this pull request Nov 24, 2025
…oject#28987)

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…oject#28987)

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
…oject#28987)

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants