[Feat] Iteration-level profiling for Torch and CUDA profiler#28987
[Feat] Iteration-level profiling for Torch and CUDA profiler#28987vllm-bot merged 2 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
There was a problem hiding this comment.
Code Review
This pull request introduces valuable iteration-level profiling controls for the Torch and CUDA profilers, allowing for more lightweight and targeted profiling. The refactoring of profiling logic into a common WorkerProfiler base class is a great improvement for code structure and maintainability. The new features are well-covered by unit tests. I've found one critical issue in the base profiler class that needs to be addressed.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
|
Nice! Given you're adding a bunch more configurable knobs to the profiler, would now be a good time to add proper config support rather than continuing to use env vars? See #25700 |
|
@markmc I'm 50/50 on converting these into flags. I think it's fairly non-intrusive as-is, and it would take some care to get the frontend flags right, since there are so many options for the torch profiler. Referring to the discussion in that issue for guidance:
I feel intuitively, though not strongly, that the profiler flags fall more into the second kind of "instrumentation" type knobs, as opposed to "runtime configuration" type knobs. Maybe this is a meaningless distinction and everything should be a frontend flag. Given the volume of knobs for the torch profiler, do you think we should have something like If you feel strongly that we should switch to flags ASAP, I can take this on. If so, would you like it to be a part of this PR? |
|
If we do this, I'm going to strongly advocate for the |
I'm not super convinced about the distinction. I just left a comment on that in the issue.
Absolutely, much more expressive and readable, especially if you're tweaking more than a handful of these knobs 👍
I'm happy for it to be a separate PR, but it merging this PR first does mean there'll be three more env vars to add backwards compat support for. |
| if self._delay_iters == 0: | ||
| self._call_start() | ||
|
|
||
| def step(self) -> None: |
There was a problem hiding this comment.
We also need the ability to log the number of requests that got scheduled/executed per iteration and how many tokens per request.
There was a problem hiding this comment.
The profiling annotation will add num_new and num_cached request count information to the NVTX range for each iteration already, thanks to the changes to annotate_profile. As for the number of tokens, this should probably just be an NVTX mark or range inside the model runner. This can be added separately and is independent of this PR.
|
I'm fine with either env var or CLI flags (or, if anyone prefer, as parameter to the |
mgoin
left a comment
There was a problem hiding this comment.
I think this looks great! It makes the torch profiler feel better too, so I'm fairly confident this is a right step. If you can fast-follow a ProfilerConfig before the next release, that would be best so we can remove the new env vars without need for deprecation
…oject#28987) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
…oject#28987) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…oject#28987) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Purpose
This PR introduces
VLLM_PROFILER_DELAY_ITERSandVLLM_PROFILER_MAX_ITERSthat control the behaviour of the Torch and CUDA profilers. Respectively, they allow to offset the start of profiling and limit the total number of engine iterations that are profiled.This mostly serves to facilitate lightweight profiling of heavy workloads on large models: when trying to measure high-throughput traffic, we need to send a lot of requests to get the batch size sufficiently high to be realistic, and we need to run for some time in order to warm up the hardware and get into a steady state. This leads to extremely large profiling footprints which can take a long time to record, serialize, store on disk, and process. This PR enables us to capture a small slice of a long benchmarking run, making it easy to capture a precise engine time slice.
This change is compatible with both the Torch profiler and the CUDA profiler, with the exception that it is not currently integrated with the CPU-side AsyncLLM Torch profiling which remains on for the full duration by default. I added an opt-in variable to disable this for cases when we need a torch profile capturing small slice of a long run.
In doing this, I refactored most of the torch profiling code out of
gpu_worker.pyand intogpu_profiler.pyalongside the CUDA profiler. A common base class coordinates the delay and max-limit calculations, and is thoroughly unit-tested.Also, I extended
annotate_profileto work with the CUDA profiler, adding a handy NVTX range over each engine iteration.TODO: update documentation
Test Plan
The common logic introduced in this PR is unit tested. I manually checked that the integration works properly for both CUDA and Torch profiler. Here are screenshots of a 256-batch-size slice with a delay of 128 and a max of 4 iterations:
Unit test passes locally. I need to make sure it gets run in the CI, I think it should get covered by
tests/v1/workerruns but should double-check.