[Core][WIP] Check for GPU<->CPU sync during CI#40561
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a GPU-CPU synchronization check mechanism via the VLLM_GPU_SYNC_CHECK environment variable, which is set to "error" by default in the Dockerfiles. The check is applied to the sample_tokens and execute_model methods in the V1 GPU worker using a new decorator. Feedback indicates that the with_gpu_sync_check decorator should be improved to restore the previous synchronization mode rather than resetting to default and should check the environment variable at runtime to support dynamic disabling.
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Thanks for discovering this!
Would it be possible to warn when a GPU sync is found like this? This way Transformers backend users can request that it be fixed by the Transformers team?
There was a problem hiding this comment.
@hmellor once this PR is merged, the CI will fail when such syncs occur.
The sync checker does have a warn mode but I don't think we want to enable that at runtime since it may have some overhead.
However we can always just add a logged warning next to any of the added with gpu_sync_allowed()'s like these.
There was a problem hiding this comment.
Ok, in that case I think I'd prefer no warning instead of always warning. I'll add something to the Transformers backend docs explaining how users can enable this for development so that they can catch syncs in their models.
|
Replacing with #43107. |
vLLM now uses asynchronous scheduling by default and in the majority of cases. Performance relies on the absence of any gpu<->cpu synchronizations on the main cuda stream, but such syncs can be opaque and it is easy for them to creep in accidentally.
This change adds a
VLLM_GPU_SYNC_CHECKenv var which enablestorch.cuda.set_sync_debug_modefor the model forward pass and sampler, so that we can easily check for such syncs.I'm trying first to enable it globally in the CI to flush out syncs that need to be fixed or where they are unavoidable and the check needs to be suppressed. Will then probably split the fixes into separate PR(s).
Update
Started to open separate PRs fixing identified sync points: