[Perf][1/n] Eliminate various GPU<->CPU syncs#41429
Conversation
Signed-off-by: Nick Hill <nickhill123@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request implements several optimizations to reduce CPU-GPU synchronization across the vLLM codebase, primarily by moving tensor construction to the CPU and utilizing non-blocking transfers. Key improvements include the introduction of a platform-aware PIN_MEMORY constant to handle WSL environments, the use of index_fill_ and slice assignments to avoid implicit syncs, and the migration of sequence length processing to CPU-resident tensors. Review feedback highlights the need to use the existing pin_memory configuration in the Sampler class for WSL compatibility and recommends explicitly pinning tensors in the GPUModelRunner to ensure that non-blocking transfers are effective.
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
yewentao256
left a comment
There was a problem hiding this comment.
LGTM, thanks for the work!
| ) -> None: | ||
| prompt_bin_mask[expanded_idx_mapping] = 0 | ||
| output_bin_counts[expanded_idx_mapping] = 0 | ||
| # Use index_fill_ instead of `tensor[idx] = 0` to avoid sync. |
There was a problem hiding this comment.
| # Use index_fill_ instead of `tensor[idx] = 0` to avoid sync. |
Seems redundant comment
Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Fix first batch of unnecessary gpu/cpu syncs, found via #40561:
Should benefit the following features:
logprob_token_idsimplbad_wordssampling parameter impl