[Bugfix] Initialize attention bias on the same device as Query/Key/Value#13468
[Bugfix] Initialize attention bias on the same device as Query/Key/Value#13468simon-mo merged 1 commit intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
f466344 to
87824ed
Compare
|
The pre-commit CI passed once, but failed after I signed off and force-pushed. I'm not sure why. |
|
This could solve issues like huggingface/open-r1#278 and facebookresearch/xformers#1064 (comment) |
Signed-off-by: Junlin Zhou <jameszhou2108@hotmail.com>
87824ed to
275d082
Compare
|
using vllm==0.7.3, still having this issue |
|
same question, how to solve it ? |
You need to either install from the main branch, or wait for a release. |
…lue (vllm-project#13468) Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
The attention bias in vLLM's xformers backend is currently initialized on the default device, rather than the device of the Q/K/V tensors:
vllm/vllm/attention/backends/xformers.py
Lines 676 to 677 in b53d799
And here is how xformers decide which device to use:
https://github.com/facebookresearch/xformers/blob/8d91ce05a2f6a5ae059593922a631b9ff325b134/xformers/ops/fmha/attn_bias.py#L742:
https://github.com/facebookresearch/xformers/blob/8d91ce05a2f6a5ae059593922a631b9ff325b134/xformers/ops/fmha/attn_bias.py#L90
This becomes problematic when vLLM is used in conjunction with libraries like
trlfor GRPO training. In such cases, vLLM might be assigned to run on a specific GPU (e.g., the next available GPU after those used for training, which is the default behaviour oftrl).For example, if I have 8 GPUs and use
cuda:0tocuda:6for GRPO training, vLLM will then be assigned tocuda:7. However, the current attention bias initialization will place the bias oncuda:0, leading to the following error:This PR will probably solve this issue.