Revert back to torch.equal over torch.allclose from #28819 #29086
Revert back to torch.equal over torch.allclose from #28819 #29086NickLucche merged 4 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request reverts the weight comparison logic from torch.allclose back to torch.equal for determining whether to share weights between the target and draft models. The justification provided is sound, as using torch.equal avoids the ambiguity of atol/rtol tuning and prevents incorrect weight sharing for embed_tokens that might have small, meaningful updates. Additionally, the change to add .cpu() calls to the lm_head weight comparison is a great improvement for robustness, making it consistent with the embed_tokens comparison and preventing potential issues from tensors on different devices. The changes are correct and improve the codebase.
NickLucche
left a comment
There was a problem hiding this comment.
Is there any other model with smaller footprint that we can use for the test that triggered this discussion in #28819?
| target_embed_tokens.weight.cpu(), | ||
| self.model.model.embed_tokens.weight.cpu(), |
There was a problem hiding this comment.
as I commented in the previous PR, I don't think this check should happen on cpu when both weights are already loaded.
If the memory is so tight it will crash during dummy run instead of during load.
There was a problem hiding this comment.
also qq: can we just share these two layers?
There was a problem hiding this comment.
No, we can't share the layers just by default. A good amount of eagle-3 speculators are training embed_tokens and/or lm_head from scratch.
There was a problem hiding this comment.
As for the memory issue, re-reading the original #28819 makes me think that the memory issue came from a completely different thing: https://github.com/vllm-project/vllm/pull/28819/files#diff-66eeb1275496be8941bef653c4e581ac25dfd2ba463f704801c0729213ab5f34L175
and not from torch.equal running on a GPU (basically one instantiation of a layer hasn't been treated properly in the weight-sharing logic).
As for the general CPU-vs-GPU torch.equal comparison: if we can't fit torch.equal(t1, t2) on a GPU, then what is the purpose of that deployment, it won't have memory to fit a single forward pass? Shall we revert this check back to gpu?
There was a problem hiding this comment.
@zhewenl could you perhaps confirm if the GPU OOM happened because of torch.equal(t1, t2) running on a GPU or because of the fact that one instantiation of a layer hasn't been treated properly in the weight-sharing logic (which you resolved here: https://github.com/vllm-project/vllm/pull/28819/files#diff-66eeb1275496be8941bef653c4e581ac25dfd2ba463f704801c0729213ab5f34L175)?
There was a problem hiding this comment.
@eldarkurtic it's happening because torch.equal(t1, t2) running on GPU and OOMed on CI, which only has 24G memory.
As for the general CPU-vs-GPU torch.equal comparison: if we can't fit torch.equal(t1, t2) on a GPU, then what is the purpose of that deployment, it won't have memory to fit a single forward pass? Shall we revert this check back to gpu?
Yes, I fully agree, and from CI's perspective, we can check if we can replace the test suite with smaller models
There was a problem hiding this comment.
No, we can't share the layers just by default
Yeah no I meant loading the same weight once and then sharing the tensor between draft-target model (if not done already)
|
cc @benchislett |
benchislett
left a comment
There was a problem hiding this comment.
I'm overall indifferent to running the computation on cpu vs gpu. I don't think it will make a big difference either way.
As for whether to use equal or allclose, it's hard to say. If we can't find a rigorous bound like dtype-epsilon or something, we should just use direct .equal comparison. We should be conservative here and avoid risking inaccuracy where possible.
|
Based on all comments above, I see two potential resolutions:
|
It's more like the opposite, if both tensors are already on gpu, what's the point of offloading the check?
I think this is the best option regardless, because if on next commit any of the extra overhead we have (bookeping tensors like model runner v2, cuda graph etc..) goes up only by a little bit, we're back with the same problem. |
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
NickLucche
left a comment
There was a problem hiding this comment.
Discussed offline, leaving check on cpu to unblock CI until we have a valid smaller model that we can use for the test
vllm-project#29086) Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
vllm-project#29086) Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
As pointed out in #28819 , we already had some offline discussions (cc @benchislett) on whether
torch.equalortorch.allcloseshould be used here to detect whether draft model has exactly the same layers as the target model to enable weight-sharing.I have two major concerns with
torch.allcloseapproach:embed_tokenstend to have very small magnitudes of updates during speculators training, making them look very close but not equal to the verifier model.torch.allcloseopens up a pathway of overriding those weights due to their similarity, which is not something that a user who trained a speculator model with distinct embed_tokens expects.torch.allclosealso opens up an additional set of questions for how to tuneatol/rtol, especially for different dtypes (which will become even more problematic ones people start using quantized models)I would advocate here to go back to
torch.equaluntil someone demonstrates a use case wheretorch.equalfails to identify equal layers due to issues with FP precision. If that happens, we can update the logic to check for bitwise equality like this:The PR also adds CPU offloading to prevent GPU OOM issues reported in #28819