Skip to content

Revert back to torch.equal over torch.allclose from #28819 #29086

Merged
NickLucche merged 4 commits intovllm-project:mainfrom
eldarkurtic:equal-over-allclose
Nov 25, 2025
Merged

Revert back to torch.equal over torch.allclose from #28819 #29086
NickLucche merged 4 commits intovllm-project:mainfrom
eldarkurtic:equal-over-allclose

Conversation

@eldarkurtic
Copy link
Copy Markdown
Contributor

@eldarkurtic eldarkurtic commented Nov 20, 2025

As pointed out in #28819 , we already had some offline discussions (cc @benchislett) on whether torch.equal or torch.allclose should be used here to detect whether draft model has exactly the same layers as the target model to enable weight-sharing.
I have two major concerns with torch.allclose approach:

  1. embed_tokens tend to have very small magnitudes of updates during speculators training, making them look very close but not equal to the verifier model. torch.allclose opens up a pathway of overriding those weights due to their similarity, which is not something that a user who trained a speculator model with distinct embed_tokens expects.
  2. torch.allclose also opens up an additional set of questions for how to tune atol/rtol, especially for different dtypes (which will become even more problematic ones people start using quantized models)

I would advocate here to go back to torch.equal until someone demonstrates a use case where torch.equal fails to identify equal layers due to issues with FP precision. If that happens, we can update the logic to check for bitwise equality like this:

a_bits = a.view(torch.int32)
b_bits = b.view(torch.int32)

bitwise_equal = torch.equal(a_bits, b_bits)

The PR also adds CPU offloading to prevent GPU OOM issues reported in #28819

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reverts the weight comparison logic from torch.allclose back to torch.equal for determining whether to share weights between the target and draft models. The justification provided is sound, as using torch.equal avoids the ambiguity of atol/rtol tuning and prevents incorrect weight sharing for embed_tokens that might have small, meaningful updates. Additionally, the change to add .cpu() calls to the lm_head weight comparison is a great improvement for robustness, making it consistent with the embed_tokens comparison and preventing potential issues from tensors on different devices. The changes are correct and improve the codebase.

Copy link
Copy Markdown
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other model with smaller footprint that we can use for the test that triggered this discussion in #28819?

Comment on lines 1032 to 1033
target_embed_tokens.weight.cpu(),
self.model.model.embed_tokens.weight.cpu(),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I commented in the previous PR, I don't think this check should happen on cpu when both weights are already loaded.
If the memory is so tight it will crash during dummy run instead of during load.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also qq: can we just share these two layers?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we can't share the layers just by default. A good amount of eagle-3 speculators are training embed_tokens and/or lm_head from scratch.

Copy link
Copy Markdown
Contributor Author

@eldarkurtic eldarkurtic Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the memory issue, re-reading the original #28819 makes me think that the memory issue came from a completely different thing: https://github.com/vllm-project/vllm/pull/28819/files#diff-66eeb1275496be8941bef653c4e581ac25dfd2ba463f704801c0729213ab5f34L175
and not from torch.equal running on a GPU (basically one instantiation of a layer hasn't been treated properly in the weight-sharing logic).

As for the general CPU-vs-GPU torch.equal comparison: if we can't fit torch.equal(t1, t2) on a GPU, then what is the purpose of that deployment, it won't have memory to fit a single forward pass? Shall we revert this check back to gpu?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhewenl could you perhaps confirm if the GPU OOM happened because of torch.equal(t1, t2) running on a GPU or because of the fact that one instantiation of a layer hasn't been treated properly in the weight-sharing logic (which you resolved here: https://github.com/vllm-project/vllm/pull/28819/files#diff-66eeb1275496be8941bef653c4e581ac25dfd2ba463f704801c0729213ab5f34L175)?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eldarkurtic it's happening because torch.equal(t1, t2) running on GPU and OOMed on CI, which only has 24G memory.

As for the general CPU-vs-GPU torch.equal comparison: if we can't fit torch.equal(t1, t2) on a GPU, then what is the purpose of that deployment, it won't have memory to fit a single forward pass? Shall we revert this check back to gpu?

Yes, I fully agree, and from CI's perspective, we can check if we can replace the test suite with smaller models

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eldarkurtic

No, we can't share the layers just by default

Yeah no I meant loading the same weight once and then sharing the tensor between draft-target model (if not done already)

@LucasWilkinson
Copy link
Copy Markdown
Collaborator

cc @benchislett

Copy link
Copy Markdown
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm overall indifferent to running the computation on cpu vs gpu. I don't think it will make a big difference either way.

As for whether to use equal or allclose, it's hard to say. If we can't find a rigorous bound like dtype-epsilon or something, we should just use direct .equal comparison. We should be conservative here and avoid risking inaccuracy where possible.

@eldarkurtic
Copy link
Copy Markdown
Contributor Author

Based on all comments above, I see two potential resolutions:

  1. Offload tensors to cpu to run the check. @NickLucche could you perhaps clarify why are against offloading to CPU?
  2. We use the smaller model in CI as the current one doesn’t make much sense on the 24GB GPU

@NickLucche
Copy link
Copy Markdown
Collaborator

NickLucche commented Nov 24, 2025

could you perhaps clarify why are against offloading to CPU?

It's more like the opposite, if both tensors are already on gpu, what's the point of offloading the check?
We're constantly trading memory for better inference performance, if we don't have space to hold a single weight matrix then I think we have a bigger problem in running vllm on that particular setup to start with. We're basically allowing our CI infra to shape the code (albeit the impact is small in this case).

We use the smaller model in CI as the current one doesn’t make much sense on the 24GB GPU

I think this is the best option regardless, because if on next commit any of the extra overhead we have (bookeping tensors like model runner v2, cuda graph etc..) goes up only by a little bit, we're back with the same problem.

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, leaving check on cpu to unblock CI until we have a valid smaller model that we can use for the test

@NickLucche NickLucche enabled auto-merge (squash) November 25, 2025 11:56
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 25, 2025
@NickLucche NickLucche merged commit 0231ce8 into vllm-project:main Nov 25, 2025
43 checks passed
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
vllm-project#29086)

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
vllm-project#29086)

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants