Revert back to torch.equal over torch.allclose from #28819 by eldarkurtic · Pull Request #29086 · vllm-project/vllm

eldarkurtic · 2025-11-20T11:33:44Z

As pointed out in #28819 , we already had some offline discussions (cc @benchislett) on whether torch.equal or torch.allclose should be used here to detect whether draft model has exactly the same layers as the target model to enable weight-sharing.
I have two major concerns with torch.allclose approach:

embed_tokens tend to have very small magnitudes of updates during speculators training, making them look very close but not equal to the verifier model. torch.allclose opens up a pathway of overriding those weights due to their similarity, which is not something that a user who trained a speculator model with distinct embed_tokens expects.
torch.allclose also opens up an additional set of questions for how to tune atol/rtol, especially for different dtypes (which will become even more problematic ones people start using quantized models)

I would advocate here to go back to torch.equal until someone demonstrates a use case where torch.equal fails to identify equal layers due to issues with FP precision. If that happens, we can update the logic to check for bitwise equality like this:

a_bits = a.view(torch.int32)
b_bits = b.view(torch.int32)

bitwise_equal = torch.equal(a_bits, b_bits)

The PR also adds CPU offloading to prevent GPU OOM issues reported in #28819

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request reverts the weight comparison logic from torch.allclose back to torch.equal for determining whether to share weights between the target and draft models. The justification provided is sound, as using torch.equal avoids the ambiguity of atol/rtol tuning and prevents incorrect weight sharing for embed_tokens that might have small, meaningful updates. Additionally, the change to add .cpu() calls to the lm_head weight comparison is a great improvement for robustness, making it consistent with the embed_tokens comparison and preventing potential issues from tensors on different devices. The changes are correct and improve the codebase.

NickLucche

Is there any other model with smaller footprint that we can use for the test that triggered this discussion in #28819?

NickLucche · 2025-11-21T10:46:35Z

                        target_embed_tokens.weight.cpu(),
                        self.model.model.embed_tokens.weight.cpu(),


as I commented in the previous PR, I don't think this check should happen on cpu when both weights are already loaded.
If the memory is so tight it will crash during dummy run instead of during load.

also qq: can we just share these two layers?

No, we can't share the layers just by default. A good amount of eagle-3 speculators are training embed_tokens and/or lm_head from scratch.

As for the memory issue, re-reading the original #28819 makes me think that the memory issue came from a completely different thing: https://github.com/vllm-project/vllm/pull/28819/files#diff-66eeb1275496be8941bef653c4e581ac25dfd2ba463f704801c0729213ab5f34L175
and not from torch.equal running on a GPU (basically one instantiation of a layer hasn't been treated properly in the weight-sharing logic).

As for the general CPU-vs-GPU torch.equal comparison: if we can't fit torch.equal(t1, t2) on a GPU, then what is the purpose of that deployment, it won't have memory to fit a single forward pass? Shall we revert this check back to gpu?

@zhewenl could you perhaps confirm if the GPU OOM happened because of torch.equal(t1, t2) running on a GPU or because of the fact that one instantiation of a layer hasn't been treated properly in the weight-sharing logic (which you resolved here: https://github.com/vllm-project/vllm/pull/28819/files#diff-66eeb1275496be8941bef653c4e581ac25dfd2ba463f704801c0729213ab5f34L175)?

@eldarkurtic it's happening because torch.equal(t1, t2) running on GPU and OOMed on CI, which only has 24G memory.

As for the general CPU-vs-GPU torch.equal comparison: if we can't fit torch.equal(t1, t2) on a GPU, then what is the purpose of that deployment, it won't have memory to fit a single forward pass? Shall we revert this check back to gpu?

Yes, I fully agree, and from CI's perspective, we can check if we can replace the test suite with smaller models

@eldarkurtic

No, we can't share the layers just by default

Yeah no I meant loading the same weight once and then sharing the tensor between draft-target model (if not done already)

LucasWilkinson · 2025-11-21T19:27:00Z

cc @benchislett

benchislett

I'm overall indifferent to running the computation on cpu vs gpu. I don't think it will make a big difference either way.

As for whether to use equal or allclose, it's hard to say. If we can't find a rigorous bound like dtype-epsilon or something, we should just use direct .equal comparison. We should be conservative here and avoid risking inaccuracy where possible.

eldarkurtic · 2025-11-22T07:08:02Z

Based on all comments above, I see two potential resolutions:

Offload tensors to cpu to run the check. @NickLucche could you perhaps clarify why are against offloading to CPU?
We use the smaller model in CI as the current one doesn’t make much sense on the 24GB GPU

NickLucche · 2025-11-24T17:38:00Z

could you perhaps clarify why are against offloading to CPU?

It's more like the opposite, if both tensors are already on gpu, what's the point of offloading the check?
We're constantly trading memory for better inference performance, if we don't have space to hold a single weight matrix then I think we have a bigger problem in running vllm on that particular setup to start with. We're basically allowing our CI infra to shape the code (albeit the impact is small in this case).

We use the smaller model in CI as the current one doesn’t make much sense on the 24GB GPU

I think this is the best option regardless, because if on next commit any of the extra overhead we have (bookeping tensors like model runner v2, cuda graph etc..) goes up only by a little bit, we're back with the same problem.

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

NickLucche

Discussed offline, leaving check on cpu to unblock CI until we have a valid smaller model that we can use for the test

vllm-project#29086) Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

eldarkurtic added 2 commits November 20, 2025 11:22

use torch.equal over torch.allclose

3cb9621

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

use torch.equal over torch.allclose in lmhead check as well

fb81f38

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

eldarkurtic requested review from benchislett and luccafong as code owners November 20, 2025 11:33

mergify bot added speculative-decoding v1 labels Nov 20, 2025

gemini-code-assist bot reviewed Nov 20, 2025

View reviewed changes

DarkLight1337 requested a review from NickLucche November 20, 2025 11:56

NickLucche reviewed Nov 21, 2025

View reviewed changes

benchislett reviewed Nov 21, 2025

View reviewed changes

eldarkurtic added 2 commits November 25, 2025 11:51

add comment about cpu offloading

7238bb3

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

add TODO

aaeb81b

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

NickLucche approved these changes Nov 25, 2025

View reviewed changes

NickLucche enabled auto-merge (squash) November 25, 2025 11:56

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 25, 2025

NickLucche merged commit 0231ce8 into vllm-project:main Nov 25, 2025
43 checks passed

eldarkurtic mentioned this pull request Nov 25, 2025

Attempt to fix GPU OOM in a spec-decoding test #29419

Merged

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

Revert back to torch.equal over torch.allclose from vllm-project#28819 (

81383d0

vllm-project#29086) Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

Revert back to torch.equal over torch.allclose from vllm-project#28819 (

8e1a4d7

vllm-project#29086) Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

		target_embed_tokens.weight.cpu(),
		self.model.model.embed_tokens.weight.cpu(),

Uh oh!

Conversation

eldarkurtic commented Nov 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

NickLucche Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

NickLucche Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

eldarkurtic Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

eldarkurtic Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eldarkurtic Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

zhewenl Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

NickLucche Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson commented Nov 21, 2025

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

eldarkurtic commented Nov 22, 2025

Uh oh!

NickLucche commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

eldarkurtic commented Nov 20, 2025 •

edited by github-actions bot

Loading

eldarkurtic Nov 21, 2025 •

edited

Loading

NickLucche commented Nov 24, 2025 •

edited

Loading