[CI] Fix ColBERT HF comparison tests on AMD CI + refactor#34567
[CI] Fix ColBERT HF comparison tests on AMD CI + refactor#34567vllm-bot merged 6 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a fix for a crash on AMD CI in ColBERT HF comparison tests and significantly refactors the tests for better maintainability. The crash, caused by using Triton attention kernels on CPU tensors, is correctly addressed by conditionally setting attn_implementation="eager" when running on CPU. The refactoring consolidates three near-identical test functions into a single, parameterized test, which greatly reduces code duplication and improves clarity. Additionally, the PR resolves torch.tensor copy-construction warnings by correctly using torch.as_tensor. The changes are well-implemented and improve both the correctness and quality of the test suite. I have no further comments.
|
cc @ieBoytsov @DarkLight1337 @noooop Follow-up to #34170 |
| cls = BertModel if hf_spec["model_cls"] == "BertModel" else AutoModel | ||
| trust = hf_spec.get("trust_remote_code", False) | ||
|
|
||
| # Flash / Triton kernels require GPU tensors; fall back to eager on CPU. |
There was a problem hiding this comment.
Hi @AndreasKaratzas nice job, refactored tests looks great. I don't understand so far why we need explicit cast to eager attention. The motivation here makes sense but tests passed everywhere except AMD? Could you elaborate if there is anything specific about AMD in this context? Thanks in advance : )
There was a problem hiding this comment.
So, I just tested it on ROCm as well right now without the eager step and it passes at least on MI355. I would prefer to keep this there though as it's only targeting HF API, not vLLM. There is still a related ticket on ROCm for this one though too: #30167
It's the same reason there is this conftest.py under the parent of this TG.
There was a problem hiding this comment.
Also I am not sure if without eager, the test would work on CPU CI. Btw, I just tested it on MI325 as well, without the eager patch and it passes on there too. So all in all, I can remove it, but I think it's beneficial and it does not affect vLLM evaluation, rather HF API side only.
There was a problem hiding this comment.
Thanks, glad it is merged now!
|
The failure in this one has to do with |
noooop
left a comment
There was a problem hiding this comment.
thanks for this fix.
As @ieBoytsov also feels great for this.
|
@noooop Distributed model tests failure looks completely unrelated to this PR: https://buildkite.com/vllm/ci/builds/52483/steps/canvas?jid=019c7c16-c973-4172-a5f5-06ad78c33d20&tab=output#019c7c16-c973-4172-a5f5-06ad78c33d20/L2942 Can we merge it? |
…ct#34567) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…ct#34567) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: joezuo <qianzhou.zuo@gmail.com>
…ct#34567) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…ct#34567) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…ct#34567) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…ct#34567) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…ct#34567) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
…ct#34567) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…ct#34567) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
…ct#34567) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
This PR fixes the
test_colbert_hf_comparison_modernbertcrash on AMD CI and refactors the ColBERT HF comparison tests for maintainability and cross-platform robustness.test_colbert_hf_comparison_modernbertcrashes on AMD CI nightly because the HF reference model runs on CPU while ModernBERT defaults to Triton-based flash attention:test_colbert_hf_comparison_modernbertfailure logThe HF ModernBERT model defaults to flash/Triton attention kernels, which require GPU tensors. The test loads the HF reference model on CPU via
AutoModel.from_pretrained(model_name)without specifying an attention backend, so the Triton rotary embedding kernel fails when it receives CPU tensors.Introduced by #34170, caught on AMD CI nightly.
Multiple ColBERT tests emit
UserWarning: To copy construct from a tensordue totorch.tensor()being called on existing tensors:torch.tensorcopy-construct warningstorch.tensor()always copies data and emits this warning when given an existing tensor. Since these call sites don't need a copy,torch.as_tensor()is the correct idiom — it returns a view when possible and never warns.Changes
Fix Triton crash: detect device and set attention implementation
New
_load_hf_modelhelper detects whether inference will run on CPU or GPU and passesattn_implementation="eager"on CPU to avoid Triton kernel crashes. On GPU it uses the default (flash attention). This is the actual bug fix.Diff details
Refactor: collapse three HF comparison tests into one parametrized test
The three
test_colbert_hf_comparison_{bert,modernbert,jina}functions were near-identical (~80 lines of duplication), differing only in which model class, weights file, andtrust_remote_codeflag to use. Added anhf_comparisonsub-dict to eachCOLBERT_MODELSentry to capture these differences declaratively, extracted shared logic into helpers, and replaced the three functions with a single@pytest.mark.parametrize-driven test. This ensures the Triton fix (and any future fix) only needs to exist in one place.Diff details
hf_comparisonmetadata added to each model spec:"bert": { "model": "answerdotai/answerai-colbert-small-v1", "colbert_dim": 96, "max_model_len": 512, "extra_kwargs": {}, + "hf_comparison": { + "weights_file": "model.safetensors", + "weights_key": "linear.weight", + "trust_remote_code": False, + "model_cls": "BertModel", + }, },(Similarly for
modernbertandjina—modernbertuses"1_Dense/model.safetensors",jinauses"trust_remote_code": True.)New shared helpers for weight loading and embedding computation:
.float()is called unconditionally on bothhidden_statesandlinear_weight— this is a no-op for bert/modernbert (already float32) and matches the explicit.float()calls the old Jina test required. Results are moved to CPU for device-agnostic comparison.Three separate test functions replaced by one:
Resolve
torch.tensorcopy-construct warningsAll
torch.tensor(existing_tensor)calls replaced withtorch.as_tensor()across every test function to eliminate theUserWarning.torch.as_tensor()returns a view when possible and is the correct idiom when a copy is not needed.Diff details