[ROCm][CI] Fix spec decode logprobs flakiness and parametrize tree attention backends#34599
Conversation
…hroughout Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…ference backends Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request addresses test flakiness on ROCm by disabling skinny GEMM in logprob tests and parametrizing tree attention tests over available backends. The changes include converting VllmRunner to use context managers for better resource management and adding KV cache layout adaptation. While the overall direction is correct, there are a few issues in the test setup that could lead to crashes or unexpected failures depending on the environment.
| for backend in AttentionBackendEnum: | ||
| if backend.value is not None and backend.get_path() == backend_path: | ||
| return backend |
There was a problem hiding this comment.
This loop will raise a ValueError if it encounters an AttentionBackendEnum member with an empty string value (such as TORCH_SDPA). This is because get_path() explicitly checks for non-empty paths and raises an error otherwise. This could crash test collection on platforms where such backends are present.
for backend in AttentionBackendEnum:
try:
if backend.get_path() == backend_path:
return backend
except ValueError:
continue| backends: list[AttentionBackendEnum] = [] | ||
|
|
||
| # 1. Whatever the platform would auto-select at runtime. | ||
| backends.append(_get_platform_default_backend()) |
There was a problem hiding this comment.
The platform default backend should be filtered against known incompatible backends (as documented in the TODOs below) before being added to the reference backends list. If ROCM_AITER_FA or ROCM_ATTN is selected as the default by the platform (e.g., via environment variables like VLLM_ROCM_USE_AITER), the test will fail due to the documented incompatibilities.
default_backend = _get_platform_default_backend()
if default_backend not in (AttentionBackendEnum.ROCM_AITER_FA,
AttentionBackendEnum.ROCM_ATTN):
backends.append(default_backend)Signed-off-by: Andreas Karatzas <akaratza@amd.com>
SageMoore
left a comment
There was a problem hiding this comment.
Looks reasonable. One minor nit.
tests/v1/sample/test_logprobs.py
Outdated
| contention. Both use identical chunked prefill settings and eager | ||
| mode to control for infrastructure differences. | ||
|
|
||
| On ROCm, the custom skinny GEMM kernels are non-deterministic |
There was a problem hiding this comment.
Nit: I think one block comment describing why we are disabling skinny gemms is sufficient :).
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…tention backends (vllm-project#34599) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…tention backends (vllm-project#34599) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: joezuo <qianzhou.zuo@gmail.com>
…tention backends (vllm-project#34599) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…tention backends (vllm-project#34599) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…tention backends (vllm-project#34599) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…tention backends (vllm-project#34599) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…tention backends (vllm-project#34599) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
…tention backends (vllm-project#34599) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…tention backends (vllm-project#34599) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
…tention backends (vllm-project#34599) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
This PR fixes
V1 Test otherstest flakiness on ROCm.test_logprobs.pytest_spec_decode_logprobswas intermittently failing on ROCm due to logprob differences between the base and speculative decode LLM that were misattributed to spec decode itself, due to ROCm skinny GEMM non-determinism --- thewvSplitKkernels ingemm_kernels.cuuse persistent workgroup scheduling and wave-level shuffle reductions that produce different results across LLM instantiations, even with identical configs and seeds.The fix disables the skinny GEMM via
VLLM_ROCM_USE_SKINNY_GEMM=0for this test. Descriptive assertion messages are added for easier future triage.Additional cleanup: converted standalone
VllmRunnerinstantiations throughout the file to use context managers for proper resource cleanup.test_tree_attention.pyParametrizes
test_tree_attn_correctnessover all reference attention backends available on the current platform rather than hardcodingFLASH_ATTN. On ROCm this includesTRITON_ATTNand the platform default. Adds KV cache layout adaptation (flash<->block) so backends with different cache layouts can be used as references. Documents known incompatibilities withROCM_ATTN(paged layout) andROCM_AITER_FA(head count mismatch) as TODOs.