[Hybrid] Pass kernel block size to builders by tdoublep · Pull Request #27753 · vllm-project/vllm

tdoublep · 2025-10-29T15:58:40Z

Purpose

This PR makes two changes:

The GPU model runner will now pass the kernel block size to the metadata builders, fixing a pretty bad bug that exists on main.
I also restrict the kernel block sizes for the FlashAttention backend based on the discussion in [Bug]: Cache malformation in hybrid models with SSM cache dtype float32 and block allocation wrap around #27264. This is required since for block_size >= 128, FA will read partial block data that may contain NaN's if mamba state is in fp32, which leads to NaN's appearing in output.

Test Plan

I'm using the following test script:

from vllm import LLM, SamplingParams
import os

os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASH_ATTN'
#os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER'

prompts = ["Solve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.\n\nConsider the paths of length $16$ that follow the lines from the lower left corner to the upper right corner on an $8\\times 8$ grid. Find the number of such paths that change direction exactly four times, as in the examples shown below.\n\nRemember to put your answer on its own line after \"Answer:\"."]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    trust_remote_code=True,
    num_gpu_blocks_override=10,
    compilation_config={'cudagraph_capture_sizes': [1]},
    mamba_ssm_cache_dtype="float32",
    max_num_seqs=1,
    enable_prefix_caching=True,
)

outputs = []
for _ in range(10):
    outputs.append(llm.generate(prompts, sampling_params)[0])

# Print the outputs.
print(f"Prompt: {prompts[0]!r}")
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Generated text: {generated_text!r}")
    print(f"Generated token IDs: {output.outputs[0].token_ids!r}")

Which on main using FLASH_ATTN produces:

Generated text: '</think>\nThe problem requires finding the number of paths of length 16'
Generated token IDs: [1885, 74045, 1561, 1784, 4127, 10867, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032, 1049, 1054]
Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

and on main with FLASHINFER produces:

Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Generated text: ''
Generated token IDs: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Test Result

With this PR, FLASH_ATTN produces:

Generated text: '</think>\nThe problem requires finding the number of paths of length 16'
Generated token IDs: [1885, 74045, 1561, 1784, 4127, 10867, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032, 1049, 1054]
Generated text: ' \n\nExample 1: \n\nExample 2:\n\n\n<think>\n'
Generated token IDs: [1032, 1267, 20396, 1032, 1049, 1058, 1032, 1267, 20396, 1032, 1050, 2100, 1010, 49250, 2077, 1561]
Generated text: '</think>\nThe number of paths of length 16 from the lower left'
Generated token IDs: [1885, 74045, 1561, 1784, 2782, 1307, 22344, 1307, 5592, 1032, 1049, 1054, 1562, 1278, 4953, 3979]
Generated text: '</think>\nThe problem involves finding the number of paths of length 16'
Generated token IDs: [1885, 74045, 1561, 1784, 4127, 19263, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032, 1049, 1054]
Generated text: ' output: To solve the problem of finding the number of paths of length 1'
Generated token IDs: [4848, 1058, 3870, 15047, 1278, 4127, 1307, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032, 1049]
Generated text: ' \n</think>\nThe problem requires finding the number of paths of length '
Generated token IDs: [1032, 1010, 1885, 74045, 1561, 1784, 4127, 10867, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032]
Generated text: '</think>\nThe problem requires finding the number of paths of length 16'
Generated token IDs: [1885, 74045, 1561, 1784, 4127, 10867, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032, 1049, 1054]
Generated text: ' \n</think>\nThe paths of length 16 from the lower left'
Generated token IDs: [1032, 1010, 1885, 74045, 1561, 1784, 22344, 1307, 5592, 1032, 1049, 1054, 1562, 1278, 4953, 3979]
Generated text: " \n\n<think>\nOkay, let's see. I need to find the"
Generated token IDs: [1032, 1267, 49250, 2077, 1561, 44053, 1044, 2878, 1681, 3219, 1046, 1362, 2534, 1317, 3081, 1278]
Generated text: '<think>\nOkay, so I need to find the number of paths on an'
Generated token IDs: [1060, 74045, 1561, 44053, 1044, 1878, 1362, 2534, 1317, 3081, 1278, 2782, 1307, 22344, 1408, 1420]

and FLASHINFER produces:

Generated text: '</think>\nThe problem requires finding the number of paths of length 16'
Generated token IDs: [1885, 74045, 1561, 1784, 4127, 10867, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032, 1049, 1054]
Generated text: ' \n\nExample paths:\n- Path 1: Right, Right, Down,'
Generated token IDs: [1032, 1267, 20396, 22344, 1877, 1045, 17669, 1032, 1049, 1058, 21285, 1044, 21285, 1044, 16999, 1044]
Generated text: '</think>\nThe number of paths of length 16 from the lower left'
Generated token IDs: [1885, 74045, 1561, 1784, 2782, 1307, 22344, 1307, 5592, 1032, 1049, 1054, 1562, 1278, 4953, 3979]
Generated text: '</think>\nThe problem involves finding the number of paths of length 16'
Generated token IDs: [1885, 74045, 1561, 1784, 4127, 19263, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032, 1049, 1054]
Generated text: ' output: To solve the problem of finding the number of paths of length 1'
Generated token IDs: [4848, 1058, 3870, 15047, 1278, 4127, 1307, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032, 1049]
Generated text: ' \n</think>\nThe problem requires finding the number of paths of length '
Generated token IDs: [1032, 1010, 1885, 74045, 1561, 1784, 4127, 10867, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032]
Generated text: '</think>\nThe problem requires finding the number of paths of length 16'
Generated token IDs: [1885, 74045, 1561, 1784, 4127, 10867, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032, 1049, 1054]
Generated text: ' \n</think>\nThe paths of length 16 from the lower left'
Generated token IDs: [1032, 1010, 1885, 74045, 1561, 1784, 22344, 1307, 5592, 1032, 1049, 1054, 1562, 1278, 4953, 3979]
Generated text: " \n\n<think>\nOkay, let's see. I need to find the"
Generated token IDs: [1032, 1267, 49250, 2077, 1561, 44053, 1044, 2878, 1681, 3219, 1046, 1362, 2534, 1317, 3081, 1278]
Generated text: ' \n</think>\nThe problem requires finding the number of paths of length '
Generated token IDs: [1032, 1010, 1885, 74045, 1561, 1784, 4127, 10867, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032]

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

vllm/v1/worker/utils.py

mergify · 2025-10-31T21:33:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tdoublep.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep · 2025-10-31T21:37:29Z

@heheda12345 This is ready for review now.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

heheda12345 · 2025-11-01T22:45:16Z

@tdoublep my only concern is codex's comment. Other parts looks great.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep · 2025-11-02T11:38:17Z

@heheda12345 The v1-test-attention (B200) job is failing but it looks unrelated (the container doesn't even start-up).

heheda12345

LGTM!

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: soaringk <k3vin.zhang@gmail.com>

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: shen-shanshan <467638484@qq.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com>

Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: shen-shanshan <467638484@qq.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Signed-off-by: Kurumi5210 <Jaychou1620@Gmail.com>

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: shen-shanshan <467638484@qq.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com>

Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: shen-shanshan <467638484@qq.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>

Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: shen-shanshan <467638484@qq.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com>

mergify bot added the v1 label Oct 29, 2025

This was referenced Oct 29, 2025

[Bug]: Cache malformation in hybrid models with SSM cache dtype float32 and block allocation wrap around #27264

Closed

[BUG] Fix hybrid kvcache kernel page size issue #27547

Closed

heheda12345 reviewed Oct 31, 2025

View reviewed changes

vllm/v1/worker/utils.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Oct 31, 2025

Pass kernel block sizes to metadata builders

7c7fab3

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep force-pushed the pass_kernel_block_size_to_builders branch from b9528e1 to 7c7fab3 Compare October 31, 2025 21:34

Reduce diff

3d0061d

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep marked this pull request as ready for review October 31, 2025 21:36

tdoublep requested review from LucasWilkinson, mgoin and pavanimajety as code owners October 31, 2025 21:36

mergify bot removed the needs-rebase label Oct 31, 2025

chatgpt-codex-connector bot reviewed Oct 31, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

Replace block size in KVCacheSpec

61813aa

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 1, 2025

Fix codex suggestion

083afeb

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

heheda12345 mentioned this pull request Nov 2, 2025

[GPUModelRunner] Refactor initialize_kv_cache #27935

Draft

5 tasks

heheda12345 reviewed Nov 2, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

Fix issue for encoders

2e5efe5

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

heheda12345 approved these changes Nov 2, 2025

View reviewed changes

heheda12345 enabled auto-merge (squash) November 2, 2025 18:01

simon-mo disabled auto-merge November 2, 2025 22:43

heheda12345 enabled auto-merge (squash) November 3, 2025 05:43

heheda12345 merged commit 18961c5 into vllm-project:main Nov 3, 2025
52 checks passed

heheda12345 mentioned this pull request Nov 3, 2025

[Bug]: Hybrid Attention models broken after switching to flashinfer 0.4 (tested on Granite 4.0 H, Qwen3-Next, Jamba-3B, Nemotron-H-8b) #26936

Open

1 task

soaringk pushed a commit to soaringk/vllm that referenced this pull request Nov 3, 2025

[Hybrid] Pass kernel block size to builders (vllm-project#27753)

4b2dfef

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: soaringk <k3vin.zhang@gmail.com>

ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025

[Hybrid] Pass kernel block size to builders (vllm-project#27753)

5bc65e2

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Hybrid] Pass kernel block size to builders (vllm-project#27753)

5c2b343

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

This was referenced Nov 13, 2025

[Bugfix][Nixl] Fix kernel physical<>logical block_size issue #28677

Merged

[Attention] Refactor FA block_size limitations to hybrid models only #29084

Merged

wangxiyuan mentioned this pull request Nov 25, 2025

upgrade to vllm 0.11.2 vllm-project/vllm-ascend#4400

Merged

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Hybrid] Pass kernel block size to builders (vllm-project#27753)

166f62b

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep mentioned this pull request Mar 10, 2026

[Core] Remove FlashAttention block size restriction for hybrid models #36701

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Hybrid] Pass kernel block size to builders#27753

[Hybrid] Pass kernel block size to builders#27753
heheda12345 merged 5 commits intovllm-project:mainfrom
tdoublep:pass_kernel_block_size_to_builders

tdoublep commented Oct 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

mergify bot commented Oct 31, 2025

Uh oh!

tdoublep commented Oct 31, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

heheda12345 commented Nov 1, 2025

Uh oh!

Uh oh!

tdoublep commented Nov 2, 2025

Uh oh!

heheda12345 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tdoublep commented Oct 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Uh oh!

mergify bot commented Oct 31, 2025

Uh oh!

tdoublep commented Oct 31, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

heheda12345 commented Nov 1, 2025

Uh oh!

Uh oh!

tdoublep commented Nov 2, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tdoublep commented Oct 29, 2025 •

edited by github-actions bot

Loading