Fix a bug in 1D input shape by WoosukKwon · Pull Request #5 · vllm-project/vllm

WoosukKwon · 2023-03-03T04:18:38Z

This PR fixes a miscalculation of the input shape when iteration-level scheduling is used.

More improvements awq

* finish changing scheduler * finish merge * fix model * Fix (vllm-project#5) * fix problems * fix * delete unused params * remove redundant comments --------- Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>

Align optimum-intel based model signature with vLLM signature

…imum Install optimum-intel from latest main

* Drop indecies when finish * min 1 attention layer * CG is working on forward pass passing * Remove comments * cosmetics - rename indecies -> indices, organize some whitespaces * Add some TODOs * Adding mamba cache for cg * Remove useless vars from input_metadata * Remove unused import * Set the seqlen offset to boolean * Return only hidden state * Return only hidden states * Add padding to match forward pass bs * Is prompt instead of seqlen offset * Remove mamba cache class (not used) * Another remove * Remove * Use mamba4gc * Fix mamba forward, run update only on non prompt * Use 1 index after the maximal index * Remove import * Remove import * typo * typo * place holder * Padding and empty token takes it from the first empty place * reformat * Apply suggestions from code review Whitespaces --------- Co-authored-by: Mor Zusman <morz@ai21.com> Co-authored-by: Tomer Asida <tomera@ai21.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>

…3small [Model][Kernels] Support Phi3small architecture, blocksparse attnention prefilling kernel, CUDA+Triton paged attn kernels

Faster v2 hopper fused moe kernel configs

Kuntai disagg refactor

@wuhang2014

* # This is a combination of 6 commits. # This is the 1st commit message: mooncake store connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> # This is the commit message vllm-project#2: mooncake store connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> # This is the commit message vllm-project#3: mooncake store connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> # This is the commit message vllm-project#4: mooncake store connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> # This is the commit message vllm-project#5: mooncake store connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> # This is the commit message vllm-project#6: mooncake store connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * mooncake store connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> * mooncake store connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> mooncake store connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> mooncake store connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> mooncake store connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> mooncake store connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> mooncake store connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> mooncake store connector Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> fix comments * Update vllm/distributed/ec_transfer/utils/tensor_memory_pool.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update vllm/distributed/ec_transfer/ec_lookup_buffer/mooncake_store.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update vllm/distributed/ec_transfer/ec_connector/mooncake_storage_connector.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @wuhang2014 line length format * Apply suggestion from @wuhang2014 remove extra empty line --------- Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> Co-authored-by: wuhang <whlbx@hotmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…tionalGeneration` (vllm-project#27895) (vllm-project#5) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com>

…ections Manufacturing enhancements: - Add complete Vision Inspection MCP with Vision AI defect detection - Add Manufacturing MES MCP with PostgreSQL integration - Include detailed defect classification and statistics - Add ROI analysis showing 78% cost reduction and 99.6% time savings Healthcare enhancements: - Enhance existing Medical OCR, Drug Interaction, and EHR MCPs - Add ROI analysis showing 97.2% time reduction - Include medical accident prevention benefits (5억원 annual savings) - Demonstrate HIPAA-compliant prescription OCR workflow Summary: - Sections vllm-project#5-8: Fully detailed implementations (2,000+ lines each) - Sections vllm-project#9-10: Enhanced with complete code + ROI - Sections vllm-project#11-20+: Comprehensive summaries covering all major industries - Total guide provides 20+ real-world MCP + Agent architecture patterns

[AFD][M2N]ffn side support dp

fix tp bug

Follow-up item vllm-project#5 from RFC vllm-project#24885 (shutdown semantics). Currently Ctrl-C does not respect --shutdown-timeout. Engine core and worker child processes receive SIGINT directly from the terminal, causing them to exit immediately rather than draining in-flight requests. Changes: 1. Engine core processes call os.setpgrp() at startup to create their own process group, isolating them from terminal SIGINT. Worker processes (children of engine core) inherit this group. The parent API server is the only process receiving terminal Ctrl-C, and it orchestrates child shutdown via SIGTERM. 2. Launcher (serve_http): first SIGINT/SIGTERM triggers graceful drain with --shutdown-timeout. A second signal forces immediate exit by setting timeout=0 and cancelling the server task. 3. CLI serve handlers (run_headless, run_multi_api_server): first signal triggers graceful SystemExit. Second signal calls os._exit(1) for immediate termination. Fixes vllm-project#24885 Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>

fea: support rfork

…v0.9.2rc2 [DONE] Upstream v0.9.2rc2

## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary>#1 — llama-nemotron-embed / score-template support (vllm-project#30550): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#2 — Triton Attention (vllm-project#31406): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#3 — Llama-4 attn quant (vllm-project#34243): Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454): Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507): Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085): Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary>vllm-project#7 — response_format validation for completions (vllm-project#35456): Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary>vllm-project#8 — response_format validation for chat completions (vllm-project#35510): Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

逐一 grep 驗證所有已完成項目的整合程式碼確實存在： - #3 spec decode: _batch_precompute_spec_decode() 已在 scheduler.py - vllm-project#5 builtin hash: 已在 config/cache.py Literal type - vllm-project#15 batch spec decode: _precomputed_spec 快速路徑已在迴圈中清除 strikethrough 噪音，統一為乾淨的「已完成/未完成」兩表格式。 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

WoosukKwon added 4 commits March 3, 2023 04:16

Fix a bug in 1D shape

e5a1fa8

Minor

342275f

Minor

b91a2fa

Test iteration-level scheduling

4db2916

WoosukKwon merged commit 04e5acc into main Mar 6, 2023

WoosukKwon deleted the bugfix branch March 6, 2023 18:05

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

v1nc3nt27 pushed a commit to v1nc3nt27/vllm that referenced this pull request Sep 12, 2023

Merge pull request vllm-project#5 from ri938/more_improvements_awq

73db30f

More improvements awq

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Fix a bug in 1D input shape (vllm-project#5)

c639d4c

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 14, 2024

Merge pull request vllm-project#5 from slyalin/fixed_parameter_types

30605c8

Align optimum-intel based model signature with vLLM signature

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 25, 2024

Merge pull request vllm-project#5 from ilya-lavrenov/update-intel-opt…

8f016e0

…imum Install optimum-intel from latest main

dlopes78 mentioned this pull request May 8, 2024

[Bug]: VLLM + tritonserver #4695

Closed

Starmys pushed a commit to Starmys/vllm that referenced this pull request May 20, 2024

Merge pull request vllm-project#5 from wenxcs/fit-cluster-tests

de23377

Faster v2 hopper fused moe kernel configs

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

oliver-li mentioned this pull request Jul 5, 2024

[Bug]: NCCL hangs and causes timeout #5484

Closed

This was referenced Jul 5, 2024

Support W4A8 quantization for vllm #5218

Merged

[Bug]: call for stack trace for "Watchdog caught collective operation timeout" #6042

Closed

ehuaa mentioned this pull request Jul 19, 2024

[Bug]: The vllm is disconnected after running for some time #5084

Closed

xinzaifeixiang1992 mentioned this pull request Jul 24, 2024

[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型，刚开始服务正常，但是并发高的时候就报错 #6734

Closed

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

Minami-su mentioned this pull request Aug 11, 2024

[Bug]: vllm is crashed on v0.5.3.post1 #7161

Closed

wangwensuo mentioned this pull request Aug 22, 2024

[Bug]: llama3-405b-fp8 NCCL communication #7775

Closed

zeroorhero pushed a commit to zeroorhero/vllm that referenced this pull request Sep 23, 2024

Merge pull request vllm-project#5 from KuntaiDu/kuntai-disagg-refactor

4db6446

Kuntai disagg refactor

acodercat mentioned this pull request Nov 10, 2025

[Bugfix] Add strong reference to CUDA pluggable allocator callbacks #23477

Merged

4 tasks

bnellnm mentioned this pull request Nov 22, 2025

[Feature]: Generalize RoutingMethodType for broader MoE routing control #28408

Open

1 task

sravan500 mentioned this pull request Nov 25, 2025

[Bug]: vllm/vllm-openai:v0.11.0 deployment --quantization fp8 throws cuda and tensor errors #29374

Closed

1 task

xiaoshudian555 pushed a commit to xiaoshudian555/vllm that referenced this pull request Nov 26, 2025

Merge pull request vllm-project#5 from adoimikubi/jcz_afd_v0.11.0rc3

6b308ce

[AFD][M2N]ffn side support dp

slwang-ustc mentioned this pull request Nov 27, 2025

[Bug]: RuntimeError "cancelled" when using pipeline parallelism with Qwen3-14B #29085

Closed

1 task

guyueh1 referenced this pull request in guyueh1/vllm Dec 9, 2025

Merge pull request #5 from TomerBN-Nvidia/linear-mxfp8-triton-support

102d058

fix tp bug

BJWang-ant mentioned this pull request Dec 17, 2025

[Bug]: Qwen3-32B with MTP, run failed. #30766

Open

1 task

Lrcx mentioned this pull request Jan 29, 2026

[Bug]: Crash when using presence_penalty with Qwen3-VL in v0.11.0 #33338

Open

1 task

HervorTao mentioned this pull request Feb 3, 2026

[Bug]: [CPU Backend] AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' #33675

Closed

1 task

Isotr0py mentioned this pull request Feb 25, 2026

[Bug]: Large Video Request cause vLLM Progress Core Dump #35285

Open

1 task

JGSweets mentioned this pull request Mar 9, 2026

[Bug]: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. #28028

Open

1 task

LironKesem mentioned this pull request Mar 12, 2026

[Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries #36835

Closed

1 task

lavanyabollepalli mentioned this pull request Mar 12, 2026

[Bug]: GPU failure during repeated model loading when using --enable-prefix-caching with KV transfer (LMCacheConnectorV1) #36852

Open

1 task

IWantFight pushed a commit to IWantFight/vllm that referenced this pull request Mar 12, 2026

Merge pull request vllm-project#5 from IWantFight/bf_ant_group

1888945

fea: support rfork

mahaocong90 mentioned this pull request Mar 17, 2026

[Bug]: QWEN 3.5-397B-A17B report "RPC call to sample_tokens timed out" #37250

Closed

1 task

watch-Ultra mentioned this pull request Mar 18, 2026

[Bug]:推理时报错，模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型 #37392

Open

1 task

bigshanedogg pushed a commit to bigshanedogg/vllm that referenced this pull request Mar 19, 2026

Merge pull request vllm-project#5 from HYPERSCALE-AI-VISION/upstream_…

9397e5a

…v0.9.2rc2 [DONE] Upstream v0.9.2rc2

RocketRider mentioned this pull request Mar 21, 2026

Mamba-2 Triton kernels crash with illegal instruction on SM121 (DGX Spark) without CUDA_LAUNCH_BLOCKING=1 #37431

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix a bug in 1D input shape#5

Fix a bug in 1D input shape#5
WoosukKwon merged 4 commits intomainfrom
bugfix

WoosukKwon commented Mar 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

WoosukKwon commented Mar 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant