Skip to content

Fix a bug in 1D input shape#5

Merged
WoosukKwon merged 4 commits intomainfrom
bugfix
Mar 6, 2023
Merged

Fix a bug in 1D input shape#5
WoosukKwon merged 4 commits intomainfrom
bugfix

Conversation

@WoosukKwon
Copy link
Copy Markdown
Collaborator

This PR fixes a miscalculation of the input shape when iteration-level scheduling is used.

@WoosukKwon WoosukKwon merged commit 04e5acc into main Mar 6, 2023
@WoosukKwon WoosukKwon deleted the bugfix branch March 6, 2023 18:05
v1nc3nt27 pushed a commit to v1nc3nt27/vllm that referenced this pull request Sep 12, 2023
xiangyuT added a commit to xiangyuT/vllm that referenced this pull request Oct 24, 2023
* finish changing scheduler

* finish merge

* fix model

* Fix (vllm-project#5)

* fix problems

* fix

* delete unused params

* remove redundant comments

---------

Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 14, 2024
Align optimum-intel based model signature with vLLM signature
luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 25, 2024
…imum

Install optimum-intel from latest main
mzusman added a commit to mzusman/vllm that referenced this pull request Apr 16, 2024
* Drop indecies when finish

* min 1 attention layer

* CG is working on forward pass passing

* Remove comments

* cosmetics - rename indecies -> indices, organize some whitespaces

* Add some TODOs

* Adding mamba cache for cg

* Remove useless vars from input_metadata

* Remove unused import

* Set the seqlen offset to boolean

* Return only hidden state

* Return only hidden states

* Add padding to match forward pass bs

* Is prompt instead of seqlen offset

* Remove mamba cache class (not used)

* Another remove

* Remove

* Use mamba4gc

* Fix mamba forward, run update only on non prompt

* Use 1 index after the maximal index

* Remove import

* Remove import

* typo

* typo

* place holder

* Padding and empty token takes it from the first empty place

* reformat

* Apply suggestions from code review

Whitespaces

---------

Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
linxihui added a commit to linxihui/vllm that referenced this pull request May 14, 2024
…3small

 [Model][Kernels] Support Phi3small architecture, blocksparse attnention prefilling kernel, CUDA+Triton paged attn kernels
Starmys pushed a commit to Starmys/vllm that referenced this pull request May 20, 2024
Faster v2 hopper fused moe kernel configs
@alixiaodi alixiaodi mentioned this pull request Aug 2, 2024
zeroorhero pushed a commit to zeroorhero/vllm that referenced this pull request Sep 23, 2024
Bounty-hunter pushed a commit to Bounty-hunter/vllm that referenced this pull request Nov 4, 2025
* # This is a combination of 6 commits.
# This is the 1st commit message:

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

# This is the commit message vllm-project#2:

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

# This is the commit message vllm-project#3:

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

# This is the commit message vllm-project#4:

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

# This is the commit message vllm-project#5:

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

# This is the commit message vllm-project#6:

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

fix comments

* Update vllm/distributed/ec_transfer/utils/tensor_memory_pool.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update vllm/distributed/ec_transfer/ec_lookup_buffer/mooncake_store.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update vllm/distributed/ec_transfer/ec_connector/mooncake_storage_connector.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @wuhang2014

line length format

* Apply suggestion from @wuhang2014

remove extra empty line

---------

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Co-authored-by: wuhang <whlbx@hotmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
yma11 pushed a commit to yma11/vllm that referenced this pull request Nov 14, 2025
…tionalGeneration` (vllm-project#27895) (vllm-project#5)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
dik654 pushed a commit to dik654/vllm-for-study that referenced this pull request Nov 18, 2025
…ections

Manufacturing enhancements:
- Add complete Vision Inspection MCP with Vision AI defect detection
- Add Manufacturing MES MCP with PostgreSQL integration
- Include detailed defect classification and statistics
- Add ROI analysis showing 78% cost reduction and 99.6% time savings

Healthcare enhancements:
- Enhance existing Medical OCR, Drug Interaction, and EHR MCPs
- Add ROI analysis showing 97.2% time reduction
- Include medical accident prevention benefits (5억원 annual savings)
- Demonstrate HIPAA-compliant prescription OCR workflow

Summary:
- Sections vllm-project#5-8: Fully detailed implementations (2,000+ lines each)
- Sections vllm-project#9-10: Enhanced with complete code + ROI
- Sections vllm-project#11-20+: Comprehensive summaries covering all major industries
- Total guide provides 20+ real-world MCP + Agent architecture patterns
xiaoshudian555 pushed a commit to xiaoshudian555/vllm that referenced this pull request Nov 26, 2025
guyueh1 referenced this pull request in guyueh1/vllm Dec 9, 2025
wojciech-wais added a commit to wojciech-wais/vllm that referenced this pull request Mar 6, 2026
Follow-up item vllm-project#5 from RFC vllm-project#24885 (shutdown semantics).

Currently Ctrl-C does not respect --shutdown-timeout. Engine core and
worker child processes receive SIGINT directly from the terminal,
causing them to exit immediately rather than draining in-flight
requests.

Changes:

1. Engine core processes call os.setpgrp() at startup to create their
   own process group, isolating them from terminal SIGINT. Worker
   processes (children of engine core) inherit this group. The parent
   API server is the only process receiving terminal Ctrl-C, and it
   orchestrates child shutdown via SIGTERM.

2. Launcher (serve_http): first SIGINT/SIGTERM triggers graceful drain
   with --shutdown-timeout. A second signal forces immediate exit by
   setting timeout=0 and cancelling the server task.

3. CLI serve handlers (run_headless, run_multi_api_server): first
   signal triggers graceful SystemExit. Second signal calls os._exit(1)
   for immediate termination.

Fixes vllm-project#24885

Signed-off-by: Wojciech Wais <wojciech.wais@gmail.com>
IWantFight pushed a commit to IWantFight/vllm that referenced this pull request Mar 12, 2026
bigshanedogg pushed a commit to bigshanedogg/vllm that referenced this pull request Mar 19, 2026
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 26, 2026
## Summary

Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All
fixes are from upstream vLLM `main` and address critical bugs affecting
RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately.

**Jira Epic:**
[INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743)

## Cherry-picked commits (chronological order)

| # | Upstream PR | Jira | Summary |
|---|------------|------|---------|
| 1 | [vllm-project#30550](vllm-project#30550) |
[INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) |
Support using chat template as custom score template for reranking
models |
| 2 | [vllm-project#31406](vllm-project#31406) |
[INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add
encoder-only/cross attention support to Triton Attention backend |
| 3 | [vllm-project#34243](vllm-project#34243) |
[INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix
Llama-4 attn quantization by correctly permuting scales for rope (int8,
fp8) |
| 4 | [vllm-project#34454](vllm-project#34454) |
[INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix
structured output in multi-turn GPT-OSS (content:null with json_object)
|
| 5 | [vllm-project#34507](vllm-project#34507) |
[INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix
fused MoE int32 overflow in stride*offset for large models |
| 6 | [vllm-project#35085](vllm-project#35085) |
[INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) |
Gracefully disable AllReduceFusionPass on GPUs without multicast support
|
| 7 | [vllm-project#35456](vllm-project#35456) |
[INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) |
Replace assert with ValueError for response_format validation
(completions) |
| 8 | [vllm-project#35510](vllm-project#35510) |
[INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add
response_format validation to chat completions endpoint |


## Conflict resolutions

<details>
<summary><b>#1 — llama-nemotron-embed / score-template support
(vllm-project#30550)</b>: Clean cherry-pick, no conflicts</summary>

Applied cleanly onto `rhai/0.13.0`.
</details>

<details>
<summary><b>#2 — Triton Attention (vllm-project#31406)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly onto `rhai/0.13.0`.
</details>

<details>
<summary><b>#3 — Llama-4 attn quant (vllm-project#34243)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but
the fix targets a self-contained block.
</details>

<details>
<summary><b>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly despite 3 intermediate upstream commits that refactored
imports in `gptoss_reasoning_parser.py`. The fix logic (adding
`eom_token_id` early-exit check in `is_reasoning_end`) was independent
of the import changes.
</details>

<details>
<summary><b>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507)</b>: Conflicts in 2
files</summary>

**`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30
intermediate upstream commits refactored `fused_moe_kernel` with
conditional `naive_block_assignment` logic that doesn't exist in
`rhai/0.13.0`. Resolved by keeping our simpler code and applying only
the int64 cast fix:
- `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()`
result
- `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)`
before `token_mask`

**`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on
`make_dummy_moe_config()` from intermediate refactors. Resolved by
keeping our existing test code (no test changes).
</details>

<details>
<summary><b>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085)</b>: Conflict
due to file rename + API change</summary>

Upstream moved `collective_fusion.py` →
`compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API
from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to
`create_allreduce_fusion_workspace()`. Resolved by applying the
try/except wrapper around our existing
`trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in
`collective_fusion.py`. The error handling logic (catching RuntimeError
with "multicast" in message, logging warning, returning early) is
identical to upstream.
</details>

<details>
<summary><b>vllm-project#7 — response_format validation for completions
(vllm-project#35456)</b>: Conflict due to file restructuring</summary>

Upstream split `protocol.py` into `completion/protocol.py` and
`chat_completion/protocol.py`. Our branch still has the monolithic
`protocol.py`. Resolved by:
- Removing the non-existent
`vllm/entrypoints/openai/completion/protocol.py`
- Manually adding `validate_response_format` model_validator to
`CompletionRequest` in our `protocol.py`
- Using `ValueError` instead of upstream's `VLLMValidationError` (which
doesn't exist in our branch; `ValueError` is already handled as 400 Bad
Request in `serving_engine.py`)
- Test additions from upstream applied cleanly to
`test_completion_error.py`
</details>

<details>
<summary><b>vllm-project#8 — response_format validation for chat completions
(vllm-project#35510)</b>: Conflict due to file restructuring</summary>

Same file restructuring issue as vllm-project#6. Resolved by:
- Removing the non-existent
`vllm/entrypoints/openai/chat_completion/protocol.py`
- Manually adding `validate_response_format` model_validator to
`ChatCompletionRequest` in our `protocol.py`
- Only accepting the `test_json_schema_response_format_missing_schema`
test from the conflict (discarding ~140 lines of intermediate upstream
tests that reference non-existent paths in our branch)
</details>

## Test plan

- [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the
backported score-template / bidirectional model support
- [ ] Verify Llama-4 quantized model loads correctly with int8/fp8
attention quantization
- [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format
returns valid content
- [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32
overflow
- [ ] Verify MoE model loading on H200 GPUs (without multicast)
gracefully falls back
- [ ] Verify `response_format: {type: "json_schema"}` without
`json_schema` field returns 400 (not 500) for both `/v1/completions` and
`/v1/chat/completions`
- [ ] Verify encoder models (e.g. Whisper) work with Triton attention
backend on ROCm


[INFERENG-4743]:
https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-4800]:
https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-4746]:
https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-5032]:
https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-5038]:
https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

[INFERENG-5106]:
https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
roy-shih added a commit to UnieAI/vllm that referenced this pull request Mar 31, 2026
逐一 grep 驗證所有已完成項目的整合程式碼確實存在:
- #3 spec decode: _batch_precompute_spec_decode() 已在 scheduler.py
- vllm-project#5 builtin hash: 已在 config/cache.py Literal type
- vllm-project#15 batch spec decode: _precomputed_spec 快速路徑已在迴圈中

清除 strikethrough 噪音,統一為乾淨的「已完成/未完成」兩表格式。

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant