Merged
Conversation
xiangyuT
added a commit
to xiangyuT/vllm
that referenced
this pull request
Oct 24, 2023
* finish changing scheduler * finish merge * fix model * Fix (vllm-project#5) * fix problems * fix * delete unused params * remove redundant comments --------- Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
hongxiayang
pushed a commit
to hongxiayang/vllm
that referenced
this pull request
Feb 13, 2024
slyalin
pushed a commit
to slyalin/vllm
that referenced
this pull request
Mar 21, 2024
Add missing Python requirements
mzusman
added a commit
to mzusman/vllm
that referenced
this pull request
Apr 16, 2024
Co-authored-by: Mor Zusman <morz@ai21.com>
dtrifiro
referenced
this pull request
in dtrifiro/vllm
Apr 26, 2024
[CI/Build] Dockerfile.ubi : Remove test stage
Starmys
pushed a commit
to Starmys/vllm
that referenced
this pull request
May 20, 2024
FP8 on A100 for PHIMOE
yma11
pushed a commit
to yma11/vllm
that referenced
this pull request
Nov 10, 2025
* [kernel][DS-R1][linear] use default Fp8LinearMethod/Fp8MoEMethod Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * [kernel][DS-R1][Attention] enable Triton MLA attention Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * enable MHA for deepseek, need padding head_size to make flash attn kernel happy Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * not break fp8 path Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> --------- Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
4 tasks
dik654
pushed a commit
to dik654/vllm-for-study
that referenced
this pull request
Nov 18, 2025
- Add complete Accounting ERP MCP server code (140+ lines) - Add complete File Storage MCP server code (40+ lines) - Add detailed Agent execution sequence (STEP 1-56) - Add comprehensive ROI analysis - 99.1% time savings, 726만원/year cost savings
Closed
1 task
xiaoshudian555
pushed a commit
to xiaoshudian555/vllm
that referenced
this pull request
Nov 26, 2025
cam aclgraph ok.
1 task
1 task
pjin-nvidia
pushed a commit
to pjin-nvidia/vllm
that referenced
this pull request
Jan 21, 2026
Use replicated linear latent
sriumcp
referenced
this pull request
in inference-sim/vllm
Jan 26, 2026
This commit adds the api.FIRST_RESPONSE_FROM_CORE event to complete the API journey event stream for detailed timing analysis. Implementation: 1. Emit event when first output received from engine - Both streaming (chat_completion_stream_generator) - And non-streaming (chat_completion_full_generator) paths 2. Event emitted at same time as first_response_time tracking - Captures monotonic timestamp for consistency - Uses epoch timestamp for OTEL timeline placement Event attributes: - name: "api.FIRST_RESPONSE_FROM_CORE" - EVENT_TS_MONOTONIC: monotonic timestamp - timestamp: epoch nanoseconds Timing analysis now possible: - Queue + scheduling time: FIRST_RESPONSE - HANDOFF_TO_CORE - API processing overhead: HANDOFF_TO_CORE - ARRIVED - Complete API latency: DEPARTED - ARRIVED Related: Addresses Task #6 (optional but good to have) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This was referenced Jan 27, 2026
tjtanaa
pushed a commit
to tjtanaa/vllm
that referenced
this pull request
Jan 29, 2026
init main repo structure and demonstrate the AR + DiT demo for omni models
1 task
1 task
1 task
1 task
eble-amd
pushed a commit
to eble-amd/vllm
that referenced
this pull request
Mar 16, 2026
…e_to_matthias.awq_gemv Port rogarcia.exllama_moe to matthias.awq gemv
1 task
1 task
Pwspang
pushed a commit
to hyscale-lab/vllm-thought-eviction
that referenced
this pull request
Mar 26, 2026
- wrap_stream async generator middleware with finally cleanup - _accumulate: differential L2 norms, reasoning content extraction, offset computation - _maybe_schedule_cycle: time-based and token-based triggers via asyncio.create_task - _run_eviction_cycle: guard conditions (ENG-09, ENG-10, Pitfall vllm-project#6), strategy dispatch - Reasoning-relative to absolute offset conversion (D-05) - merge_overlapping_ranges + apply_retention_window + align_ranges_to_blocks pipeline - engine_client.update_request_mask call (D-04)
Pwspang
pushed a commit
to hyscale-lab/vllm-thought-eviction
that referenced
this pull request
Mar 26, 2026
- 21 tests covering accumulation, guard conditions, cycle scheduling, passthrough - Tests: ENG-09, ENG-10, Pitfall vllm-project#6, D-05 offset, ENG-06 permanent ranges, ENG-07 isolation - Fix: apply_retention_window only when floor > 0 to avoid discarding all ranges - Fix: used asyncio.run() for async tests (no pytest-asyncio installed)
khairulkabir1661
pushed a commit
to khairulkabir1661/vllm
that referenced
this pull request
Mar 26, 2026
## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary><b>#1 — llama-nemotron-embed / score-template support (vllm-project#30550)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary><b>#2 — Triton Attention (vllm-project#31406)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary><b>#3 — Llama-4 attn quant (vllm-project#34243)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary><b>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary><b>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507)</b>: Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary><b>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085)</b>: Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary><b>vllm-project#7 — response_format validation for completions (vllm-project#35456)</b>: Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary><b>vllm-project#8 — response_format validation for chat completions (vllm-project#35510)</b>: Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
Damon-Salvetore
pushed a commit
to Damon-Salvetore/vllm
that referenced
this pull request
Mar 31, 2026
…rk-slidesparse Add comprehensive SlideSparse integration documentation for vLLM
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds OPT memory analyzer to the system, and uses it to automatically determine the KV cache size.
Tested models:
Tested GPUs: