Add miscellaneous updates by WoosukKwon · Pull Request #8 · vllm-project/vllm

WoosukKwon · 2023-03-13T20:48:11Z

This PR contains several miscellaneous updates to the system, with two notable changes:

The size of the CPU KV cache is now calculated based on the swap_space size provided by the user (defaulting to 20 GiB).
The default value for max_num_batched_tokens has been increased from 2048 to 2560.

Organise

* Return support for other models apart from jamba * Support n>1 * A little cleanup * Rename * Apply whitespace suggestions from code review * Add max batch size to the main func * Fixed attention kv cache bug * log where requests id are deleted from the dict to debug mode * Fix typo * Align with v0.3.3 vllm code * Remove comments * Take out model config from CUDAGraph object * Fix * Fix typo * Make the kv cache selection cleaner * Another typo * Took the num layers calc outside * Remove the -1 * Set as num layer / period --------- Co-authored-by: Mor Zusman <morz@ai21.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>

remove dummy path in arctic

…128k Support Phi3SuScaledRotaryEmbedding for 128k model

update overhead benchmark

Test coverage (Issue #8 requirements): - Quantization: FP16, BF16, FP8 with per-layer and per-token scales - Acceptance patterns: 0%, 50%, 100%, contiguous prefix, sparse - Multi-layer staging across multiple attention layers - Cache layouts: Flash [B,T,H,D] and Paged [B,H,T,D] - Safety: disabled NWOR overhead, empty mask, int32 slot conversion - Edge cases: zero acceptance, full acceptance, partial patterns Total: 17 test cases covering all design review requirements. User will run with pytest for validation. This is Phase 7 (tests) of the draft commit implementation. Implementation complete - ready for user testing.

Merge to vllm:main

- Add Exam Grading OCR MCP with Vision AI for handwriting recognition - Add Learning Analytics MCP with PostgreSQL integration - Add Student Database MCP for student information management - Add detailed Agent execution sequence (STEP 1-315, 600+ lines) - Add comprehensive ROI analysis showing 99.6% cost reduction - Include automated grading, weak area analysis, and personalized study plans - Support both multiple choice (OMR) and essay grading with AI

Add PR and issue templates from vLLM project

…Manager - Add store_threshold >= 2 validation in FilterReusedOffloadingManager constructor (mirrors the existing max_tracker_size >= 1 guard) - Fix cpu.py gate from > 1 to >= 2; update comment to clarify that values < 2 disable filtering - Add internal assertions to test_filter_reused_manager to verify tracker eviction and count reset (Comments vllm-project#8 and vllm-project#9) - Remove tests/v1/kv_offload/__init__.py (not needed for pytest discovery) - Remove accidentally tracked dev-workflow files (.patch, diff*.txt, error.txt, log files, mypy/test output files) Signed-off-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com>

## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary>#1 — llama-nemotron-embed / score-template support (vllm-project#30550): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#2 — Triton Attention (vllm-project#31406): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#3 — Llama-4 attn quant (vllm-project#34243): Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454): Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507): Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085): Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary>vllm-project#7 — response_format validation for completions (vllm-project#35456): Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary>vllm-project#8 — response_format validation for chat completions (vllm-project#35510): Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

…llm-project#8) Add optional `get_desired_lora_slots()` method to the `LoRAResolver` ABC with a default `return None` so all existing subclasses remain unaffected. The engine will call this hook between batches when dynamic_lora_slots=True to let resolver implementations signal a desired GPU slot count. The returned value is clamped to [min_loras, max_loras] by the engine (implemented in vllm-project#13). Closes vllm-project#8 Co-authored-by: Claude Signed-off-by: Chen Wang <Chen.Wang1@ibm.com>

…rk-slidesparse 更新 framework_slidesparse.md：重构为七阶段工程流程并完善实现细节

WoosukKwon added 6 commits March 13, 2023 18:44

Handle empty inputs

e58e731

Minor

63ba824

Add namespace

d7eb2e0

Default batch size 2048 -> 2560

d02d394

memory utilization -> swap space

532365e

Fetch requests every step

d87b2b0

WoosukKwon merged commit cfae35b into main Mar 13, 2023

WoosukKwon deleted the minor branch March 13, 2023 20:48

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

v1nc3nt27 pushed a commit to v1nc3nt27/vllm that referenced this pull request Sep 12, 2023

Merge pull request vllm-project#8 from ri938/organise

2617c55

Organise

xiangyuT pushed a commit to xiangyuT/vllm that referenced this pull request Oct 24, 2023

Comments & minor changes (vllm-project#8)

a8561b8

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add miscellaneous updates (vllm-project#8)

cd9f1ac

sfc-gh-hazhang pushed a commit to sfc-gh-hazhang/vllm that referenced this pull request May 7, 2024

Merge pull request vllm-project#8 from Snowflake-Labs/remove-dummy

15de0c2

remove dummy path in arctic

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

ykim362 pushed a commit to ykim362/vllm that referenced this pull request Jun 17, 2024

Merge pull request vllm-project#8 from Starmys/dev/chengzhang/phi3moe…

dfaba7c

…128k Support Phi3SuScaledRotaryEmbedding for 128k model

This was referenced Jul 5, 2024

Support W4A8 quantization for vllm #5218

Merged

[Bug]: call for stack trace for "Watchdog caught collective operation timeout" #6042

Closed

xinzaifeixiang1992 mentioned this pull request Jul 24, 2024

[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型，刚开始服务正常，但是并发高的时候就报错 #6734

Closed

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

Minami-su mentioned this pull request Aug 11, 2024

[Bug]: vllm is crashed on v0.5.3.post1 #7161

Closed

zeroorhero pushed a commit to zeroorhero/vllm that referenced this pull request Sep 23, 2024

Merge pull request vllm-project#8 from KuntaiDu/jiayi-dev-v2

0dd3571

update overhead benchmark

liulisi16323 mentioned this pull request Sep 24, 2024

[Bug]: v0.5.5 crash: "AssertionError: expected running sequences" #8016

Closed

1 task

SpaceHunterInf mentioned this pull request Sep 30, 2024

[Bug]: Bus error (core dumped) #8974

Closed

1 task

Michel-debug mentioned this pull request Oct 23, 2025

[Bug]: qwen3-vl-2b after ms-swift fine-tuning lance errors #27405

Closed

1 task

whwangovo mentioned this pull request Oct 23, 2025

[Bug]: vLLM (TP=8) on 235B model triggers "CUDA error: unspecified launch failure" and persistent "ERR!" state in nvidia-smi #27430

Open

1 task

acodercat mentioned this pull request Nov 10, 2025

[Bugfix] Add strong reference to CUDA pluggable allocator callbacks #23477

Merged

4 tasks

pisceskkk added a commit to pisceskkk/vllm that referenced this pull request Nov 12, 2025

Merge pull request vllm-project#8 from pisceskkk/long_seq_dev

6041fdf

Merge to vllm:main

sravan500 mentioned this pull request Nov 25, 2025

[Bug]: vllm/vllm-openai:v0.11.0 deployment --quantization fp8 throws cuda and tensor errors #29374

Closed

1 task

slwang-ustc mentioned this pull request Nov 27, 2025

[Bug]: RuntimeError "cancelled" when using pipeline parallelism with Qwen3-14B #29085

Closed

1 task

BJWang-ant mentioned this pull request Dec 17, 2025

[Bug]: Qwen3-32B with MTP, run failed. #30766

Open

1 task

This was referenced Jan 27, 2026

[Feature] Emit journey events to core spans (PR #4/9) #33136

Closed

[Feature] Add API parent span lifecycle management (PR #6/9) #33182

Closed

[Feature] Add API↔Engine context propagation for journey tracing (PR #7/9) #33190

Closed

tjtanaa pushed a commit to tjtanaa/vllm that referenced this pull request Jan 29, 2026

Merge pull request vllm-project#8 from hsliuustc0106/hsliu-dev-C

c150346

Add PR and issue templates from vLLM project

Lrcx mentioned this pull request Jan 29, 2026

[Bug]: Crash when using presence_penalty with Qwen3-VL in v0.11.0 #33338

Open

1 task

HervorTao mentioned this pull request Feb 3, 2026

[Bug]: [CPU Backend] AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' #33675

Closed

1 task

JGSweets mentioned this pull request Mar 9, 2026

[Bug]: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. #28028

Open

1 task

LironKesem mentioned this pull request Mar 12, 2026

[Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries #36835

Closed

1 task

lavanyabollepalli mentioned this pull request Mar 12, 2026

[Bug]: GPU failure during repeated model loading when using --enable-prefix-caching with KV transfer (LMCacheConnectorV1) #36852

Open

1 task

mahaocong90 mentioned this pull request Mar 17, 2026

[Bug]: QWEN 3.5-397B-A17B report "RPC call to sample_tokens timed out" #37250

Closed

1 task

watch-Ultra mentioned this pull request Mar 18, 2026

[Bug]:推理时报错，模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型 #37392

Open

1 task

This was referenced Mar 20, 2026

Fix XPU segfault when tensor_parallel_size exceeds available devices hongbolv/vllm#5

Closed

Fix XPU Level Zero crash by setting per-worker ZE_AFFINITY_MASK hongbolv/vllm#6

Closed

RocketRider mentioned this pull request Mar 21, 2026

Mamba-2 Triton kernels crash with illegal instruction on SM121 (DGX Spark) without CUDA_LAUNCH_BLOCKING=1 #37431

Open

Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026

Merge pull request vllm-project#8 from bcacdwk/copilot/update-framewo…

4447366

…rk-slidesparse 更新 framework_slidesparse.md：重构为七阶段工程流程并完善实现细节

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add miscellaneous updates#8

Add miscellaneous updates#8
WoosukKwon merged 6 commits intomainfrom
minor

WoosukKwon commented Mar 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

WoosukKwon commented Mar 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant