Support beam search & parallel generation by WoosukKwon · Pull Request #7 · vllm-project/vllm

WoosukKwon · 2023-03-09T20:40:36Z

This PR adds support for beam search and parallel generation (i.e., n > 1).

NOTE: The correctness is only checked for beam search, but not for random sampling methods.

Tested models:

OPT-125M
OPT-350M
OPT-1.3B
OPT-2.7B
OPT-6.7B
OPT-13B

Tested GPUs:

A100

…llm-project#7) * add xpu path Signed-off-by: Lin, Fanli <fanli.lin@intel.com> * use partial to create a function wrapper Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Signed-off-by: Lin, Fanli <fanli.lin@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- Add Legal Document OCR MCP with Vision AI for image PDF processing - Add Legal Database MCP with Korean legal API integration (대법원 API) - Add Legal Document Generation MCP with ReportLab PDF generation - Add detailed Agent execution sequence (STEP 1-9, 300+ lines) - Add comprehensive ROI analysis showing 940% revenue increase potential - Include fraud detection and contract risk analysis capabilities

…port add connector support

support tp8

This commit adds comprehensive test coverage for span cleanup in error and completion paths to ensure no memory leaks. New Tests: 1. test_cleanup_on_abort_path - Verifies _core_spans cleanup when request is aborted - Tests FINISHED_ABORTED status path - Confirms span is closed and all tracking state removed - Validates FINISHED event emitted with "aborted" status 2. test_cleanup_on_natural_completion - Verifies _core_spans cleanup on natural completion (EOS/max_tokens) - Tests FINISHED_STOPPED status path - Confirms span is closed and all tracking state removed - Validates FINISHED event emitted with "stopped" status Coverage: - Abort path: finish_requests() with FINISHED_ABORTED - Natural completion: finish_requests() with FINISHED_STOPPED - Both paths verify: - _core_spans[request_id] removed - _journey_prefill_hiwater[request_id] removed - _first_token_emitted does not contain request_id - Span is properly closed (end_called=True, is_recording()=False) All 11 span tests pass, confirming no memory leaks in any termination path. Related: Addresses Task #7 (optional but important for production safety) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary>#1 — llama-nemotron-embed / score-template support (vllm-project#30550): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#2 — Triton Attention (vllm-project#31406): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#3 — Llama-4 attn quant (vllm-project#34243): Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454): Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507): Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085): Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary>vllm-project#7 — response_format validation for completions (vllm-project#35456): Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary>vllm-project#8 — response_format validation for chat completions (vllm-project#35510): Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

…-project#7) Adds five new fields to LoRAConfig in vllm/config/lora.py to support runtime dynamic resizing of GPU LoRA adapter slots: - min_loras (int, ge=1): floor for dynamic slot shrinking - dynamic_lora_slots (bool): enables automatic watermark-driven scaling - lora_mem_high_watermark (float, 0<x<1): scale-down threshold - lora_mem_low_watermark (float, 0<x<1): scale-up threshold - lora_slot_resize_cooldown_s (float, ge=0): anti-thrash cooldown Cross-field validation added to _validate_lora_config(): - min_loras <= max_loras (when dynamic_lora_slots=True) - lora_mem_low_watermark < lora_mem_high_watermark (when dynamic=True) Field-level bounds (ge/gt/lt) enforced by Pydantic at construction time. dynamic_lora_slots added to compute_hash() as it affects the CudaGraph specialization path (disables LoRA cudagraph when True, see issue vllm-project#14). All new fields default to safe values so existing configs are unaffected when dynamic_lora_slots=False (the default). Includes 16 unit tests in tests/lora/test_lora_config_dynamic.py covering defaults, valid configs, all validation error paths, and compute_hash() behavior. Closes vllm-project#7 Closes vllm-project#18 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Chen Wang <Chen.Wang1@ibm.com>

[Core] Add dynamic LoRA slot scaling fields to LoRAConfig (issue vllm-project#7)

…-slidesparse-md Optimize framework_slidesparse.md: simplify code blocks, add implementation guidance

WoosukKwon added 30 commits March 6, 2023 19:06

Minor

e0a8519

Add get_seqs

a320382

Minor

cf9536f

max_context_len -> context_window_size

19ff0d0

Add __repr__ to SamplingParams

c6ae9e1

Minor

22822d8

Enhance Frontend and SamplingParams

271d5df

Add get_last_token_id

ab13c30

Add InputSequenceGroup

2f04d46

Support temperature & top_p sampling

f743df9

Support parallel generation

f10fbac

Use n=2 for test inputs

f1f49f8

Enforce zero temperature for beam search

ac85d81

Remove group_id from seq_groups

b790887

Refactor Sampler

bdbb3f9

Minor

261f3cd

Use replacement=True for torch.multinomial

6184793

Fix a bug in block copy

c158f6e

InputSequenceGroup -> SequenceGroupInput

4340cdb

SequenceGroupInput -> SequenceGroupInputs

893c1a0

Add SequenceOutputs & Stre logprobs for sequences

7b8889c

Add num_logprobs to SamplingParams

f8493e6

Use num_logprobs in sampling_params

38244e4

[WIP] Refactor

2a4b8bb

Minor

d449b3d

[WIP] Refactor

2ac01dc

Implement beam search

de1c3d7

Minor

0daed38

Shallow copy -> deep copy

a0a55b0

Bugfix for beam search

e1f359a

Michel-debug mentioned this pull request Oct 23, 2025

[Bug]: qwen3-vl-2b after ms-swift fine-tuning lance errors #27405

Closed

1 task

whwangovo mentioned this pull request Oct 23, 2025

[Bug]: vLLM (TP=8) on 235B model triggers "CUDA error: unspecified launch failure" and persistent "ERR!" state in nvidia-smi #27430

Open

1 task

acodercat mentioned this pull request Nov 10, 2025

[Bugfix] Add strong reference to CUDA pluggable allocator callbacks #23477

Merged

4 tasks

sravan500 mentioned this pull request Nov 25, 2025

[Bug]: vllm/vllm-openai:v0.11.0 deployment --quantization fp8 throws cuda and tensor errors #29374

Closed

1 task

prashanth058 pushed a commit to prashanth058/vllm that referenced this pull request Nov 25, 2025

Merge pull request vllm-project#7 from prashanth058/mlm-connector-sup…

112779f

…port add connector support

slwang-ustc mentioned this pull request Nov 27, 2025

[Bug]: RuntimeError "cancelled" when using pipeline parallelism with Qwen3-14B #29085

Closed

1 task

guyueh1 referenced this pull request in guyueh1/vllm Dec 8, 2025

Merge pull request #7 from TomerBN-Nvidia/support-tp-8-for-mxfp8

e7e3064

support tp8

BJWang-ant mentioned this pull request Dec 17, 2025

[Bug]: Qwen3-32B with MTP, run failed. #30766

Open

1 task

This was referenced Jan 27, 2026

[Feature] Emit journey events to core spans (PR #4/9) #33136

Closed

[Feature] Add API parent span lifecycle management (PR #6/9) #33182

Closed

[Feature] Add API↔Engine context propagation for journey tracing (PR #7/9) #33190

Closed

Lrcx mentioned this pull request Jan 29, 2026

[Bug]: Crash when using presence_penalty with Qwen3-VL in v0.11.0 #33338

Open

1 task

HervorTao mentioned this pull request Feb 3, 2026

[Bug]: [CPU Backend] AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' #33675

Closed

1 task

JGSweets mentioned this pull request Mar 9, 2026

[Bug]: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. #28028

Open

1 task

LironKesem mentioned this pull request Mar 12, 2026

[Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries #36835

Closed

1 task

lavanyabollepalli mentioned this pull request Mar 12, 2026

[Bug]: GPU failure during repeated model loading when using --enable-prefix-caching with KV transfer (LMCacheConnectorV1) #36852

Open

1 task

mahaocong90 mentioned this pull request Mar 17, 2026

[Bug]: QWEN 3.5-397B-A17B report "RPC call to sample_tokens timed out" #37250

Closed

1 task

watch-Ultra mentioned this pull request Mar 18, 2026

[Bug]:推理时报错，模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型 #37392

Open

1 task

This was referenced Mar 20, 2026

Fix XPU segfault when tensor_parallel_size exceeds available devices hongbolv/vllm#5

Closed

Fix XPU Level Zero crash by setting per-worker ZE_AFFINITY_MASK hongbolv/vllm#6

Closed

RocketRider mentioned this pull request Mar 21, 2026

Mamba-2 Triton kernels crash with illegal instruction on SM121 (DGX Spark) without CUDA_LAUNCH_BLOCKING=1 #37431

Open

yuezhu1 added a commit to yuezhu1/vllm that referenced this pull request Mar 30, 2026

feat(lora): add dynamic LoRA slot scaling config fields

cd6e8d1

[Core] Add dynamic LoRA slot scaling fields to LoRAConfig (issue vllm-project#7)

Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026

Merge pull request vllm-project#7 from bcacdwk/copilot/edit-framework…

bb39217

…-slidesparse-md Optimize framework_slidesparse.md: simplify code blocks, add implementation guidance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support beam search & parallel generation#7

Support beam search & parallel generation#7
WoosukKwon merged 43 commits intomainfrom
parallel-generation

WoosukKwon commented Mar 9, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

WoosukKwon commented Mar 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WoosukKwon commented Mar 9, 2023 •

edited

Loading