Skip to content

Support beam search & parallel generation#7

Merged
WoosukKwon merged 43 commits intomainfrom
parallel-generation
Mar 10, 2023
Merged

Support beam search & parallel generation#7
WoosukKwon merged 43 commits intomainfrom
parallel-generation

Conversation

@WoosukKwon
Copy link
Copy Markdown
Collaborator

@WoosukKwon WoosukKwon commented Mar 9, 2023

This PR adds support for beam search and parallel generation (i.e., n > 1).

NOTE: The correctness is only checked for beam search, but not for random sampling methods.

Tested models:

  • OPT-125M
  • OPT-350M
  • OPT-1.3B
  • OPT-2.7B
  • OPT-6.7B
  • OPT-13B

Tested GPUs:

  • A100

yma11 pushed a commit to yma11/vllm that referenced this pull request Nov 14, 2025
…llm-project#7)

* add xpu path

Signed-off-by: Lin, Fanli <fanli.lin@intel.com>

* use partial to create a function wrapper

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
yma11 pushed a commit to yma11/vllm that referenced this pull request Nov 16, 2025
…llm-project#7)

* add xpu path

Signed-off-by: Lin, Fanli <fanli.lin@intel.com>

* use partial to create a function wrapper

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
dik654 pushed a commit to dik654/vllm-for-study that referenced this pull request Nov 18, 2025
- Add Legal Document OCR MCP with Vision AI for image PDF processing
- Add Legal Database MCP with Korean legal API integration (대법원 API)
- Add Legal Document Generation MCP with ReportLab PDF generation
- Add detailed Agent execution sequence (STEP 1-9, 300+ lines)
- Add comprehensive ROI analysis showing 940% revenue increase potential
- Include fraud detection and contract risk analysis capabilities
prashanth058 pushed a commit to prashanth058/vllm that referenced this pull request Nov 25, 2025
guyueh1 referenced this pull request in guyueh1/vllm Dec 8, 2025
sriumcp referenced this pull request in inference-sim/vllm Jan 26, 2026
This commit adds comprehensive test coverage for span cleanup in error and
completion paths to ensure no memory leaks.

New Tests:

1. test_cleanup_on_abort_path
   - Verifies _core_spans cleanup when request is aborted
   - Tests FINISHED_ABORTED status path
   - Confirms span is closed and all tracking state removed
   - Validates FINISHED event emitted with "aborted" status

2. test_cleanup_on_natural_completion
   - Verifies _core_spans cleanup on natural completion (EOS/max_tokens)
   - Tests FINISHED_STOPPED status path
   - Confirms span is closed and all tracking state removed
   - Validates FINISHED event emitted with "stopped" status

Coverage:
- Abort path: finish_requests() with FINISHED_ABORTED
- Natural completion: finish_requests() with FINISHED_STOPPED
- Both paths verify:
  - _core_spans[request_id] removed
  - _journey_prefill_hiwater[request_id] removed
  - _first_token_emitted does not contain request_id
  - Span is properly closed (end_called=True, is_recording()=False)

All 11 span tests pass, confirming no memory leaks in any termination path.

Related: Addresses Task #7 (optional but important for production safety)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 26, 2026
## Summary

Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All
fixes are from upstream vLLM `main` and address critical bugs affecting
RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately.

**Jira Epic:**
[INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743)

## Cherry-picked commits (chronological order)

| # | Upstream PR | Jira | Summary |
|---|------------|------|---------|
| 1 | [vllm-project#30550](vllm-project#30550) |
[INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) |
Support using chat template as custom score template for reranking
models |
| 2 | [vllm-project#31406](vllm-project#31406) |
[INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add
encoder-only/cross attention support to Triton Attention backend |
| 3 | [vllm-project#34243](vllm-project#34243) |
[INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix
Llama-4 attn quantization by correctly permuting scales for rope (int8,
fp8) |
| 4 | [vllm-project#34454](vllm-project#34454) |
[INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix
structured output in multi-turn GPT-OSS (content:null with json_object)
|
| 5 | [vllm-project#34507](vllm-project#34507) |
[INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix
fused MoE int32 overflow in stride*offset for large models |
| 6 | [vllm-project#35085](vllm-project#35085) |
[INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) |
Gracefully disable AllReduceFusionPass on GPUs without multicast support
|
| 7 | [vllm-project#35456](vllm-project#35456) |
[INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) |
Replace assert with ValueError for response_format validation
(completions) |
| 8 | [vllm-project#35510](vllm-project#35510) |
[INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add
response_format validation to chat completions endpoint |


## Conflict resolutions

<details>
<summary><b>#1 — llama-nemotron-embed / score-template support
(vllm-project#30550)</b>: Clean cherry-pick, no conflicts</summary>

Applied cleanly onto `rhai/0.13.0`.
</details>

<details>
<summary><b>#2 — Triton Attention (vllm-project#31406)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly onto `rhai/0.13.0`.
</details>

<details>
<summary><b>#3 — Llama-4 attn quant (vllm-project#34243)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but
the fix targets a self-contained block.
</details>

<details>
<summary><b>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly despite 3 intermediate upstream commits that refactored
imports in `gptoss_reasoning_parser.py`. The fix logic (adding
`eom_token_id` early-exit check in `is_reasoning_end`) was independent
of the import changes.
</details>

<details>
<summary><b>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507)</b>: Conflicts in 2
files</summary>

**`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30
intermediate upstream commits refactored `fused_moe_kernel` with
conditional `naive_block_assignment` logic that doesn't exist in
`rhai/0.13.0`. Resolved by keeping our simpler code and applying only
the int64 cast fix:
- `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()`
result
- `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)`
before `token_mask`

**`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on
`make_dummy_moe_config()` from intermediate refactors. Resolved by
keeping our existing test code (no test changes).
</details>

<details>
<summary><b>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085)</b>: Conflict
due to file rename + API change</summary>

Upstream moved `collective_fusion.py` →
`compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API
from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to
`create_allreduce_fusion_workspace()`. Resolved by applying the
try/except wrapper around our existing
`trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in
`collective_fusion.py`. The error handling logic (catching RuntimeError
with "multicast" in message, logging warning, returning early) is
identical to upstream.
</details>

<details>
<summary><b>vllm-project#7 — response_format validation for completions
(vllm-project#35456)</b>: Conflict due to file restructuring</summary>

Upstream split `protocol.py` into `completion/protocol.py` and
`chat_completion/protocol.py`. Our branch still has the monolithic
`protocol.py`. Resolved by:
- Removing the non-existent
`vllm/entrypoints/openai/completion/protocol.py`
- Manually adding `validate_response_format` model_validator to
`CompletionRequest` in our `protocol.py`
- Using `ValueError` instead of upstream's `VLLMValidationError` (which
doesn't exist in our branch; `ValueError` is already handled as 400 Bad
Request in `serving_engine.py`)
- Test additions from upstream applied cleanly to
`test_completion_error.py`
</details>

<details>
<summary><b>vllm-project#8 — response_format validation for chat completions
(vllm-project#35510)</b>: Conflict due to file restructuring</summary>

Same file restructuring issue as vllm-project#6. Resolved by:
- Removing the non-existent
`vllm/entrypoints/openai/chat_completion/protocol.py`
- Manually adding `validate_response_format` model_validator to
`ChatCompletionRequest` in our `protocol.py`
- Only accepting the `test_json_schema_response_format_missing_schema`
test from the conflict (discarding ~140 lines of intermediate upstream
tests that reference non-existent paths in our branch)
</details>

## Test plan

- [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the
backported score-template / bidirectional model support
- [ ] Verify Llama-4 quantized model loads correctly with int8/fp8
attention quantization
- [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format
returns valid content
- [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32
overflow
- [ ] Verify MoE model loading on H200 GPUs (without multicast)
gracefully falls back
- [ ] Verify `response_format: {type: "json_schema"}` without
`json_schema` field returns 400 (not 500) for both `/v1/completions` and
`/v1/chat/completions`
- [ ] Verify encoder models (e.g. Whisper) work with Triton attention
backend on ROCm


[INFERENG-4743]:
https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-4800]:
https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-4746]:
https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-5032]:
https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-5038]:
https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

[INFERENG-5106]:
https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
yuezhu1 pushed a commit to yuezhu1/vllm that referenced this pull request Mar 30, 2026
…-project#7)

Adds five new fields to LoRAConfig in vllm/config/lora.py to support
runtime dynamic resizing of GPU LoRA adapter slots:

  - min_loras (int, ge=1): floor for dynamic slot shrinking
  - dynamic_lora_slots (bool): enables automatic watermark-driven scaling
  - lora_mem_high_watermark (float, 0<x<1): scale-down threshold
  - lora_mem_low_watermark (float, 0<x<1): scale-up threshold
  - lora_slot_resize_cooldown_s (float, ge=0): anti-thrash cooldown

Cross-field validation added to _validate_lora_config():
  - min_loras <= max_loras (when dynamic_lora_slots=True)
  - lora_mem_low_watermark < lora_mem_high_watermark (when dynamic=True)

Field-level bounds (ge/gt/lt) enforced by Pydantic at construction time.

dynamic_lora_slots added to compute_hash() as it affects the CudaGraph
specialization path (disables LoRA cudagraph when True, see issue vllm-project#14).

All new fields default to safe values so existing configs are unaffected
when dynamic_lora_slots=False (the default).

Includes 16 unit tests in tests/lora/test_lora_config_dynamic.py
covering defaults, valid configs, all validation error paths, and
compute_hash() behavior.

Closes vllm-project#7
Closes vllm-project#18

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Chen Wang <Chen.Wang1@ibm.com>
yuezhu1 added a commit to yuezhu1/vllm that referenced this pull request Mar 30, 2026
[Core] Add dynamic LoRA slot scaling fields to LoRAConfig (issue vllm-project#7)
Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026
…-slidesparse-md

Optimize framework_slidesparse.md: simplify code blocks, add implementation guidance
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants