Skip to content

Automatically configure KV cache size#6

Merged
WoosukKwon merged 17 commits intomainfrom
autoconfig
Mar 12, 2023
Merged

Automatically configure KV cache size#6
WoosukKwon merged 17 commits intomainfrom
autoconfig

Conversation

@WoosukKwon
Copy link
Copy Markdown
Collaborator

@WoosukKwon WoosukKwon commented Mar 3, 2023

This PR adds OPT memory analyzer to the system, and uses it to automatically determine the KV cache size.

Tested models:

  • OPT-125M
  • OPT-350M
  • OPT-1.3B
  • OPT-2.7B
  • OPT-6.7B
  • OPT-13B

Tested GPUs:

  • A100

@WoosukKwon WoosukKwon merged commit e9d3f2f into main Mar 12, 2023
@WoosukKwon WoosukKwon deleted the autoconfig branch March 12, 2023 07:23
xiangyuT added a commit to xiangyuT/vllm that referenced this pull request Oct 24, 2023
* finish changing scheduler

* finish merge

* fix model

* Fix (vllm-project#5)

* fix problems

* fix

* delete unused params

* remove redundant comments

---------

Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
slyalin pushed a commit to slyalin/vllm that referenced this pull request Mar 21, 2024
mzusman added a commit to mzusman/vllm that referenced this pull request Apr 16, 2024
Co-authored-by: Mor Zusman <morz@ai21.com>
dtrifiro referenced this pull request in dtrifiro/vllm Apr 26, 2024
[CI/Build] Dockerfile.ubi : Remove test stage
Starmys pushed a commit to Starmys/vllm that referenced this pull request May 20, 2024
yma11 pushed a commit to yma11/vllm that referenced this pull request Nov 10, 2025
* [kernel][DS-R1][linear] use default Fp8LinearMethod/Fp8MoEMethod

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* [kernel][DS-R1][Attention] enable Triton MLA attention

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* enable MHA for deepseek, need padding head_size to make flash attn kernel happy

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* not break fp8 path

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

---------

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
dik654 pushed a commit to dik654/vllm-for-study that referenced this pull request Nov 18, 2025
- Add complete Accounting ERP MCP server code (140+ lines)
- Add complete File Storage MCP server code (40+ lines)
- Add detailed Agent execution sequence (STEP 1-56)
- Add comprehensive ROI analysis
- 99.1% time savings, 726만원/year cost savings
xiaoshudian555 pushed a commit to xiaoshudian555/vllm that referenced this pull request Nov 26, 2025
guyueh1 referenced this pull request in guyueh1/vllm Dec 4, 2025
pjin-nvidia pushed a commit to pjin-nvidia/vllm that referenced this pull request Jan 21, 2026
sriumcp referenced this pull request in inference-sim/vllm Jan 26, 2026
This commit adds the api.FIRST_RESPONSE_FROM_CORE event to complete the
API journey event stream for detailed timing analysis.

Implementation:

1. Emit event when first output received from engine
   - Both streaming (chat_completion_stream_generator)
   - And non-streaming (chat_completion_full_generator) paths

2. Event emitted at same time as first_response_time tracking
   - Captures monotonic timestamp for consistency
   - Uses epoch timestamp for OTEL timeline placement

Event attributes:
- name: "api.FIRST_RESPONSE_FROM_CORE"
- EVENT_TS_MONOTONIC: monotonic timestamp
- timestamp: epoch nanoseconds

Timing analysis now possible:
- Queue + scheduling time: FIRST_RESPONSE - HANDOFF_TO_CORE
- API processing overhead: HANDOFF_TO_CORE - ARRIVED
- Complete API latency: DEPARTED - ARRIVED

Related: Addresses Task #6 (optional but good to have)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
tjtanaa pushed a commit to tjtanaa/vllm that referenced this pull request Jan 29, 2026
init main repo structure and demonstrate the AR + DiT demo for omni models
eble-amd pushed a commit to eble-amd/vllm that referenced this pull request Mar 16, 2026
…e_to_matthias.awq_gemv

Port rogarcia.exllama_moe to matthias.awq gemv
Pwspang pushed a commit to hyscale-lab/vllm-thought-eviction that referenced this pull request Mar 26, 2026
- wrap_stream async generator middleware with finally cleanup
- _accumulate: differential L2 norms, reasoning content extraction, offset computation
- _maybe_schedule_cycle: time-based and token-based triggers via asyncio.create_task
- _run_eviction_cycle: guard conditions (ENG-09, ENG-10, Pitfall vllm-project#6), strategy dispatch
- Reasoning-relative to absolute offset conversion (D-05)
- merge_overlapping_ranges + apply_retention_window + align_ranges_to_blocks pipeline
- engine_client.update_request_mask call (D-04)
Pwspang pushed a commit to hyscale-lab/vllm-thought-eviction that referenced this pull request Mar 26, 2026
- 21 tests covering accumulation, guard conditions, cycle scheduling, passthrough
- Tests: ENG-09, ENG-10, Pitfall vllm-project#6, D-05 offset, ENG-06 permanent ranges, ENG-07 isolation
- Fix: apply_retention_window only when floor > 0 to avoid discarding all ranges
- Fix: used asyncio.run() for async tests (no pytest-asyncio installed)
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 26, 2026
## Summary

Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All
fixes are from upstream vLLM `main` and address critical bugs affecting
RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately.

**Jira Epic:**
[INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743)

## Cherry-picked commits (chronological order)

| # | Upstream PR | Jira | Summary |
|---|------------|------|---------|
| 1 | [vllm-project#30550](vllm-project#30550) |
[INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) |
Support using chat template as custom score template for reranking
models |
| 2 | [vllm-project#31406](vllm-project#31406) |
[INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add
encoder-only/cross attention support to Triton Attention backend |
| 3 | [vllm-project#34243](vllm-project#34243) |
[INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix
Llama-4 attn quantization by correctly permuting scales for rope (int8,
fp8) |
| 4 | [vllm-project#34454](vllm-project#34454) |
[INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix
structured output in multi-turn GPT-OSS (content:null with json_object)
|
| 5 | [vllm-project#34507](vllm-project#34507) |
[INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix
fused MoE int32 overflow in stride*offset for large models |
| 6 | [vllm-project#35085](vllm-project#35085) |
[INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) |
Gracefully disable AllReduceFusionPass on GPUs without multicast support
|
| 7 | [vllm-project#35456](vllm-project#35456) |
[INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) |
Replace assert with ValueError for response_format validation
(completions) |
| 8 | [vllm-project#35510](vllm-project#35510) |
[INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add
response_format validation to chat completions endpoint |


## Conflict resolutions

<details>
<summary><b>#1 — llama-nemotron-embed / score-template support
(vllm-project#30550)</b>: Clean cherry-pick, no conflicts</summary>

Applied cleanly onto `rhai/0.13.0`.
</details>

<details>
<summary><b>#2 — Triton Attention (vllm-project#31406)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly onto `rhai/0.13.0`.
</details>

<details>
<summary><b>#3 — Llama-4 attn quant (vllm-project#34243)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but
the fix targets a self-contained block.
</details>

<details>
<summary><b>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly despite 3 intermediate upstream commits that refactored
imports in `gptoss_reasoning_parser.py`. The fix logic (adding
`eom_token_id` early-exit check in `is_reasoning_end`) was independent
of the import changes.
</details>

<details>
<summary><b>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507)</b>: Conflicts in 2
files</summary>

**`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30
intermediate upstream commits refactored `fused_moe_kernel` with
conditional `naive_block_assignment` logic that doesn't exist in
`rhai/0.13.0`. Resolved by keeping our simpler code and applying only
the int64 cast fix:
- `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()`
result
- `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)`
before `token_mask`

**`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on
`make_dummy_moe_config()` from intermediate refactors. Resolved by
keeping our existing test code (no test changes).
</details>

<details>
<summary><b>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085)</b>: Conflict
due to file rename + API change</summary>

Upstream moved `collective_fusion.py` →
`compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API
from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to
`create_allreduce_fusion_workspace()`. Resolved by applying the
try/except wrapper around our existing
`trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in
`collective_fusion.py`. The error handling logic (catching RuntimeError
with "multicast" in message, logging warning, returning early) is
identical to upstream.
</details>

<details>
<summary><b>vllm-project#7 — response_format validation for completions
(vllm-project#35456)</b>: Conflict due to file restructuring</summary>

Upstream split `protocol.py` into `completion/protocol.py` and
`chat_completion/protocol.py`. Our branch still has the monolithic
`protocol.py`. Resolved by:
- Removing the non-existent
`vllm/entrypoints/openai/completion/protocol.py`
- Manually adding `validate_response_format` model_validator to
`CompletionRequest` in our `protocol.py`
- Using `ValueError` instead of upstream's `VLLMValidationError` (which
doesn't exist in our branch; `ValueError` is already handled as 400 Bad
Request in `serving_engine.py`)
- Test additions from upstream applied cleanly to
`test_completion_error.py`
</details>

<details>
<summary><b>vllm-project#8 — response_format validation for chat completions
(vllm-project#35510)</b>: Conflict due to file restructuring</summary>

Same file restructuring issue as vllm-project#6. Resolved by:
- Removing the non-existent
`vllm/entrypoints/openai/chat_completion/protocol.py`
- Manually adding `validate_response_format` model_validator to
`ChatCompletionRequest` in our `protocol.py`
- Only accepting the `test_json_schema_response_format_missing_schema`
test from the conflict (discarding ~140 lines of intermediate upstream
tests that reference non-existent paths in our branch)
</details>

## Test plan

- [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the
backported score-template / bidirectional model support
- [ ] Verify Llama-4 quantized model loads correctly with int8/fp8
attention quantization
- [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format
returns valid content
- [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32
overflow
- [ ] Verify MoE model loading on H200 GPUs (without multicast)
gracefully falls back
- [ ] Verify `response_format: {type: "json_schema"}` without
`json_schema` field returns 400 (not 500) for both `/v1/completions` and
`/v1/chat/completions`
- [ ] Verify encoder models (e.g. Whisper) work with Triton attention
backend on ROCm


[INFERENG-4743]:
https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-4800]:
https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-4746]:
https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-5032]:
https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-5038]:
https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

[INFERENG-5106]:
https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026
…rk-slidesparse

Add comprehensive SlideSparse integration documentation for vLLM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant