Automatically configure KV cache size by WoosukKwon · Pull Request #6 · vllm-project/vllm

WoosukKwon · 2023-03-03T10:05:40Z

This PR adds OPT memory analyzer to the system, and uses it to automatically determine the KV cache size.

Tested models:

OPT-125M
OPT-350M
OPT-1.3B
OPT-2.7B
OPT-6.7B
OPT-13B

Tested GPUs:

A100

* finish changing scheduler * finish merge * fix model * Fix (vllm-project#5) * fix problems * fix * delete unused params * remove redundant comments --------- Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>

…ect#6)

Add missing Python requirements

Co-authored-by: Mor Zusman <morz@ai21.com>

[CI/Build] Dockerfile.ubi : Remove test stage

FP8 on A100 for PHIMOE

* [kernel][DS-R1][linear] use default Fp8LinearMethod/Fp8MoEMethod Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * [kernel][DS-R1][Attention] enable Triton MLA attention Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * enable MHA for deepseek, need padding head_size to make flash attn kernel happy Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * not break fp8 path Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> --------- Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

- Add complete Accounting ERP MCP server code (140+ lines) - Add complete File Storage MCP server code (40+ lines) - Add detailed Agent execution sequence (STEP 1-56) - Add comprehensive ROI analysis - 99.1% time savings, 726만원/year cost savings

cam aclgraph ok.

Linear mxfp8 triton support

Use replicated linear latent

This commit adds the api.FIRST_RESPONSE_FROM_CORE event to complete the API journey event stream for detailed timing analysis. Implementation: 1. Emit event when first output received from engine - Both streaming (chat_completion_stream_generator) - And non-streaming (chat_completion_full_generator) paths 2. Event emitted at same time as first_response_time tracking - Captures monotonic timestamp for consistency - Uses epoch timestamp for OTEL timeline placement Event attributes: - name: "api.FIRST_RESPONSE_FROM_CORE" - EVENT_TS_MONOTONIC: monotonic timestamp - timestamp: epoch nanoseconds Timing analysis now possible: - Queue + scheduling time: FIRST_RESPONSE - HANDOFF_TO_CORE - API processing overhead: HANDOFF_TO_CORE - ARRIVED - Complete API latency: DEPARTED - ARRIVED Related: Addresses Task #6 (optional but good to have) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

init main repo structure and demonstrate the AR + DiT demo for omni models

…e_to_matthias.awq_gemv Port rogarcia.exllama_moe to matthias.awq gemv

- wrap_stream async generator middleware with finally cleanup - _accumulate: differential L2 norms, reasoning content extraction, offset computation - _maybe_schedule_cycle: time-based and token-based triggers via asyncio.create_task - _run_eviction_cycle: guard conditions (ENG-09, ENG-10, Pitfall vllm-project#6), strategy dispatch - Reasoning-relative to absolute offset conversion (D-05) - merge_overlapping_ranges + apply_retention_window + align_ranges_to_blocks pipeline - engine_client.update_request_mask call (D-04)

- 21 tests covering accumulation, guard conditions, cycle scheduling, passthrough - Tests: ENG-09, ENG-10, Pitfall vllm-project#6, D-05 offset, ENG-06 permanent ranges, ENG-07 isolation - Fix: apply_retention_window only when floor > 0 to avoid discarding all ranges - Fix: used asyncio.run() for async tests (no pytest-asyncio installed)

## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary>#1 — llama-nemotron-embed / score-template support (vllm-project#30550): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#2 — Triton Attention (vllm-project#31406): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#3 — Llama-4 attn quant (vllm-project#34243): Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454): Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507): Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085): Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary>vllm-project#7 — response_format validation for completions (vllm-project#35456): Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary>vllm-project#8 — response_format validation for chat completions (vllm-project#35510): Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

…rk-slidesparse Add comprehensive SlideSparse integration documentation for vLLM

WoosukKwon added 17 commits March 3, 2023 04:16

Fix a bug in 1D shape

e5a1fa8

Minor

342275f

Minor

b91a2fa

[WIP] Add memory analyzer

d78e2fb

Automatically config GPU/CPU blocks

2649eb5

Remove TODO

1ae7420

Merge branch 'main' into autoconfig

6654b34

Merge branch 'main' into autoconfig

fcbf027

Add max_num_batched_tokens argument

350ed27

Minor

6f5b41b

Minor

2d03918

Refactor model utils

8ec00fe

Re-implement memory analyzer

84203fc

Fix __init__

96b216c

Use memory analyzer in server.py

c89d440

Add psutil to README

f5d1e2c

Fix comment

cc63c24

WoosukKwon merged commit e9d3f2f into main Mar 12, 2023

WoosukKwon deleted the autoconfig branch March 12, 2023 07:23

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add memory analyzer & utomatically configure KV cache size (vllm-proj…

de10960

…ect#6)

slyalin pushed a commit to slyalin/vllm that referenced this pull request Mar 21, 2024

Merge pull request vllm-project#6 from mzegla/extended_requirements

2922b06

Add missing Python requirements

mzusman added a commit to mzusman/vllm that referenced this pull request Apr 16, 2024

dtype (vllm-project#6)

00bce1f

Co-authored-by: Mor Zusman <morz@ai21.com>

dtrifiro referenced this pull request in dtrifiro/vllm Apr 26, 2024

Merge pull request #6 from z103cb/ibm_main_docker_ubi_updates

91e4a51

[CI/Build] Dockerfile.ubi : Remove test stage

dlopes78 mentioned this pull request May 8, 2024

[Bug]: VLLM + tritonserver #4695

Closed

Starmys pushed a commit to Starmys/vllm that referenced this pull request May 20, 2024

Merge pull request vllm-project#6 from wenxcs/wenxh/fp8-on-a100

4e56e27

FP8 on A100 for PHIMOE

oliver-li mentioned this pull request Jul 5, 2024

[Bug]: NCCL hangs and causes timeout #5484

Closed

acodercat mentioned this pull request Nov 10, 2025

[Bugfix] Add strong reference to CUDA pluggable allocator callbacks #23477

Merged

4 tasks

sravan500 mentioned this pull request Nov 25, 2025

[Bug]: vllm/vllm-openai:v0.11.0 deployment --quantization fp8 throws cuda and tensor errors #29374

Closed

1 task

xiaoshudian555 pushed a commit to xiaoshudian555/vllm that referenced this pull request Nov 26, 2025

Merge pull request vllm-project#6 from GuoRen868/jcz_afd_v0.11.0rc3

1d5de21

cam aclgraph ok.

slwang-ustc mentioned this pull request Nov 27, 2025

[Bug]: RuntimeError "cancelled" when using pipeline parallelism with Qwen3-14B #29085

Closed

1 task

guyueh1 referenced this pull request in guyueh1/vllm Dec 4, 2025

Merge pull request #6 from TomerBN-Nvidia/linear-mxfp8-triton-support

3623cd4

Linear mxfp8 triton support

BJWang-ant mentioned this pull request Dec 17, 2025

[Bug]: Qwen3-32B with MTP, run failed. #30766

Open

1 task

pjin-nvidia pushed a commit to pjin-nvidia/vllm that referenced this pull request Jan 21, 2026

Merge pull request vllm-project#6 from amirkl94/add-replicated-linear

c077bc3

Use replicated linear latent

This was referenced Jan 27, 2026

[Feature] Emit journey events to core spans (PR #4/9) #33136

Closed

[Feature] Add API parent span lifecycle management (PR #6/9) #33182

Closed

[Feature] Add API↔Engine context propagation for journey tracing (PR #7/9) #33190

Closed

tjtanaa pushed a commit to tjtanaa/vllm that referenced this pull request Jan 29, 2026

Merge pull request vllm-project#6 from hsliuustc0106/hsliu-dev-C

5a503f3

init main repo structure and demonstrate the AR + DiT demo for omni models

Lrcx mentioned this pull request Jan 29, 2026

[Bug]: Crash when using presence_penalty with Qwen3-VL in v0.11.0 #33338

Open

1 task

HervorTao mentioned this pull request Feb 3, 2026

[Bug]: [CPU Backend] AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' #33675

Closed

1 task

JGSweets mentioned this pull request Mar 9, 2026

[Bug]: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. #28028

Open

1 task

LironKesem mentioned this pull request Mar 12, 2026

[Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries #36835

Closed

1 task

lavanyabollepalli mentioned this pull request Mar 12, 2026

[Bug]: GPU failure during repeated model loading when using --enable-prefix-caching with KV transfer (LMCacheConnectorV1) #36852

Open

1 task

eble-amd pushed a commit to eble-amd/vllm that referenced this pull request Mar 16, 2026

Merge pull request vllm-project#6 from roberteg16/rogarcia.exllama_mo…

8568900

…e_to_matthias.awq_gemv Port rogarcia.exllama_moe to matthias.awq gemv

mahaocong90 mentioned this pull request Mar 17, 2026

[Bug]: QWEN 3.5-397B-A17B report "RPC call to sample_tokens timed out" #37250

Closed

1 task

watch-Ultra mentioned this pull request Mar 18, 2026

[Bug]:推理时报错，模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型 #37392

Open

1 task

Copilot AI mentioned this pull request Mar 20, 2026

Fix XPU segfault when tensor_parallel_size exceeds available devices hongbolv/vllm#5

Closed

RocketRider mentioned this pull request Mar 21, 2026

Mamba-2 Triton kernels crash with illegal instruction on SM121 (DGX Spark) without CUDA_LAUNCH_BLOCKING=1 #37431

Open

Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026

Merge pull request vllm-project#6 from bcacdwk/copilot/create-framewo…

64ac2fb

…rk-slidesparse Add comprehensive SlideSparse integration documentation for vLLM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automatically configure KV cache size#6

Automatically configure KV cache size#6
WoosukKwon merged 17 commits intomainfrom
autoconfig

WoosukKwon commented Mar 3, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

WoosukKwon commented Mar 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

WoosukKwon commented Mar 3, 2023 •

edited

Loading