Skip to content

Add cache watermark to avoid frequent cache eviction#11

Merged
WoosukKwon merged 5 commits intomainfrom
watermark
Mar 29, 2023
Merged

Add cache watermark to avoid frequent cache eviction#11
WoosukKwon merged 5 commits intomainfrom
watermark

Conversation

@WoosukKwon
Copy link
Copy Markdown
Collaborator

@WoosukKwon WoosukKwon commented Mar 27, 2023

This PR implements the watermark mechanism to prevent frequent preemption.

If we admit new sequences such that the GPU KV cache becomes full, preemptions are highly likely to happen in the next few steps. Instead, we can reserve a small portion of the cache and refrain from utilizing the entire cache space when admitting new sequences. This will help us avoid the inefficiencies.

@WoosukKwon WoosukKwon requested a review from zhuohan123 March 28, 2023 08:16
@WoosukKwon WoosukKwon changed the title Add cache watermark to avoid frequent preemptions Add cache watermark to avoid frequent cache eviction Mar 29, 2023
@WoosukKwon
Copy link
Copy Markdown
Collaborator Author

@zhuohan123 I'm merging this PR as it does not conflict with any other and it (slightly) improves the system performance.

@WoosukKwon WoosukKwon merged commit 64e0e38 into main Mar 29, 2023
@WoosukKwon WoosukKwon deleted the watermark branch March 29, 2023 23:38
bigPYJ1151 pushed a commit to bigPYJ1151/vllm that referenced this pull request Sep 12, 2023
* add pos_encoding impl

* add benchmark and add open mp parallel
xiangyuT pushed a commit to xiangyuT/vllm that referenced this pull request Oct 25, 2023
* Comments done above worker

* format

* fixed missing arguments

* fix

* format
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 25, 2024
ykim362 pushed a commit to ykim362/vllm that referenced this pull request Jun 17, 2024
@alixiaodi alixiaodi mentioned this pull request Aug 2, 2024
zeroorhero pushed a commit to zeroorhero/vllm that referenced this pull request Sep 23, 2024
Xaenalt pushed a commit to Xaenalt/vllm that referenced this pull request Jan 15, 2025
wuhuikx pushed a commit to wuhuikx/vllm that referenced this pull request Mar 27, 2025
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Add feature and model support matrix

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI test is enough

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
njhill pushed a commit to njhill/vllm that referenced this pull request May 10, 2025
dcmaddix pushed a commit to dcmaddix/vllm that referenced this pull request Oct 11, 2025
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Oct 17, 2025
Issue: TPU sampler and Eagle code had two separate but related issues:
1. TPU sampler divides by zero for greedy requests (temperature=0.0)
2. Eagle code triggers mypy type errors due to missing None check

Root Cause:
- TPU sampler's apply_temperature() method lacks epsilon guard to prevent
  division by zero when temperature=0.0 (greedy sampling)
- Eagle's compute_probs_and_sample_next_token() uses temperature without
  asserting it's not None, causing mypy type errors

Impact:
- TPU: Division by zero produces NaN/Inf logits, breaking speculative
  decoding on TPU platforms for all models using Eagle/rejection sampling
- Eagle: mypy type checking failures prevent pre-commit hooks from passing

Fix:
1. TPU Sampler (vllm/v1/sample/tpu/sampler.py):
   - Add all_random parameter to apply_temperature() method
   - Add epsilon guard: if not all_random: temp = torch.where(temp < _SAMPLING_EPS, 1.0, temp)
   - Update call site to pass sampling_metadata.all_random

2. TPU Metadata (vllm/v1/sample/tpu/metadata.py):
   - Add all_random property to TPUSupportedSamplingMetadata
   - Populate all_random from input_batch in from_input_batch()

3. Eagle (vllm/v1/spec_decode/eagle.py):
   - Add assert sampling_metadata.temperature is not None after all_greedy early return
   - Matches sampler.py pattern (line 162) for type safety

Files Modified:
- vllm/v1/sample/tpu/sampler.py: Epsilon guard in apply_temperature()
- vllm/v1/sample/tpu/metadata.py: Added all_random property
- vllm/v1/spec_decode/eagle.py: Added temperature None assertion
- CLAUDE.md: Updated modification vllm-project#11 to document fixes

This addresses PR vllm-project#27077 reviewer feedback and resolves mypy type errors.

Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
yma11 added a commit to yma11/vllm that referenced this pull request Nov 14, 2025
Signed-off-by: Yan Ma <yan.ma@intel.com>
yma11 added a commit to yma11/vllm that referenced this pull request Nov 16, 2025
Signed-off-by: Yan Ma <yan.ma@intel.com>
dik654 pushed a commit to dik654/vllm-for-study that referenced this pull request Nov 18, 2025
…ections

Manufacturing enhancements:
- Add complete Vision Inspection MCP with Vision AI defect detection
- Add Manufacturing MES MCP with PostgreSQL integration
- Include detailed defect classification and statistics
- Add ROI analysis showing 78% cost reduction and 99.6% time savings

Healthcare enhancements:
- Enhance existing Medical OCR, Drug Interaction, and EHR MCPs
- Add ROI analysis showing 97.2% time reduction
- Include medical accident prevention benefits (5억원 annual savings)
- Demonstrate HIPAA-compliant prescription OCR workflow

Summary:
- Sections vllm-project#5-8: Fully detailed implementations (2,000+ lines each)
- Sections vllm-project#9-10: Enhanced with complete code + ROI
- Sections vllm-project#11-20+: Comprehensive summaries covering all major industries
- Total guide provides 20+ real-world MCP + Agent architecture patterns
chaojun-zhang pushed a commit to chaojun-zhang/vllm that referenced this pull request Nov 20, 2025
Signed-off-by: Yan Ma <yan.ma@intel.com>
GuoRen868 pushed a commit to GuoRen868/vllm that referenced this pull request Dec 26, 2025
[AFD][MTP] Adapt MTP layer for AFD mode, co-locate with attention, and fix several quantization bugs.
Copy link
Copy Markdown
Contributor

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, focused change. The watermark logic is sound and well-scoped. A couple of minor observations below.

yuezhu1 pushed a commit to yuezhu1/vllm that referenced this pull request Mar 30, 2026
- Add _lora_slots field on LoRAModelManager, decoupled from lora_config
  so dynamic scaling does not mutate the original config object
- Add _evict_adapters_to_fit() hook on base class (raises on overflow);
  LRUCacheLoRAModelManager overrides it with LRU eviction and rebuilds
  _active_adapters cache with new capacity (cachetools maxsize read-only)
- Implement resize_lora_slots() on base class: validates, evicts, calls
  reallocate_lora_weights() on all modules, empty_cache() once, resizes
  lora_index_to_id, updates _lora_slots
- Step 7 (re-load surviving adapters) intentionally omitted — weights are
  preserved via GPU-to-GPU copy in reallocate_lora_weights(); comment
  notes what to do if a remote weight store is introduced in future
- Add tests/lora/test_lora_model_manager_resize.py: 6 CPU-only unit tests
  covering validation, no-op, grow, LRU shrink, base-class overflow raise,
  and empty_cache() called exactly once

Closes vllm-project#11
Closes vllm-project#21

AI assistance was used; all changed lines reviewed by the submitter.

Co-authored-by: Claude
Signed-off-by: Yue Zhu <Yue.Zhu@ibm.com>
Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026
…-for-vllm

Replace vLLM README with comprehensive SlideSparse documentation for ICML submission
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants