[Misc] Adpot the new token matching solution #2599
Conversation
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Summary of ChangesHello @ApostaC, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the token matching mechanism within LMCache to use a more flexible range-based approach. By introducing a Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new RangePatternMatcher class to handle range-based token matching, replacing ParallelPatternMatcher in the BlendEngine. While the new matcher correctly identifies start and end patterns and simplifies token separation by removing AutoTokenizer and related environment variables, a high-severity Denial of Service (DoS) vulnerability has been identified in the C++ implementation due to quadratic time complexity in certain scenarios. This critical issue needs to be addressed to prevent potential server hangs.
I am having trouble creating individual review comments. Click here to see my feedback.
csrc/storage_manager/utils.cpp (85-96)
The RangePatternMatcher::match function, specifically within this section, has a high-severity Denial of Service (DoS) vulnerability due to its O(N^2) time complexity in the worst case. For each start_pattern_, the algorithm performs a linear search for end_pattern_. If start_pattern_ is frequent and end_pattern_ is rare or absent, this leads to repeated full scans of the data. A malicious client can exploit this by sending a large sequence of tokens consisting primarily of the start_pattern_ without any end_pattern_, causing a significant denial of service by hanging the server process. This quadratic complexity needs to be addressed to prevent potential system instability.
csrc/storage_manager/pybind.cpp (12)
The RangePatternMatcher is imported but not used in this file. It's good practice to remove unused imports to keep the code clean and avoid potential confusion.
csrc/storage_manager/utils.cpp (71-73)
This edge case check is good, but it might be more robust to also check if data.size() is less than start_pattern_.size() alone, as a start pattern cannot be matched if the data is too short for it, regardless of the end pattern. This would prevent unnecessary iterations.
if (data.size() < start_pattern_.size() || data.size() < end_pattern_.size()) {
return ranges;
}
csrc/storage_manager/utils.cpp (76)
The loop condition i <= data.size() - start_pattern_.size() is correct for finding start patterns. However, if start_pattern_.size() is 0 (which is prevented by the constructor, but good to consider defensively), data.size() - start_pattern_.size() could be data.size(), leading to i <= data.size(). If data.size() is SIZE_MAX, this could lead to an infinite loop. Given the constructor check, this is not a critical issue, but it's a good practice to be aware of potential size_t underflow issues.
csrc/storage_manager/utils.cpp (92)
The current logic for i = j + end_pattern_.size() correctly advances the search to avoid re-matching the same end pattern. However, if the start pattern can overlap with the end pattern, or if the next start pattern can begin immediately after the current end pattern, this might skip potential valid start patterns. For example, if start_pattern = [1] and end_pattern = [1], and data = [1, 1], it would find (0, 2) and then jump i to 2, missing the [1] at index 1 as a new start. If this behavior is intended, it's fine, but it's worth clarifying.
csrc/storage_manager/utils.h (49-50)
The constructor documentation mentions "(1-5 elements)" for start_pattern and end_pattern. However, the C++ implementation of RangePatternMatcher (in utils.cpp) only checks for !empty() and does not enforce a maximum length of 5 elements. This discrepancy could lead to unexpected behavior or confusion if the Python binding or other parts of the system rely on this implicit constraint. It's best to either enforce the constraint in C++ or remove it from the documentation.
* @param start_pattern The pattern marking the start of a range
* @param end_pattern The pattern marking the end of a range
lmcache/native_storage_ops.pyi (99-100)
The docstring for RangePatternMatcher.__init__ states that it raises ValueError if either pattern has "more than 5 elements". However, the C++ implementation does not enforce this 5-element limit. This creates a mismatch between the Python interface's documented behavior and the actual C++ implementation. It's important to either implement the length check in C++ or remove this part from the Python docstring to avoid confusion and potential bugs.
Raises:
ValueError: If either pattern is empty.
lmcache/v1/multiprocess/blend_server.py (564-567)
The previous logic for handling no matches and constructing ranges from match_start and _sep_token_len has been removed. This is a significant change in how ranges are determined when no patterns are found or when patterns are sparse. The new RangePatternMatcher directly returns the desired ranges, simplifying this part of the code. Ensure that the new behavior correctly handles cases where no ranges are found or where there are leading/trailing tokens outside of any defined range.
lmcache/v1/multiprocess/blend_server.py (589-594)
The get_sep_tokens function now hardcodes the separator tokens [200006], [200007] specifically for GPT-OSS models. This removes the flexibility of defining custom separator strings and offsets via environment variables. While this simplifies the current implementation, it might limit future extensibility if other model types or custom separators are needed. The TODO comment acknowledges this, but it's a design decision to be aware of.
tests/v1/multiprocess/test_blend_server.py (706-707)
The create_cb_cache_key function is replaced with create_cache_key here. While both functions create IPCCacheEngineKey objects, create_cache_key is typically used for normal (non-CB) operations. Given that this is a CB test, it might be more consistent to use create_cb_cache_key or ensure that create_cache_key is appropriate for CB contexts.
key = create_cb_cache_key(token_ids, request_id="store-lookup-single")
tests/v1/native_storage_ops/test_pattern_matcher.py (17)
The RangePatternMatcher is imported but not used in the TestParallelPatternMatcherBasic class. It's good practice to remove unused imports to keep the code clean and avoid potential confusion.
from lmcache.native_storage_ops import ParallelPatternMatcher
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
* [add] start-end matching for mp blend Signed-off-by: ApostaC <yihua98@uchicago.edu> * revert unfull chunk matching and add nemontron support Signed-off-by: ApostaC <yihua98@uchicago.edu>
* [add] start-end matching for mp blend Signed-off-by: ApostaC <yihua98@uchicago.edu> * revert unfull chunk matching and add nemontron support Signed-off-by: ApostaC <yihua98@uchicago.edu>
* [add] start-end matching for mp blend Signed-off-by: ApostaC <yihua98@uchicago.edu> * revert unfull chunk matching and add nemontron support Signed-off-by: ApostaC <yihua98@uchicago.edu> Signed-off-by: Ofer Kiselov Nahman <ofer.kiselovnahman@weka.io>
* [add] start-end matching for mp blend Signed-off-by: ApostaC <yihua98@uchicago.edu> * revert unfull chunk matching and add nemontron support Signed-off-by: ApostaC <yihua98@uchicago.edu>
* [add] start-end matching for mp blend Signed-off-by: ApostaC <yihua98@uchicago.edu> * revert unfull chunk matching and add nemontron support Signed-off-by: ApostaC <yihua98@uchicago.edu>
* [add] start-end matching for mp blend Signed-off-by: ApostaC <yihua98@uchicago.edu> * revert unfull chunk matching and add nemontron support Signed-off-by: ApostaC <yihua98@uchicago.edu> Signed-off-by: shaoxiawjc <wjc2800@163.com>
* [add] start-end matching for mp blend Signed-off-by: ApostaC <yihua98@uchicago.edu> * revert unfull chunk matching and add nemontron support Signed-off-by: ApostaC <yihua98@uchicago.edu> Signed-off-by: Aaron Wu <aaron.wu@dell.com>
…e CLI Two bugs in the last fix, both now addressed: 1. The probe did not exercise the failing import chain. `from vllm.entrypoints.cli.main import main` only resolves the `main` symbol; the problematic `import vllm.entrypoints.cli.benchmark.main` lives *inside* main()'s body and is only reached when the CLI is actually invoked. Build LMCache#2599 confirmed this: the post-install probe printed "vLLM CLI import chain OK post-install" and then `vllm serve` immediately failed with the same `ImportError: cannot import name 'GenerationConfig' from 'transformers'` that started this whole thread. Switch the probe to `vllm --help`, which runs main() as a subprocess end-to-end and walks the full vllm.entrypoints.cli.main -> vllm.entrypoints.cli.benchmark.main -> vllm.config -> vllm.transformers_utils.config chain. 2. Root cause of the env breakage: stale bytecode from base-image layers. The CI base image pre-installs packages from requirements/*.txt at image build time, which populates /opt/venv/.../<pkg>/__pycache__/*.pyc with mtimes from the image build. When setup-env.sh later runs `uv pip install -U vllm ...`, uv extracts the new wheel using the mtimes recorded in the wheel itself -- often *older* than the pre-existing .pyc. Python's import system compares .py vs .pyc mtimes and keeps using the older .pyc, so Python executes 5.5.0's bytecode for transformers/__init__.py even though the .py on disk is 5.5.4 -- and 5.5.0's _import_structure differs enough from 5.5.4's that GenerationConfig doesn't get exposed at the top level. The result is the ImportError observed only on the CI pods (base image cached), not on any fresh venv. Wipe /opt/venv/**/__pycache__ after all upgrades so Python is forced to re-byte-compile from the current .py sources on first import. This is mechanically idempotent and cheap (a few seconds on first-use recompile, no network). This combination fixes the observed CI failure and, more importantly, closes the class of failure: any future base-image -> per-job upgrade that would otherwise leave stale bytecode behind now self-heals, and any future import-chain break that wouldn't have tripped the old probe now fails fast with the real traceback. Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
…state Build LMCache#2599 with the `vllm --help` probe in place proved the env is already broken immediately after `uv pip install -U vllm ...`, before LMCache install and before any post-install eviction: the auto-heal loop trips the "non-ModuleNotFoundError" branch with the exact ImportError traceback from vllm/transformers_utils/config.py:18. The same install recipe replayed in a fresh local venv (including a full requirements/cuda.txt-based base-image emulation) always succeeds. The divergence is therefore filesystem state on the K3s pod coming out of the cached base image, not something we can fix by regenerating bytecode after the fact. Apply the minimum-blast-radius fix: tell uv to uninstall-and- reinstall the full vllm serve import chain (transformers, tokenizers, huggingface-hub, safetensors, vllm) even when it thinks the existing install is already up to date. `--reinstall-package` implies `--refresh-package`, so the wheels come down fresh and are extracted over freshly cleared paths. Combined with a pre-install `uv cache clean` + `__pycache__` wipe and the existing post-install eviction, this puts the import chain on guaranteed-clean ground regardless of what the base image had. Cost is a few extra seconds of re-download; the base image stays the same. If a future job hits the same failure, the setup still fails fast with the full traceback (via the pre-install auto-heal loop), pointing at whatever upstream break is actually at fault. Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
…all (#3093) * [Hotfix][CI] Fail-fast when vLLM CLI import chain is broken post-install The k3 integration tests have been red since 2026-04-21 ~04:00 UTC with: ImportError: cannot import name 'GenerationConfig' from 'transformers' (/opt/venv/lib/python3.12/site-packages/transformers/__init__.py) at vllm/transformers_utils/config.py line 18. The failure surfaces 180s after the test starts as a generic "vLLM failed to start on port 8000 within 180s" in wait_for_server, and only then does the harness tail vllm.log to show the real traceback. Root cause is that setup-env.sh declared the environment "ready" without exercising the CLI import chain that `vllm serve` runs at startup. The existing sequence was: 1. Install vLLM nightly 2. Probe `from vllm.entrypoints.cli.main import main` (auto-heal) 3. `uv pip install -e . --no-build-isolation` (LMCache install) 4. `python -c "import vllm; import lmcache"` (final probe) Step 3 silently downgrades 9 transitive packages (opentelemetry-* 1.41->1.40, prometheus-client 0.25->0.24.1) to honor the caps in requirements/common.txt. Step 4 is the only post-install check, but plain `import vllm` doesn't pull vllm.entrypoints.cli.main -> vllm.config -> vllm.transformers_utils.config, so any CLI-chain breakage introduced by the downgrades slips through until the first `vllm serve` subprocess fails 180s later. Fixes: - Extract the CLI import probe into a `probe_vllm_cli` function so the same check runs both during the auto-heal loop (pre-install) and as a hard probe after the LMCache install. - Add a post-install CLI probe that fails fast with the actual traceback and a full `uv pip freeze` if the env is broken, instead of letting the 180s test-harness timeout hide the real failure. - Snapshot `uv pip freeze` before and after `uv pip install -e .` and diff them, so the silent downgrades done by LMCache's pins are visible in the build log instead of having to be reconstructed from package-install stderr. With this change, the current k3 failure mode surfaces in ~10s at setup time with a clear ImportError traceback and the exact package versions at fault, instead of a 180s port-wait timeout. Signed-off-by: Samuel Shen <slshen@tensormesh.ai> * [Hotfix][CI] Evict stale base-image bytecode and actually exercise the CLI Two bugs in the last fix, both now addressed: 1. The probe did not exercise the failing import chain. `from vllm.entrypoints.cli.main import main` only resolves the `main` symbol; the problematic `import vllm.entrypoints.cli.benchmark.main` lives *inside* main()'s body and is only reached when the CLI is actually invoked. Build #2599 confirmed this: the post-install probe printed "vLLM CLI import chain OK post-install" and then `vllm serve` immediately failed with the same `ImportError: cannot import name 'GenerationConfig' from 'transformers'` that started this whole thread. Switch the probe to `vllm --help`, which runs main() as a subprocess end-to-end and walks the full vllm.entrypoints.cli.main -> vllm.entrypoints.cli.benchmark.main -> vllm.config -> vllm.transformers_utils.config chain. 2. Root cause of the env breakage: stale bytecode from base-image layers. The CI base image pre-installs packages from requirements/*.txt at image build time, which populates /opt/venv/.../<pkg>/__pycache__/*.pyc with mtimes from the image build. When setup-env.sh later runs `uv pip install -U vllm ...`, uv extracts the new wheel using the mtimes recorded in the wheel itself -- often *older* than the pre-existing .pyc. Python's import system compares .py vs .pyc mtimes and keeps using the older .pyc, so Python executes 5.5.0's bytecode for transformers/__init__.py even though the .py on disk is 5.5.4 -- and 5.5.0's _import_structure differs enough from 5.5.4's that GenerationConfig doesn't get exposed at the top level. The result is the ImportError observed only on the CI pods (base image cached), not on any fresh venv. Wipe /opt/venv/**/__pycache__ after all upgrades so Python is forced to re-byte-compile from the current .py sources on first import. This is mechanically idempotent and cheap (a few seconds on first-use recompile, no network). This combination fixes the observed CI failure and, more importantly, closes the class of failure: any future base-image -> per-job upgrade that would otherwise leave stale bytecode behind now self-heals, and any future import-chain break that wouldn't have tripped the old probe now fails fast with the real traceback. Signed-off-by: Samuel Shen <slshen@tensormesh.ai> * [Hotfix][CI] Force-reinstall transformers chain to bypass base-image state Build #2599 with the `vllm --help` probe in place proved the env is already broken immediately after `uv pip install -U vllm ...`, before LMCache install and before any post-install eviction: the auto-heal loop trips the "non-ModuleNotFoundError" branch with the exact ImportError traceback from vllm/transformers_utils/config.py:18. The same install recipe replayed in a fresh local venv (including a full requirements/cuda.txt-based base-image emulation) always succeeds. The divergence is therefore filesystem state on the K3s pod coming out of the cached base image, not something we can fix by regenerating bytecode after the fact. Apply the minimum-blast-radius fix: tell uv to uninstall-and- reinstall the full vllm serve import chain (transformers, tokenizers, huggingface-hub, safetensors, vllm) even when it thinks the existing install is already up to date. `--reinstall-package` implies `--refresh-package`, so the wheels come down fresh and are extracted over freshly cleared paths. Combined with a pre-install `uv cache clean` + `__pycache__` wipe and the existing post-install eviction, this puts the import chain on guaranteed-clean ground regardless of what the base image had. Cost is a few extra seconds of re-download; the base image stays the same. If a future job hits the same failure, the setup still fails fast with the full traceback (via the pre-install auto-heal loop), pointing at whatever upstream break is actually at fault. Signed-off-by: Samuel Shen <slshen@tensormesh.ai> * [Hotfix][CI] Dump transformers state on probe failure to unblock debugging Build #2652 with --reinstall-package on the whole import chain still fails with the same ImportError: freshly extracted transformers 5.5.4 wheel, GenerationConfig still missing from the top-level namespace according to Python, while an identical recipe in any fresh local venv produces a working transformers import. I'm out of remote-debuggable hypotheses for why this is CI-specific. Add a diagnostic block that the auto-heal loop runs when the probe hits the "non-ModuleNotFoundError" branch. It dumps: - `uv pip list` for the transformers chain - ls+stat of transformers/__init__.py and its .pyc - the dist-info METADATA Version - the __version__ and _import_structure["generation"] block from the actual __init__.py on disk - what Python itself sees: sys.executable, sys.path, transformers.__file__, whether GenerationConfig is in dir() and in _class_to_module / _import_structure, and the traceback of an isolated `from transformers import GenerationConfig` attempt Three outcomes, each unblocks the next step: 1. The file-on-disk _import_structure does *not* contain GenerationConfig -> the wheel or its extraction is corrupt; pin transformers or change the index. 2. Python loads a different transformers.__file__ than we expect, or _import_structure is absent -> shadowing/.pth/PYTHONPATH issue; inspect sys.path. 3. Isolated `from transformers import GenerationConfig` WORKS in the diagnostic block -> the failure depends on vllm's prior imports; we can then bisect the vllm import chain. This commit just adds the dump. Once a build runs with this script the real fix will be obvious from the diagnostic output. Signed-off-by: Samuel Shen <slshen@tensormesh.ai> --------- Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
What this PR does / why we need it:
Special notes for your reviewers:
If applicable: