Skip to content

[Misc] Adpot the new token matching solution #2599

Merged
ApostaC merged 5 commits intoLMCache:devfrom
ApostaC:local-dev/mp-cb-3
Feb 27, 2026
Merged

[Misc] Adpot the new token matching solution #2599
ApostaC merged 5 commits intoLMCache:devfrom
ApostaC:local-dev/mp-cb-3

Conversation

@ApostaC
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC commented Feb 13, 2026

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @ApostaC, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the token matching mechanism within LMCache to use a more flexible range-based approach. By introducing a RangePatternMatcher, the system can now identify and process token sequences that are delimited by distinct start and end patterns, rather than relying on a single separator token. This change enhances the robustness of token separation, particularly for GPT-OSS models, and simplifies the overall tokenization pipeline by removing an external tokenizer dependency for separator identification.

Highlights

  • New Range Pattern Matching: Introduced a new RangePatternMatcher class in C++ to identify token ranges defined by distinct start and end patterns, replacing the previous single-pattern matching approach.
  • Python Bindings and Integration: Exposed the new RangePatternMatcher to Python via pybind11 and integrated it into the BlendEngine for token separation, simplifying the logic and removing the transformers.AutoTokenizer dependency for this specific task.
  • Simplified Separator Token Handling: The get_sep_tokens function now directly returns hardcoded start and end patterns for GPT-OSS models, streamlining the configuration for blend token matching.
  • Comprehensive Testing: Added extensive unit tests for the RangePatternMatcher covering various scenarios, including basic functionality, edge cases, and complex pattern interactions, and updated existing blend server tests to align with the new range-based token separation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • csrc/storage_manager/pybind.cpp
    • Added RangePatternMatcher to the Python bindings for native_storage_ops.
  • csrc/storage_manager/utils.cpp
    • Implemented the RangePatternMatcher class, including its constructor, matchesAt helper, and the core match method for finding token ranges.
  • csrc/storage_manager/utils.h
    • Declared the RangePatternMatcher class with its public and private members.
  • lmcache/native_storage_ops.pyi
    • Added type hints and docstrings for the new RangePatternMatcher class and its methods.
  • lmcache/v1/multiprocess/blend_server.py
    • Removed transformers.AutoTokenizer import.
    • Updated BlendEngine's __init__ to use RangePatternMatcher with sep_tokens as a tuple of start and end patterns.
    • Modified _separate_tokens_by_pattern to directly return matches from RangePatternMatcher.
    • Refactored get_sep_tokens to return hardcoded start and end token patterns for GPT-OSS models, removing environment variable dependencies for separator string and tokenizer offset.
  • tests/v1/multiprocess/test_blend_server.py
    • Added create_token_ids_with_sep_tokens helper function to encapsulate the new separator token logic.
    • Updated various test cases (test_cb_store_pre_computed_basic, test_cb_store_pre_computed_various_offsets, test_cb_store_pre_computed_long_doc, test_cb_lookup_after_store_single_paragraph, test_cb_lookup_after_store_multiple_paragraphs, test_cb_lookup_partial_match, test_cb_retrieve_after_store_and_lookup, test_cb_retrieve_verify_data_correctness, test_cb_retrieve_empty_ranges, test_cb_retrieve_invalid_ranges, test_cb_store_final_basic, test_cb_store_final_then_normal_lookup_retrieve, test_cb_store_final_not_visible_to_cb_lookup) to use the new create_token_ids_with_sep_tokens helper and adapt to range-based matching expectations.
  • tests/v1/native_storage_ops/test_pattern_matcher.py
    • Imported RangePatternMatcher.
    • Added TestRangePatternMatcherBasic class with tests for single, multiple, and no matches, as well as cases with missing start/end patterns and minimal range selection.
    • Added TestRangePatternMatcherEdgeCases class with tests for empty data, data too short, patterns at boundaries, negative numbers, zeros, and large numbers.
    • Added TestRangePatternMatcherConsistency class to verify consistent results across repeated calls and multiple instances.
    • Added TestRangePatternMatcherComplexScenarios class with tests for nested-like patterns, multiple end patterns, consecutive ranges, long data with sparse patterns, and patterns longer than data.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ApostaC ApostaC added the mp Buildkite trigger for multi-processing mode test label Feb 13, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new RangePatternMatcher class to handle range-based token matching, replacing ParallelPatternMatcher in the BlendEngine. While the new matcher correctly identifies start and end patterns and simplifies token separation by removing AutoTokenizer and related environment variables, a high-severity Denial of Service (DoS) vulnerability has been identified in the C++ implementation due to quadratic time complexity in certain scenarios. This critical issue needs to be addressed to prevent potential server hangs.

I am having trouble creating individual review comments. Click here to see my feedback.

csrc/storage_manager/utils.cpp (85-96)

security-high high

The RangePatternMatcher::match function, specifically within this section, has a high-severity Denial of Service (DoS) vulnerability due to its O(N^2) time complexity in the worst case. For each start_pattern_, the algorithm performs a linear search for end_pattern_. If start_pattern_ is frequent and end_pattern_ is rare or absent, this leads to repeated full scans of the data. A malicious client can exploit this by sending a large sequence of tokens consisting primarily of the start_pattern_ without any end_pattern_, causing a significant denial of service by hanging the server process. This quadratic complexity needs to be addressed to prevent potential system instability.

csrc/storage_manager/pybind.cpp (12)

medium

The RangePatternMatcher is imported but not used in this file. It's good practice to remove unused imports to keep the code clean and avoid potential confusion.

csrc/storage_manager/utils.cpp (71-73)

medium

This edge case check is good, but it might be more robust to also check if data.size() is less than start_pattern_.size() alone, as a start pattern cannot be matched if the data is too short for it, regardless of the end pattern. This would prevent unnecessary iterations.

  if (data.size() < start_pattern_.size() || data.size() < end_pattern_.size()) {
    return ranges;
  }

csrc/storage_manager/utils.cpp (76)

medium

The loop condition i <= data.size() - start_pattern_.size() is correct for finding start patterns. However, if start_pattern_.size() is 0 (which is prevented by the constructor, but good to consider defensively), data.size() - start_pattern_.size() could be data.size(), leading to i <= data.size(). If data.size() is SIZE_MAX, this could lead to an infinite loop. Given the constructor check, this is not a critical issue, but it's a good practice to be aware of potential size_t underflow issues.

csrc/storage_manager/utils.cpp (92)

medium

The current logic for i = j + end_pattern_.size() correctly advances the search to avoid re-matching the same end pattern. However, if the start pattern can overlap with the end pattern, or if the next start pattern can begin immediately after the current end pattern, this might skip potential valid start patterns. For example, if start_pattern = [1] and end_pattern = [1], and data = [1, 1], it would find (0, 2) and then jump i to 2, missing the [1] at index 1 as a new start. If this behavior is intended, it's fine, but it's worth clarifying.

csrc/storage_manager/utils.h (49-50)

medium

The constructor documentation mentions "(1-5 elements)" for start_pattern and end_pattern. However, the C++ implementation of RangePatternMatcher (in utils.cpp) only checks for !empty() and does not enforce a maximum length of 5 elements. This discrepancy could lead to unexpected behavior or confusion if the Python binding or other parts of the system rely on this implicit constraint. It's best to either enforce the constraint in C++ or remove it from the documentation.

   * @param start_pattern The pattern marking the start of a range
   * @param end_pattern The pattern marking the end of a range

lmcache/native_storage_ops.pyi (99-100)

medium

The docstring for RangePatternMatcher.__init__ states that it raises ValueError if either pattern has "more than 5 elements". However, the C++ implementation does not enforce this 5-element limit. This creates a mismatch between the Python interface's documented behavior and the actual C++ implementation. It's important to either implement the length check in C++ or remove this part from the Python docstring to avoid confusion and potential bugs.

        Raises:
            ValueError: If either pattern is empty.

lmcache/v1/multiprocess/blend_server.py (564-567)

medium

The previous logic for handling no matches and constructing ranges from match_start and _sep_token_len has been removed. This is a significant change in how ranges are determined when no patterns are found or when patterns are sparse. The new RangePatternMatcher directly returns the desired ranges, simplifying this part of the code. Ensure that the new behavior correctly handles cases where no ranges are found or where there are leading/trailing tokens outside of any defined range.

lmcache/v1/multiprocess/blend_server.py (589-594)

medium

The get_sep_tokens function now hardcodes the separator tokens [200006], [200007] specifically for GPT-OSS models. This removes the flexibility of defining custom separator strings and offsets via environment variables. While this simplifies the current implementation, it might limit future extensibility if other model types or custom separators are needed. The TODO comment acknowledges this, but it's a design decision to be aware of.

tests/v1/multiprocess/test_blend_server.py (706-707)

medium

The create_cb_cache_key function is replaced with create_cache_key here. While both functions create IPCCacheEngineKey objects, create_cache_key is typically used for normal (non-CB) operations. Given that this is a CB test, it might be more consistent to use create_cb_cache_key or ensure that create_cache_key is appropriate for CB contexts.

    key = create_cb_cache_key(token_ids, request_id="store-lookup-single")

tests/v1/native_storage_ops/test_pattern_matcher.py (17)

medium

The RangePatternMatcher is imported but not used in the TestParallelPatternMatcherBasic class. It's good practice to remove unused imports to keep the code clean and avoid potential confusion.

from lmcache.native_storage_ops import ParallelPatternMatcher

@ApostaC ApostaC added the full Run comprehensive tests on this PR label Feb 13, 2026
Copy link
Copy Markdown
Contributor

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Contributor

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Signed-off-by: ApostaC <yihua98@uchicago.edu>
@ApostaC ApostaC merged commit 21ad108 into LMCache:dev Feb 27, 2026
24 checks passed
sammshen pushed a commit to sammshen/LMCache that referenced this pull request Mar 1, 2026
* [add] start-end matching for mp blend

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* revert unfull chunk matching and add nemontron support

Signed-off-by: ApostaC <yihua98@uchicago.edu>
hlin99 pushed a commit to hlin99/LMCache that referenced this pull request Mar 2, 2026
* [add] start-end matching for mp blend

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* revert unfull chunk matching and add nemontron support

Signed-off-by: ApostaC <yihua98@uchicago.edu>
oferki pushed a commit to oferki/LMCache that referenced this pull request Mar 3, 2026
* [add] start-end matching for mp blend

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* revert unfull chunk matching and add nemontron support

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: Ofer Kiselov Nahman <ofer.kiselovnahman@weka.io>
oferki pushed a commit to oferki/LMCache that referenced this pull request Mar 3, 2026
* [add] start-end matching for mp blend

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* revert unfull chunk matching and add nemontron support

Signed-off-by: ApostaC <yihua98@uchicago.edu>
mauryaavinash95 pushed a commit to mauryaavinash95/LMCache that referenced this pull request Mar 7, 2026
* [add] start-end matching for mp blend

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* revert unfull chunk matching and add nemontron support

Signed-off-by: ApostaC <yihua98@uchicago.edu>
shaoxiawjc pushed a commit to shaoxiawjc/LMCache that referenced this pull request Mar 11, 2026
* [add] start-end matching for mp blend

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* revert unfull chunk matching and add nemontron support

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: shaoxiawjc <wjc2800@163.com>
realAaronWu pushed a commit to realAaronWu/LMCache that referenced this pull request Mar 20, 2026
* [add] start-end matching for mp blend

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* revert unfull chunk matching and add nemontron support

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: Aaron Wu <aaron.wu@dell.com>
sammshen added a commit to sammshen/LMCache that referenced this pull request Apr 21, 2026
…e CLI

Two bugs in the last fix, both now addressed:

1. The probe did not exercise the failing import chain. `from
   vllm.entrypoints.cli.main import main` only resolves the `main`
   symbol; the problematic `import vllm.entrypoints.cli.benchmark.main`
   lives *inside* main()'s body and is only reached when the CLI is
   actually invoked. Build LMCache#2599 confirmed this: the post-install
   probe printed "vLLM CLI import chain OK post-install" and then
   `vllm serve` immediately failed with the same
   `ImportError: cannot import name 'GenerationConfig' from
   'transformers'` that started this whole thread.

   Switch the probe to `vllm --help`, which runs main() as a
   subprocess end-to-end and walks the full vllm.entrypoints.cli.main
   -> vllm.entrypoints.cli.benchmark.main -> vllm.config ->
   vllm.transformers_utils.config chain.

2. Root cause of the env breakage: stale bytecode from base-image
   layers. The CI base image pre-installs packages from
   requirements/*.txt at image build time, which populates
   /opt/venv/.../<pkg>/__pycache__/*.pyc with mtimes from the image
   build. When setup-env.sh later runs `uv pip install -U vllm ...`,
   uv extracts the new wheel using the mtimes recorded in the wheel
   itself -- often *older* than the pre-existing .pyc. Python's
   import system compares .py vs .pyc mtimes and keeps using the
   older .pyc, so Python executes 5.5.0's bytecode for
   transformers/__init__.py even though the .py on disk is 5.5.4 --
   and 5.5.0's _import_structure differs enough from 5.5.4's that
   GenerationConfig doesn't get exposed at the top level. The result
   is the ImportError observed only on the CI pods (base image
   cached), not on any fresh venv.

   Wipe /opt/venv/**/__pycache__ after all upgrades so Python is
   forced to re-byte-compile from the current .py sources on first
   import. This is mechanically idempotent and cheap (a few seconds
   on first-use recompile, no network).

This combination fixes the observed CI failure and, more
importantly, closes the class of failure: any future base-image ->
per-job upgrade that would otherwise leave stale bytecode behind
now self-heals, and any future import-chain break that wouldn't
have tripped the old probe now fails fast with the real traceback.

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
sammshen added a commit to sammshen/LMCache that referenced this pull request Apr 21, 2026
…state

Build LMCache#2599 with the `vllm --help` probe in place proved the env is
already broken immediately after `uv pip install -U vllm ...`, before
LMCache install and before any post-install eviction: the auto-heal
loop trips the "non-ModuleNotFoundError" branch with the exact
ImportError traceback from vllm/transformers_utils/config.py:18.

The same install recipe replayed in a fresh local venv (including a
full requirements/cuda.txt-based base-image emulation) always
succeeds. The divergence is therefore filesystem state on the K3s
pod coming out of the cached base image, not something we can fix
by regenerating bytecode after the fact.

Apply the minimum-blast-radius fix: tell uv to uninstall-and-
reinstall the full vllm serve import chain (transformers, tokenizers,
huggingface-hub, safetensors, vllm) even when it thinks the existing
install is already up to date. `--reinstall-package` implies
`--refresh-package`, so the wheels come down fresh and are extracted
over freshly cleared paths. Combined with a pre-install
`uv cache clean` + `__pycache__` wipe and the existing post-install
eviction, this puts the import chain on guaranteed-clean ground
regardless of what the base image had.

Cost is a few extra seconds of re-download; the base image stays
the same. If a future job hits the same failure, the setup still
fails fast with the full traceback (via the pre-install auto-heal
loop), pointing at whatever upstream break is actually at fault.

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
sammshen added a commit that referenced this pull request Apr 21, 2026
…all (#3093)

* [Hotfix][CI] Fail-fast when vLLM CLI import chain is broken post-install

The k3 integration tests have been red since 2026-04-21 ~04:00 UTC with:

    ImportError: cannot import name 'GenerationConfig' from 'transformers'
    (/opt/venv/lib/python3.12/site-packages/transformers/__init__.py)

at vllm/transformers_utils/config.py line 18. The failure surfaces
180s after the test starts as a generic "vLLM failed to start on
port 8000 within 180s" in wait_for_server, and only then does the
harness tail vllm.log to show the real traceback.

Root cause is that setup-env.sh declared the environment "ready"
without exercising the CLI import chain that `vllm serve` runs at
startup. The existing sequence was:

  1. Install vLLM nightly
  2. Probe `from vllm.entrypoints.cli.main import main` (auto-heal)
  3. `uv pip install -e . --no-build-isolation` (LMCache install)
  4. `python -c "import vllm; import lmcache"` (final probe)

Step 3 silently downgrades 9 transitive packages (opentelemetry-*
1.41->1.40, prometheus-client 0.25->0.24.1) to honor the caps in
requirements/common.txt. Step 4 is the only post-install check, but
plain `import vllm` doesn't pull vllm.entrypoints.cli.main ->
vllm.config -> vllm.transformers_utils.config, so any CLI-chain
breakage introduced by the downgrades slips through until the first
`vllm serve` subprocess fails 180s later.

Fixes:

- Extract the CLI import probe into a `probe_vllm_cli` function so
  the same check runs both during the auto-heal loop (pre-install)
  and as a hard probe after the LMCache install.
- Add a post-install CLI probe that fails fast with the actual
  traceback and a full `uv pip freeze` if the env is broken, instead
  of letting the 180s test-harness timeout hide the real failure.
- Snapshot `uv pip freeze` before and after `uv pip install -e .`
  and diff them, so the silent downgrades done by LMCache's pins
  are visible in the build log instead of having to be reconstructed
  from package-install stderr.

With this change, the current k3 failure mode surfaces in ~10s at
setup time with a clear ImportError traceback and the exact package
versions at fault, instead of a 180s port-wait timeout.

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>

* [Hotfix][CI] Evict stale base-image bytecode and actually exercise the CLI

Two bugs in the last fix, both now addressed:

1. The probe did not exercise the failing import chain. `from
   vllm.entrypoints.cli.main import main` only resolves the `main`
   symbol; the problematic `import vllm.entrypoints.cli.benchmark.main`
   lives *inside* main()'s body and is only reached when the CLI is
   actually invoked. Build #2599 confirmed this: the post-install
   probe printed "vLLM CLI import chain OK post-install" and then
   `vllm serve` immediately failed with the same
   `ImportError: cannot import name 'GenerationConfig' from
   'transformers'` that started this whole thread.

   Switch the probe to `vllm --help`, which runs main() as a
   subprocess end-to-end and walks the full vllm.entrypoints.cli.main
   -> vllm.entrypoints.cli.benchmark.main -> vllm.config ->
   vllm.transformers_utils.config chain.

2. Root cause of the env breakage: stale bytecode from base-image
   layers. The CI base image pre-installs packages from
   requirements/*.txt at image build time, which populates
   /opt/venv/.../<pkg>/__pycache__/*.pyc with mtimes from the image
   build. When setup-env.sh later runs `uv pip install -U vllm ...`,
   uv extracts the new wheel using the mtimes recorded in the wheel
   itself -- often *older* than the pre-existing .pyc. Python's
   import system compares .py vs .pyc mtimes and keeps using the
   older .pyc, so Python executes 5.5.0's bytecode for
   transformers/__init__.py even though the .py on disk is 5.5.4 --
   and 5.5.0's _import_structure differs enough from 5.5.4's that
   GenerationConfig doesn't get exposed at the top level. The result
   is the ImportError observed only on the CI pods (base image
   cached), not on any fresh venv.

   Wipe /opt/venv/**/__pycache__ after all upgrades so Python is
   forced to re-byte-compile from the current .py sources on first
   import. This is mechanically idempotent and cheap (a few seconds
   on first-use recompile, no network).

This combination fixes the observed CI failure and, more
importantly, closes the class of failure: any future base-image ->
per-job upgrade that would otherwise leave stale bytecode behind
now self-heals, and any future import-chain break that wouldn't
have tripped the old probe now fails fast with the real traceback.

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>

* [Hotfix][CI] Force-reinstall transformers chain to bypass base-image state

Build #2599 with the `vllm --help` probe in place proved the env is
already broken immediately after `uv pip install -U vllm ...`, before
LMCache install and before any post-install eviction: the auto-heal
loop trips the "non-ModuleNotFoundError" branch with the exact
ImportError traceback from vllm/transformers_utils/config.py:18.

The same install recipe replayed in a fresh local venv (including a
full requirements/cuda.txt-based base-image emulation) always
succeeds. The divergence is therefore filesystem state on the K3s
pod coming out of the cached base image, not something we can fix
by regenerating bytecode after the fact.

Apply the minimum-blast-radius fix: tell uv to uninstall-and-
reinstall the full vllm serve import chain (transformers, tokenizers,
huggingface-hub, safetensors, vllm) even when it thinks the existing
install is already up to date. `--reinstall-package` implies
`--refresh-package`, so the wheels come down fresh and are extracted
over freshly cleared paths. Combined with a pre-install
`uv cache clean` + `__pycache__` wipe and the existing post-install
eviction, this puts the import chain on guaranteed-clean ground
regardless of what the base image had.

Cost is a few extra seconds of re-download; the base image stays
the same. If a future job hits the same failure, the setup still
fails fast with the full traceback (via the pre-install auto-heal
loop), pointing at whatever upstream break is actually at fault.

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>

* [Hotfix][CI] Dump transformers state on probe failure to unblock debugging

Build #2652 with --reinstall-package on the whole import chain still
fails with the same ImportError: freshly extracted transformers 5.5.4
wheel, GenerationConfig still missing from the top-level namespace
according to Python, while an identical recipe in any fresh local
venv produces a working transformers import.

I'm out of remote-debuggable hypotheses for why this is CI-specific.
Add a diagnostic block that the auto-heal loop runs when the probe
hits the "non-ModuleNotFoundError" branch. It dumps:

- `uv pip list` for the transformers chain
- ls+stat of transformers/__init__.py and its .pyc
- the dist-info METADATA Version
- the __version__ and _import_structure["generation"] block from the
  actual __init__.py on disk
- what Python itself sees: sys.executable, sys.path,
  transformers.__file__, whether GenerationConfig is in dir() and in
  _class_to_module / _import_structure, and the traceback of an
  isolated `from transformers import GenerationConfig` attempt

Three outcomes, each unblocks the next step:

1. The file-on-disk _import_structure does *not* contain
   GenerationConfig -> the wheel or its extraction is corrupt; pin
   transformers or change the index.
2. Python loads a different transformers.__file__ than we expect, or
   _import_structure is absent -> shadowing/.pth/PYTHONPATH issue;
   inspect sys.path.
3. Isolated `from transformers import GenerationConfig` WORKS in
   the diagnostic block -> the failure depends on vllm's prior
   imports; we can then bisect the vllm import chain.

This commit just adds the dump. Once a build runs with this script
the real fix will be obvious from the diagnostic output.

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>

---------

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR mp Buildkite trigger for multi-processing mode test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants