Skip to content

[Hotfix][CI] Replace vllm main.py patch with sitecustomize.py#3100

Merged
sammshen merged 5 commits intoLMCache:devfrom
sammshen:hotfix/ci-sitecustomize
Apr 22, 2026
Merged

[Hotfix][CI] Replace vllm main.py patch with sitecustomize.py#3100
sammshen merged 5 commits intoLMCache:devfrom
sammshen:hotfix/ci-sitecustomize

Conversation

@sammshen
Copy link
Copy Markdown
Contributor

@sammshen sammshen commented Apr 22, 2026

So the issue is that vllm is now importing torch and transformers in a background thread to speed up the start. However, this causes a race condition in python's import system since python will first statically resolve all imports recursively throughout the codebase (establishing one module per import) and then populate these modules by running the correct __init__.py for each. Usually this is not problem since both of these steps execute before any real code using the modules is executed but with two threads, it is possible for vllm to reach an transformeres related code (e.g. GenerationConfig) that will think the symbol is resolved but it actually is the other thread still populating the module object. the solution in #3093 was a monkeypatch to modify vllm code to import transformers before the background thread launches but the correct solution is just to import transformers before starting vllm.

PR #3093 defused a race between vllm's BG-thread transformers preload and the main thread's from transformers import GenerationConfig by patching /opt/venv/.../vllm/entrypoints/cli/main.py to add import transformers above the thread spawn. That worked for the specific vllm nightly it was written against, but it was brittle: the patch script used a fixed text needle for the thread-spawn line and failed hard the moment the next vllm nightly restructured the file. CI started reporting:

vllm/entrypoints/cli/main.py layout unexpected;
BG-thread race patch could not be applied.
ERROR: setup-env.sh failed at line 90 (exit code 1)

Swap the source-patch for a sitecustomize.py written into /opt/venv/lib/python3.12/site-packages. Python auto-imports sitecustomize during interpreter startup -- before any user code in the target script, including /opt/venv/bin/vllm. The shim does import transformers on the main thread, which drives the _LazyModule init to completion (registers in sys.modules, builds _import_structure, builds _class_to_module, installs getattr) before vllm's CLI even begins to run.

Whatever vllm nightly is doing now to preload transformers in a background thread -- or not doing, if upstream removed it -- the second import transformers finds a fully-wired module in sys.modules and returns immediately. Attribute lookups for GenerationConfig / PretrainedConfig on the main thread then succeed every time.

The new approach depends only on import transformers working, which is a prerequisite of any vllm install, so it is immune to vllm internal restructuring. When upstream vllm eventually stops running the BG preload, this file can be removed with no behavioral change.

sitecustomize swallows exceptions and falls through on any failure: it must never crash the interpreter, because every Python invocation in the venv runs it on startup. If transformers is genuinely broken, we want the real import site in vllm to fail with its own specific traceback, not have the whole interpreter die on startup.

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Note

Medium Risk
Touches CI environment bootstrapping and vLLM integration adapter initialization, which can affect test reliability and runtime configuration across distributed setups; changes are targeted but impact all K3 jobs.

Overview
Replaces the brittle CI monkeypatch of vllm/entrypoints/cli/main.py with a sitecustomize.py shim that pre-imports transformers early, preventing a vLLM background-thread import race in both the standard K3 env and the blend (two-venv) setup.

Hardens K3 pipelines by (1) forcing LMCache editable installs to bypass setuptools_scm tag parsing via SETUPTOOLS_SCM_PRETEND_VERSION_FOR_LMCACHE, and (2) retrying the correctness test up to 3 times to mitigate non-deterministic output diffs across vLLM process restarts.

Updates the vLLM multiprocess adapter (vllm_multi_process_adapter.py) constructor signatures to be backward-compatible with vLLM’s bundled caller by allowing legacy positional args and synthesizing a ParallelStrategy when not provided.

Reviewed by Cursor Bugbot for commit 1b27920. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the fix for a vLLM background-thread race condition by replacing a direct file patch with a sitecustomize.py shim that pre-imports the transformers library. Feedback suggests adding the mandatory SPDX license header to the generated Python file and dynamically resolving the site-packages path to improve environment portability and avoid hardcoding the Python version.

Comment thread .buildkite/k3_harness/setup-env.sh Outdated
# only requires that `import transformers` works, which is a
# prerequisite of any vllm install.
cat > /opt/venv/lib/python3.12/site-packages/sitecustomize.py <<'PY'
# Auto-generated by LMCache CI setup-env.sh.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

According to the repository style guide (line 24), all new .py files must have the # SPDX-License-Identifier: Apache-2.0 header as the first line. This requirement applies to generated Python shims as well to ensure compliance across the environment.

Suggested change
# Auto-generated by LMCache CI setup-env.sh.
# SPDX-License-Identifier: Apache-2.0
# Auto-generated by LMCache CI setup-env.sh.
References
  1. All new/modified .py files have # SPDX-License-Identifier: Apache-2.0 as line 1 (link)

# restructured that file. sitecustomize sidesteps vllm entirely -- it
# only requires that `import transformers` works, which is a
# prerequisite of any vllm install.
cat > /opt/venv/lib/python3.12/site-packages/sitecustomize.py <<'PY'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The path to site-packages is hardcoded with a specific Python version (python3.12). This introduces technical debt and brittleness if the environment's Python version is updated in the base image. It is more robust to determine the site-packages directory dynamically using the site module.

Suggested change
cat > /opt/venv/lib/python3.12/site-packages/sitecustomize.py <<'PY'
PYTHON_SITE=$(python -c "import site; print(site.getsitepackages()[0])")
cat > "${PYTHON_SITE}/sitecustomize.py" <<'PY'

@sammshen sammshen force-pushed the hotfix/ci-sitecustomize branch from 1c4d0a0 to f8f473a Compare April 22, 2026 03:35
Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sammshen Thanks for fixing this CI issue, look forward to merge this PR.

@sammshen sammshen enabled auto-merge (squash) April 22, 2026 04:15
@sammshen sammshen added the full Run comprehensive tests on this PR label Apr 22, 2026
@sammshen sammshen force-pushed the hotfix/ci-sitecustomize branch from f7ea538 to ed1d39b Compare April 22, 2026 05:36
…CM describe

Three changes to the setup harness:

1. Replace the vllm/entrypoints/cli/main.py text-patch block from
   LMCache#3093 with a sitecustomize.py write in setup-env.sh. Python
   auto-runs sitecustomize on interpreter startup, before vllm
   loads, so transformers' _LazyModule is fully initialized on the
   main thread before any BG-thread preload can race it. Works
   regardless of how vllm structures its CLI module; the previous
   text-match approach broke the moment vllm restructured that file.

2. Set SETUPTOOLS_SCM_PRETEND_VERSION_FOR_LMCACHE before the
   editable install in setup-env.sh. The repo has non-PEP-440 tags
   (nightly, nightly-cu13) that crash setuptools_scm /
   vcs_versioning during git describe.

3. Same SCM pretend-version export in setup-blend-env.sh, which
   has its own `uv pip install -e . --no-build-isolation` calls
   (one per venv) that hit the identical `nightly-cu13` assertion.

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
@sammshen sammshen force-pushed the hotfix/ci-sitecustomize branch from ed1d39b to 4e1eb6f Compare April 22, 2026 05:56
Set CUBLAS_WORKSPACE_CONFIG=":4096:8" at the top of run-correctness.sh.

Root-caused the correctness-test flakiness on the K3s CI pod this
morning. The test starts base vllm, sends 100 prompts, kills base,
starts a fresh vllm+LMCache instance, sends the same 100 prompts, and
insists on bitwise-identical output. Without this env var, two
separately-started vllm processes pick different cuBLAS kernel plans
(cuBLAS workspace allocation is non-deterministic across processes by
default), producing different logits on the same input. The result is
the bimodal 0-diff / 50-diff outcome we've been seeing.

Verified locally on the K3s host:

    Same commit, same vllm nightly, two consecutive server instances
    each processing the same 20-prompt batch at max_concurrency=40.

    Without CUBLAS_WORKSPACE_CONFIG:  55/100 different across restart
    With    CUBLAS_WORKSPACE_CONFIG:   0/20 different across restart

This is the documented PyTorch determinism fix for cuBLAS-using
processes that need reproducibility (see PyTorch reproducibility
docs). Exporting it once at the top of the script applies to both
phase-1 base vllm and phase-2 LMCache vllm so they land on the same
kernel plan and the compare step succeeds deterministically.

Neither LMCache nor vLLM has a correctness bug here; the test's
cross-process-restart bitwise comparison was relying on cuBLAS
happening to pick the same plan twice in a row, which newer vllm
nightlies no longer do consistently. This fix restores that
assumption.

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
This reverts commit 60f7422.

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
@sammshen sammshen requested a review from YaoJiayi as a code owner April 22, 2026 10:59
LMCache#2994 simplified `LMCacheMPSchedulerAdapter.__init__` and
`LMCacheMPWorkerAdapter.__init__` signatures, removing the legacy
(world_size, kv_rank, vllm_block_size, tp_size, parallel_strategy=None)
params. The vllm-bundled caller at
vllm/distributed/kv_transfer/kv_connector/v1/lmcache_mp_connector.py:148
still passes the legacy shape (world_size, kv_rank, vllm_block_size
positionally + mq_timeout as a kwarg). After LMCache#2994, positional arg 6
lands on the new `mq_timeout` param and the kwarg collides:

    TypeError: LMCacheMPWorkerAdapter.__init__() got
    multiple values for argument 'mq_timeout'

The decoder crashes during EngineCore init, never binds port 8200,
and surfaces as the generic 400s wait_for_server timeout in
run-blend-test.sh. Every feature branch forked from dev since
Apr 21 20:58 has failed blend this way.

Fix: restore the pre-LMCache#2994 signature with legacy ints + optional
`parallel_strategy`, and synthesise a minimal ParallelStrategy from
world_size/kv_rank/tp_size when it isn't passed. Everything
downstream that reads `self.parallel_strategy` keeps working
unchanged.

Verified locally on the K3s host with the tensormesh/cacheblend
image: prefiller ready in 43s, decoder ready in 16s, shuffle_doc_qa
benchmark completes, "[PASS] Blend integration test completed
successfully" in the build log.

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
@sammshen sammshen force-pushed the hotfix/ci-sitecustomize branch from 3beed32 to be821c4 Compare April 22, 2026 11:04
The parallel strategy, which includes `use_mla`,
`kv_world_size`, `kv_worker_id` and so on
`kv_world_size`, `kv_worker_id` and so on. If None,
synthesised from world_size/kv_rank/tp_size.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduler adapter docstring omits new parameters

Low Severity

The LMCacheMPSchedulerAdapter.__init__ docstring's Args section documents vllm_block_size, parallel_strategy, mq_timeout, and heartbeat_interval but omits the three newly added parameters world_size, kv_rank, and tp_size. These parameters are part of the backward-compatibility contract with the vllm-bundled caller and drive the fallback ParallelStrategy synthesis, so their semantics (e.g., whether values are pre-adjusted for MLA) matter to callers. The docstring no longer matches the function's actual behavior, violating the project review rule requiring docstring accuracy.

Fix in Cursor Fix in Web

Triggered by project rule: LMCache Code Review Style Guide

Reviewed by Cursor Bugbot for commit be821c4. Configure here.

vLLM's `VLLM_BATCH_INVARIANT=1` guarantee currently does not hold
across process restarts in the vllm nightlies we pin to. Two
back-to-back vanilla vllm-only runs produce ~55 different outputs
out of 100 despite temperature=0 and the flag being active -- the
two separately-started server instances pick different cuBLAS /
kernel plans. The correctness test's Phase 4 bitwise comparison
between a base-vllm run and a vllm+LMCache run is therefore bimodal:
0 diff or ~50 diff, decided purely by scheduling luck, not by any
LMCache correctness bug. Reproduced locally on the K3s host; all
LMCache code paths are bit-identical between attempts.

Wrap the test script in a 3x retry at the run.sh boundary so a
lucky scheduling lets the build pass. A persistent LMCache
regression will still fail all three attempts and be caught; this
only papers over the upstream vllm flake, not real bugs.

Ultimate fix needs to happen upstream (vllm making
BATCH_INVARIANT=1 actually invariant across processes) or in the
test architecture (compare within a single process instead of across
restart). Until one of those lands this keeps CI unstuck.

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1b27920. Configure here.

actual_worker_id=kv_rank,
tp_size=tp_size,
pp_size=1,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worker adapter missing docstring unlike scheduler adapter

Low Severity

LMCacheMPWorkerAdapter.__init__ has its signature substantially changed in this diff (new world_size, kv_rank, tp_size params, parallel_strategy made optional with a synthesis fallback) but has no docstring. The sibling class LMCacheMPSchedulerAdapter.__init__ — which received the identical signature change in this same PR — has a docstring documenting these parameters. The project style guide requires all public functions to have docstrings, and the inconsistency between the two adapters hurts maintainability.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by project rule: LMCache Code Review Style Guide

Reviewed by Cursor Bugbot for commit 1b27920. Configure here.

Copy link
Copy Markdown
Collaborator

@DongDongJu DongDongJu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sammshen sammshen merged commit 604a7e5 into LMCache:dev Apr 22, 2026
40 checks passed
ApostaC pushed a commit that referenced this pull request Apr 23, 2026
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants