Skip to content

[Chore][CI] Force vLLM Model Runner V1 in the PD comprehensive test#3502

Merged
deng451e merged 1 commit into
LMCache:devfrom
sammshen:fix/pd-force-mrv1-model-runner
Jun 3, 2026
Merged

[Chore][CI] Force vLLM Model Runner V1 in the PD comprehensive test#3502
deng451e merged 1 commit into
LMCache:devfrom
sammshen:fix/pd-force-mrv1-model-runner

Conversation

@sammshen

@sammshen sammshen commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

The PD comprehensive test (k3-comprehensive-test) started timing out on 2026-06-02 when the floating vLLM nightly bumped from 0.22.1rc1.dev91 to .dev95. That delta includes vllm-project/vllm#43458, which enables Model Runner V2 (MRV2) by default for Llama/Mistral dense models -- including meta-llama/Llama-3.2-1B-Instruct, the model used by pd.yaml.

MRV2 is incompatible with KV cache connectors, so the prefiller/decoder (both driven through LMCacheConnectorV1) hang on the first request and the job is killed by the 30-minute timeout. The vLLM servers start fine; only the disaggregated request path hangs.

Pin the PD prefiller and decoder to MRV1 via VLLM_USE_V2_MODEL_RUNNER=0 until MRV2 supports KV connectors / reliably falls back to MRV1.

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

The PD comprehensive test (k3-comprehensive-test) started timing out on
2026-06-02 when the floating vLLM nightly bumped from 0.22.1rc1.dev91 to
.dev95. That delta includes vllm-project/vllm#43458, which enables Model
Runner V2 (MRV2) by default for Llama/Mistral dense models -- including
meta-llama/Llama-3.2-1B-Instruct, the model used by pd.yaml.

MRV2 is incompatible with KV cache connectors, so the prefiller/decoder
(both driven through LMCacheConnectorV1) hang on the first request and the
job is killed by the 30-minute timeout. The vLLM servers start fine; only
the disaggregated request path hangs.

Pin the PD prefiller and decoder to MRV1 via VLLM_USE_V2_MODEL_RUNNER=0
until MRV2 supports KV connectors / reliably falls back to MRV1.

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Buildkite configuration in .buildkite/configs/pd.yaml to force the use of vLLM Model Runner V1 (VLLM_USE_V2_MODEL_RUNNER=0) for both the docker-prefiller and docker-decoder environments. This change prevents hangs caused by incompatibilities between Model Runner V2 and KV cache connectors. No review comments were provided, and the changes are clear and well-documented.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@sammshen sammshen added the full Run comprehensive tests on this PR label Jun 3, 2026

@hlin99 hlin99 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@deng451e deng451e left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6

@deng451e deng451e enabled auto-merge (squash) June 3, 2026 01:45
@deng451e deng451e merged commit 9fbd742 into LMCache:dev Jun 3, 2026
32 checks passed
Lyj1007 pushed a commit to Lyj1007/LMCache that referenced this pull request Jun 3, 2026
…MCache#3502)

[CI/CD] Force vLLM Model Runner V1 in the PD comprehensive test

The PD comprehensive test (k3-comprehensive-test) started timing out on
2026-06-02 when the floating vLLM nightly bumped from 0.22.1rc1.dev91 to
.dev95. That delta includes vllm-project/vllm#43458, which enables Model
Runner V2 (MRV2) by default for Llama/Mistral dense models -- including
meta-llama/Llama-3.2-1B-Instruct, the model used by pd.yaml.

MRV2 is incompatible with KV cache connectors, so the prefiller/decoder
(both driven through LMCacheConnectorV1) hang on the first request and the
job is killed by the 30-minute timeout. The vLLM servers start fine; only
the disaggregated request path hangs.

Pin the PD prefiller and decoder to MRV1 via VLLM_USE_V2_MODEL_RUNNER=0
until MRV2 supports KV connectors / reliably falls back to MRV1.

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants