Skip to content

fix(vllm): unblock InstrumentedScheduler decode benchmark in disagg + at batch>1#9360

Merged
tedzhouhk merged 6 commits into
mainfrom
hzhou/vllm-bench-kv-connector-fix
May 11, 2026
Merged

fix(vllm): unblock InstrumentedScheduler decode benchmark in disagg + at batch>1#9360
tedzhouhk merged 6 commits into
mainfrom
hzhou/vllm-bench-kv-connector-fix

Conversation

@tedzhouhk

@tedzhouhk tedzhouhk commented May 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Two independent bugs in InstrumentedScheduler (DYN_BENCHMARK_MODE=decode) that combine to fully block decode-mode self-benchmark, especially in disagg deployments. Both diagnosed end-to-end on Qwen3-0.6B + RTX 6000 Ada + vllm 0.20.0; the same code paths exist on the version reported by users (vllm 0.19.0).

Fix 1 — attach kv_connector_metadata to benchmark-built SchedulerOutputs (d2fdd1fa158)

The parent Scheduler.schedule() populates kv_connector_metadata on every SchedulerOutput when a connector is configured (vllm/v1/core/sched/scheduler.py:912-923). The worker-side _get_kv_connector_output then assert scheduler_output.kv_connector_metadata is not None before binding (vllm/v1/worker/kv_connector_model_runner_mixin.py:105).

InstrumentedScheduler builds two SchedulerOutputs from scratch under benchmark mode:

  1. _bench_inject_fake_decode — the synthetic decode batch
  2. schedule() empty drain frame — between decode points

Neither populated kv_connector_metadata, so on any worker with a KV connector (NixlConnector for any disagg worker; FlexKV/PdConnector users) EngineCore dies with AssertionError on the very first synthetic decode batch — before the worker can register and the planner can ever read get_perf_metrics.

This fix mirrors the parent: output.kv_connector_metadata = self.connector.build_connector_meta(output) (and ec_connector symmetrically) on both benchmark-built outputs. Our fake decode requests have KV pre-allocated locally via allocate_slots, so build_connector_meta produces no-op metadata — no transfers planned, just a non-None object the worker contract requires.

Fix 2 — pad bench fake-decode prompt to avoid async sampler -1 placeholder (e8e938808c2)

In vLLM's async-scheduling path, gpu_model_runner._update_states_after_model_execute writes a -1 placeholder into token_ids_cpu[req_idx, num_tokens_no_spec[req_idx]] after every sampling step (vllm/v1/worker/gpu_model_runner.py:3457-3478):

if self.use_async_scheduling:
    sampled_ids = [-1] if req_idx not in invalid_req_indices_set else None
...
self.input_batch.token_ids_cpu[req_idx, start_idx:end_idx] = sampled_ids  # writes -1

num_tokens_no_spec equals the request's prompt length when it joins the InputBatch (gpu_input_batch.py:355). With prompt_token_ids = [0] * ctx_len and num_computed_tokens = ctx_len, the -1 placeholder lands at exactly position ctx_len — the slot the next benchmark batch's request reads as its decode input when the InputBatch slot gets recycled. The embedding lookup OOBs because -1 is out of vocab.

Diagnosed via debug prints in _prepare_input_ids: with batch=6, input_ids.cpu[:6] = [-1, 0, 0, 0, 0, 0] — the leading -1 came from the prior iteration's async-sampler placeholder write at token_ids_cpu[0, 16].

Why batch=1 happens to work: the very first decode point reuses the slot last touched by warmup (which has a longer prompt covering position ctx_len with a valid 0). The OOB only fires from the second decode point onward, once a fake-decode sample has stamped -1.

Fix: pad the synthetic prompt to ctx_len + 1 so position ctx_len is part of the prompt (filled with a valid 0 by InputBatch.add_request). The -1 placeholder write then lands at position ctx_len + 1, out of the read range. Allocate ctx_len + 1 KV slots for the same reason — block-table indexing for position ctx_len requires block ctx_len // block_size, which is a NEW block when ctx_len % block_size == 0. Drop the append_output_token_ids(0) call — the padded prompt now provides the input-token slot the model needs, and the appended token never propagated to the worker's CachedRequestState anyway (output_token_ids is hard-coded to [] at gpu_model_runner.py:1168).

End-to-end verification

Scenario Before After
Disagg decode worker with --kv-transfer-config '{"kv_connector":"NixlConnector",...}' + --benchmark-mode decode AssertionError: kv_connector_metadata is not None on first batch Connector contract OK, decode sweep runs
Decode benchmark, agg, batch ∈ {1,6,11,16} × ctx ∈ {16, 1014} CUDA indexSelectSmallIndex: srcIndex < srcSelectDimSize on the second decode point 8/8 points complete with sensible ITLs (20–30 ms)
Combined: disagg decode + benchmark across batch ∈ {1,6,11,16} × ctx ∈ {16, 1014} Connector assertion 8/8 points complete, FPMs written to JSON

Repro command:

python -m dynamo.vllm \
    --model Qwen/Qwen3-0.6B \
    --disaggregation-mode decode \
    --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
    --benchmark-mode decode \
    --benchmark-decode-batch-granularity 4

Test plan

  • Unit tests pass (pytest components/src/dynamo/vllm/tests/test_vllm_instrumented_scheduler.py — 19 passed)
  • New tests added for both fixes:
    • test_decode_sweep_empty_frame_attaches_kv_connector_metadata
    • test_decode_sweep_empty_frame_attaches_ec_connector_metadata_when_set
    • test_decode_sweep_empty_frame_no_connector_leaves_metadata_none
    • test_bench_inject_fake_decode_pads_prompt_for_async_placeholder
  • Each new test verified to fail on the unpatched scheduler (sanity-checked via git stash)
  • End-to-end: aggregated worker, single GPU, decode benchmark completes (8/8 points)
  • End-to-end: disagg decode + NixlConnector, benchmark completes (8/8 points)
  • Pre-commit hooks pass on the changed files
  • CI green on this branch

🤖 Generated with Claude Code


Open in Devin Review

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed a boundary alignment issue in synthetic request generation during benchmark decode operations to prevent memory overflow.
  • Tests

    • Added comprehensive test coverage for benchmark decode sweep operations, including connector metadata handling and synthetic request validation.

tedzhouhk added 2 commits May 8, 2026 19:03
InstrumentedScheduler builds two SchedulerOutputs from scratch during
DYN_BENCHMARK_MODE=decode: the synthetic decode batch from
_bench_inject_fake_decode and the empty drain frame in schedule().
Neither populated kv_connector_metadata.

When a KV connector is configured (e.g. NixlConnector for any disagg
worker), vLLM's worker-side _get_kv_connector_output asserts
"scheduler_output.kv_connector_metadata is not None" before calling
bind_connector_metadata. The synthetic outputs failed this assertion
on the first decode-sweep iteration, killing EngineCore before the
worker could register and the planner could ever read get_perf_metrics.

Mirror the parent scheduler (vllm/v1/core/sched/scheduler.py) and call
connector.build_connector_meta(output) on both benchmark-built outputs.
Our fake decode reqs have KV pre-allocated locally via allocate_slots,
so build_connector_meta produces a no-op metadata: no transfers
planned, just a non-None object the worker contract requires.

End-to-end repro (Qwen3-0.6B, RTX 6000 Ada, vllm 0.20):

  python -m dynamo.vllm \
    --model Qwen/Qwen3-0.6B \
    --disaggregation-mode decode \
    --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
    --benchmark-mode decode

Before: EngineCore fatal at gpu_model_runner execute_model on the
first benchmark decode batch. After: sweep advances past the connector
contract; the previous failure mode is gone.

Add three unit tests covering both connector-set and connector-absent
paths via the schedule() empty-frame branch (the _bench_inject_fake_decode
path uses the same pattern and is covered by symmetry). Tests fail on
the unpatched scheduler.

Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
…ceholder

In vLLM's async scheduler path,
gpu_model_runner._update_states_after_model_execute writes a -1
placeholder into ``token_ids_cpu[req_idx, num_tokens_no_spec]`` after
every sampling step (see ``sampled_ids = [-1]`` for async scheduling).
``num_tokens_no_spec`` is set from the request's prompt length when the
request joins the InputBatch.

InstrumentedScheduler._bench_inject_fake_decode previously used
``prompt_token_ids=[0] * ctx_len`` and immediately set
``num_computed_tokens = ctx_len``. The placeholder write therefore
landed at exactly position ``ctx_len`` -- the same slot the next
benchmark batch's request reads as its decode input when the
InputBatch slot gets recycled. The embedding lookup OOBs because -1 is
out of vocab. Reproduces reliably at batch_size > 1; batch=1 happens
to dodge it because the slot was last touched by warmup with a longer
prompt that didn't pollute position ctx_len.

Diagnosis was via debug prints in vLLM's ``_prepare_input_ids``: with
batch=6, ``input_ids.cpu[:6]=[-1, 0, 0, 0, 0, 0]`` -- the leading -1
came from the prior iteration's async-sampler placeholder write at
``token_ids_cpu[0, 16]``.

Fix: pad the synthetic prompt to ``ctx_len + 1`` so position
``ctx_len`` is part of the prompt and is filled with a valid token id
(0) by ``InputBatch.add_request``. The placeholder write then lands
at ``ctx_len + 1``, out of the read range. Allocate ``ctx_len + 1``
KV slots for the same reason (block-table indexing for position
``ctx_len`` requires block ``ctx_len // block_size``, which is a NEW
block when ``ctx_len % block_size == 0``). Drop the
``append_output_token_ids(0)`` call -- the padded prompt now provides
the input-token slot the model needs, and the appended token never
propagated to the worker's CachedRequestState anyway (output_token_ids
is hard-coded to [] at gpu_model_runner.py:1168).

End-to-end repro (Qwen3-0.6B, RTX 6000 Ada, vllm 0.20):

  python -m dynamo.vllm \
    --model Qwen/Qwen3-0.6B \
    --benchmark-mode decode \
    --benchmark-decode-batch-granularity 4

Before: CUDA ``indexSelectSmallIndex: srcIndex < srcSelectDimSize``
on the second decode point (first batch_size > 1). After: 8/8 points
complete and FPMs are written to the output JSON. Same in disagg mode
combined with the connector-metadata fix.

Add a unit test that asserts ``allocate_slots`` is called with
``ctx_len + 1`` (not ``ctx_len``); the test fails on the unpatched
scheduler.

Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@tedzhouhk tedzhouhk requested review from a team as code owners May 9, 2026 02:54
@github-actions github-actions Bot added fix backend::vllm Relates to the vllm backend labels May 9, 2026
@coderabbitai

coderabbitai Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

This PR enhances InstrumentedScheduler's benchmark decode path: it fixes a prompt padding issue in synthetic requests (ctx_len + 1 instead of ctx_len for KV allocation) to prevent out-of-bounds writes, and ensures connector metadata is consistently populated in both the decode-sweep empty frame and synthetic SchedulerOutput objects returned by _bench_inject_fake_decode.

Changes

Benchmark Decode Path Enhancement

Layer / File(s) Summary
Benchmark Fake Decode: Prompt Padding & Metadata
components/src/dynamo/vllm/instrumented_scheduler.py
_bench_inject_fake_decode pads synthetic request prompts to ctx_len + 1 for KV allocation while keeping num_computed_tokens at ctx_len, and populates kv_connector_metadata and ec_connector_metadata in returned SchedulerOutput.
Decode Sweep Empty Frame: Connector Metadata
components/src/dynamo/vllm/instrumented_scheduler.py
During benchmark decode sweep empty frame, SchedulerOutput conditionally populates kv_connector_metadata and ec_connector_metadata from configured connectors.
Test Coverage & Regression Assertions
components/src/dynamo/vllm/tests/test_vllm_instrumented_scheduler.py
Tests validate connector metadata attachment in decode sweep empty frame and verify prompt padding regression fix to prevent out-of-bounds async scheduler writes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly addresses the main change: fixing bugs in InstrumentedScheduler's decode benchmark that blocked disaggregated deployments and batch>1 scenarios.
Description check ✅ Passed The description comprehensively covers all required sections: overview (two independent bugs explained), details (both fixes described with code references and root cause analysis), reviewer guidance (specific file calls to start review), and related issues.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/src/dynamo/vllm/instrumented_scheduler.py`:
- Around line 362-369: Replace uses of getattr(self, "connector", None) and
getattr(self, "ec_connector", None) with direct attribute access checks (e.g.,
if self.connector is not None) so missing attributes raise AttributeError
instead of being silently masked; update both the shown block where
empty.kv_connector_metadata and empty.ec_connector_metadata are set (use
self.connector.build_connector_meta and self.ec_connector.build_connector_meta
guarded by "if self.connector is not None" / "if self.ec_connector is not None")
and the similar pattern in _bench_inject_fake_decode to follow the same
direct-access guard.

In `@components/src/dynamo/vllm/tests/test_vllm_instrumented_scheduler.py`:
- Around line 350-352: The tests currently import MagicMock and _BenchPhase
inside _make_decode_sweep_stub and inside three test functions; hoist "from
unittest.mock import MagicMock" and "from dynamo.vllm.instrumented_scheduler
import _BenchPhase" to the module-level imports alongside the existing
InstrumentedScheduler import, then remove the in-function/in-test duplicate
import lines in _make_decode_sweep_stub and the three new tests so all imports
live at the top of the file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1d22e18f-6833-44d2-ad3e-77a973fb0319

📥 Commits

Reviewing files that changed from the base of the PR and between 6cd4238 and e8e9388.

📒 Files selected for processing (2)
  • components/src/dynamo/vllm/instrumented_scheduler.py
  • components/src/dynamo/vllm/tests/test_vllm_instrumented_scheduler.py

Comment thread components/src/dynamo/vllm/instrumented_scheduler.py Outdated
Comment thread components/src/dynamo/vllm/tests/test_vllm_instrumented_scheduler.py Outdated
Comment thread components/src/dynamo/vllm/instrumented_scheduler.py
…nchmark

Pure unit tests on _bench_inject_fake_decode and the empty-frame
schedule branch can't catch the two recently-fixed regressions in
isolation -- both only manifest when the model actually executes on a
GPU:

* kv_connector_metadata must be attached to benchmark-built
  SchedulerOutputs, otherwise the worker-side
  _get_kv_connector_output assertion fires (any disagg config / any
  worker with a KV connector).
* The synthetic decode prompt must be padded to ctx_len + 1 so the
  async-scheduler's -1 placeholder write doesn't pollute the read
  slot the next batch reuses (CUDA OOB on the second decode point,
  first batch>1 sweep).

Add a single-GPU end-to-end test that spawns python -m dynamo.vllm
with --benchmark-mode {agg,decode}, waits for the benchmark JSON
to be written, validates structure (right mode, every point produces
at least one FPM with positive wall_time), and tears down. Two
scenarios:

  * test_self_benchmark_agg_mode -- aggregated worker, no
    kv-transfer-config; covers the prompt-padding regression on the
    common-case path.
  * test_self_benchmark_disagg_decode_with_nixl_connector -- disagg
    decode + NixlConnector kv_both; this is the original
    user-reported configuration and exercises both fixes together.

Both runs are tight (granularity=2, max-model-len=1024,
gpu-memory-utilization=0.4 on Qwen3-0.6B) so each completes in ~30s
of GPU time inside the 45-minute pre-merge slot.

Markers (vllm, gpu_1, e2e, pre_merge, model(QWEN)) match the
selector pr.yaml uses for the vllm-test GPU stage:

    gpu_test_markers: pre_merge and vllm and gpu_1

so the tests gate every PR. Verified the selector picks them up via
``pytest --collect-only -m "pre_merge and vllm and gpu_1"``.

Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Extend the real-GPU pre-merge gate so it also verifies the worker can
serve normal traffic after self-benchmarking. Replaces the earlier
"start worker just for the benchmark" approach with the production
flow: worker startup runs the benchmark, registers with the runtime,
then serves a real chat completion.

Setup mirrors tests/fault_tolerance/cancellation/test_vllm.py for CI
parity:

* Same model: FAULT_TOLERANCE_MODEL_NAME (Qwen/Qwen3-0.6B).
* Same DynamoFrontendProcess + ManagedProcess worker layout, same
  health-check URLs, same env (DYN_HEALTH_CHECK_ENABLED=false,
  DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS, DYN_SYSTEM_PORT,
  DYN_HTTP_PORT).
* Same NixlConnector kv_both wiring and same prefill-side
  --kv-events-config + VLLM_NIXL_SIDE_CHANNEL_PORT for the disagg
  case.
* Same fixtures: runtime_services_dynamic_ports, predownload_models,
  set_ucx_tls_no_mm (for the disagg pair).
* Same per-worker allocate_port(9100) / deallocate_port pattern.

Two scenarios:

* test_self_benchmark_agg_serves_after_bench (gpu_1):
    aggregated worker, --benchmark-mode agg. Worker startup runs the
    full agg sweep (prefill + decode points, decode includes
    batch>1), then serves a chat completion.

* test_self_benchmark_disagg_serves_after_bench (gpu_2):
    prefill worker --benchmark-mode prefill plus decode worker
    --benchmark-mode decode kv_both. Both workers run their sweeps,
    register, and together serve a chat completion that traverses
    both engines.

Each test, after both workers' health checks pass:

1. Validates the per-worker benchmark JSON: right mode, every
   benchmark point produced at least one FPM with positive
   wall_time -- catches a worker that registered but ran a degenerate
   sweep.
2. Asserts the decode sweep covers batch>1 (the prompt-padding
   regression case) so a future regression that only affects batch>1
   doesn't slip through.
3. Sends a real chat completion via tests/utils/client.send_request
   and asserts a non-empty assistant content.

Marker selectors verified locally:

* pre_merge and vllm and gpu_1 -> agg test
  (matches pr.yaml's vllm-test gpu_test_markers)
* pre_merge and vllm and (gpu_2 or gpu_4) -> disagg test
  (matches pr.yaml's vllm-multi-gpu-test gpu_test_markers)

so both gate every PR.

Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Three changes addressing PR review:

1. (dynamo-ops) ``_bench_generate_decode_grid`` previously sized
   ``max_batch`` from raw ``ctx_len`` token count
   (``total_kv_tokens // ctx_len``). After the prompt-padding fix
   ``_bench_inject_fake_decode`` actually requests ``ctx_len + 1``
   tokens, which the KV cache manager rounds UP to the next block
   boundary. So at boundary ctx_lens (notably ctx_len that is a
   multiple of block_size) the advertised ``max_batch`` was
   2x what the allocator could fit, and the allocator silently
   truncated the batch (logged as "KV exhausted at ctx_len=...,
   truncating batch") -- the benchmark recorded the partial result
   under the wrong batch size.

   Fix: size ``max_batch`` from blocks-per-request derived from the
   padded length:

       blocks_per_req = ceil((ctx_len + 1) / block_size)
       max_batch = num_gpu_blocks // blocks_per_req

   Add two unit tests that fail on the unpatched grid sizer (e.g. at
   block_size=16 / num_gpu_blocks=100 / ctx_len=16 the old code
   advertised batch_size=100, which needs 200 blocks).

2. (CodeRabbit) Replace ``getattr(self, "connector", None)`` and
   ``getattr(self, "ec_connector", None)`` guards with direct
   attribute access (``self.connector is not None``,
   ``self.ec_connector is not None``). Both attributes are part of
   the parent vLLM ``Scheduler``'s public contract; using
   ``getattr`` with a default would silently mask a future vLLM bump
   that drops them. Direct access makes such a contract violation
   fail loudly. Same fix at both call sites (the empty-frame
   schedule branch and ``_bench_inject_fake_decode``).

3. (CodeRabbit) Hoist ``from unittest.mock import MagicMock`` and
   ``from dynamo.vllm.instrumented_scheduler import _BenchPhase`` to
   the module-level imports in
   ``test_vllm_instrumented_scheduler.py``, matching the project's
   python-guidelines (all imports at module top).

Local pytest: 21/21 unit tests pass; the two new grid-sizing tests
verified to fail on the unpatched scheduler.

Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
…rop shebang

Three CI failures on PR #9360 addressed:

1. ``check-shebang-scripts-are-executable`` (pre-commit):
   ``tests/vllm/test_self_benchmark_gpu.py`` carried a ``#!/usr/bin/env
   python3`` shebang but isn't intended to be directly executed (pytest
   collects it). Drop the shebang to satisfy the hook without granting
   exec bits.

2. ``pytest-marker-report`` (pre-commit) -- 9 collection errors:
   ``tests/vllm/__init__.py`` makes ``tests/vllm`` a Python package
   that shadows the real ``vllm`` distribution when ``tests`` is on
   sys.path during marker discovery. Every other vLLM-suite test that
   does ``from vllm import LLM`` (or similar) broke with
   ``ImportError: cannot import name 'LLM' from 'vllm'`` pointing at
   our ``tests/vllm/__init__.py``. Renamed to
   ``tests/vllm_self_benchmark/`` to remove the name collision; all 9
   shadowed imports resolve to the real ``vllm`` again.

3. ``vllm-runtime / Multi-GPU Test`` -- disagg test failed with:
       (EngineCore pid=...) zmq.error.ZMQError: Address already in use
                                  (addr='tcp://*:20380')
   ``InstrumentedScheduler`` (auto-injected by ``--benchmark-mode``)
   binds the FPM publisher to ``tcp://*:DYN_FORWARDPASS_METRIC_PORT +
   dp_rank``, defaulting to 20380. Two workers on the same host (the
   disagg prefill + decode pair) both try to bind 20380 and the second
   loses. The operator works around this by injecting
   ``DYN_FORWARDPASS_METRIC_PORT`` per engine
   (deploy/operator/internal/dynamo/component_worker.go:111); the
   single-GPU agg test happens to avoid the collision because only
   one worker runs. Apply the same fix in the test: allocate a
   per-worker port via ``allocate_port(20380)`` and inject it as the
   env var, with matching ``deallocate_port`` on teardown.

   (The cancellation/migration tests don't hit this because they
   don't pass ``--benchmark-mode`` and so don't get
   InstrumentedScheduler.)

Verified locally:
* ``import vllm`` resolves to the real site-packages vllm again.
* pytest collection of the renamed test passes; marker selectors
  ``pre_merge and vllm and gpu_1`` (agg test) and
  ``pre_merge and vllm and (gpu_2 or gpu_4)`` (disagg test) still
  match -- so both stay wired to ``pr.yaml``'s vllm-test and
  vllm-multi-gpu-test GPU stages and continue to gate the PR.

The single-GPU agg test passed in the prior CI run (only the disagg
test failed) so issue #3 is specifically what was needed for disagg
to pass.

Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@tedzhouhk tedzhouhk merged commit 9dd5d56 into main May 11, 2026
87 of 88 checks passed
@tedzhouhk tedzhouhk deleted the hzhou/vllm-bench-kv-connector-fix branch May 11, 2026 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::vllm Relates to the vllm backend fix size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants