fix(vllm): unblock InstrumentedScheduler decode benchmark in disagg + at batch>1#9360
Conversation
InstrumentedScheduler builds two SchedulerOutputs from scratch during
DYN_BENCHMARK_MODE=decode: the synthetic decode batch from
_bench_inject_fake_decode and the empty drain frame in schedule().
Neither populated kv_connector_metadata.
When a KV connector is configured (e.g. NixlConnector for any disagg
worker), vLLM's worker-side _get_kv_connector_output asserts
"scheduler_output.kv_connector_metadata is not None" before calling
bind_connector_metadata. The synthetic outputs failed this assertion
on the first decode-sweep iteration, killing EngineCore before the
worker could register and the planner could ever read get_perf_metrics.
Mirror the parent scheduler (vllm/v1/core/sched/scheduler.py) and call
connector.build_connector_meta(output) on both benchmark-built outputs.
Our fake decode reqs have KV pre-allocated locally via allocate_slots,
so build_connector_meta produces a no-op metadata: no transfers
planned, just a non-None object the worker contract requires.
End-to-end repro (Qwen3-0.6B, RTX 6000 Ada, vllm 0.20):
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--disaggregation-mode decode \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
--benchmark-mode decode
Before: EngineCore fatal at gpu_model_runner execute_model on the
first benchmark decode batch. After: sweep advances past the connector
contract; the previous failure mode is gone.
Add three unit tests covering both connector-set and connector-absent
paths via the schedule() empty-frame branch (the _bench_inject_fake_decode
path uses the same pattern and is covered by symmetry). Tests fail on
the unpatched scheduler.
Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
…ceholder
In vLLM's async scheduler path,
gpu_model_runner._update_states_after_model_execute writes a -1
placeholder into ``token_ids_cpu[req_idx, num_tokens_no_spec]`` after
every sampling step (see ``sampled_ids = [-1]`` for async scheduling).
``num_tokens_no_spec`` is set from the request's prompt length when the
request joins the InputBatch.
InstrumentedScheduler._bench_inject_fake_decode previously used
``prompt_token_ids=[0] * ctx_len`` and immediately set
``num_computed_tokens = ctx_len``. The placeholder write therefore
landed at exactly position ``ctx_len`` -- the same slot the next
benchmark batch's request reads as its decode input when the
InputBatch slot gets recycled. The embedding lookup OOBs because -1 is
out of vocab. Reproduces reliably at batch_size > 1; batch=1 happens
to dodge it because the slot was last touched by warmup with a longer
prompt that didn't pollute position ctx_len.
Diagnosis was via debug prints in vLLM's ``_prepare_input_ids``: with
batch=6, ``input_ids.cpu[:6]=[-1, 0, 0, 0, 0, 0]`` -- the leading -1
came from the prior iteration's async-sampler placeholder write at
``token_ids_cpu[0, 16]``.
Fix: pad the synthetic prompt to ``ctx_len + 1`` so position
``ctx_len`` is part of the prompt and is filled with a valid token id
(0) by ``InputBatch.add_request``. The placeholder write then lands
at ``ctx_len + 1``, out of the read range. Allocate ``ctx_len + 1``
KV slots for the same reason (block-table indexing for position
``ctx_len`` requires block ``ctx_len // block_size``, which is a NEW
block when ``ctx_len % block_size == 0``). Drop the
``append_output_token_ids(0)`` call -- the padded prompt now provides
the input-token slot the model needs, and the appended token never
propagated to the worker's CachedRequestState anyway (output_token_ids
is hard-coded to [] at gpu_model_runner.py:1168).
End-to-end repro (Qwen3-0.6B, RTX 6000 Ada, vllm 0.20):
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--benchmark-mode decode \
--benchmark-decode-batch-granularity 4
Before: CUDA ``indexSelectSmallIndex: srcIndex < srcSelectDimSize``
on the second decode point (first batch_size > 1). After: 8/8 points
complete and FPMs are written to the output JSON. Same in disagg mode
combined with the connector-metadata fix.
Add a unit test that asserts ``allocate_slots`` is called with
``ctx_len + 1`` (not ``ctx_len``); the test fails on the unpatched
scheduler.
Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
WalkthroughThis PR enhances ChangesBenchmark Decode Path Enhancement
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/src/dynamo/vllm/instrumented_scheduler.py`:
- Around line 362-369: Replace uses of getattr(self, "connector", None) and
getattr(self, "ec_connector", None) with direct attribute access checks (e.g.,
if self.connector is not None) so missing attributes raise AttributeError
instead of being silently masked; update both the shown block where
empty.kv_connector_metadata and empty.ec_connector_metadata are set (use
self.connector.build_connector_meta and self.ec_connector.build_connector_meta
guarded by "if self.connector is not None" / "if self.ec_connector is not None")
and the similar pattern in _bench_inject_fake_decode to follow the same
direct-access guard.
In `@components/src/dynamo/vllm/tests/test_vllm_instrumented_scheduler.py`:
- Around line 350-352: The tests currently import MagicMock and _BenchPhase
inside _make_decode_sweep_stub and inside three test functions; hoist "from
unittest.mock import MagicMock" and "from dynamo.vllm.instrumented_scheduler
import _BenchPhase" to the module-level imports alongside the existing
InstrumentedScheduler import, then remove the in-function/in-test duplicate
import lines in _make_decode_sweep_stub and the three new tests so all imports
live at the top of the file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1d22e18f-6833-44d2-ad3e-77a973fb0319
📒 Files selected for processing (2)
components/src/dynamo/vllm/instrumented_scheduler.pycomponents/src/dynamo/vllm/tests/test_vllm_instrumented_scheduler.py
…nchmark
Pure unit tests on _bench_inject_fake_decode and the empty-frame
schedule branch can't catch the two recently-fixed regressions in
isolation -- both only manifest when the model actually executes on a
GPU:
* kv_connector_metadata must be attached to benchmark-built
SchedulerOutputs, otherwise the worker-side
_get_kv_connector_output assertion fires (any disagg config / any
worker with a KV connector).
* The synthetic decode prompt must be padded to ctx_len + 1 so the
async-scheduler's -1 placeholder write doesn't pollute the read
slot the next batch reuses (CUDA OOB on the second decode point,
first batch>1 sweep).
Add a single-GPU end-to-end test that spawns python -m dynamo.vllm
with --benchmark-mode {agg,decode}, waits for the benchmark JSON
to be written, validates structure (right mode, every point produces
at least one FPM with positive wall_time), and tears down. Two
scenarios:
* test_self_benchmark_agg_mode -- aggregated worker, no
kv-transfer-config; covers the prompt-padding regression on the
common-case path.
* test_self_benchmark_disagg_decode_with_nixl_connector -- disagg
decode + NixlConnector kv_both; this is the original
user-reported configuration and exercises both fixes together.
Both runs are tight (granularity=2, max-model-len=1024,
gpu-memory-utilization=0.4 on Qwen3-0.6B) so each completes in ~30s
of GPU time inside the 45-minute pre-merge slot.
Markers (vllm, gpu_1, e2e, pre_merge, model(QWEN)) match the
selector pr.yaml uses for the vllm-test GPU stage:
gpu_test_markers: pre_merge and vllm and gpu_1
so the tests gate every PR. Verified the selector picks them up via
``pytest --collect-only -m "pre_merge and vllm and gpu_1"``.
Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Extend the real-GPU pre-merge gate so it also verifies the worker can
serve normal traffic after self-benchmarking. Replaces the earlier
"start worker just for the benchmark" approach with the production
flow: worker startup runs the benchmark, registers with the runtime,
then serves a real chat completion.
Setup mirrors tests/fault_tolerance/cancellation/test_vllm.py for CI
parity:
* Same model: FAULT_TOLERANCE_MODEL_NAME (Qwen/Qwen3-0.6B).
* Same DynamoFrontendProcess + ManagedProcess worker layout, same
health-check URLs, same env (DYN_HEALTH_CHECK_ENABLED=false,
DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS, DYN_SYSTEM_PORT,
DYN_HTTP_PORT).
* Same NixlConnector kv_both wiring and same prefill-side
--kv-events-config + VLLM_NIXL_SIDE_CHANNEL_PORT for the disagg
case.
* Same fixtures: runtime_services_dynamic_ports, predownload_models,
set_ucx_tls_no_mm (for the disagg pair).
* Same per-worker allocate_port(9100) / deallocate_port pattern.
Two scenarios:
* test_self_benchmark_agg_serves_after_bench (gpu_1):
aggregated worker, --benchmark-mode agg. Worker startup runs the
full agg sweep (prefill + decode points, decode includes
batch>1), then serves a chat completion.
* test_self_benchmark_disagg_serves_after_bench (gpu_2):
prefill worker --benchmark-mode prefill plus decode worker
--benchmark-mode decode kv_both. Both workers run their sweeps,
register, and together serve a chat completion that traverses
both engines.
Each test, after both workers' health checks pass:
1. Validates the per-worker benchmark JSON: right mode, every
benchmark point produced at least one FPM with positive
wall_time -- catches a worker that registered but ran a degenerate
sweep.
2. Asserts the decode sweep covers batch>1 (the prompt-padding
regression case) so a future regression that only affects batch>1
doesn't slip through.
3. Sends a real chat completion via tests/utils/client.send_request
and asserts a non-empty assistant content.
Marker selectors verified locally:
* pre_merge and vllm and gpu_1 -> agg test
(matches pr.yaml's vllm-test gpu_test_markers)
* pre_merge and vllm and (gpu_2 or gpu_4) -> disagg test
(matches pr.yaml's vllm-multi-gpu-test gpu_test_markers)
so both gate every PR.
Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Three changes addressing PR review:
1. (dynamo-ops) ``_bench_generate_decode_grid`` previously sized
``max_batch`` from raw ``ctx_len`` token count
(``total_kv_tokens // ctx_len``). After the prompt-padding fix
``_bench_inject_fake_decode`` actually requests ``ctx_len + 1``
tokens, which the KV cache manager rounds UP to the next block
boundary. So at boundary ctx_lens (notably ctx_len that is a
multiple of block_size) the advertised ``max_batch`` was
2x what the allocator could fit, and the allocator silently
truncated the batch (logged as "KV exhausted at ctx_len=...,
truncating batch") -- the benchmark recorded the partial result
under the wrong batch size.
Fix: size ``max_batch`` from blocks-per-request derived from the
padded length:
blocks_per_req = ceil((ctx_len + 1) / block_size)
max_batch = num_gpu_blocks // blocks_per_req
Add two unit tests that fail on the unpatched grid sizer (e.g. at
block_size=16 / num_gpu_blocks=100 / ctx_len=16 the old code
advertised batch_size=100, which needs 200 blocks).
2. (CodeRabbit) Replace ``getattr(self, "connector", None)`` and
``getattr(self, "ec_connector", None)`` guards with direct
attribute access (``self.connector is not None``,
``self.ec_connector is not None``). Both attributes are part of
the parent vLLM ``Scheduler``'s public contract; using
``getattr`` with a default would silently mask a future vLLM bump
that drops them. Direct access makes such a contract violation
fail loudly. Same fix at both call sites (the empty-frame
schedule branch and ``_bench_inject_fake_decode``).
3. (CodeRabbit) Hoist ``from unittest.mock import MagicMock`` and
``from dynamo.vllm.instrumented_scheduler import _BenchPhase`` to
the module-level imports in
``test_vllm_instrumented_scheduler.py``, matching the project's
python-guidelines (all imports at module top).
Local pytest: 21/21 unit tests pass; the two new grid-sizing tests
verified to fail on the unpatched scheduler.
Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
…rop shebang Three CI failures on PR #9360 addressed: 1. ``check-shebang-scripts-are-executable`` (pre-commit): ``tests/vllm/test_self_benchmark_gpu.py`` carried a ``#!/usr/bin/env python3`` shebang but isn't intended to be directly executed (pytest collects it). Drop the shebang to satisfy the hook without granting exec bits. 2. ``pytest-marker-report`` (pre-commit) -- 9 collection errors: ``tests/vllm/__init__.py`` makes ``tests/vllm`` a Python package that shadows the real ``vllm`` distribution when ``tests`` is on sys.path during marker discovery. Every other vLLM-suite test that does ``from vllm import LLM`` (or similar) broke with ``ImportError: cannot import name 'LLM' from 'vllm'`` pointing at our ``tests/vllm/__init__.py``. Renamed to ``tests/vllm_self_benchmark/`` to remove the name collision; all 9 shadowed imports resolve to the real ``vllm`` again. 3. ``vllm-runtime / Multi-GPU Test`` -- disagg test failed with: (EngineCore pid=...) zmq.error.ZMQError: Address already in use (addr='tcp://*:20380') ``InstrumentedScheduler`` (auto-injected by ``--benchmark-mode``) binds the FPM publisher to ``tcp://*:DYN_FORWARDPASS_METRIC_PORT + dp_rank``, defaulting to 20380. Two workers on the same host (the disagg prefill + decode pair) both try to bind 20380 and the second loses. The operator works around this by injecting ``DYN_FORWARDPASS_METRIC_PORT`` per engine (deploy/operator/internal/dynamo/component_worker.go:111); the single-GPU agg test happens to avoid the collision because only one worker runs. Apply the same fix in the test: allocate a per-worker port via ``allocate_port(20380)`` and inject it as the env var, with matching ``deallocate_port`` on teardown. (The cancellation/migration tests don't hit this because they don't pass ``--benchmark-mode`` and so don't get InstrumentedScheduler.) Verified locally: * ``import vllm`` resolves to the real site-packages vllm again. * pytest collection of the renamed test passes; marker selectors ``pre_merge and vllm and gpu_1`` (agg test) and ``pre_merge and vllm and (gpu_2 or gpu_4)`` (disagg test) still match -- so both stay wired to ``pr.yaml``'s vllm-test and vllm-multi-gpu-test GPU stages and continue to gate the PR. The single-GPU agg test passed in the prior CI run (only the disagg test failed) so issue #3 is specifically what was needed for disagg to pass. Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Summary
Two independent bugs in
InstrumentedScheduler(DYN_BENCHMARK_MODE=decode) that combine to fully block decode-mode self-benchmark, especially in disagg deployments. Both diagnosed end-to-end on Qwen3-0.6B + RTX 6000 Ada + vllm 0.20.0; the same code paths exist on the version reported by users (vllm 0.19.0).Fix 1 — attach
kv_connector_metadatato benchmark-builtSchedulerOutputs (d2fdd1fa158)The parent
Scheduler.schedule()populateskv_connector_metadataon everySchedulerOutputwhen a connector is configured (vllm/v1/core/sched/scheduler.py:912-923). The worker-side_get_kv_connector_outputthenassert scheduler_output.kv_connector_metadata is not Nonebefore binding (vllm/v1/worker/kv_connector_model_runner_mixin.py:105).InstrumentedSchedulerbuilds twoSchedulerOutputs from scratch under benchmark mode:_bench_inject_fake_decode— the synthetic decode batchschedule()empty drain frame — between decode pointsNeither populated
kv_connector_metadata, so on any worker with a KV connector (NixlConnector for any disagg worker; FlexKV/PdConnector users) EngineCore dies withAssertionErroron the very first synthetic decode batch — before the worker can register and the planner can ever readget_perf_metrics.This fix mirrors the parent:
output.kv_connector_metadata = self.connector.build_connector_meta(output)(andec_connectorsymmetrically) on both benchmark-built outputs. Our fake decode requests have KV pre-allocated locally viaallocate_slots, sobuild_connector_metaproduces no-op metadata — no transfers planned, just a non-None object the worker contract requires.Fix 2 — pad bench fake-decode prompt to avoid async sampler
-1placeholder (e8e938808c2)In vLLM's async-scheduling path,
gpu_model_runner._update_states_after_model_executewrites a-1placeholder intotoken_ids_cpu[req_idx, num_tokens_no_spec[req_idx]]after every sampling step (vllm/v1/worker/gpu_model_runner.py:3457-3478):num_tokens_no_specequals the request's prompt length when it joins theInputBatch(gpu_input_batch.py:355). Withprompt_token_ids = [0] * ctx_lenandnum_computed_tokens = ctx_len, the-1placeholder lands at exactly positionctx_len— the slot the next benchmark batch's request reads as its decode input when the InputBatch slot gets recycled. The embedding lookup OOBs because-1is out of vocab.Diagnosed via debug prints in
_prepare_input_ids: with batch=6,input_ids.cpu[:6] = [-1, 0, 0, 0, 0, 0]— the leading-1came from the prior iteration's async-sampler placeholder write attoken_ids_cpu[0, 16].Why batch=1 happens to work: the very first decode point reuses the slot last touched by warmup (which has a longer prompt covering position
ctx_lenwith a valid 0). The OOB only fires from the second decode point onward, once a fake-decode sample has stamped-1.Fix: pad the synthetic prompt to
ctx_len + 1so positionctx_lenis part of the prompt (filled with a valid 0 byInputBatch.add_request). The-1placeholder write then lands at positionctx_len + 1, out of the read range. Allocatectx_len + 1KV slots for the same reason — block-table indexing for positionctx_lenrequires blockctx_len // block_size, which is a NEW block whenctx_len % block_size == 0. Drop theappend_output_token_ids(0)call — the padded prompt now provides the input-token slot the model needs, and the appended token never propagated to the worker'sCachedRequestStateanyway (output_token_idsis hard-coded to[]atgpu_model_runner.py:1168).End-to-end verification
--kv-transfer-config '{"kv_connector":"NixlConnector",...}'+--benchmark-mode decodeAssertionError: kv_connector_metadata is not Noneon first batchindexSelectSmallIndex: srcIndex < srcSelectDimSizeon the second decode pointRepro command:
Test plan
pytest components/src/dynamo/vllm/tests/test_vllm_instrumented_scheduler.py— 19 passed)test_decode_sweep_empty_frame_attaches_kv_connector_metadatatest_decode_sweep_empty_frame_attaches_ec_connector_metadata_when_settest_decode_sweep_empty_frame_no_connector_leaves_metadata_nonetest_bench_inject_fake_decode_pads_prompt_for_async_placeholdergit stash)🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests