fix(vllm): unblock InstrumentedScheduler decode benchmark in disagg + at batch>1 by tedzhouhk · Pull Request #9360 · ai-dynamo/dynamo

tedzhouhk · 2026-05-09T02:54:28Z

Summary

Two independent bugs in InstrumentedScheduler (DYN_BENCHMARK_MODE=decode) that combine to fully block decode-mode self-benchmark, especially in disagg deployments. Both diagnosed end-to-end on Qwen3-0.6B + RTX 6000 Ada + vllm 0.20.0; the same code paths exist on the version reported by users (vllm 0.19.0).

Fix 1 — attach `kv_connector_metadata` to benchmark-built `SchedulerOutput`s (`d2fdd1fa158`)

The parent Scheduler.schedule() populates kv_connector_metadata on every SchedulerOutput when a connector is configured (vllm/v1/core/sched/scheduler.py:912-923). The worker-side _get_kv_connector_output then assert scheduler_output.kv_connector_metadata is not None before binding (vllm/v1/worker/kv_connector_model_runner_mixin.py:105).

InstrumentedScheduler builds two SchedulerOutputs from scratch under benchmark mode:

_bench_inject_fake_decode — the synthetic decode batch
schedule() empty drain frame — between decode points

Neither populated kv_connector_metadata, so on any worker with a KV connector (NixlConnector for any disagg worker; FlexKV/PdConnector users) EngineCore dies with AssertionError on the very first synthetic decode batch — before the worker can register and the planner can ever read get_perf_metrics.

This fix mirrors the parent: output.kv_connector_metadata = self.connector.build_connector_meta(output) (and ec_connector symmetrically) on both benchmark-built outputs. Our fake decode requests have KV pre-allocated locally via allocate_slots, so build_connector_meta produces no-op metadata — no transfers planned, just a non-None object the worker contract requires.

Fix 2 — pad bench fake-decode prompt to avoid async sampler `-1` placeholder (`e8e938808c2`)

In vLLM's async-scheduling path, gpu_model_runner._update_states_after_model_execute writes a -1 placeholder into token_ids_cpu[req_idx, num_tokens_no_spec[req_idx]] after every sampling step (vllm/v1/worker/gpu_model_runner.py:3457-3478):

if self.use_async_scheduling:
    sampled_ids = [-1] if req_idx not in invalid_req_indices_set else None
...
self.input_batch.token_ids_cpu[req_idx, start_idx:end_idx] = sampled_ids  # writes -1

num_tokens_no_spec equals the request's prompt length when it joins the InputBatch (gpu_input_batch.py:355). With prompt_token_ids = [0] * ctx_len and num_computed_tokens = ctx_len, the -1 placeholder lands at exactly position ctx_len — the slot the next benchmark batch's request reads as its decode input when the InputBatch slot gets recycled. The embedding lookup OOBs because -1 is out of vocab.

Diagnosed via debug prints in _prepare_input_ids: with batch=6, input_ids.cpu[:6] = [-1, 0, 0, 0, 0, 0] — the leading -1 came from the prior iteration's async-sampler placeholder write at token_ids_cpu[0, 16].

Why batch=1 happens to work: the very first decode point reuses the slot last touched by warmup (which has a longer prompt covering position ctx_len with a valid 0). The OOB only fires from the second decode point onward, once a fake-decode sample has stamped -1.

Fix: pad the synthetic prompt to ctx_len + 1 so position ctx_len is part of the prompt (filled with a valid 0 by InputBatch.add_request). The -1 placeholder write then lands at position ctx_len + 1, out of the read range. Allocate ctx_len + 1 KV slots for the same reason — block-table indexing for position ctx_len requires block ctx_len // block_size, which is a NEW block when ctx_len % block_size == 0. Drop the append_output_token_ids(0) call — the padded prompt now provides the input-token slot the model needs, and the appended token never propagated to the worker's CachedRequestState anyway (output_token_ids is hard-coded to [] at gpu_model_runner.py:1168).

End-to-end verification

Scenario	Before	After
Disagg decode worker with `--kv-transfer-config '{"kv_connector":"NixlConnector",...}'` + `--benchmark-mode decode`	`AssertionError: kv_connector_metadata is not None` on first batch	Connector contract OK, decode sweep runs
Decode benchmark, agg, batch ∈ {1,6,11,16} × ctx ∈ {16, 1014}	CUDA `indexSelectSmallIndex: srcIndex < srcSelectDimSize` on the second decode point	8/8 points complete with sensible ITLs (20–30 ms)
Combined: disagg decode + benchmark across batch ∈ {1,6,11,16} × ctx ∈ {16, 1014}	Connector assertion	8/8 points complete, FPMs written to JSON

Repro command:

python -m dynamo.vllm \
    --model Qwen/Qwen3-0.6B \
    --disaggregation-mode decode \
    --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
    --benchmark-mode decode \
    --benchmark-decode-batch-granularity 4

Test plan

Unit tests pass (pytest components/src/dynamo/vllm/tests/test_vllm_instrumented_scheduler.py — 19 passed)
New tests added for both fixes:
- test_decode_sweep_empty_frame_attaches_kv_connector_metadata
- test_decode_sweep_empty_frame_attaches_ec_connector_metadata_when_set
- test_decode_sweep_empty_frame_no_connector_leaves_metadata_none
- test_bench_inject_fake_decode_pads_prompt_for_async_placeholder
Each new test verified to fail on the unpatched scheduler (sanity-checked via git stash)
End-to-end: aggregated worker, single GPU, decode benchmark completes (8/8 points)
End-to-end: disagg decode + NixlConnector, benchmark completes (8/8 points)
Pre-commit hooks pass on the changed files
CI green on this branch

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

Bug Fixes
- Fixed a boundary alignment issue in synthetic request generation during benchmark decode operations to prevent memory overflow.
Tests
- Added comprehensive test coverage for benchmark decode sweep operations, including connector metadata handling and synthetic request validation.

InstrumentedScheduler builds two SchedulerOutputs from scratch during DYN_BENCHMARK_MODE=decode: the synthetic decode batch from _bench_inject_fake_decode and the empty drain frame in schedule(). Neither populated kv_connector_metadata. When a KV connector is configured (e.g. NixlConnector for any disagg worker), vLLM's worker-side _get_kv_connector_output asserts "scheduler_output.kv_connector_metadata is not None" before calling bind_connector_metadata. The synthetic outputs failed this assertion on the first decode-sweep iteration, killing EngineCore before the worker could register and the planner could ever read get_perf_metrics. Mirror the parent scheduler (vllm/v1/core/sched/scheduler.py) and call connector.build_connector_meta(output) on both benchmark-built outputs. Our fake decode reqs have KV pre-allocated locally via allocate_slots, so build_connector_meta produces a no-op metadata: no transfers planned, just a non-None object the worker contract requires. End-to-end repro (Qwen3-0.6B, RTX 6000 Ada, vllm 0.20): python -m dynamo.vllm \ --model Qwen/Qwen3-0.6B \ --disaggregation-mode decode \ --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \ --benchmark-mode decode Before: EngineCore fatal at gpu_model_runner execute_model on the first benchmark decode batch. After: sweep advances past the connector contract; the previous failure mode is gone. Add three unit tests covering both connector-set and connector-absent paths via the schedule() empty-frame branch (the _bench_inject_fake_decode path uses the same pattern and is covered by symmetry). Tests fail on the unpatched scheduler. Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

…ceholder In vLLM's async scheduler path, gpu_model_runner._update_states_after_model_execute writes a -1 placeholder into ``token_ids_cpu[req_idx, num_tokens_no_spec]`` after every sampling step (see ``sampled_ids = [-1]`` for async scheduling). ``num_tokens_no_spec`` is set from the request's prompt length when the request joins the InputBatch. InstrumentedScheduler._bench_inject_fake_decode previously used ``prompt_token_ids=[0] * ctx_len`` and immediately set ``num_computed_tokens = ctx_len``. The placeholder write therefore landed at exactly position ``ctx_len`` -- the same slot the next benchmark batch's request reads as its decode input when the InputBatch slot gets recycled. The embedding lookup OOBs because -1 is out of vocab. Reproduces reliably at batch_size > 1; batch=1 happens to dodge it because the slot was last touched by warmup with a longer prompt that didn't pollute position ctx_len. Diagnosis was via debug prints in vLLM's ``_prepare_input_ids``: with batch=6, ``input_ids.cpu[:6]=[-1, 0, 0, 0, 0, 0]`` -- the leading -1 came from the prior iteration's async-sampler placeholder write at ``token_ids_cpu[0, 16]``. Fix: pad the synthetic prompt to ``ctx_len + 1`` so position ``ctx_len`` is part of the prompt and is filled with a valid token id (0) by ``InputBatch.add_request``. The placeholder write then lands at ``ctx_len + 1``, out of the read range. Allocate ``ctx_len + 1`` KV slots for the same reason (block-table indexing for position ``ctx_len`` requires block ``ctx_len // block_size``, which is a NEW block when ``ctx_len % block_size == 0``). Drop the ``append_output_token_ids(0)`` call -- the padded prompt now provides the input-token slot the model needs, and the appended token never propagated to the worker's CachedRequestState anyway (output_token_ids is hard-coded to [] at gpu_model_runner.py:1168). End-to-end repro (Qwen3-0.6B, RTX 6000 Ada, vllm 0.20): python -m dynamo.vllm \ --model Qwen/Qwen3-0.6B \ --benchmark-mode decode \ --benchmark-decode-batch-granularity 4 Before: CUDA ``indexSelectSmallIndex: srcIndex < srcSelectDimSize`` on the second decode point (first batch_size > 1). After: 8/8 points complete and FPMs are written to the output JSON. Same in disagg mode combined with the connector-metadata fix. Add a unit test that asserts ``allocate_slots`` is called with ``ctx_len + 1`` (not ``ctx_len``); the test fails on the unpatched scheduler. Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

coderabbitai · 2026-05-09T02:58:27Z

Walkthrough

This PR enhances InstrumentedScheduler's benchmark decode path: it fixes a prompt padding issue in synthetic requests (ctx_len + 1 instead of ctx_len for KV allocation) to prevent out-of-bounds writes, and ensures connector metadata is consistently populated in both the decode-sweep empty frame and synthetic SchedulerOutput objects returned by _bench_inject_fake_decode.

Changes

Benchmark Decode Path Enhancement

Layer / File(s)	Summary
Benchmark Fake Decode: Prompt Padding & Metadata `components/src/dynamo/vllm/instrumented_scheduler.py`	`_bench_inject_fake_decode` pads synthetic request prompts to `ctx_len + 1` for KV allocation while keeping `num_computed_tokens` at `ctx_len`, and populates `kv_connector_metadata` and `ec_connector_metadata` in returned `SchedulerOutput`.
Decode Sweep Empty Frame: Connector Metadata `components/src/dynamo/vllm/instrumented_scheduler.py`	During benchmark decode sweep empty frame, `SchedulerOutput` conditionally populates `kv_connector_metadata` and `ec_connector_metadata` from configured connectors.
Test Coverage & Regression Assertions `components/src/dynamo/vllm/tests/test_vllm_instrumented_scheduler.py`	Tests validate connector metadata attachment in decode sweep empty frame and verify prompt padding regression fix to prevent out-of-bounds async scheduler writes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly addresses the main change: fixing bugs in InstrumentedScheduler's decode benchmark that blocked disaggregated deployments and batch>1 scenarios.
Description check	✅ Passed	The description comprehensively covers all required sections: overview (two independent bugs explained), details (both fixes described with code references and root cause analysis), reviewer guidance (specific file calls to start review), and related issues.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/src/dynamo/vllm/instrumented_scheduler.py`:
- Around line 362-369: Replace uses of getattr(self, "connector", None) and
getattr(self, "ec_connector", None) with direct attribute access checks (e.g.,
if self.connector is not None) so missing attributes raise AttributeError
instead of being silently masked; update both the shown block where
empty.kv_connector_metadata and empty.ec_connector_metadata are set (use
self.connector.build_connector_meta and self.ec_connector.build_connector_meta
guarded by "if self.connector is not None" / "if self.ec_connector is not None")
and the similar pattern in _bench_inject_fake_decode to follow the same
direct-access guard.

In `@components/src/dynamo/vllm/tests/test_vllm_instrumented_scheduler.py`:
- Around line 350-352: The tests currently import MagicMock and _BenchPhase
inside _make_decode_sweep_stub and inside three test functions; hoist "from
unittest.mock import MagicMock" and "from dynamo.vllm.instrumented_scheduler
import _BenchPhase" to the module-level imports alongside the existing
InstrumentedScheduler import, then remove the in-function/in-test duplicate
import lines in _make_decode_sweep_stub and the three new tests so all imports
live at the top of the file.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1d22e18f-6833-44d2-ad3e-77a973fb0319

📥 Commits

Reviewing files that changed from the base of the PR and between 6cd4238 and e8e9388.

📒 Files selected for processing (2)

components/src/dynamo/vllm/instrumented_scheduler.py
components/src/dynamo/vllm/tests/test_vllm_instrumented_scheduler.py

…nchmark Pure unit tests on _bench_inject_fake_decode and the empty-frame schedule branch can't catch the two recently-fixed regressions in isolation -- both only manifest when the model actually executes on a GPU: * kv_connector_metadata must be attached to benchmark-built SchedulerOutputs, otherwise the worker-side _get_kv_connector_output assertion fires (any disagg config / any worker with a KV connector). * The synthetic decode prompt must be padded to ctx_len + 1 so the async-scheduler's -1 placeholder write doesn't pollute the read slot the next batch reuses (CUDA OOB on the second decode point, first batch>1 sweep). Add a single-GPU end-to-end test that spawns python -m dynamo.vllm with --benchmark-mode {agg,decode}, waits for the benchmark JSON to be written, validates structure (right mode, every point produces at least one FPM with positive wall_time), and tears down. Two scenarios: * test_self_benchmark_agg_mode -- aggregated worker, no kv-transfer-config; covers the prompt-padding regression on the common-case path. * test_self_benchmark_disagg_decode_with_nixl_connector -- disagg decode + NixlConnector kv_both; this is the original user-reported configuration and exercises both fixes together. Both runs are tight (granularity=2, max-model-len=1024, gpu-memory-utilization=0.4 on Qwen3-0.6B) so each completes in ~30s of GPU time inside the 45-minute pre-merge slot. Markers (vllm, gpu_1, e2e, pre_merge, model(QWEN)) match the selector pr.yaml uses for the vllm-test GPU stage: gpu_test_markers: pre_merge and vllm and gpu_1 so the tests gate every PR. Verified the selector picks them up via ``pytest --collect-only -m "pre_merge and vllm and gpu_1"``. Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Extend the real-GPU pre-merge gate so it also verifies the worker can serve normal traffic after self-benchmarking. Replaces the earlier "start worker just for the benchmark" approach with the production flow: worker startup runs the benchmark, registers with the runtime, then serves a real chat completion. Setup mirrors tests/fault_tolerance/cancellation/test_vllm.py for CI parity: * Same model: FAULT_TOLERANCE_MODEL_NAME (Qwen/Qwen3-0.6B). * Same DynamoFrontendProcess + ManagedProcess worker layout, same health-check URLs, same env (DYN_HEALTH_CHECK_ENABLED=false, DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS, DYN_SYSTEM_PORT, DYN_HTTP_PORT). * Same NixlConnector kv_both wiring and same prefill-side --kv-events-config + VLLM_NIXL_SIDE_CHANNEL_PORT for the disagg case. * Same fixtures: runtime_services_dynamic_ports, predownload_models, set_ucx_tls_no_mm (for the disagg pair). * Same per-worker allocate_port(9100) / deallocate_port pattern. Two scenarios: * test_self_benchmark_agg_serves_after_bench (gpu_1): aggregated worker, --benchmark-mode agg. Worker startup runs the full agg sweep (prefill + decode points, decode includes batch>1), then serves a chat completion. * test_self_benchmark_disagg_serves_after_bench (gpu_2): prefill worker --benchmark-mode prefill plus decode worker --benchmark-mode decode kv_both. Both workers run their sweeps, register, and together serve a chat completion that traverses both engines. Each test, after both workers' health checks pass: 1. Validates the per-worker benchmark JSON: right mode, every benchmark point produced at least one FPM with positive wall_time -- catches a worker that registered but ran a degenerate sweep. 2. Asserts the decode sweep covers batch>1 (the prompt-padding regression case) so a future regression that only affects batch>1 doesn't slip through. 3. Sends a real chat completion via tests/utils/client.send_request and asserts a non-empty assistant content. Marker selectors verified locally: * pre_merge and vllm and gpu_1 -> agg test (matches pr.yaml's vllm-test gpu_test_markers) * pre_merge and vllm and (gpu_2 or gpu_4) -> disagg test (matches pr.yaml's vllm-multi-gpu-test gpu_test_markers) so both gate every PR. Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Three changes addressing PR review: 1. (dynamo-ops) ``_bench_generate_decode_grid`` previously sized ``max_batch`` from raw ``ctx_len`` token count (``total_kv_tokens // ctx_len``). After the prompt-padding fix ``_bench_inject_fake_decode`` actually requests ``ctx_len + 1`` tokens, which the KV cache manager rounds UP to the next block boundary. So at boundary ctx_lens (notably ctx_len that is a multiple of block_size) the advertised ``max_batch`` was 2x what the allocator could fit, and the allocator silently truncated the batch (logged as "KV exhausted at ctx_len=..., truncating batch") -- the benchmark recorded the partial result under the wrong batch size. Fix: size ``max_batch`` from blocks-per-request derived from the padded length: blocks_per_req = ceil((ctx_len + 1) / block_size) max_batch = num_gpu_blocks // blocks_per_req Add two unit tests that fail on the unpatched grid sizer (e.g. at block_size=16 / num_gpu_blocks=100 / ctx_len=16 the old code advertised batch_size=100, which needs 200 blocks). 2. (CodeRabbit) Replace ``getattr(self, "connector", None)`` and ``getattr(self, "ec_connector", None)`` guards with direct attribute access (``self.connector is not None``, ``self.ec_connector is not None``). Both attributes are part of the parent vLLM ``Scheduler``'s public contract; using ``getattr`` with a default would silently mask a future vLLM bump that drops them. Direct access makes such a contract violation fail loudly. Same fix at both call sites (the empty-frame schedule branch and ``_bench_inject_fake_decode``). 3. (CodeRabbit) Hoist ``from unittest.mock import MagicMock`` and ``from dynamo.vllm.instrumented_scheduler import _BenchPhase`` to the module-level imports in ``test_vllm_instrumented_scheduler.py``, matching the project's python-guidelines (all imports at module top). Local pytest: 21/21 unit tests pass; the two new grid-sizing tests verified to fail on the unpatched scheduler. Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

…rop shebang Three CI failures on PR #9360 addressed: 1. ``check-shebang-scripts-are-executable`` (pre-commit): ``tests/vllm/test_self_benchmark_gpu.py`` carried a ``#!/usr/bin/env python3`` shebang but isn't intended to be directly executed (pytest collects it). Drop the shebang to satisfy the hook without granting exec bits. 2. ``pytest-marker-report`` (pre-commit) -- 9 collection errors: ``tests/vllm/__init__.py`` makes ``tests/vllm`` a Python package that shadows the real ``vllm`` distribution when ``tests`` is on sys.path during marker discovery. Every other vLLM-suite test that does ``from vllm import LLM`` (or similar) broke with ``ImportError: cannot import name 'LLM' from 'vllm'`` pointing at our ``tests/vllm/__init__.py``. Renamed to ``tests/vllm_self_benchmark/`` to remove the name collision; all 9 shadowed imports resolve to the real ``vllm`` again. 3. ``vllm-runtime / Multi-GPU Test`` -- disagg test failed with: (EngineCore pid=...) zmq.error.ZMQError: Address already in use (addr='tcp://*:20380') ``InstrumentedScheduler`` (auto-injected by ``--benchmark-mode``) binds the FPM publisher to ``tcp://*:DYN_FORWARDPASS_METRIC_PORT + dp_rank``, defaulting to 20380. Two workers on the same host (the disagg prefill + decode pair) both try to bind 20380 and the second loses. The operator works around this by injecting ``DYN_FORWARDPASS_METRIC_PORT`` per engine (deploy/operator/internal/dynamo/component_worker.go:111); the single-GPU agg test happens to avoid the collision because only one worker runs. Apply the same fix in the test: allocate a per-worker port via ``allocate_port(20380)`` and inject it as the env var, with matching ``deallocate_port`` on teardown. (The cancellation/migration tests don't hit this because they don't pass ``--benchmark-mode`` and so don't get InstrumentedScheduler.) Verified locally: * ``import vllm`` resolves to the real site-packages vllm again. * pytest collection of the renamed test passes; marker selectors ``pre_merge and vllm and gpu_1`` (agg test) and ``pre_merge and vllm and (gpu_2 or gpu_4)`` (disagg test) still match -- so both stay wired to ``pr.yaml``'s vllm-test and vllm-multi-gpu-test GPU stages and continue to gate the PR. The single-GPU agg test passed in the prior CI run (only the disagg test failed) so issue #3 is specifically what was needed for disagg to pass. Signed-off-by: Hongkuan Zhou <hongkuanz@nvidia.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

tedzhouhk added 2 commits May 8, 2026 19:03

tedzhouhk requested review from a team as code owners May 9, 2026 02:54

pull-request-size Bot added the size/L label May 9, 2026

github-actions Bot added fix backend::vllm Relates to the vllm backend labels May 9, 2026

coderabbitai Bot reviewed May 9, 2026

View reviewed changes

Comment thread components/src/dynamo/vllm/instrumented_scheduler.py Outdated

Comment thread components/src/dynamo/vllm/tests/test_vllm_instrumented_scheduler.py Outdated

dynamo-ops reviewed May 9, 2026

View reviewed changes

Comment thread components/src/dynamo/vllm/instrumented_scheduler.py

pull-request-size Bot added size/XL and removed size/L labels May 9, 2026

copy-pr-bot Bot temporarily deployed to GITLAB May 9, 2026 03:15 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 9, 2026 03:29 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 9, 2026 03:44 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 9, 2026 04:03 Inactive

PeaBrane approved these changes May 9, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to GITLAB May 11, 2026 15:58 Inactive

tedzhouhk merged commit 9dd5d56 into main May 11, 2026
87 of 88 checks passed

tedzhouhk deleted the hzhou/vllm-bench-kv-connector-fix branch May 11, 2026 16:50

copy-pr-bot Bot had a problem deploying to GITLAB May 11, 2026 17:07 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(vllm): unblock InstrumentedScheduler decode benchmark in disagg + at batch>1#9360

fix(vllm): unblock InstrumentedScheduler decode benchmark in disagg + at batch>1#9360
tedzhouhk merged 6 commits into
mainfrom
hzhou/vllm-bench-kv-connector-fix

tedzhouhk commented May 9, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 9, 2026

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tedzhouhk commented May 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix 1 — attach kv_connector_metadata to benchmark-built SchedulerOutputs (d2fdd1fa158)

Fix 2 — pad bench fake-decode prompt to avoid async sampler -1 placeholder (e8e938808c2)

End-to-end verification

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 9, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tedzhouhk commented May 9, 2026 •

edited by coderabbitai Bot

Loading

Fix 1 — attach `kv_connector_metadata` to benchmark-built `SchedulerOutput`s (`d2fdd1fa158`)

Fix 2 — pad bench fake-decode prompt to avoid async sampler `-1` placeholder (`e8e938808c2`)