Skip to content

ci(mm-router): parallel worker boot + pre_merge gater for MM routing#9542

Merged
dmitry-tokarev-nv merged 10 commits into
mainfrom
krish/mm-router-premerge-gater
May 29, 2026
Merged

ci(mm-router): parallel worker boot + pre_merge gater for MM routing#9542
dmitry-tokarev-nv merged 10 commits into
mainfrom
krish/mm-router-premerge-gater

Conversation

@krishung5

@krishung5 krishung5 commented May 14, 2026

Copy link
Copy Markdown
Contributor

Summary

Three changes to add MM-aware-routing (lightseek) coverage back to pre_merge without re-introducing the ~28-min wall-clock growth that triggered #9452:

  1. agg_multimodal_router.sh — workers now boot in parallel under SINGLE_GPU=true. The script previously launched each worker with & then immediately wait_ready-ed inside the same loop, so 2 workers booted serially even though they share the GPU. Move wait_ready to a second loop so phase 1 fires all workers concurrently and phase 2 waits for all of them.

  2. tests/serve/multimodal_profiles/vllm.py — promote the qwen3-vl-2b agg_router topology from post_mergepre_merge. Its existing cached_tokens > 0 assertion (already present from fix: MM-aware KV routing for Phi-3, Qwen2-VL, Qwen2.5-VL #9441) fails closed if MM-aware routing silently degrades to text-prefix — catching the same class of regression the post_merge tests/mm_router/* tests do, but earlier.

  3. tests/utils/multimodal.py — trim max_tokens on the cached-tokens payload from 100 → 50. All assertions are prompt-side (cached_tokens, expected substring match on "green", routing block-count log), so output length is safe to halve. Shaves ~5s/test in CI.

Test time

Before:

PASSED [w2] tests/serve/test_vllm.py::test_serve_deployment[mm_agg_router_qwen3-vl-2b] [142s/400s]

After:

PASSED [w219] tests/serve/test_vllm.py::test_serve_deployment[mm_agg_router_qwen3-vl-2b-2] [96s/400s]

Why pre_merge needed this

After #9452 moved the lightseek-path MM-routing tests to post_merge, no pre_merge test exercises lightseek + MM-aware routing. The two remaining `tests/mm_router/*` pre_merge tests cover the TRT-LLM MM Router Worker and the Python vLLM-processor paths only, not the Rust-frontend lightseek path. So any future regression that breaks lightseek init, placeholder expansion, or KV-block hash alignment — exactly the class of bugs #9441 fixes — would only surface in post_merge.

Closes DIS-1983

Test plan

  • pre_merge picks up `mm_agg_router_qwen3-vl-2b` and reports `cached_tokens > 0`
  • post_merge `mm_agg_router_*` topologies retain previous behavior under the new parallel boot
  • No new flake introduced by concurrent worker boot under `SINGLE_GPU=true` (vLLM's `--kv-cache-memory-bytes` byte cap, used in CI, sizes KV cache independently of `--gpu-memory-utilization`, so concurrent profile passes don't OOM)

Related

🤖 Generated with Claude Code


Open in Devin Review

Summary by CodeRabbit

  • Documentation

    • Improved clarity in vLLM worker launch process with phase demarcation comments.
  • Tests

    • Adjusted test gating to run multimodal router validation earlier in the testing pipeline for faster issue detection.

Review Change Stack

@krishung5 krishung5 requested review from a team as code owners May 14, 2026 05:47
@github-actions github-actions Bot added ci Issues/PRs that reference CI build/test backend::vllm Relates to the vllm backend and removed ci Issues/PRs that reference CI build/test labels May 14, 2026
@coderabbitai

coderabbitai Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

This PR adds clarifying phase comments to the vLLM aggregation router launch script and re-gates the agg_router test topology from post-merge to pre-merge with updated documentation describing its warm-KV-cache correctness and routing-plumbing validation role.

Changes

vLLM aggregation router clarification and test gating

Layer / File(s) Summary
vLLM launch script phase demarcation
examples/backends/vllm/launch/agg_multimodal_router.sh
Adds "Phase 1" comment above the parallel worker startup loop and "Phase 2" comment at the transition to the health-endpoint readiness-check loop, clarifying the two-stage worker launch and validation sequence.
agg_router test re-gating to pre-merge
tests/serve/multimodal_profiles/vllm.py
Changes the agg_router topology marker from pytest.mark.post_merge to pytest.mark.pre_merge and updates inline documentation to describe it as a warm-KV-cache correctness gate that surfaces routing-plumbing issues before merge.

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description provides a comprehensive summary with detailed changes, rationale, test times, and a clear test plan.
Title check ✅ Passed The title clearly refers to the two main changes: parallel worker boot optimization and the pre_merge gate for MM routing.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

agg_multimodal_router.sh launched each worker in the background but
then blocked on its health endpoint inside the same loop, so two
workers booted serially even though they could share the GPU. Move
wait_ready outside the launch loop so phase 1 fires all workers and
phase 2 waits for all of them concurrently.

Combined with this, promote the qwen3-vl-2b agg_router topology to
pre_merge. Its cached_tokens > 0 assertion fails closed when MM-aware
routing silently degrades to text-prefix — catching the same class of
regression the post_merge tests/mm_router/* tests do, but in pre_merge.

Wall-clock in mm-vllm-dev (RTX 5880 Ada, --kv-cache-memory-bytes-capped
profile matching CI): 66s pass time / 72s real (was 131-142s in the
last main post_merge run).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@krishung5 krishung5 force-pushed the krish/mm-router-premerge-gater branch from 152b57f to 6d462c9 Compare May 14, 2026 05:53
krishung5 added a commit that referenced this pull request May 14, 2026
…ting-blocks

The base CachedTokensChatPayload (cached_tokens >= 1 on repeat) can pass
when MM-aware routing has silently degraded to text-prefix overlap: with
two workers and identical text bodies, the fallback router will
frequently land both requests on the same warm worker by luck, and
prompt-prefix reuse alone produces cached_tokens > 0. Real silent
regressions in the lightseek path were caught only by reading the
frontend logs by hand.

Extend CachedTokensChatPayload with two opt-in flags that auto-populate
the existing expected_log regex contract:

* require_lightseek_init=True — assert the per-model "MM-aware KV
  routing enabled (lightseek)" INFO log fired, proving init succeeded
  for this model+placeholder spec.

* min_routing_total_blocks=N — assert at least one "[ROUTING] ... with
  X/M blocks overlap" log line has M >= N. Real MM-aware routing
  inflates the block-count by 30-150x over text-only (image
  placeholders -> ~14 tokens/block each); N=10 reliably distinguishes
  the two modes for all profiled models without false positives.

All 6 sglang agg_router profiles (Qwen3-VL-2B pre_merge + Qwen2.5-VL-3B
/ Qwen2-VL-2B / Phi-3-vision / LLaVA-1.5 / LLaVA-NeXT post_merge) opt
into both flags.

Note: this overlaps verbatim with PR #9542 (vLLM-side same
strong-gating). Whichever lands first rebases the other cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
krishung5 added a commit that referenced this pull request May 14, 2026
…ting-blocks

The base CachedTokensChatPayload (cached_tokens >= 1 on repeat) can pass
when MM-aware routing has silently degraded to text-prefix overlap: with
two workers and identical text bodies, the fallback router will
frequently land both requests on the same warm worker by luck, and
prompt-prefix reuse alone produces cached_tokens > 0. Real silent
regressions in the lightseek path were caught only by reading the
frontend logs by hand.

Extend CachedTokensChatPayload with two opt-in flags that auto-populate
the existing expected_log regex contract:

* require_lightseek_init=True — assert the per-model "MM-aware KV
  routing enabled (lightseek)" INFO log fired, proving init succeeded
  for this model+placeholder spec.

* min_routing_total_blocks=N — assert at least one "[ROUTING] ... with
  X/M blocks overlap" log line has M >= N. Real MM-aware routing
  inflates the block-count by 30-150x over text-only (image
  placeholders -> ~14 tokens/block each); N=10 reliably distinguishes
  the two modes for all profiled models without false positives.

All 6 sglang agg_router profiles (Qwen3-VL-2B pre_merge + Qwen2.5-VL-3B
/ Qwen2-VL-2B / Phi-3-vision / LLaVA-1.5 / LLaVA-NeXT post_merge) opt
into both flags.

Note: this overlaps verbatim with PR #9542 (vLLM-side same
strong-gating). Whichever lands first rebases the other cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rmccorm4 rmccorm4 changed the title ci(mm-router): parallel worker boot + pre_merge gater for lightseek ci(mm-router): parallel worker boot + pre_merge gater for rust MM routing May 28, 2026
Shaves ~10s off mm_agg_router_qwen3-vl-2b in CI (2 requests × ~5s
generation). All assertions are prompt-side, so output length is safe
to halve.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@dmitry-tokarev-nv dmitry-tokarev-nv left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krishung5 can you see if agg_multimodal_router_chat_processor.sh:173-200 needs same change (point 2 below). Other claude's concerns are up to you to decide on.

Issues & suggestions

1. profiled_vram_gib=18.7 was measured under serial boot — re-profile or watch the first run. (Main risk.)
Concurrent boot keeps the steady-state footprint identical (both workers resident either way) but raises the transient peak: two vLLM memory-profiling forward passes now overlap instead of one running against the other's steady state. profiled_vram_gib is scheduling/selection metadata (pytest_parallel_gpu.py packing + --max-vram-gib deselect in conftest.py:575), not a runtime assertion — so nothing fails on the marker, but the orchestrator may co-locate another test based on a now-stale 18.7 figure and the physical GPU could OOM. --enforce-eager (no graph capture) + a 2B model + headroom on a 24 GiB card make this likely fine, but it's genuinely unverified — your own test-plan item #3 is still unchecked. Worth re-profiling under parallel boot, or at least confirming the peak on the first pre_merge CI run before considering this settled.

2. The sibling script wasn't given the same treatment. agg_multimodal_router_chat_processor.sh:173-200 still has the exact launch-&-then-wait_ready-in-the-same-loop pattern this PR fixes. It stays on post_merge, so it's not blocking — but the same speedup applies and the divergence will confuse the next reader. Either mirror the Phase 1/Phase 2 split there or drop a one-line note on why only the lightseek script was changed.

3. Pre-existing, but the PR's reasoning lives right here: GPU_MEMORY_UTILIZATION (agg_multimodal_router.sh:41) is declared, echoed, and documented in --help, but never passed to the worker — the only GPU-mem args come from ${GPU_MEM_ARGS}. So a manual SINGLE_GPU=true run without the KV-bytes override (the script's own defaults!) falls back to vLLM's 0.9 utilization and would OOM under concurrent boot. That path was already broken under serial boot (worker 2 failed the free-mem check), so this isn't a regression — but since the OOM-safety reasoning now hinges on the marker being set, a one-line comment near Phase 1 ("safe only when the KV-bytes cap is set; CI sets it via the requested_vllm_kv_cache_bytes marker") would save a future debugger real time.

Minor / non-blocking
- wait_ready is called with 2 args here (defaulting to the 900 s timeout, :122) while the chat_processor sibling passes 900 explicitly — harmless, but the inconsistency is a tiny readability snag.
- The topology comment was trimmed of its "why cached_tokens catches silent text-prefix regressions" rationale; it survives in the make_image_payload_cached_tokens docstring (tests/utils/multimodal.py:72-84), so this is fine.
- max_tokens is now 50 only in the cached-tokens factory; make_image_payload/_b64/_video/_audio remain at 100. Intentional (only this one is on the pre_merge time budget), just noting the asymmetry.

…o pre_merge

agg_multimodal_router_chat_processor.sh had the same serial-boot pattern as
the lightseek script (wait_ready inside the launch loop). Mirror the
Phase 1/Phase 2 split so the chat_processor sibling gets the same speedup,
and promote mm_agg_router_chat_processor_qwen3-vl-2b to pre_merge so the
chat-processor MM-routing path is gated before merge alongside lightseek.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both lightseek and chat_processor mm_agg_router tests peak at 14.9 GiB
(measured at 0.5s nvidia-smi polling locally + 15.0 GiB at 10s polling
in CI run 26596313542). 18.7 was conservative; 16 keeps 1 GiB safety
headroom while freeing the scheduler to pack smaller co-located tests
alongside instead of reserving the whole GPU0 slot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch the 3 agg_router topologies (lightseek pre_merge, chat_processor
pre_merge, frontend_decode post_merge) from Qwen3-VL-2B-Instruct (bf16)
to Qwen3-VL-2B-Instruct-FP8. The router is precision-agnostic — it
operates on token IDs, block hashes, and counts, not float values — so
FP8 produces bit-identical routing behavior (verified: same cached_tokens
0/272/272, same kv_hit_rate 0.944, same 17/18 block overlap on R3).

Drop profiled_vram_gib 16.0 → 13.0 (measured peak 12.16 GiB + 0.8 GiB
margin) and requested_vllm_kv_cache_bytes 1.6 GB → 512 MB (test needs
<100 MB; smaller cap frees the scheduler's accounting without affecting
real peak which is activation-dominated).

Net effect: ~5 GiB more co-scheduling headroom on a 22 GiB L4 vs the
original 18.7 GiB marker, no coverage loss.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit's sed range matched agg_router topologies across
all VLLM_MULTIMODAL_PROFILES, not just qwen3-vl-2b. Restore the original
profiled_vram_gib and requested_vllm_kv_cache_bytes for
qwen2.5-vl-3b, qwen2-vl-2b, and phi3-vision agg_router topologies.
qwen2-vl-2b's profiled was dropped 16→13 without verification.

Only qwen3-vl-2b's 3 agg_router* topologies should carry the FP8 swap
and tightened VRAM markers measured in this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@krishung5 krishung5 changed the title ci(mm-router): parallel worker boot + pre_merge gater for rust MM routing ci(mm-router): parallel worker boot + pre_merge gater for MM routing May 28, 2026
krishung5 and others added 2 commits May 28, 2026 19:52
The chat_processor variant doubles GPU0 queue time at 4-worker scale
without catching distinct bugs — both lightseek (Rust frontend) and
chat_processor (Python preprocessor) share the kv_router downstream,
so a routing regression would surface in either. Keep the lighter
lightseek-path test as the pre_merge gate; chat_processor stays in
post_merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@krishung5 krishung5 enabled auto-merge (squash) May 29, 2026 03:49
@dmitry-tokarev-nv dmitry-tokarev-nv merged commit 49c9c80 into main May 29, 2026
241 of 245 checks passed
@dmitry-tokarev-nv dmitry-tokarev-nv deleted the krish/mm-router-premerge-gater branch May 29, 2026 04:21
joshuayao added a commit to joshuayao/dynamo that referenced this pull request Jun 1, 2026
Signed-off-by: Yi Yao <yi.a.yao@intel.com>
tmonty12 pushed a commit that referenced this pull request Jun 8, 2026
…9542)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::vllm Relates to the vllm backend ci Issues/PRs that reference CI build/test size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants