ci(mm-router): parallel worker boot + pre_merge gater for MM routing by krishung5 · Pull Request #9542 · ai-dynamo/dynamo

krishung5 · 2026-05-14T05:47:36Z

Summary

Three changes to add MM-aware-routing (lightseek) coverage back to pre_merge without re-introducing the ~28-min wall-clock growth that triggered #9452:

agg_multimodal_router.sh — workers now boot in parallel under SINGLE_GPU=true. The script previously launched each worker with & then immediately wait_ready-ed inside the same loop, so 2 workers booted serially even though they share the GPU. Move wait_ready to a second loop so phase 1 fires all workers concurrently and phase 2 waits for all of them.
tests/serve/multimodal_profiles/vllm.py — promote the qwen3-vl-2b agg_router topology from post_merge → pre_merge. Its existing cached_tokens > 0 assertion (already present from fix: MM-aware KV routing for Phi-3, Qwen2-VL, Qwen2.5-VL #9441) fails closed if MM-aware routing silently degrades to text-prefix — catching the same class of regression the post_merge tests/mm_router/* tests do, but earlier.
tests/utils/multimodal.py — trim max_tokens on the cached-tokens payload from 100 → 50. All assertions are prompt-side (cached_tokens, expected substring match on "green", routing block-count log), so output length is safe to halve. Shaves ~5s/test in CI.

Test time

Before:

PASSED [w2] tests/serve/test_vllm.py::test_serve_deployment[mm_agg_router_qwen3-vl-2b] [142s/400s]

After:

PASSED [w219] tests/serve/test_vllm.py::test_serve_deployment[mm_agg_router_qwen3-vl-2b-2] [96s/400s]

Why pre_merge needed this

After #9452 moved the lightseek-path MM-routing tests to post_merge, no pre_merge test exercises lightseek + MM-aware routing. The two remaining `tests/mm_router/*` pre_merge tests cover the TRT-LLM MM Router Worker and the Python vLLM-processor paths only, not the Rust-frontend lightseek path. So any future regression that breaks lightseek init, placeholder expansion, or KV-block hash alignment — exactly the class of bugs #9441 fixes — would only surface in post_merge.

Closes DIS-1983

Test plan

pre_merge picks up `mm_agg_router_qwen3-vl-2b` and reports `cached_tokens > 0`
post_merge `mm_agg_router_*` topologies retain previous behavior under the new parallel boot
No new flake introduced by concurrent worker boot under `SINGLE_GPU=true` (vLLM's `--kv-cache-memory-bytes` byte cap, used in CI, sizes KV cache independently of `--gpu-memory-utilization`, so concurrent profile passes don't OOM)

Summary by CodeRabbit

Documentation
- Improved clarity in vLLM worker launch process with phase demarcation comments.
Tests
- Adjusted test gating to run multimodal router validation earlier in the testing pipeline for faster issue detection.

coderabbitai · 2026-05-14T05:50:44Z

Walkthrough

This PR adds clarifying phase comments to the vLLM aggregation router launch script and re-gates the agg_router test topology from post-merge to pre-merge with updated documentation describing its warm-KV-cache correctness and routing-plumbing validation role.

Changes

vLLM aggregation router clarification and test gating

Layer / File(s)	Summary
vLLM launch script phase demarcation `examples/backends/vllm/launch/agg_multimodal_router.sh`	Adds "Phase 1" comment above the parallel worker startup loop and "Phase 2" comment at the transition to the health-endpoint readiness-check loop, clarifying the two-stage worker launch and validation sequence.
agg_router test re-gating to pre-merge `tests/serve/multimodal_profiles/vllm.py`	Changes the `agg_router` topology marker from `pytest.mark.post_merge` to `pytest.mark.pre_merge` and updates inline documentation to describe it as a warm-KV-cache correctness gate that surfaces routing-plumbing issues before merge.

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description provides a comprehensive summary with detailed changes, rationale, test times, and a clear test plan.
Title check	✅ Passed	The title clearly refers to the two main changes: parallel worker boot optimization and the pre_merge gate for MM routing.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

agg_multimodal_router.sh launched each worker in the background but then blocked on its health endpoint inside the same loop, so two workers booted serially even though they could share the GPU. Move wait_ready outside the launch loop so phase 1 fires all workers and phase 2 waits for all of them concurrently. Combined with this, promote the qwen3-vl-2b agg_router topology to pre_merge. Its cached_tokens > 0 assertion fails closed when MM-aware routing silently degrades to text-prefix — catching the same class of regression the post_merge tests/mm_router/* tests do, but in pre_merge. Wall-clock in mm-vllm-dev (RTX 5880 Ada, --kv-cache-memory-bytes-capped profile matching CI): 66s pass time / 72s real (was 131-142s in the last main post_merge run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ting-blocks The base CachedTokensChatPayload (cached_tokens >= 1 on repeat) can pass when MM-aware routing has silently degraded to text-prefix overlap: with two workers and identical text bodies, the fallback router will frequently land both requests on the same warm worker by luck, and prompt-prefix reuse alone produces cached_tokens > 0. Real silent regressions in the lightseek path were caught only by reading the frontend logs by hand. Extend CachedTokensChatPayload with two opt-in flags that auto-populate the existing expected_log regex contract: * require_lightseek_init=True — assert the per-model "MM-aware KV routing enabled (lightseek)" INFO log fired, proving init succeeded for this model+placeholder spec. * min_routing_total_blocks=N — assert at least one "[ROUTING] ... with X/M blocks overlap" log line has M >= N. Real MM-aware routing inflates the block-count by 30-150x over text-only (image placeholders -> ~14 tokens/block each); N=10 reliably distinguishes the two modes for all profiled models without false positives. All 6 sglang agg_router profiles (Qwen3-VL-2B pre_merge + Qwen2.5-VL-3B / Qwen2-VL-2B / Phi-3-vision / LLaVA-1.5 / LLaVA-NeXT post_merge) opt into both flags. Note: this overlaps verbatim with PR #9542 (vLLM-side same strong-gating). Whichever lands first rebases the other cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rge-gater

Shaves ~10s off mm_agg_router_qwen3-vl-2b in CI (2 requests × ~5s generation). All assertions are prompt-side, so output length is safe to halve. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dmitry-tokarev-nv

@krishung5 can you see if agg_multimodal_router_chat_processor.sh:173-200 needs same change (point 2 below). Other claude's concerns are up to you to decide on.

Issues & suggestions

1. profiled_vram_gib=18.7 was measured under serial boot — re-profile or watch the first run. (Main risk.)
Concurrent boot keeps the steady-state footprint identical (both workers resident either way) but raises the transient peak: two vLLM memory-profiling forward passes now overlap instead of one running against the other's steady state. profiled_vram_gib is scheduling/selection metadata (pytest_parallel_gpu.py packing + --max-vram-gib deselect in conftest.py:575), not a runtime assertion — so nothing fails on the marker, but the orchestrator may co-locate another test based on a now-stale 18.7 figure and the physical GPU could OOM. --enforce-eager (no graph capture) + a 2B model + headroom on a 24 GiB card make this likely fine, but it's genuinely unverified — your own test-plan item #3 is still unchecked. Worth re-profiling under parallel boot, or at least confirming the peak on the first pre_merge CI run before considering this settled.

2. The sibling script wasn't given the same treatment. agg_multimodal_router_chat_processor.sh:173-200 still has the exact launch-&-then-wait_ready-in-the-same-loop pattern this PR fixes. It stays on post_merge, so it's not blocking — but the same speedup applies and the divergence will confuse the next reader. Either mirror the Phase 1/Phase 2 split there or drop a one-line note on why only the lightseek script was changed.

3. Pre-existing, but the PR's reasoning lives right here: GPU_MEMORY_UTILIZATION (agg_multimodal_router.sh:41) is declared, echoed, and documented in --help, but never passed to the worker — the only GPU-mem args come from ${GPU_MEM_ARGS}. So a manual SINGLE_GPU=true run without the KV-bytes override (the script's own defaults!) falls back to vLLM's 0.9 utilization and would OOM under concurrent boot. That path was already broken under serial boot (worker 2 failed the free-mem check), so this isn't a regression — but since the OOM-safety reasoning now hinges on the marker being set, a one-line comment near Phase 1 ("safe only when the KV-bytes cap is set; CI sets it via the requested_vllm_kv_cache_bytes marker") would save a future debugger real time.

Minor / non-blocking
- wait_ready is called with 2 args here (defaulting to the 900 s timeout, :122) while the chat_processor sibling passes 900 explicitly — harmless, but the inconsistency is a tiny readability snag.
- The topology comment was trimmed of its "why cached_tokens catches silent text-prefix regressions" rationale; it survives in the make_image_payload_cached_tokens docstring (tests/utils/multimodal.py:72-84), so this is fine.
- max_tokens is now 50 only in the cached-tokens factory; make_image_payload/_b64/_video/_audio remain at 100. Intentional (only this one is on the pre_merge time budget), just noting the asymmetry.

…o pre_merge agg_multimodal_router_chat_processor.sh had the same serial-boot pattern as the lightseek script (wait_ready inside the launch loop). Mirror the Phase 1/Phase 2 split so the chat_processor sibling gets the same speedup, and promote mm_agg_router_chat_processor_qwen3-vl-2b to pre_merge so the chat-processor MM-routing path is gated before merge alongside lightseek. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both lightseek and chat_processor mm_agg_router tests peak at 14.9 GiB (measured at 0.5s nvidia-smi polling locally + 15.0 GiB at 10s polling in CI run 26596313542). 18.7 was conservative; 16 keeps 1 GiB safety headroom while freeing the scheduler to pack smaller co-located tests alongside instead of reserving the whole GPU0 slot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Switch the 3 agg_router topologies (lightseek pre_merge, chat_processor pre_merge, frontend_decode post_merge) from Qwen3-VL-2B-Instruct (bf16) to Qwen3-VL-2B-Instruct-FP8. The router is precision-agnostic — it operates on token IDs, block hashes, and counts, not float values — so FP8 produces bit-identical routing behavior (verified: same cached_tokens 0/272/272, same kv_hit_rate 0.944, same 17/18 block overlap on R3). Drop profiled_vram_gib 16.0 → 13.0 (measured peak 12.16 GiB + 0.8 GiB margin) and requested_vllm_kv_cache_bytes 1.6 GB → 512 MB (test needs <100 MB; smaller cap frees the scheduler's accounting without affecting real peak which is activation-dominated). Net effect: ~5 GiB more co-scheduling headroom on a 22 GiB L4 vs the original 18.7 GiB marker, no coverage loss. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous commit's sed range matched agg_router topologies across all VLLM_MULTIMODAL_PROFILES, not just qwen3-vl-2b. Restore the original profiled_vram_gib and requested_vllm_kv_cache_bytes for qwen2.5-vl-3b, qwen2-vl-2b, and phi3-vision agg_router topologies. qwen2-vl-2b's profiled was dropped 16→13 without verification. Only qwen3-vl-2b's 3 agg_router* topologies should carry the FP8 swap and tightened VRAM markers measured in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The chat_processor variant doubles GPU0 queue time at 4-worker scale without catching distinct bugs — both lightseek (Rust frontend) and chat_processor (Python preprocessor) share the kv_router downstream, so a routing regression would surface in either. Keep the lighter lightseek-path test as the pre_merge gate; chat_processor stays in post_merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rge-gater

Signed-off-by: Yi Yao <yi.a.yao@intel.com>

…9542) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

krishung5 requested review from a team as code owners May 14, 2026 05:47

pull-request-size Bot added the size/S label May 14, 2026

github-actions Bot added ci Issues/PRs that reference CI build/test backend::vllm Relates to the vllm backend and removed ci Issues/PRs that reference CI build/test labels May 14, 2026

krishung5 force-pushed the krish/mm-router-premerge-gater branch from 152b57f to 6d462c9 Compare May 14, 2026 05:53

copy-pr-bot Bot temporarily deployed to GITLAB May 14, 2026 05:53 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 14, 2026 06:01 Inactive

dynamo-ops approved these changes May 14, 2026

View reviewed changes

krishung5 mentioned this pull request May 14, 2026

feat(sglang): MM-aware KV routing via pad_value substitution #9561

Merged

3 tasks

Merge remote-tracking branch 'origin/main' into krish/mm-router-preme…

5a81137

…rge-gater

rmccorm4 changed the title ~~ci(mm-router): parallel worker boot + pre_merge gater for lightseek~~ ci(mm-router): parallel worker boot + pre_merge gater for rust MM routing May 28, 2026

Merge remote-tracking branch 'origin/main' into krish/mm-router-preme…

7f243c8

…rge-gater

copy-pr-bot Bot temporarily deployed to GITLAB May 28, 2026 19:09 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 28, 2026 19:22 Inactive

test(mm-router): trim max_tokens 100 → 50 in cached-tokens payload

f63cb3e

Shaves ~10s off mm_agg_router_qwen3-vl-2b in CI (2 requests × ~5s generation). All assertions are prompt-side, so output length is safe to halve. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

copy-pr-bot Bot temporarily deployed to GITLAB May 28, 2026 20:05 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 28, 2026 20:37 Inactive

dmitry-tokarev-nv approved these changes May 28, 2026

View reviewed changes

pull-request-size Bot removed the size/S label May 28, 2026

pull-request-size Bot added the size/M label May 28, 2026

copy-pr-bot Bot temporarily deployed to GITLAB May 28, 2026 21:51 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 28, 2026 22:02 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 28, 2026 22:12 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 28, 2026 22:34 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 28, 2026 23:00 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 28, 2026 23:03 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 28, 2026 23:11 Inactive

krishung5 changed the title ~~ci(mm-router): parallel worker boot + pre_merge gater for rust MM routing~~ ci(mm-router): parallel worker boot + pre_merge gater for MM routing May 28, 2026

krishung5 and others added 2 commits May 28, 2026 19:52

Merge remote-tracking branch 'origin/main' into krish/mm-router-preme…

2ca53fb

…rge-gater

copy-pr-bot Bot temporarily deployed to GITLAB May 29, 2026 02:52 Inactive

dmitry-tokarev-nv approved these changes May 29, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to GITLAB May 29, 2026 03:07 Inactive

krishung5 enabled auto-merge (squash) May 29, 2026 03:49

dmitry-tokarev-nv disabled auto-merge May 29, 2026 04:21

dmitry-tokarev-nv merged commit 49c9c80 into main May 29, 2026
241 of 245 checks passed

dmitry-tokarev-nv deleted the krish/mm-router-premerge-gater branch May 29, 2026 04:21

joshuayao added a commit to joshuayao/dynamo that referenced this pull request Jun 1, 2026

Align with ai-dynamo#9542

c87f46c

Signed-off-by: Yi Yao <yi.a.yao@intel.com>

tmonty12 pushed a commit that referenced this pull request Jun 8, 2026

ci(mm-router): parallel worker boot + pre_merge gater for MM routing (#…

6c92b92

…9542) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(mm-router): parallel worker boot + pre_merge gater for MM routing#9542

ci(mm-router): parallel worker boot + pre_merge gater for MM routing#9542
dmitry-tokarev-nv merged 10 commits into
mainfrom
krish/mm-router-premerge-gater

krishung5 commented May 14, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Uh oh!

dmitry-tokarev-nv left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

krishung5 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test time

Why pre_merge needed this

Test plan

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

dmitry-tokarev-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

krishung5 commented May 14, 2026 •

edited

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading