ci(mm-router): parallel worker boot + pre_merge gater for MM routing#9542
Conversation
WalkthroughThis PR adds clarifying phase comments to the vLLM aggregation router launch script and re-gates the ChangesvLLM aggregation router clarification and test gating
🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
agg_multimodal_router.sh launched each worker in the background but then blocked on its health endpoint inside the same loop, so two workers booted serially even though they could share the GPU. Move wait_ready outside the launch loop so phase 1 fires all workers and phase 2 waits for all of them concurrently. Combined with this, promote the qwen3-vl-2b agg_router topology to pre_merge. Its cached_tokens > 0 assertion fails closed when MM-aware routing silently degrades to text-prefix — catching the same class of regression the post_merge tests/mm_router/* tests do, but in pre_merge. Wall-clock in mm-vllm-dev (RTX 5880 Ada, --kv-cache-memory-bytes-capped profile matching CI): 66s pass time / 72s real (was 131-142s in the last main post_merge run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
152b57f to
6d462c9
Compare
…ting-blocks The base CachedTokensChatPayload (cached_tokens >= 1 on repeat) can pass when MM-aware routing has silently degraded to text-prefix overlap: with two workers and identical text bodies, the fallback router will frequently land both requests on the same warm worker by luck, and prompt-prefix reuse alone produces cached_tokens > 0. Real silent regressions in the lightseek path were caught only by reading the frontend logs by hand. Extend CachedTokensChatPayload with two opt-in flags that auto-populate the existing expected_log regex contract: * require_lightseek_init=True — assert the per-model "MM-aware KV routing enabled (lightseek)" INFO log fired, proving init succeeded for this model+placeholder spec. * min_routing_total_blocks=N — assert at least one "[ROUTING] ... with X/M blocks overlap" log line has M >= N. Real MM-aware routing inflates the block-count by 30-150x over text-only (image placeholders -> ~14 tokens/block each); N=10 reliably distinguishes the two modes for all profiled models without false positives. All 6 sglang agg_router profiles (Qwen3-VL-2B pre_merge + Qwen2.5-VL-3B / Qwen2-VL-2B / Phi-3-vision / LLaVA-1.5 / LLaVA-NeXT post_merge) opt into both flags. Note: this overlaps verbatim with PR #9542 (vLLM-side same strong-gating). Whichever lands first rebases the other cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ting-blocks The base CachedTokensChatPayload (cached_tokens >= 1 on repeat) can pass when MM-aware routing has silently degraded to text-prefix overlap: with two workers and identical text bodies, the fallback router will frequently land both requests on the same warm worker by luck, and prompt-prefix reuse alone produces cached_tokens > 0. Real silent regressions in the lightseek path were caught only by reading the frontend logs by hand. Extend CachedTokensChatPayload with two opt-in flags that auto-populate the existing expected_log regex contract: * require_lightseek_init=True — assert the per-model "MM-aware KV routing enabled (lightseek)" INFO log fired, proving init succeeded for this model+placeholder spec. * min_routing_total_blocks=N — assert at least one "[ROUTING] ... with X/M blocks overlap" log line has M >= N. Real MM-aware routing inflates the block-count by 30-150x over text-only (image placeholders -> ~14 tokens/block each); N=10 reliably distinguishes the two modes for all profiled models without false positives. All 6 sglang agg_router profiles (Qwen3-VL-2B pre_merge + Qwen2.5-VL-3B / Qwen2-VL-2B / Phi-3-vision / LLaVA-1.5 / LLaVA-NeXT post_merge) opt into both flags. Note: this overlaps verbatim with PR #9542 (vLLM-side same strong-gating). Whichever lands first rebases the other cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Shaves ~10s off mm_agg_router_qwen3-vl-2b in CI (2 requests × ~5s generation). All assertions are prompt-side, so output length is safe to halve. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dmitry-tokarev-nv
left a comment
There was a problem hiding this comment.
@krishung5 can you see if agg_multimodal_router_chat_processor.sh:173-200 needs same change (point 2 below). Other claude's concerns are up to you to decide on.
Issues & suggestions
1. profiled_vram_gib=18.7 was measured under serial boot — re-profile or watch the first run. (Main risk.)
Concurrent boot keeps the steady-state footprint identical (both workers resident either way) but raises the transient peak: two vLLM memory-profiling forward passes now overlap instead of one running against the other's steady state. profiled_vram_gib is scheduling/selection metadata (pytest_parallel_gpu.py packing + --max-vram-gib deselect in conftest.py:575), not a runtime assertion — so nothing fails on the marker, but the orchestrator may co-locate another test based on a now-stale 18.7 figure and the physical GPU could OOM. --enforce-eager (no graph capture) + a 2B model + headroom on a 24 GiB card make this likely fine, but it's genuinely unverified — your own test-plan item #3 is still unchecked. Worth re-profiling under parallel boot, or at least confirming the peak on the first pre_merge CI run before considering this settled.
2. The sibling script wasn't given the same treatment. agg_multimodal_router_chat_processor.sh:173-200 still has the exact launch-&-then-wait_ready-in-the-same-loop pattern this PR fixes. It stays on post_merge, so it's not blocking — but the same speedup applies and the divergence will confuse the next reader. Either mirror the Phase 1/Phase 2 split there or drop a one-line note on why only the lightseek script was changed.
3. Pre-existing, but the PR's reasoning lives right here: GPU_MEMORY_UTILIZATION (agg_multimodal_router.sh:41) is declared, echoed, and documented in --help, but never passed to the worker — the only GPU-mem args come from ${GPU_MEM_ARGS}. So a manual SINGLE_GPU=true run without the KV-bytes override (the script's own defaults!) falls back to vLLM's 0.9 utilization and would OOM under concurrent boot. That path was already broken under serial boot (worker 2 failed the free-mem check), so this isn't a regression — but since the OOM-safety reasoning now hinges on the marker being set, a one-line comment near Phase 1 ("safe only when the KV-bytes cap is set; CI sets it via the requested_vllm_kv_cache_bytes marker") would save a future debugger real time.
Minor / non-blocking
- wait_ready is called with 2 args here (defaulting to the 900 s timeout, :122) while the chat_processor sibling passes 900 explicitly — harmless, but the inconsistency is a tiny readability snag.
- The topology comment was trimmed of its "why cached_tokens catches silent text-prefix regressions" rationale; it survives in the make_image_payload_cached_tokens docstring (tests/utils/multimodal.py:72-84), so this is fine.
- max_tokens is now 50 only in the cached-tokens factory; make_image_payload/_b64/_video/_audio remain at 100. Intentional (only this one is on the pre_merge time budget), just noting the asymmetry.
…o pre_merge agg_multimodal_router_chat_processor.sh had the same serial-boot pattern as the lightseek script (wait_ready inside the launch loop). Mirror the Phase 1/Phase 2 split so the chat_processor sibling gets the same speedup, and promote mm_agg_router_chat_processor_qwen3-vl-2b to pre_merge so the chat-processor MM-routing path is gated before merge alongside lightseek. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both lightseek and chat_processor mm_agg_router tests peak at 14.9 GiB (measured at 0.5s nvidia-smi polling locally + 15.0 GiB at 10s polling in CI run 26596313542). 18.7 was conservative; 16 keeps 1 GiB safety headroom while freeing the scheduler to pack smaller co-located tests alongside instead of reserving the whole GPU0 slot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch the 3 agg_router topologies (lightseek pre_merge, chat_processor pre_merge, frontend_decode post_merge) from Qwen3-VL-2B-Instruct (bf16) to Qwen3-VL-2B-Instruct-FP8. The router is precision-agnostic — it operates on token IDs, block hashes, and counts, not float values — so FP8 produces bit-identical routing behavior (verified: same cached_tokens 0/272/272, same kv_hit_rate 0.944, same 17/18 block overlap on R3). Drop profiled_vram_gib 16.0 → 13.0 (measured peak 12.16 GiB + 0.8 GiB margin) and requested_vllm_kv_cache_bytes 1.6 GB → 512 MB (test needs <100 MB; smaller cap frees the scheduler's accounting without affecting real peak which is activation-dominated). Net effect: ~5 GiB more co-scheduling headroom on a 22 GiB L4 vs the original 18.7 GiB marker, no coverage loss. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit's sed range matched agg_router topologies across all VLLM_MULTIMODAL_PROFILES, not just qwen3-vl-2b. Restore the original profiled_vram_gib and requested_vllm_kv_cache_bytes for qwen2.5-vl-3b, qwen2-vl-2b, and phi3-vision agg_router topologies. qwen2-vl-2b's profiled was dropped 16→13 without verification. Only qwen3-vl-2b's 3 agg_router* topologies should carry the FP8 swap and tightened VRAM markers measured in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The chat_processor variant doubles GPU0 queue time at 4-worker scale without catching distinct bugs — both lightseek (Rust frontend) and chat_processor (Python preprocessor) share the kv_router downstream, so a routing regression would surface in either. Keep the lighter lightseek-path test as the pre_merge gate; chat_processor stays in post_merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yi Yao <yi.a.yao@intel.com>
…9542) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Three changes to add MM-aware-routing (lightseek) coverage back to pre_merge without re-introducing the ~28-min wall-clock growth that triggered #9452:
agg_multimodal_router.sh— workers now boot in parallel underSINGLE_GPU=true. The script previously launched each worker with&then immediatelywait_ready-ed inside the same loop, so 2 workers booted serially even though they share the GPU. Movewait_readyto a second loop so phase 1 fires all workers concurrently and phase 2 waits for all of them.tests/serve/multimodal_profiles/vllm.py— promote theqwen3-vl-2bagg_routertopology frompost_merge→pre_merge. Its existingcached_tokens > 0assertion (already present from fix: MM-aware KV routing for Phi-3, Qwen2-VL, Qwen2.5-VL #9441) fails closed if MM-aware routing silently degrades to text-prefix — catching the same class of regression the post_mergetests/mm_router/*tests do, but earlier.tests/utils/multimodal.py— trimmax_tokenson the cached-tokens payload from 100 → 50. All assertions are prompt-side (cached_tokens, expected substring match on "green", routing block-count log), so output length is safe to halve. Shaves ~5s/test in CI.Test time
Before:
After:
Why pre_merge needed this
After #9452 moved the lightseek-path MM-routing tests to post_merge, no pre_merge test exercises lightseek + MM-aware routing. The two remaining `tests/mm_router/*` pre_merge tests cover the TRT-LLM MM Router Worker and the Python vLLM-processor paths only, not the Rust-frontend lightseek path. So any future regression that breaks lightseek init, placeholder expansion, or KV-block hash alignment — exactly the class of bugs #9441 fixes — would only surface in post_merge.
Closes DIS-1983
Test plan
Related
🤖 Generated with Claude Code
Summary by CodeRabbit
Documentation
Tests