feat(sglang): MM-aware KV routing via pad_value substitution#9561
Conversation
3791c43 to
f492144
Compare
When DYN_MM_ROUTING_BACKEND=sglang, the Rust frontend substitutes per-image pad_value = MM_PAD_SHIFT_VALUE + (mm_hash % 2^30) at image positions in routing_token_ids and emits no block_mm_infos. This makes the routing-side hash byte-identical to sglang's BlockStored token_ids (which inline pad_value and carry no extra_keys), so external KV-aware routing works without any sglang event-protocol change — only the minimal C1 mm_hashes interop hook. Default behavior (env unset) is unchanged: canonical image_token_id with mm_hashes appended via block_mm_infos (vLLM-compatible protocol). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pieces required for end-to-end MM-aware routing with SGLang: - decode_handler.py: signature-probe SGLang's async_generate for the mm_hashes kwarg and forward extra_args["mm_hashes"] from the Rust frontend when supported. Older SGLang builds without the interop hook drop the kwarg and fall back to text-prefix MM routing transparently. - agg_multimodal_router.sh: 2-worker aggregated MM router launch script with xdist-aware port allocation. - multimodal_profiles/sglang.py: profile registry covering Qwen3-VL-2B, Qwen2.5-VL-3B, Qwen2-VL-2B, Phi-3-vision, LLaVA-1.5-7B, LLaVA-NeXT-mistral-7B. Each profile uses make_image_payload_cached_tokens to validate the second request hits the cache populated by the first. - test_sglang.py + tests/utils/multimodal.py: wire SGLang into the shared multimodal test harness (image_server fixture, profile expansion). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ting-blocks The base CachedTokensChatPayload (cached_tokens >= 1 on repeat) can pass when MM-aware routing has silently degraded to text-prefix overlap: with two workers and identical text bodies, the fallback router will frequently land both requests on the same warm worker by luck, and prompt-prefix reuse alone produces cached_tokens > 0. Real silent regressions in the lightseek path were caught only by reading the frontend logs by hand. Extend CachedTokensChatPayload with two opt-in flags that auto-populate the existing expected_log regex contract: * require_lightseek_init=True — assert the per-model "MM-aware KV routing enabled (lightseek)" INFO log fired, proving init succeeded for this model+placeholder spec. * min_routing_total_blocks=N — assert at least one "[ROUTING] ... with X/M blocks overlap" log line has M >= N. Real MM-aware routing inflates the block-count by 30-150x over text-only (image placeholders -> ~14 tokens/block each); N=10 reliably distinguishes the two modes for all profiled models without false positives. All 6 sglang agg_router profiles (Qwen3-VL-2B pre_merge + Qwen2.5-VL-3B / Qwen2-VL-2B / Phi-3-vision / LLaVA-1.5 / LLaVA-NeXT post_merge) opt into both flags. Note: this overlaps verbatim with PR #9542 (vLLM-side same strong-gating). Whichever lands first rebases the other cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f492144 to
a6d7031
Compare
Fixture bypasses __init__ via __new__(), so the new attribute set in DecodeWorkerHandler.__init__ must be set manually. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SGLang's supported-models docs list only Qwen-family VLMs for multimodal; phi-3 fails upstream with a rope_scaling shape error and llava variants need GPU 2x for sizing. Keep Qwen3-VL-2B (pre_merge) + Qwen2.5-VL-3B / Qwen2-VL-2B (post_merge) only. Also drops unused make_image_payload import (ruff F401). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WalkthroughThis PR enables multimodal hash-aware routing for SGLang by adding MM hash forwarding in the decode handler, using backend-specific token padding in preprocessing, extending test validation infrastructure, and registering multimodal profiles with a complete launcher script. ChangesMultimodal Hash-Aware Routing for SGLang
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
lib/llm/src/preprocessor.rs (1)
1020-1022: ⚡ Quick winMove environment variable read to startup.
The
DYN_MM_ROUTING_BACKENDenvironment variable is read and parsed on every call togather_mm_exact_routing_info(i.e., once per multimodal request). For high-throughput workloads, this is wasteful. Consider reading the env var once inOpenAIPreprocessor::new_with_partsand storing the resolved mode (e.g.,mm_routing_backend: MmRoutingBackendenum or bool flag) as a field onOpenAIPreprocessor. This would reduce per-request overhead to a single field read.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/llm/src/preprocessor.rs` around lines 1020 - 1022, The code currently reads DYN_MM_ROUTING_BACKEND inside gather_mm_exact_routing_info on every request (via sglang_pad_value_mode), causing unnecessary per-request env access; move this logic into OpenAIPreprocessor::new_with_parts by resolving the env once (e.g., parse into a MmRoutingBackend enum or a bool field named mm_routing_backend or similar) and store it as a field on OpenAIPreprocessor; update gather_mm_exact_routing_info to read that field instead of calling std::env::var, and ensure constructors set the new field when instantiating OpenAIPreprocessor.components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (1)
236-253: ⚡ Quick winConsider validating list element types.
The method returns
Optional[List[str]]but doesn't verify thatmm_hasheslist elements are actually strings. If the list contains non-string items (e.g.,[1, 2, 3]), they will be forwarded to SGLang and may cause a runtime error downstream. Per Python guidelines, prefer failing fast.🛡️ Proposed element-type guard
mm_hashes = extra_args.get("mm_hashes") if not mm_hashes: return None if not isinstance(mm_hashes, list): return None + if not all(isinstance(h, str) for h in mm_hashes): + return None return mm_hashes🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/src/dynamo/sglang/request_handlers/llm/decode_handler.py` around lines 236 - 253, The _extract_mm_hashes function currently returns a list without checking element types; update _extract_mm_hashes to validate that mm_hashes is a list of str by using an element-type guard (e.g., all(isinstance(h, str) for h in mm_hashes)) and if any element is not a string fail fast by raising a TypeError (include a clear message referencing mm_hashes and the offending values) so downstream SGLang callers never receive non-string hashes.examples/backends/sglang/launch/agg_multimodal_router.sh (1)
65-68: 💤 Low valueConsider using
print_launch_bannerfor consistency.The script prints a custom banner, but the bash launch guidelines recommend calling
print_launch_banner(fromlaunch_utils.sh) with script description, model, and port to ensure consistent, debuggable startup logs across all launch scripts.📋 Example usage of print_launch_banner
+print_launch_banner \ + "Lightseek MM Exact Routing (SGLang)" \ + "${MODEL}" \ + "${HTTP_PORT}" +echo "NUM_WORKERS=${NUM_WORKERS}, BLOCK_SIZE=${BLOCK_SIZE}" +echo "NAMESPACE=${NAMESPACE}" -echo "=== Lightseek MM Exact Routing Launch (SGLang) ===" -echo "MODEL=${MODEL}" -echo "NUM_WORKERS=${NUM_WORKERS}, BLOCK_SIZE=${BLOCK_SIZE}" -echo "HTTP_PORT=${HTTP_PORT}, NAMESPACE=${NAMESPACE}"As per coding guidelines: "Call
print_launch_bannerwith script description, model, and port to ensure consistent, debuggable startup logs."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/backends/sglang/launch/agg_multimodal_router.sh` around lines 65 - 68, Replace the custom echo banner with a call to print_launch_banner to follow launch guidelines: call print_launch_banner with a concise script description (e.g., "Lightseek MM Exact Routing (SGLang)"), pass MODEL and HTTP_PORT as the required model and port arguments, and keep printing any extra runtime vars like NUM_WORKERS and BLOCK_SIZE separately if needed; update the section containing the current echo lines to invoke print_launch_banner (referencing the print_launch_banner function) and remove the redundant echo banner to maintain consistent startup logs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/utils/payloads.py`:
- Around line 390-395: The code builds a regex using min_routing_total_blocks
but only documents that it must be a power of 10; add an explicit validation
before constructing the regex (in the same block where min_routing_total_blocks
and log_patterns are used) to ensure min_routing_total_blocks is a power of 10
and > 0, and raise a clear ValueError/AssertionError if not; reference the
symbol min_routing_total_blocks and the list log_patterns so the check sits
immediately before the regex construction and prevents generating an
under-enforcing pattern for invalid inputs.
---
Nitpick comments:
In `@components/src/dynamo/sglang/request_handlers/llm/decode_handler.py`:
- Around line 236-253: The _extract_mm_hashes function currently returns a list
without checking element types; update _extract_mm_hashes to validate that
mm_hashes is a list of str by using an element-type guard (e.g.,
all(isinstance(h, str) for h in mm_hashes)) and if any element is not a string
fail fast by raising a TypeError (include a clear message referencing mm_hashes
and the offending values) so downstream SGLang callers never receive non-string
hashes.
In `@examples/backends/sglang/launch/agg_multimodal_router.sh`:
- Around line 65-68: Replace the custom echo banner with a call to
print_launch_banner to follow launch guidelines: call print_launch_banner with a
concise script description (e.g., "Lightseek MM Exact Routing (SGLang)"), pass
MODEL and HTTP_PORT as the required model and port arguments, and keep printing
any extra runtime vars like NUM_WORKERS and BLOCK_SIZE separately if needed;
update the section containing the current echo lines to invoke
print_launch_banner (referencing the print_launch_banner function) and remove
the redundant echo banner to maintain consistent startup logs.
In `@lib/llm/src/preprocessor.rs`:
- Around line 1020-1022: The code currently reads DYN_MM_ROUTING_BACKEND inside
gather_mm_exact_routing_info on every request (via sglang_pad_value_mode),
causing unnecessary per-request env access; move this logic into
OpenAIPreprocessor::new_with_parts by resolving the env once (e.g., parse into a
MmRoutingBackend enum or a bool field named mm_routing_backend or similar) and
store it as a field on OpenAIPreprocessor; update gather_mm_exact_routing_info
to read that field instead of calling std::env::var, and ensure constructors set
the new field when instantiating OpenAIPreprocessor.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 320446bd-dfb5-4e69-9660-a4b7cfc6fef2
📒 Files selected for processing (8)
components/src/dynamo/sglang/request_handlers/llm/decode_handler.pycomponents/src/dynamo/sglang/tests/test_sglang_frontend_decoding.pyexamples/backends/sglang/launch/agg_multimodal_router.shlib/llm/src/preprocessor.rstests/serve/multimodal_profiles/sglang.pytests/serve/test_sglang.pytests/utils/multimodal.pytests/utils/payloads.py
Adds the SGLang row to the support matrix, a How-It-Works subsection covering the pad_value substitution path, and a Launching block with env vars + verification log signals. Cross-link from multimodal-sglang.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collapse three call-site branches on MmRoutingProtocol::Sglang into
methods on the enum:
- format_mm_hash_hex: 16-char (sglang) vs 64-char (vLLM)
- image_fill_token: pad_value(mm_hash) (sglang) vs find_token_id (vLLM)
- emits_block_mm_infos: false (sglang) vs true (vLLM)
Wire format split stays load-bearing: parse_mm_hash_from_extra_key in
lib/kv-router/src/zmq_wire/extra_keys.rs uses 64-char length to filter
MM hashes from LoRA/cache_salt/prompt-embed extra_keys in vLLM
BlockStored events.
Verified:
- cargo check + clippy clean (lightseek-mm feature)
- mm_pad_value_matches_sglang_protocol unit test still pins formula
- qwen3-vl-2b sglang MM-routing e2e PASSED with kv_hit_rate=0.944
(>= 0.9 strong gate)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Report pytest markers gate in pre-commit requires every test to declare a Test Type and Hardware marker. Both tests in this file are pure unit (import sglang module, assert on values; no GPU, no engine startup), so unit + gpu_0 is the appropriate pair. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dynamo-runtime sequential job runs on an image without sglang installed and filters by `not sglang`. Without the marker, the test file matches the filter, gets collected, fails on `import sglang.srt`. Adding pytest.mark.sglang gates it to the sglang container image where the module is available.
Tighten per-method docs to 1-2 lines and add a TODO on the enum pointing at the kv-router 64-char type-tag dependency that blocks full wire-format unification.
… + comment Strip user-visible "lightseek" mentions from the sglang multimodal-router launch script (header, default NAMESPACE, banner echos) and one stray comment in preprocessor.rs. The cargo feature name `lightseek-mm` and the `lightseek_mm` module/type names stay since they're real identifiers.
This comment has been minimized.
This comment has been minimized.
Patches are version-pinned under v${SGLANG_VER}/; the dir-existence guard
already handles upstream-absorbed bumps, so the --forward / rc-tolerance
layer was redundant. Switch to strict: any non-zero `patch` exit aborts
the build.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nv-anants
left a comment
There was a problem hiding this comment.
reviewed container/ changes
Summary
Adds MM-aware KV routing support on the SGLang backend, mirroring the existing vLLM path. The Rust frontend substitutes per-image
pad_values into the routing-side token stream so the routing hash matches what SGLang publishes inBlockStoredevents byte-for-byte. Backend identity is auto-detected from the worker'sModelDeploymentCard— no deployer-side env var required.Closes DIS-1933
Companion PRs
mm_hashesfield onGenerateReqInputthat this PR's pad_value substitution relies on. Without that upstream PR, MM-aware routing silently degrades to text-prefix overlap.tests/utils/payloads.pyandtests/utils/multimodal.py). Whichever lands first rebases the other cleanly.What this PR adds
<img_1>etc.BlockStoredtoken_idsdecode_handler.pysignature-probesmm_hasheskwarg; new sglangagg_multimodal_router.sh(parallel worker boot); sglang test profile matrix covering 6 VLMsrequire_lightseek_init+min_routing_total_blocks=10— fails closed when MM-aware routing silently degradesmm_hashesfor sglangpad_valueto a constant; sglang gets the bare 16-hex u64backend_framework: "sglang" | "vllm"inModelRuntimeConfigat registration; frontend reads it instead of an env var. DropsDYN_MM_ROUTING_BACKENDfrom the SGLang launcher + docsTest coverage
agg_routerLLaVA-1.5-7b, LLaVA-NeXT-mistral-7b — all on
agg_routerAll 6 profiles use strong-gating (lightseek init log + 10-block routing threshold) so a silent regression to text-prefix-only routing fails the test rather than passing on text-prefix luck.
Test plan
pytest tests/serve/test_sglang.py -k agg_routerpost-merge greenthe sglang image built from that branch — we control our image)
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Tests