Skip to content

feat(sglang): MM-aware KV routing via pad_value substitution#9561

Merged
krishung5 merged 39 commits into
mainfrom
krish/sglang-mm-routing
Jun 3, 2026
Merged

feat(sglang): MM-aware KV routing via pad_value substitution#9561
krishung5 merged 39 commits into
mainfrom
krish/sglang-mm-routing

Conversation

@krishung5

@krishung5 krishung5 commented May 14, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds MM-aware KV routing support on the SGLang backend, mirroring the existing vLLM path. The Rust frontend substitutes per-image pad_values into the routing-side token stream so the routing hash matches what SGLang publishes in BlockStored events byte-for-byte. Backend identity is auto-detected from the worker's ModelDeploymentCard — no deployer-side env var required.

Closes DIS-1933

Companion PRs

  • Upstream SGLang: sgl-project/sglang#25300 adds the optional mm_hashes field on GenerateReqInput that this PR's pad_value substitution relies on. Without that upstream PR, MM-aware routing silently degrades to text-prefix overlap.
  • PR ci(mm-router): parallel worker boot + pre_merge gater for MM routing #9542 (vLLM-side strong-gating) — overlaps with one commit in this PR (test infra in tests/utils/payloads.py and tests/utils/multimodal.py). Whichever lands first rebases the other cleanly.

What this PR adds

Commit Purpose
MM-routing rewrite for BPE-shatterable numbered placeholders Lightseek init: handle tokenizers that BPE-shatter <img_1> etc.
Qwen2-VL / Qwen2.5-VL fix Resolve dual-token placeholder via tokenizer-loader
rustfmt Style only
pad_value mode in Rust frontend Routing-token substitution switches to pad-value-based for sglang workers, matching sglang's BlockStored token_ids
SGLang MM-routing glue + test scaffolding decode_handler.py signature-probes mm_hashes kwarg; new sglang agg_multimodal_router.sh (parallel worker boot); sglang test profile matrix covering 6 VLMs
Strong-gate sglang test profiles Opt all 6 profiles into require_lightseek_init + min_routing_total_blocks=10 — fails closed when MM-aware routing silently degrades
16-char hex mm_hashes for sglang vLLM's 64-char-padded form would collapse sglang's per-block pad_value to a constant; sglang gets the bare 16-hex u64
Auto-detect backend via MDC Workers advertise backend_framework: "sglang" | "vllm" in ModelRuntimeConfig at registration; frontend reads it instead of an env var. Drops DYN_MM_ROUTING_BACKEND from the SGLang launcher + docs

Test coverage

  • pre_merge: Qwen3-VL-2B-Instruct on agg_router
  • post_merge: Qwen2.5-VL-3B, Qwen2-VL-2B, Phi-3-vision-128k,
    LLaVA-1.5-7b, LLaVA-NeXT-mistral-7b — all on agg_router

All 6 profiles use strong-gating (lightseek init log + 10-block routing threshold) so a silent regression to text-prefix-only routing fails the test rather than passing on text-prefix luck.

Test plan

  • pytest tests/serve/test_sglang.py -k agg_router post-merge green
  • Local 1-GPU Qwen3-VL-2B smoke on mm-sglang-dev container
  • Upstream sglang companion PR merged before this lands (or at least
    the sglang image built from that branch — we control our image)

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added multimodal-aware routing support for SGLang backend clusters, enabling efficient processing of multimodal requests.
  • Tests

    • Added comprehensive test profiles for multimodal vision-language models with SGLang backend (Qwen3-VL-2B, Qwen2.5-VL-3B, Qwen2-VL-2B).
    • Enhanced test infrastructure for multimodal request routing and KV-cache optimization validation.

Review Change Stack

@github-actions github-actions Bot added backend::sglang Relates to the sglang backend frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` feat labels May 14, 2026
@krishung5 krishung5 force-pushed the krish/sglang-mm-routing branch from 3791c43 to f492144 Compare May 14, 2026 17:46
krishung5 and others added 3 commits May 14, 2026 10:56
When DYN_MM_ROUTING_BACKEND=sglang, the Rust frontend substitutes per-image
pad_value = MM_PAD_SHIFT_VALUE + (mm_hash % 2^30) at image positions in
routing_token_ids and emits no block_mm_infos. This makes the routing-side
hash byte-identical to sglang's BlockStored token_ids (which inline pad_value
and carry no extra_keys), so external KV-aware routing works without any
sglang event-protocol change — only the minimal C1 mm_hashes interop hook.

Default behavior (env unset) is unchanged: canonical image_token_id with
mm_hashes appended via block_mm_infos (vLLM-compatible protocol).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pieces required for end-to-end MM-aware routing with SGLang:

- decode_handler.py: signature-probe SGLang's async_generate for the mm_hashes
  kwarg and forward extra_args["mm_hashes"] from the Rust frontend when
  supported. Older SGLang builds without the interop hook drop the kwarg and
  fall back to text-prefix MM routing transparently.

- agg_multimodal_router.sh: 2-worker aggregated MM router launch script with
  xdist-aware port allocation.

- multimodal_profiles/sglang.py: profile registry covering Qwen3-VL-2B,
  Qwen2.5-VL-3B, Qwen2-VL-2B, Phi-3-vision, LLaVA-1.5-7B, LLaVA-NeXT-mistral-7B.
  Each profile uses make_image_payload_cached_tokens to validate the second
  request hits the cache populated by the first.

- test_sglang.py + tests/utils/multimodal.py: wire SGLang into the shared
  multimodal test harness (image_server fixture, profile expansion).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ting-blocks

The base CachedTokensChatPayload (cached_tokens >= 1 on repeat) can pass
when MM-aware routing has silently degraded to text-prefix overlap: with
two workers and identical text bodies, the fallback router will
frequently land both requests on the same warm worker by luck, and
prompt-prefix reuse alone produces cached_tokens > 0. Real silent
regressions in the lightseek path were caught only by reading the
frontend logs by hand.

Extend CachedTokensChatPayload with two opt-in flags that auto-populate
the existing expected_log regex contract:

* require_lightseek_init=True — assert the per-model "MM-aware KV
  routing enabled (lightseek)" INFO log fired, proving init succeeded
  for this model+placeholder spec.

* min_routing_total_blocks=N — assert at least one "[ROUTING] ... with
  X/M blocks overlap" log line has M >= N. Real MM-aware routing
  inflates the block-count by 30-150x over text-only (image
  placeholders -> ~14 tokens/block each); N=10 reliably distinguishes
  the two modes for all profiled models without false positives.

All 6 sglang agg_router profiles (Qwen3-VL-2B pre_merge + Qwen2.5-VL-3B
/ Qwen2-VL-2B / Phi-3-vision / LLaVA-1.5 / LLaVA-NeXT post_merge) opt
into both flags.

Note: this overlaps verbatim with PR #9542 (vLLM-side same
strong-gating). Whichever lands first rebases the other cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
krishung5 and others added 2 commits May 14, 2026 18:34
Fixture bypasses __init__ via __new__(), so the new attribute set in
DecodeWorkerHandler.__init__ must be set manually.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SGLang's supported-models docs list only Qwen-family VLMs for multimodal;
phi-3 fails upstream with a rope_scaling shape error and llava variants
need GPU 2x for sizing. Keep Qwen3-VL-2B (pre_merge) + Qwen2.5-VL-3B /
Qwen2-VL-2B (post_merge) only.

Also drops unused make_image_payload import (ruff F401).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@krishung5 krishung5 marked this pull request as ready for review May 15, 2026 01:36
@krishung5 krishung5 requested review from a team as code owners May 15, 2026 01:36
@coderabbitai

coderabbitai Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

This PR enables multimodal hash-aware routing for SGLang by adding MM hash forwarding in the decode handler, using backend-specific token padding in preprocessing, extending test validation infrastructure, and registering multimodal profiles with a complete launcher script.

Changes

Multimodal Hash-Aware Routing for SGLang

Layer / File(s) Summary
Decode handler MM hash detection and forwarding
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py, components/src/dynamo/sglang/tests/test_sglang_frontend_decoding.py
DecodeWorkerHandler probes engine.async_generate for mm_hashes kwarg support at init, extracts MM hashes from per-request extra_args, and conditionally forwards them during aggregated generation. Test helper initializes the probe flag for frontend-decoding tests.
Preprocessor SGLang MM routing mode
lib/llm/src/preprocessor.rs
When DYN_MM_ROUTING_BACKEND is sglang, preprocessor replaces repeated image tokens with computed pad values that encode MM hash bits, and skips block-level MM object derivation since routing identity is carried in the pad value.
Test payload MM routing assertions
tests/utils/payloads.py, tests/utils/multimodal.py
CachedTokensChatPayload accepts require_lightseek_init and min_routing_total_blocks flags; test factory functions thread these parameters and TopologyConfig gains requested_sglang_kv_tokens for pytest mark injection.
Multimodal SGLang profile registry
tests/serve/multimodal_profiles/sglang.py
Registers three Qwen VLM models (Qwen3-VL-2B, Qwen2.5-VL-3B, Qwen2-VL-2B) with agg_router topology, per-model timeout/VRAM limits, KV token constraints, single-GPU packing, and cached-token payloads configured for lightseek initialization and routing overlap validation.
Test suite multimodal config wiring
tests/serve/test_sglang.py
Imports multimodal profiles and topology scripts; generates SGLangConfig instances from SGLANG_MULTIMODAL_PROFILES via make_multimodal_configs, merges into sglang_configs, and adds image_server fixture to test_sglang_deployment.
MM-aware aggregated router launcher
examples/backends/sglang/launch/agg_multimodal_router.sh
New bash script orchestrates multi-worker SGLang cluster: parses CLI args, launches workers with per-worker system ports and KV-events TCP endpoints, waits for /health readiness, starts frontend router with HTTP routing and KV mode, polls chat endpoint until ready, and blocks on service termination.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main feature: MM-aware KV routing support on SGLang via pad_value substitution.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering overview, changes, related issues, companion PRs, and test coverage with clear details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
lib/llm/src/preprocessor.rs (1)

1020-1022: ⚡ Quick win

Move environment variable read to startup.

The DYN_MM_ROUTING_BACKEND environment variable is read and parsed on every call to gather_mm_exact_routing_info (i.e., once per multimodal request). For high-throughput workloads, this is wasteful. Consider reading the env var once in OpenAIPreprocessor::new_with_parts and storing the resolved mode (e.g., mm_routing_backend: MmRoutingBackend enum or bool flag) as a field on OpenAIPreprocessor. This would reduce per-request overhead to a single field read.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/llm/src/preprocessor.rs` around lines 1020 - 1022, The code currently
reads DYN_MM_ROUTING_BACKEND inside gather_mm_exact_routing_info on every
request (via sglang_pad_value_mode), causing unnecessary per-request env access;
move this logic into OpenAIPreprocessor::new_with_parts by resolving the env
once (e.g., parse into a MmRoutingBackend enum or a bool field named
mm_routing_backend or similar) and store it as a field on OpenAIPreprocessor;
update gather_mm_exact_routing_info to read that field instead of calling
std::env::var, and ensure constructors set the new field when instantiating
OpenAIPreprocessor.
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (1)

236-253: ⚡ Quick win

Consider validating list element types.

The method returns Optional[List[str]] but doesn't verify that mm_hashes list elements are actually strings. If the list contains non-string items (e.g., [1, 2, 3]), they will be forwarded to SGLang and may cause a runtime error downstream. Per Python guidelines, prefer failing fast.

🛡️ Proposed element-type guard
     mm_hashes = extra_args.get("mm_hashes")
     if not mm_hashes:
         return None
     if not isinstance(mm_hashes, list):
         return None
+    if not all(isinstance(h, str) for h in mm_hashes):
+        return None
     return mm_hashes
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/src/dynamo/sglang/request_handlers/llm/decode_handler.py` around
lines 236 - 253, The _extract_mm_hashes function currently returns a list
without checking element types; update _extract_mm_hashes to validate that
mm_hashes is a list of str by using an element-type guard (e.g.,
all(isinstance(h, str) for h in mm_hashes)) and if any element is not a string
fail fast by raising a TypeError (include a clear message referencing mm_hashes
and the offending values) so downstream SGLang callers never receive non-string
hashes.
examples/backends/sglang/launch/agg_multimodal_router.sh (1)

65-68: 💤 Low value

Consider using print_launch_banner for consistency.

The script prints a custom banner, but the bash launch guidelines recommend calling print_launch_banner (from launch_utils.sh) with script description, model, and port to ensure consistent, debuggable startup logs across all launch scripts.

📋 Example usage of print_launch_banner
+print_launch_banner \
+    "Lightseek MM Exact Routing (SGLang)" \
+    "${MODEL}" \
+    "${HTTP_PORT}"
+echo "NUM_WORKERS=${NUM_WORKERS}, BLOCK_SIZE=${BLOCK_SIZE}"
+echo "NAMESPACE=${NAMESPACE}"
-echo "=== Lightseek MM Exact Routing Launch (SGLang) ==="
-echo "MODEL=${MODEL}"
-echo "NUM_WORKERS=${NUM_WORKERS}, BLOCK_SIZE=${BLOCK_SIZE}"
-echo "HTTP_PORT=${HTTP_PORT}, NAMESPACE=${NAMESPACE}"

As per coding guidelines: "Call print_launch_banner with script description, model, and port to ensure consistent, debuggable startup logs."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/backends/sglang/launch/agg_multimodal_router.sh` around lines 65 -
68, Replace the custom echo banner with a call to print_launch_banner to follow
launch guidelines: call print_launch_banner with a concise script description
(e.g., "Lightseek MM Exact Routing (SGLang)"), pass MODEL and HTTP_PORT as the
required model and port arguments, and keep printing any extra runtime vars like
NUM_WORKERS and BLOCK_SIZE separately if needed; update the section containing
the current echo lines to invoke print_launch_banner (referencing the
print_launch_banner function) and remove the redundant echo banner to maintain
consistent startup logs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/utils/payloads.py`:
- Around line 390-395: The code builds a regex using min_routing_total_blocks
but only documents that it must be a power of 10; add an explicit validation
before constructing the regex (in the same block where min_routing_total_blocks
and log_patterns are used) to ensure min_routing_total_blocks is a power of 10
and > 0, and raise a clear ValueError/AssertionError if not; reference the
symbol min_routing_total_blocks and the list log_patterns so the check sits
immediately before the regex construction and prevents generating an
under-enforcing pattern for invalid inputs.

---

Nitpick comments:
In `@components/src/dynamo/sglang/request_handlers/llm/decode_handler.py`:
- Around line 236-253: The _extract_mm_hashes function currently returns a list
without checking element types; update _extract_mm_hashes to validate that
mm_hashes is a list of str by using an element-type guard (e.g.,
all(isinstance(h, str) for h in mm_hashes)) and if any element is not a string
fail fast by raising a TypeError (include a clear message referencing mm_hashes
and the offending values) so downstream SGLang callers never receive non-string
hashes.

In `@examples/backends/sglang/launch/agg_multimodal_router.sh`:
- Around line 65-68: Replace the custom echo banner with a call to
print_launch_banner to follow launch guidelines: call print_launch_banner with a
concise script description (e.g., "Lightseek MM Exact Routing (SGLang)"), pass
MODEL and HTTP_PORT as the required model and port arguments, and keep printing
any extra runtime vars like NUM_WORKERS and BLOCK_SIZE separately if needed;
update the section containing the current echo lines to invoke
print_launch_banner (referencing the print_launch_banner function) and remove
the redundant echo banner to maintain consistent startup logs.

In `@lib/llm/src/preprocessor.rs`:
- Around line 1020-1022: The code currently reads DYN_MM_ROUTING_BACKEND inside
gather_mm_exact_routing_info on every request (via sglang_pad_value_mode),
causing unnecessary per-request env access; move this logic into
OpenAIPreprocessor::new_with_parts by resolving the env once (e.g., parse into a
MmRoutingBackend enum or a bool field named mm_routing_backend or similar) and
store it as a field on OpenAIPreprocessor; update gather_mm_exact_routing_info
to read that field instead of calling std::env::var, and ensure constructors set
the new field when instantiating OpenAIPreprocessor.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 320446bd-dfb5-4e69-9660-a4b7cfc6fef2

📥 Commits

Reviewing files that changed from the base of the PR and between e77222c and b7c53ff.

📒 Files selected for processing (8)
  • components/src/dynamo/sglang/request_handlers/llm/decode_handler.py
  • components/src/dynamo/sglang/tests/test_sglang_frontend_decoding.py
  • examples/backends/sglang/launch/agg_multimodal_router.sh
  • lib/llm/src/preprocessor.rs
  • tests/serve/multimodal_profiles/sglang.py
  • tests/serve/test_sglang.py
  • tests/utils/multimodal.py
  • tests/utils/payloads.py

Comment thread tests/utils/payloads.py
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 15, 2026
Adds the SGLang row to the support matrix, a How-It-Works subsection
covering the pad_value substitution path, and a Launching block with
env vars + verification log signals. Cross-link from multimodal-sglang.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
krishung5 and others added 2 commits June 2, 2026 21:19
Collapse three call-site branches on MmRoutingProtocol::Sglang into
methods on the enum:

  - format_mm_hash_hex: 16-char (sglang) vs 64-char (vLLM)
  - image_fill_token:   pad_value(mm_hash) (sglang) vs find_token_id (vLLM)
  - emits_block_mm_infos: false (sglang) vs true (vLLM)

Wire format split stays load-bearing: parse_mm_hash_from_extra_key in
lib/kv-router/src/zmq_wire/extra_keys.rs uses 64-char length to filter
MM hashes from LoRA/cache_salt/prompt-embed extra_keys in vLLM
BlockStored events.

Verified:
  - cargo check + clippy clean (lightseek-mm feature)
  - mm_pad_value_matches_sglang_protocol unit test still pins formula
  - qwen3-vl-2b sglang MM-routing e2e PASSED with kv_hit_rate=0.944
    (>= 0.9 strong gate)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The vendored sglang patches dir (added in c733963) wasn't matched by
any glob in .github/filters.yaml, so the changed-files coverage gate
failed closed and cascaded into skipping all backend jobs on PR #9561.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Report pytest markers gate in pre-commit requires every test to
declare a Test Type and Hardware marker. Both tests in this file are
pure unit (import sglang module, assert on values; no GPU, no engine
startup), so unit + gpu_0 is the appropriate pair.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dynamo-runtime sequential job runs on an image without sglang
installed and filters by `not sglang`. Without the marker, the test
file matches the filter, gets collected, fails on `import sglang.srt`.
Adding pytest.mark.sglang gates it to the sglang container image where
the module is available.
Tighten per-method docs to 1-2 lines and add a TODO on the enum
pointing at the kv-router 64-char type-tag dependency that blocks
full wire-format unification.
… + comment

Strip user-visible "lightseek" mentions from the sglang
multimodal-router launch script (header, default NAMESPACE, banner
echos) and one stray comment in preprocessor.rs. The cargo feature
name `lightseek-mm` and the `lightseek_mm` module/type names stay
since they're real identifiers.
@datadog-official

This comment has been minimized.

Comment thread container/templates/sglang_runtime.Dockerfile Outdated
Patches are version-pinned under v${SGLANG_VER}/; the dir-existence guard
already handles upstream-absorbed bumps, so the --forward / rc-tolerance
layer was redundant. Switch to strict: any non-zero `patch` exit aborts
the build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@nv-anants nv-anants left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed container/ changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions backend::sglang Relates to the sglang backend backend::vllm Relates to the vllm backend container documentation Improvements or additions to documentation feat frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants