fix: MM-aware KV routing for Phi-3, Qwen2-VL, Qwen2.5-VL by krishung5 · Pull Request #9441 · ai-dynamo/dynamo

krishung5 · 2026-05-12T19:16:42Z

Overview

The mm_agg_router_* post-merge tests were passing for the wrong reasons. The base assertion cached_tokens >= 1 is satisfied any time vLLM's per-worker engine prefix cache fires — even when MM-aware KV routing has silently degraded to text-prefix routing and load-balancing just happens to put both repeats on the same worker. We could not distinguish "MM routing engaged" from "lucky round-robin landing." Real regressions would have slipped through.

This PR tightens those tests so they fail closed when MM-aware routing degrades, and addresses the structural mismatches that the tighter gate then exposed for Phi-3, Qwen2-VL, Qwen2.5-VL, and the fastokens tokenizer path. All four affected model profiles now route deterministically with non-zero [ROUTING] N/M blocks overlap on the repeat (load-balance luck no longer required).

How the test was hardened

CachedTokensChatPayload grows three opt-in kwargs (each off by default, no behavior change for callers that don't set them):

require_lightseek_init=True — asserts the MM-aware KV routing enabled (lightseek) init log fired (proves the routing path is wired up at deployment time).
min_avg_kv_hit_rate=0.9 — scrapes the router_kv_hit_rate Prometheus histogram from the frontend /metrics endpoint, snapshots it after R1, and asserts the R2+ delta mean is at or above the threshold. In MM-aware mode every routed model converges to (M-1)/M (≥0.944 observed across the four profiles); in the broken-hash regression class fixed in this PR the rate is ~0. Pure-text-fallback is caught separately by require_lightseek_init.
require_vllm_mm_processor_init=True — chat-processor path equivalent for [mm-routing] Transfer mode:.

make_image_payload_cached_tokens also bumps repeat_count from 2 to 3 so the metric mean covers both R2 and R3 — a single bad routing decision on either one drops the mean below 0.9 and fails the test.

The four routed MM profiles (mm_agg_router_qwen3-vl-2b, mm_agg_router_qwen2.5-vl-3b, mm_agg_router_qwen2-vl-2b, mm_agg_router_phi3-vision) opt into the strong gate.

What the strong gate exposed (and how each was fixed)

1. Phi-3-vision — chat template emits BPE-shatterable text + vLLM does a decode-roundtrip

Phi-3's chat template renders <|image_{n}|> as plain text. The Rust tokenizers crate BPE-shatters that into ~7 sub-tokens, and the leading < can merge with the preceding . (word.< → token id 19423). vLLM's _apply_token_matches therefore can't find the placeholder in the prompt tokens and falls back to _apply_text_matches, which decodes the prompt tokens back to text, substitutes, and re-encodes — and the decode inserts a space after each special token (e.g. <|user|>\n → <|user|> \n), bumping the SentencePiece prefix from ▁ (id 29871) to ▁▁ (id 259). The worker therefore hashes its KV blocks on a token sequence that differs from the router's routing_token_ids by a few ids — and every block hash from byte 0 onward diverges.

The fix mirrors vLLM's behavior exactly: the routing path does an encode → decode → re-encode roundtrip on the prompt, then splits at <|image_N|> and tokenizes each segment with the segment-BOS that Phi-3's HF processor produces. Locally captured all_token_ids from vLLM's HF processor and the router's routing_token_ids now match byte-for-byte.

Phi-3's tokenizer_config.json also declares add_bos_token: true, so the routing-side sequence prepends <s> (id 1) at the very front to match the backend's HF-processor output. No-op for add_bos_token: false models.

2. Qwen2-VL / Qwen2.5-VL — chat-template token id ≠ per-patch expansion token id

These families' config.json declares image_token_id: 151655 (<|image_pad|>) — what the chat template emits per image and what the backend's HF processor fills the expanded sequence with. Lightseek's placeholder_token_id() returns a different id (151654 = <|vision_pad|>); that id never appears in the worker's stored sequence, so using it as the routing-side fill token leaves every image-region block hashing differently from the backend.

Resolve chat_placeholder_token_id from config.json's image_token_id field and use it as both the find-target and the expansion fill token. Falls back to lightseek's value when config.json doesn't expose the field. Families where the two coincide (Qwen3-VL, LLaVA, Phi-3 after substitution) take the same code path as before — no behavior change.

3. Qwen2-VL-2B — Rust tokenizer loader misses tokenizer_config.json

Qwen2-VL-2B's release declares <|image_pad|> as a special token only in tokenizer_config.json's added_tokens_decoder field — not in tokenizer.json's added_tokens list. HF Python's AutoTokenizer.from_pretrained() reads both and merges. The Rust tokenizers crate reads only tokenizer.json and BPE-shatters <|image_pad|> into 6 sub-tokens. This gap predates MM routing; text-prefix routing didn't care because it didn't require token-identity alignment with vLLM's view.

HuggingFaceTokenizer::from_file (and a new from_tokenizer_with_model_dir variant called from model_card.rs) now reads the sibling tokenizer_config.json at load time and merges any added_tokens_decoder entries with special: true via the tokenizers crate's add_special_tokens() API. Aligns Rust with HF Python reference behavior. Idempotent for models that already have those tokens in tokenizer.json.

Cost: load-time only (~10-30 ms one-time, one extra file read + JSON parse). Encode hot path: zero overhead — tokenizers uses a trie for special-token recognition, lookup is O(input length) independent of trie size.

4. fastokens — no tokenizer_config.json merge available

The opt-in DYN_TOKENIZER=fastokens backend doesn't expose a special-token mutator on its Tokenizer type, so we can't replay the tokenizer_config.json merge there. Running MM-aware routing on a tokenizer that's BPE-shattering placeholders would silently emit wrong block hashes — far worse than falling back. The preprocessor now logs a single warning at init when DYN_TOKENIZER=fastokens is set and disables MM-aware routing (falls back to text-prefix routing). Tagged TODO(mm-routing) in the source; the guard goes away once fastokens upstream exposes a mutator.

Local validation

Each routed model run 3× (Phi-3 run 10× as the most-changed path); every R2 routing decision shows real [ROUTING] N/M blocks overlap matching, not load-balance luck.

Test profile	R2/R3 mean `router_kv_hit_rate`	Routing path
`mm_agg_router_qwen3-vl-2b` (pre_merge baseline)	0.944 (17/18)	fast path, unchanged
`mm_agg_router_qwen2.5-vl-3b`	0.957 (22/23)	`config.json` `image_token_id` as find + fill
`mm_agg_router_qwen2-vl-2b`	0.957 (22/23)	tokenizer-loader merge + `image_token_id`
`mm_agg_router_phi3-vision`	0.994 (158/159)	decode-roundtrip + per-segment BOS

The "1 missing block" in every overlap is structural: the router hashes N blocks (the trailing one is zero-padded to block boundary), but vLLM only publishes BlockStored for complete blocks of 16 tokens — so the partial-tail block sits just outside the worker's stored set. Every other block matches. The 0.9 gate threshold sits below the lowest observed floor (0.944) with margin, and well above the ~0.0 the broken-hash regression class produces.

A post-merge CI dispatch on this branch is running: https://github.com/ai-dynamo/dynamo/actions/runs/26542084108

Where reviewers should start

tests/utils/payloads.py + tests/utils/multimodal.py — the strong-gate kwargs (require_lightseek_init, min_routing_total_blocks) and the helper they wrap.
tests/serve/multimodal_profiles/vllm.py — the four model profiles opting into the gate.
lib/llm/src/preprocessor.rs — splice_numbered_placeholders_at_token_level (decode-roundtrip + per-segment BOS); chat_placeholder_token_id resolution; the DYN_TOKENIZER=fastokens guard.
lib/tokenizers/src/hf.rs — the tokenizer_config.json merge. Low-level loader used by all of dynamo; changes are additive/idempotent.
lib/llm/src/model_card.rs — the one-call-site update that routes the preprocessor tokenizer through the merging constructor.

A TODO(mm-routing) in read_image_token_id_from_config and the fastokens guard flag follow-ups (consolidate dual-token info into a single lightseek API; remove the fastokens guard once upstream exposes a mutator). Not blocking this PR.

Closes DIS-1977

Related issues

relates to: feat(mm-routing): lightseek-mm Rust-frontend MM-aware KV routing #9272 (the MM-aware KV routing PR that introduced the dependency exposing these gaps)
builds on: fix(self-host): harvest non-weight siblings into slug_dir (gh-8749) #9610 (MDC verify-and-cache slot fix for preprocessor_config.json, now merged to main)

🤖 Generated with Claude Code

coderabbitai · 2026-05-12T19:21:12Z

Walkthrough

This PR enhances HuggingFace tokenizer loading to merge special tokens from tokenizer_config.json, then leverages that capability in model card loading and multimodal preprocessing to support models where the chat-template image placeholder token differs from the expansion-time image token, with routing logic updated to substitute numbered placeholders and re-tokenize for validation.

Changes

Special Token Merging and Dual-Token Multimodal Routing

Layer / File(s)	Summary
Special Token Merging in Tokenizers `lib/tokenizers/src/hf.rs`	Adds `from_tokenizer_with_model_dir()` public constructor and private `merge_special_tokens_from_config()` helper to conditionally merge `added_tokens_decoder` entries from sibling `tokenizer_config.json` into the tokenizer's special token set; import wiring updated for `AddedToken`.
Model Card Tokenizer Loading `lib/llm/src/model_card.rs`	`ModelDeploymentCard::tokenizer()` now uses `from_tokenizer_with_model_dir()` when tokenizer path has a parent directory, enabling config-based special token merging; otherwise falls back to `from_tokenizer()`.
Multimodal Preprocessor Fields and Config Reading `lib/llm/src/preprocessor.rs`	Adds `read_image_token_id_from_config()` helper, extends `OpenAIPreprocessor` struct with `image_token_text` and `chat_placeholder_token_id` fields, and updates `new_with_parts()` to resolve the chat placeholder token from config.json with fallback to expansion-time token; includes `image_token_text` computation via tokenizer round-trip validation.
Multimodal Routing with Placeholder Substitution `lib/llm/src/preprocessor.rs`	`preprocess_request()` now passes `formatted_prompt` to `gather_mm_exact_routing_info()`, which accepts the formatted prompt and uses `find_token_id` (derived from `chat_placeholder_token_id` or `image_token_id`) to locate placeholders; replaces numbered-placeholder normalization with conditional text rewriting via new `substitute_numbered_placeholders()` helper, re-tokenizes for validation, and updates the expansion loop to replace `find_token_id` occurrences with per-image `image_token_id` blocks.
Qwen Multimodal Profile Tests `tests/serve/multimodal_profiles/vllm.py`	`agg_router` test cases for `Qwen/Qwen2.5-VL-3B-Instruct` and `Qwen/Qwen2-VL-2B-Instruct` updated to use `cached_tokens`-asserting image payloads with comments documenting dual-token routing expectations and cached-tokens failure conditions.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: fixing MM-aware KV routing for three specific VLM families (Phi-3, Qwen2-VL, Qwen2.5-VL).
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	PR description comprehensively covers all changes, includes detailed technical context, local validation results, and clear guidance on where to start review.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…uest clone Plumb formatted_prompt as Option<&str> through gather_tokens and gather_multi_modal_data instead of Option<String>. The owned String now lives once in preprocess_request and is cloned exactly at the site that actually requires ownership — when inserting into the multimodal extra_args JSON. Previously each text request paid two full prompt-string clones even when no multimodal routing work was done or lightseek-mm was compiled out. Addresses PR #9441 review comment on preprocessor.rs:578. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-fix Pulls latest main to re-run Post-Merge CI on the freshest base. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ssor_init Per #9441 review (rmccorm4): generalize the kwarg name away from the lightseek implementation detail. Log regex is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rmccorm4

🤖 This review comment is AI-generated.

Thermo-nuclear code-quality review — verdict: request changes (posting as comments).

The behavior fixes are correct and the empirical validation is strong. Two structural issues are presumptive blockers under a strict bar: a cohesive MM-routing subsystem grown inside the 3.6k-line preprocessor.rs when a canonical home (preprocessor/lightseek_mm.rs) already exists and already parses config.json; and a canonical-helper bypass in the test gate. Details inline. #3/#4 are strongly recommended cleanups to ride along.

Credit: hf.rs merge is clean/idempotent; the formatted_prompt: Option<&str> clone-removal is a real win; the mid-PR misdiagnosis was reverted rather than left as cruft; the fastokens guard is a sound fail-closed default.

Renames the two MM-routing log strings introduced in this PR to be generic (no "lightseek" mention), and updates the matching regex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Moves read_image_token_id_from_config and read_bos_token_from_config out of preprocessor.rs into a new lightseek_mm::resolve_routing_tokens that returns a RoutingTokens bundle (image_token_id, chat_placeholder_token_id, bos_token_string). Reads config.json and tokenizer_config.json once each instead of duplicating I/O across modules. Addresses #9441 review: "fold both readers into lightseek_mm.rs so config.json is read once and the module returns a resolved bundle." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

OpenAIPreprocessor previously stored both image_token_id and chat_placeholder_token_id, with usage doing chat_placeholder_token_id.unwrap_or(image_token_id) after an early-return on image_token_id.is_none(). By construction of resolve_routing_tokens, chat_placeholder_token_id is Some whenever image_token_id is Some, so the .unwrap_or fallback was unreachable. Collapse to a single routing_image_token_id field (the value used to fill routing-side block hashes). Logs use the unified id; the spec-resolution step is already logged separately by lightseek_mm. Addresses #9441 review: "Collapse to a single routing_image_token_id resolved at init; the .unwrap_or branch is unreachable." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ce helper Rename splice_numbered_placeholders_at_token_level to make the Phi-3 specificity explicit, gate it on routing_prepend_bos.is_some() so a future model with {n}-style placeholders but a non-LlamaTokenizer processor falls through to text-prefix routing instead of silently producing wrong block hashes, and move the leading-BOS push from gather_mm_exact_routing_info into the helper. The caller now skips its own leading-BOS prepend when normalized_token_ids comes from the helper (splice_owns_bos), keeping the non-splice single-placeholder path unchanged. Addresses #9441 review: split-brain BOS handling + generic name hiding Phi-3-specific encode-decode roundtrip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…al constants Use prometheus_names.name_prefix.COMPONENT + prometheus_names.router.KV_HIT_RATE to build the scraped histogram name instead of the literal "dynamo_component_router_kv_hit_rate". Matches the pattern used at the 4 other call sites in payloads.py (lines 1104, 1214, 1281, 1371). A future metric rename now cascades through this strong gate instead of silently breaking it. Addresses #9441 review: canonical-helper bypass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ount) _scrape_router_histograms wrapped a single (sum, count) tuple in a Dict keyed by the hardcoded string "router_kv_hit_rate", which callers then unpacked with .get(). Collapse to a direct Optional[tuple]: None on scrape failure, (sum, count) on success. _metrics_baseline becomes Optional[tuple] too. Renamed to _scrape_router_kv_hit_rate to match the narrowed return type. Addresses #9441 review: thin abstraction with single hardcoded key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@keivenchang

…rn on promotion Per @keivenchang's DM review: the existing tool-calling / reasoning parser tests inject already-decoded text and never run tokenizer.decode, so they cannot catch the class of bug where a previously-non-special marker gets accidentally promoted to special and silently disappears under skip_special_tokens=True. Add a single unit test in dynamo-tokenizers pinning both halves of the gate (one special:true -> promoted, one special:false -> skipped) end-to-end through the actual encode -> decode(skip_special=true) round trip — the layer the parser tests cannot reach. Also upgrade the promotion log from debug to warn, including the literal token strings, so any accidental promotion shows up loudly in worker logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Exception Per .ai/python-guidelines.md ("If you must catch broadly, use except Exception: and always re-raise after logging") and Devin's #9441 review: narrow the bare except in _scrape_router_kv_hit_rate to requests.RequestException. We expect transient endpoint flakes here (timeout / connection refused while the frontend is still binding /metrics) and the strong gate already has its own `is None` guard. Programming errors now propagate at test time instead of being swallowed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The test profile comment claimed vLLM expands `<|image_pad|>` to N `<|vision_pad|>` (151654), but the actual routing-side fill uses config.json's `image_token_id` (151655, `<|image_pad|>`) because that's what the worker stores in the expanded sequence. The stale claim predates the placeholder/expansion-token fix. Addresses #9441 Devin review: test comment contradicts the preprocessor.rs code comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…L1 cache specials extract_hf_special_tokens was called on the raw HfTokenizer (only tokenizer.json's specials), then the merge applied later inside wrap_hf. The L1 prefix cache's boundary list therefore missed tokens declared only in tokenizer_config.json (Qwen2-VL's <|image_pad|>), even though the wrapped tokenizer treated them as atomic — chat prefixes straddling those boundaries lost cache hits with DYN_TOKENIZER_CACHE=1. Apply the merge eagerly on the raw HfTokenizer before extracting specials so both halves stay in sync. Expose merge_special_tokens_from_config as pub for the cross-crate call. Simplify wrap_hf to a plain from_tokenizer since the merge has already run. Addresses #9441 Devin review: L1 tokenizer cache specials inconsistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-fix Pulls latest main to re-run Post-Merge CI on the freshest base after Ryan/Keiven/Devin review refactors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- cargo fmt: rejoin overflowed `if let` and `merge_special_tokens_from_config(...)` call sites. - clippy: move the dynamo-tokenizers `#[cfg(test)] mod tests` to the end of hf.rs to satisfy `clippy::items_after_test_module`. Verified locally: cargo fmt --check, cargo clippy --no-deps --all-targets (-D warnings) on dynamo-tokenizers + dynamo-llm, cargo test on the new hf::tests::merge_gate_round_trips_through_decode unit test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…llow-up Tracks the cohesion follow-up from #9441 review: the router-metric- specific fields and assertions on CachedTokensChatPayload (min_avg_kv_hit_rate, _metrics_baseline, _scrape_router_kv_hit_rate, the kv_hit_rate delta assertion) should move into a RouterMetricsAssertion mixin once a second router-specific strong gate gets added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Commit 211d975 dropped the "(lightseek)" suffix from the MM-routing init INFO log; test_router_rust_mm_logs_lightseek_initialization still pinned the old string and failed in post-merge CI (run 26788111103). Match the current emitted text instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

krishung5 requested review from a team as code owners May 12, 2026 19:16

pull-request-size Bot added the size/L label May 12, 2026

github-actions Bot added frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` fix labels May 12, 2026

pull-request-size Bot added size/XL and removed size/L labels May 12, 2026

copy-pr-bot Bot temporarily deployed to GITLAB May 12, 2026 19:20 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 12, 2026 21:08 Inactive

krishung5 requested review from furionw and indrajit96 May 12, 2026 22:42

copy-pr-bot Bot temporarily deployed to GITLAB May 13, 2026 00:28 Inactive

dynamo-ops reviewed May 13, 2026

View reviewed changes

Comment thread lib/llm/src/model_card.rs Outdated

Comment thread lib/llm/src/preprocessor.rs Outdated

krishung5 force-pushed the krish/phi3-mm-routing-fix branch from 81761ab to aeb7fb5 Compare May 14, 2026 00:08

copy-pr-bot Bot temporarily deployed to GITLAB May 14, 2026 00:08 Inactive

krishung5 force-pushed the krish/phi3-mm-routing-fix branch from aeb7fb5 to 4bf5736 Compare May 14, 2026 00:21

copy-pr-bot Bot temporarily deployed to GITLAB May 14, 2026 00:21 Inactive

krishung5 force-pushed the krish/phi3-mm-routing-fix branch from 4bf5736 to b49d01b Compare May 14, 2026 00:28

copy-pr-bot Bot temporarily deployed to GITLAB May 14, 2026 00:28 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 14, 2026 01:15 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 14, 2026 04:58 Inactive

krishung5 mentioned this pull request May 14, 2026

ci(mm-router): parallel worker boot + pre_merge gater for MM routing #9542

Merged

3 tasks

copy-pr-bot Bot temporarily deployed to GITLAB May 14, 2026 21:17 Inactive

krishung5 force-pushed the krish/phi3-mm-routing-fix branch from 307f576 to b30766d Compare May 14, 2026 22:32

copy-pr-bot Bot temporarily deployed to GITLAB May 29, 2026 17:35 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 29, 2026 17:36 Inactive

Merge remote-tracking branch 'origin/main' into krish/phi3-mm-routing…

646fe91

…-fix Pulls latest main to re-run Post-Merge CI on the freshest base. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

copy-pr-bot Bot temporarily deployed to GITLAB June 1, 2026 19:50 Inactive

rmccorm4 reviewed Jun 1, 2026

View reviewed changes

Comment thread tests/utils/payloads.py Outdated

test(mm-routing): rename require_lightseek_init -> require_rust_proce…

19d263d

…ssor_init Per #9441 review (rmccorm4): generalize the kwarg name away from the lightseek implementation detail. Log regex is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rmccorm4 reviewed Jun 1, 2026

View reviewed changes

Comment thread lib/tokenizers/src/hf.rs

copy-pr-bot Bot temporarily deployed to GITLAB June 1, 2026 19:54 Inactive

rmccorm4 reviewed Jun 1, 2026

View reviewed changes

Comment thread lib/llm/src/preprocessor.rs Outdated

Comment thread lib/llm/src/preprocessor.rs Outdated

Comment thread lib/llm/src/preprocessor.rs Outdated

Comment thread tests/utils/payloads.py Outdated

Comment thread tests/utils/payloads.py Outdated

test(mm-routing): drop lightseek wording from PR-added log strings

211d975

Renames the two MM-routing log strings introduced in this PR to be generic (no "lightseek" mention), and updates the matching regex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This comment has been minimized.

Sign in to view

devin-ai-integration Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread tests/utils/payloads.py Outdated

devin-ai-integration Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread tests/utils/payloads.py

devin-ai-integration Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread lib/llm/src/model_card.rs Outdated

devin-ai-integration Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread tests/serve/multimodal_profiles/vllm.py

krishung5 and others added 13 commits June 1, 2026 15:16

Merge remote-tracking branch 'origin/main' into krish/phi3-mm-routing…

8f6d7a6

…-fix Pulls latest main to re-run Post-Merge CI on the freshest base after Ryan/Keiven/Devin review refactors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rmccorm4 approved these changes Jun 2, 2026

View reviewed changes

krishung5 mentioned this pull request Jun 2, 2026

feat(sglang): MM-aware KV routing via pad_value substitution #9561

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: MM-aware KV routing for Phi-3, Qwen2-VL, Qwen2.5-VL#9441

fix: MM-aware KV routing for Phi-3, Qwen2-VL, Qwen2.5-VL#9441
rmccorm4 merged 27 commits into
mainfrom
krish/phi3-mm-routing-fix

krishung5 commented May 12, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rmccorm4 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

krishung5 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

How the test was hardened

What the strong gate exposed (and how each was fixed)

Local validation

Where reviewers should start

Related issues

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rmccorm4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

krishung5 commented May 12, 2026 •

edited

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading