Skip to content

fix: MM-aware KV routing for Phi-3, Qwen2-VL, Qwen2.5-VL#9441

Merged
rmccorm4 merged 27 commits into
mainfrom
krish/phi3-mm-routing-fix
Jun 2, 2026
Merged

fix: MM-aware KV routing for Phi-3, Qwen2-VL, Qwen2.5-VL#9441
rmccorm4 merged 27 commits into
mainfrom
krish/phi3-mm-routing-fix

Conversation

@krishung5

@krishung5 krishung5 commented May 12, 2026

Copy link
Copy Markdown
Contributor

Overview

The mm_agg_router_* post-merge tests were passing for the wrong reasons. The base assertion cached_tokens >= 1 is satisfied any time vLLM's per-worker engine prefix cache fires — even when MM-aware KV routing has silently degraded to text-prefix routing and load-balancing just happens to put both repeats on the same worker. We could not distinguish "MM routing engaged" from "lucky round-robin landing." Real regressions would have slipped through.

This PR tightens those tests so they fail closed when MM-aware routing degrades, and addresses the structural mismatches that the tighter gate then exposed for Phi-3, Qwen2-VL, Qwen2.5-VL, and the fastokens tokenizer path. All four affected model profiles now route deterministically with non-zero [ROUTING] N/M blocks overlap on the repeat (load-balance luck no longer required).

How the test was hardened

CachedTokensChatPayload grows three opt-in kwargs (each off by default, no behavior change for callers that don't set them):

  • require_lightseek_init=True — asserts the MM-aware KV routing enabled (lightseek) init log fired (proves the routing path is wired up at deployment time).
  • min_avg_kv_hit_rate=0.9 — scrapes the router_kv_hit_rate Prometheus histogram from the frontend /metrics endpoint, snapshots it after R1, and asserts the R2+ delta mean is at or above the threshold. In MM-aware mode every routed model converges to (M-1)/M (≥0.944 observed across the four profiles); in the broken-hash regression class fixed in this PR the rate is ~0. Pure-text-fallback is caught separately by require_lightseek_init.
  • require_vllm_mm_processor_init=True — chat-processor path equivalent for [mm-routing] Transfer mode:.

make_image_payload_cached_tokens also bumps repeat_count from 2 to 3 so the metric mean covers both R2 and R3 — a single bad routing decision on either one drops the mean below 0.9 and fails the test.

The four routed MM profiles (mm_agg_router_qwen3-vl-2b, mm_agg_router_qwen2.5-vl-3b, mm_agg_router_qwen2-vl-2b, mm_agg_router_phi3-vision) opt into the strong gate.

What the strong gate exposed (and how each was fixed)

1. Phi-3-vision — chat template emits BPE-shatterable text + vLLM does a decode-roundtrip

Phi-3's chat template renders <|image_{n}|> as plain text. The Rust tokenizers crate BPE-shatters that into ~7 sub-tokens, and the leading < can merge with the preceding . (word.< → token id 19423). vLLM's _apply_token_matches therefore can't find the placeholder in the prompt tokens and falls back to _apply_text_matches, which decodes the prompt tokens back to text, substitutes, and re-encodes — and the decode inserts a space after each special token (e.g. <|user|>\n<|user|> \n), bumping the SentencePiece prefix from (id 29871) to ▁▁ (id 259). The worker therefore hashes its KV blocks on a token sequence that differs from the router's routing_token_ids by a few ids — and every block hash from byte 0 onward diverges.

The fix mirrors vLLM's behavior exactly: the routing path does an encode → decode → re-encode roundtrip on the prompt, then splits at <|image_N|> and tokenizes each segment with the segment-BOS that Phi-3's HF processor produces. Locally captured all_token_ids from vLLM's HF processor and the router's routing_token_ids now match byte-for-byte.

Phi-3's tokenizer_config.json also declares add_bos_token: true, so the routing-side sequence prepends <s> (id 1) at the very front to match the backend's HF-processor output. No-op for add_bos_token: false models.

2. Qwen2-VL / Qwen2.5-VL — chat-template token id ≠ per-patch expansion token id

These families' config.json declares image_token_id: 151655 (<|image_pad|>) — what the chat template emits per image and what the backend's HF processor fills the expanded sequence with. Lightseek's placeholder_token_id() returns a different id (151654 = <|vision_pad|>); that id never appears in the worker's stored sequence, so using it as the routing-side fill token leaves every image-region block hashing differently from the backend.

Resolve chat_placeholder_token_id from config.json's image_token_id field and use it as both the find-target and the expansion fill token. Falls back to lightseek's value when config.json doesn't expose the field. Families where the two coincide (Qwen3-VL, LLaVA, Phi-3 after substitution) take the same code path as before — no behavior change.

3. Qwen2-VL-2B — Rust tokenizer loader misses tokenizer_config.json

Qwen2-VL-2B's release declares <|image_pad|> as a special token only in tokenizer_config.json's added_tokens_decoder field — not in tokenizer.json's added_tokens list. HF Python's AutoTokenizer.from_pretrained() reads both and merges. The Rust tokenizers crate reads only tokenizer.json and BPE-shatters <|image_pad|> into 6 sub-tokens. This gap predates MM routing; text-prefix routing didn't care because it didn't require token-identity alignment with vLLM's view.

HuggingFaceTokenizer::from_file (and a new from_tokenizer_with_model_dir variant called from model_card.rs) now reads the sibling tokenizer_config.json at load time and merges any added_tokens_decoder entries with special: true via the tokenizers crate's add_special_tokens() API. Aligns Rust with HF Python reference behavior. Idempotent for models that already have those tokens in tokenizer.json.

Cost: load-time only (~10-30 ms one-time, one extra file read + JSON parse). Encode hot path: zero overhead — tokenizers uses a trie for special-token recognition, lookup is O(input length) independent of trie size.

4. fastokens — no tokenizer_config.json merge available

The opt-in DYN_TOKENIZER=fastokens backend doesn't expose a special-token mutator on its Tokenizer type, so we can't replay the tokenizer_config.json merge there. Running MM-aware routing on a tokenizer that's BPE-shattering placeholders would silently emit wrong block hashes — far worse than falling back. The preprocessor now logs a single warning at init when DYN_TOKENIZER=fastokens is set and disables MM-aware routing (falls back to text-prefix routing). Tagged TODO(mm-routing) in the source; the guard goes away once fastokens upstream exposes a mutator.

Local validation

Each routed model run 3× (Phi-3 run 10× as the most-changed path); every R2 routing decision shows real [ROUTING] N/M blocks overlap matching, not load-balance luck.

Test profile R2/R3 mean router_kv_hit_rate Routing path
mm_agg_router_qwen3-vl-2b (pre_merge baseline) 0.944 (17/18) fast path, unchanged
mm_agg_router_qwen2.5-vl-3b 0.957 (22/23) config.json image_token_id as find + fill
mm_agg_router_qwen2-vl-2b 0.957 (22/23) tokenizer-loader merge + image_token_id
mm_agg_router_phi3-vision 0.994 (158/159) decode-roundtrip + per-segment BOS

The "1 missing block" in every overlap is structural: the router hashes N blocks (the trailing one is zero-padded to block boundary), but vLLM only publishes BlockStored for complete blocks of 16 tokens — so the partial-tail block sits just outside the worker's stored set. Every other block matches. The 0.9 gate threshold sits below the lowest observed floor (0.944) with margin, and well above the ~0.0 the broken-hash regression class produces.

A post-merge CI dispatch on this branch is running: https://github.com/ai-dynamo/dynamo/actions/runs/26542084108

Where reviewers should start

  1. tests/utils/payloads.py + tests/utils/multimodal.py — the strong-gate kwargs (require_lightseek_init, min_routing_total_blocks) and the helper they wrap.
  2. tests/serve/multimodal_profiles/vllm.py — the four model profiles opting into the gate.
  3. lib/llm/src/preprocessor.rssplice_numbered_placeholders_at_token_level (decode-roundtrip + per-segment BOS); chat_placeholder_token_id resolution; the DYN_TOKENIZER=fastokens guard.
  4. lib/tokenizers/src/hf.rs — the tokenizer_config.json merge. Low-level loader used by all of dynamo; changes are additive/idempotent.
  5. lib/llm/src/model_card.rs — the one-call-site update that routes the preprocessor tokenizer through the merging constructor.

A TODO(mm-routing) in read_image_token_id_from_config and the fastokens guard flag follow-ups (consolidate dual-token info into a single lightseek API; remove the fastokens guard once upstream exposes a mutator). Not blocking this PR.

Closes DIS-1977

Related issues

🤖 Generated with Claude Code

@krishung5 krishung5 requested review from a team as code owners May 12, 2026 19:16
@github-actions github-actions Bot added frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` fix labels May 12, 2026
@coderabbitai

coderabbitai Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

This PR enhances HuggingFace tokenizer loading to merge special tokens from tokenizer_config.json, then leverages that capability in model card loading and multimodal preprocessing to support models where the chat-template image placeholder token differs from the expansion-time image token, with routing logic updated to substitute numbered placeholders and re-tokenize for validation.

Changes

Special Token Merging and Dual-Token Multimodal Routing

Layer / File(s) Summary
Special Token Merging in Tokenizers
lib/tokenizers/src/hf.rs
Adds from_tokenizer_with_model_dir() public constructor and private merge_special_tokens_from_config() helper to conditionally merge added_tokens_decoder entries from sibling tokenizer_config.json into the tokenizer's special token set; import wiring updated for AddedToken.
Model Card Tokenizer Loading
lib/llm/src/model_card.rs
ModelDeploymentCard::tokenizer() now uses from_tokenizer_with_model_dir() when tokenizer path has a parent directory, enabling config-based special token merging; otherwise falls back to from_tokenizer().
Multimodal Preprocessor Fields and Config Reading
lib/llm/src/preprocessor.rs
Adds read_image_token_id_from_config() helper, extends OpenAIPreprocessor struct with image_token_text and chat_placeholder_token_id fields, and updates new_with_parts() to resolve the chat placeholder token from config.json with fallback to expansion-time token; includes image_token_text computation via tokenizer round-trip validation.
Multimodal Routing with Placeholder Substitution
lib/llm/src/preprocessor.rs
preprocess_request() now passes formatted_prompt to gather_mm_exact_routing_info(), which accepts the formatted prompt and uses find_token_id (derived from chat_placeholder_token_id or image_token_id) to locate placeholders; replaces numbered-placeholder normalization with conditional text rewriting via new substitute_numbered_placeholders() helper, re-tokenizes for validation, and updates the expansion loop to replace find_token_id occurrences with per-image image_token_id blocks.
Qwen Multimodal Profile Tests
tests/serve/multimodal_profiles/vllm.py
agg_router test cases for Qwen/Qwen2.5-VL-3B-Instruct and Qwen/Qwen2-VL-2B-Instruct updated to use cached_tokens-asserting image payloads with comments documenting dual-token routing expectations and cached-tokens failure conditions.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: fixing MM-aware KV routing for three specific VLM families (Phi-3, Qwen2-VL, Qwen2.5-VL).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed PR description comprehensively covers all changes, includes detailed technical context, local validation results, and clear guidance on where to start review.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread lib/llm/src/model_card.rs Outdated
Comment thread lib/llm/src/preprocessor.rs Outdated
krishung5 added a commit that referenced this pull request May 14, 2026
…uest clone

Plumb formatted_prompt as Option<&str> through gather_tokens and
gather_multi_modal_data instead of Option<String>. The owned String now
lives once in preprocess_request and is cloned exactly at the site that
actually requires ownership — when inserting into the multimodal extra_args
JSON. Previously each text request paid two full prompt-string clones even
when no multimodal routing work was done or lightseek-mm was compiled out.

Addresses PR #9441 review comment on preprocessor.rs:578.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@krishung5 krishung5 force-pushed the krish/phi3-mm-routing-fix branch from 81761ab to aeb7fb5 Compare May 14, 2026 00:08
krishung5 added a commit that referenced this pull request May 14, 2026
…uest clone

Plumb formatted_prompt as Option<&str> through gather_tokens and
gather_multi_modal_data instead of Option<String>. The owned String now
lives once in preprocess_request and is cloned exactly at the site that
actually requires ownership — when inserting into the multimodal extra_args
JSON. Previously each text request paid two full prompt-string clones even
when no multimodal routing work was done or lightseek-mm was compiled out.

Addresses PR #9441 review comment on preprocessor.rs:578.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@krishung5 krishung5 force-pushed the krish/phi3-mm-routing-fix branch from aeb7fb5 to 4bf5736 Compare May 14, 2026 00:21
krishung5 added a commit that referenced this pull request May 14, 2026
…uest clone

Plumb formatted_prompt as Option<&str> through gather_tokens and
gather_multi_modal_data instead of Option<String>. The owned String now
lives once in preprocess_request and is cloned exactly at the site that
actually requires ownership — when inserting into the multimodal extra_args
JSON. Previously each text request paid two full prompt-string clones even
when no multimodal routing work was done or lightseek-mm was compiled out.

Addresses PR #9441 review comment on preprocessor.rs:578.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@krishung5 krishung5 force-pushed the krish/phi3-mm-routing-fix branch from 4bf5736 to b49d01b Compare May 14, 2026 00:28
@krishung5 krishung5 force-pushed the krish/phi3-mm-routing-fix branch from 307f576 to b30766d Compare May 14, 2026 22:32
…-fix

Pulls latest main to re-run Post-Merge CI on the freshest base.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread tests/utils/payloads.py Outdated
…ssor_init

Per #9441 review (rmccorm4): generalize the kwarg name away from the
lightseek implementation detail. Log regex is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread lib/tokenizers/src/hf.rs

@rmccorm4 rmccorm4 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review comment is AI-generated.

Thermo-nuclear code-quality review — verdict: request changes (posting as comments).

The behavior fixes are correct and the empirical validation is strong. Two structural issues are presumptive blockers under a strict bar: a cohesive MM-routing subsystem grown inside the 3.6k-line preprocessor.rs when a canonical home (preprocessor/lightseek_mm.rs) already exists and already parses config.json; and a canonical-helper bypass in the test gate. Details inline. #3/#4 are strongly recommended cleanups to ride along.

Credit: hf.rs merge is clean/idempotent; the formatted_prompt: Option<&str> clone-removal is a real win; the mid-PR misdiagnosis was reverted rather than left as cruft; the fastokens guard is a sound fail-closed default.

Comment thread lib/llm/src/preprocessor.rs Outdated
Comment thread lib/llm/src/preprocessor.rs Outdated
Comment thread lib/llm/src/preprocessor.rs Outdated
Comment thread tests/utils/payloads.py Outdated
Comment thread tests/utils/payloads.py Outdated
Renames the two MM-routing log strings introduced in this PR to be
generic (no "lightseek" mention), and updates the matching regex.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@datadog-official

This comment has been minimized.

Comment thread tests/utils/payloads.py Outdated
Comment thread tests/utils/payloads.py
Comment thread lib/llm/src/model_card.rs Outdated
Comment thread tests/serve/multimodal_profiles/vllm.py
krishung5 and others added 13 commits June 1, 2026 15:16
Moves read_image_token_id_from_config and read_bos_token_from_config out
of preprocessor.rs into a new lightseek_mm::resolve_routing_tokens that
returns a RoutingTokens bundle (image_token_id, chat_placeholder_token_id,
bos_token_string). Reads config.json and tokenizer_config.json once each
instead of duplicating I/O across modules.

Addresses #9441 review: "fold both readers into lightseek_mm.rs so
config.json is read once and the module returns a resolved bundle."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OpenAIPreprocessor previously stored both image_token_id and
chat_placeholder_token_id, with usage doing
    chat_placeholder_token_id.unwrap_or(image_token_id)
after an early-return on image_token_id.is_none(). By construction of
resolve_routing_tokens, chat_placeholder_token_id is Some whenever
image_token_id is Some, so the .unwrap_or fallback was unreachable.

Collapse to a single routing_image_token_id field (the value used to fill
routing-side block hashes). Logs use the unified id; the spec-resolution
step is already logged separately by lightseek_mm.

Addresses #9441 review: "Collapse to a single routing_image_token_id
resolved at init; the .unwrap_or branch is unreachable."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ce helper

Rename splice_numbered_placeholders_at_token_level to make the Phi-3
specificity explicit, gate it on routing_prepend_bos.is_some() so a
future model with {n}-style placeholders but a non-LlamaTokenizer
processor falls through to text-prefix routing instead of silently
producing wrong block hashes, and move the leading-BOS push from
gather_mm_exact_routing_info into the helper. The caller now skips its
own leading-BOS prepend when normalized_token_ids comes from the helper
(splice_owns_bos), keeping the non-splice single-placeholder path
unchanged.

Addresses #9441 review: split-brain BOS handling + generic name hiding
Phi-3-specific encode-decode roundtrip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…al constants

Use prometheus_names.name_prefix.COMPONENT + prometheus_names.router.KV_HIT_RATE
to build the scraped histogram name instead of the literal
"dynamo_component_router_kv_hit_rate". Matches the pattern used at the
4 other call sites in payloads.py (lines 1104, 1214, 1281, 1371). A
future metric rename now cascades through this strong gate instead of
silently breaking it.

Addresses #9441 review: canonical-helper bypass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ount)

_scrape_router_histograms wrapped a single (sum, count) tuple in a
Dict keyed by the hardcoded string "router_kv_hit_rate", which callers
then unpacked with .get(). Collapse to a direct Optional[tuple]: None
on scrape failure, (sum, count) on success. _metrics_baseline becomes
Optional[tuple] too. Renamed to _scrape_router_kv_hit_rate to match
the narrowed return type.

Addresses #9441 review: thin abstraction with single hardcoded key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rn on promotion

Per @keivenchang's DM review: the existing tool-calling / reasoning
parser tests inject already-decoded text and never run tokenizer.decode,
so they cannot catch the class of bug where a previously-non-special
marker gets accidentally promoted to special and silently disappears
under skip_special_tokens=True.

Add a single unit test in dynamo-tokenizers pinning both halves of the
gate (one special:true -> promoted, one special:false -> skipped)
end-to-end through the actual encode -> decode(skip_special=true) round
trip — the layer the parser tests cannot reach.

Also upgrade the promotion log from debug to warn, including the
literal token strings, so any accidental promotion shows up loudly in
worker logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Exception

Per .ai/python-guidelines.md ("If you must catch broadly, use except
Exception: and always re-raise after logging") and Devin's #9441 review:
narrow the bare except in _scrape_router_kv_hit_rate to
requests.RequestException. We expect transient endpoint flakes here
(timeout / connection refused while the frontend is still binding
/metrics) and the strong gate already has its own `is None` guard.
Programming errors now propagate at test time instead of being swallowed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test profile comment claimed vLLM expands `<|image_pad|>` to N
`<|vision_pad|>` (151654), but the actual routing-side fill uses
config.json's `image_token_id` (151655, `<|image_pad|>`) because that's
what the worker stores in the expanded sequence. The stale claim
predates the placeholder/expansion-token fix.

Addresses #9441 Devin review: test comment contradicts the
preprocessor.rs code comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…L1 cache specials

extract_hf_special_tokens was called on the raw HfTokenizer (only
tokenizer.json's specials), then the merge applied later inside
wrap_hf. The L1 prefix cache's boundary list therefore missed tokens
declared only in tokenizer_config.json (Qwen2-VL's <|image_pad|>),
even though the wrapped tokenizer treated them as atomic — chat
prefixes straddling those boundaries lost cache hits with
DYN_TOKENIZER_CACHE=1.

Apply the merge eagerly on the raw HfTokenizer before extracting
specials so both halves stay in sync. Expose
merge_special_tokens_from_config as pub for the cross-crate call.
Simplify wrap_hf to a plain from_tokenizer since the merge has
already run.

Addresses #9441 Devin review: L1 tokenizer cache specials inconsistency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-fix

Pulls latest main to re-run Post-Merge CI on the freshest base
after Ryan/Keiven/Devin review refactors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- cargo fmt: rejoin overflowed `if let` and `merge_special_tokens_from_config(...)` call sites.
- clippy: move the dynamo-tokenizers `#[cfg(test)] mod tests` to the end of hf.rs to satisfy `clippy::items_after_test_module`.

Verified locally: cargo fmt --check, cargo clippy --no-deps --all-targets
(-D warnings) on dynamo-tokenizers + dynamo-llm, cargo test on the new
hf::tests::merge_gate_round_trips_through_decode unit test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…llow-up

Tracks the cohesion follow-up from #9441 review: the router-metric-
specific fields and assertions on CachedTokensChatPayload
(min_avg_kv_hit_rate, _metrics_baseline, _scrape_router_kv_hit_rate,
the kv_hit_rate delta assertion) should move into a
RouterMetricsAssertion mixin once a second router-specific strong
gate gets added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Commit 211d975 dropped the "(lightseek)" suffix from the MM-routing init
INFO log; test_router_rust_mm_logs_lightseek_initialization still pinned
the old string and failed in post-merge CI (run 26788111103). Match the
current emitted text instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::vllm Relates to the vllm backend fix frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants