fix: MM-aware KV routing for Phi-3, Qwen2-VL, Qwen2.5-VL#9441
Conversation
WalkthroughThis PR enhances HuggingFace tokenizer loading to merge special tokens from ChangesSpecial Token Merging and Dual-Token Multimodal Routing
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…uest clone Plumb formatted_prompt as Option<&str> through gather_tokens and gather_multi_modal_data instead of Option<String>. The owned String now lives once in preprocess_request and is cloned exactly at the site that actually requires ownership — when inserting into the multimodal extra_args JSON. Previously each text request paid two full prompt-string clones even when no multimodal routing work was done or lightseek-mm was compiled out. Addresses PR #9441 review comment on preprocessor.rs:578. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
81761ab to
aeb7fb5
Compare
…uest clone Plumb formatted_prompt as Option<&str> through gather_tokens and gather_multi_modal_data instead of Option<String>. The owned String now lives once in preprocess_request and is cloned exactly at the site that actually requires ownership — when inserting into the multimodal extra_args JSON. Previously each text request paid two full prompt-string clones even when no multimodal routing work was done or lightseek-mm was compiled out. Addresses PR #9441 review comment on preprocessor.rs:578. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aeb7fb5 to
4bf5736
Compare
…uest clone Plumb formatted_prompt as Option<&str> through gather_tokens and gather_multi_modal_data instead of Option<String>. The owned String now lives once in preprocess_request and is cloned exactly at the site that actually requires ownership — when inserting into the multimodal extra_args JSON. Previously each text request paid two full prompt-string clones even when no multimodal routing work was done or lightseek-mm was compiled out. Addresses PR #9441 review comment on preprocessor.rs:578. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4bf5736 to
b49d01b
Compare
307f576 to
b30766d
Compare
…-fix Pulls latest main to re-run Post-Merge CI on the freshest base. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ssor_init Per #9441 review (rmccorm4): generalize the kwarg name away from the lightseek implementation detail. Log regex is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rmccorm4
left a comment
There was a problem hiding this comment.
🤖 This review comment is AI-generated.
Thermo-nuclear code-quality review — verdict: request changes (posting as comments).
The behavior fixes are correct and the empirical validation is strong. Two structural issues are presumptive blockers under a strict bar: a cohesive MM-routing subsystem grown inside the 3.6k-line preprocessor.rs when a canonical home (preprocessor/lightseek_mm.rs) already exists and already parses config.json; and a canonical-helper bypass in the test gate. Details inline. #3/#4 are strongly recommended cleanups to ride along.
Credit: hf.rs merge is clean/idempotent; the formatted_prompt: Option<&str> clone-removal is a real win; the mid-PR misdiagnosis was reverted rather than left as cruft; the fastokens guard is a sound fail-closed default.
Renames the two MM-routing log strings introduced in this PR to be generic (no "lightseek" mention), and updates the matching regex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Moves read_image_token_id_from_config and read_bos_token_from_config out of preprocessor.rs into a new lightseek_mm::resolve_routing_tokens that returns a RoutingTokens bundle (image_token_id, chat_placeholder_token_id, bos_token_string). Reads config.json and tokenizer_config.json once each instead of duplicating I/O across modules. Addresses #9441 review: "fold both readers into lightseek_mm.rs so config.json is read once and the module returns a resolved bundle." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OpenAIPreprocessor previously stored both image_token_id and
chat_placeholder_token_id, with usage doing
chat_placeholder_token_id.unwrap_or(image_token_id)
after an early-return on image_token_id.is_none(). By construction of
resolve_routing_tokens, chat_placeholder_token_id is Some whenever
image_token_id is Some, so the .unwrap_or fallback was unreachable.
Collapse to a single routing_image_token_id field (the value used to fill
routing-side block hashes). Logs use the unified id; the spec-resolution
step is already logged separately by lightseek_mm.
Addresses #9441 review: "Collapse to a single routing_image_token_id
resolved at init; the .unwrap_or branch is unreachable."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ce helper
Rename splice_numbered_placeholders_at_token_level to make the Phi-3
specificity explicit, gate it on routing_prepend_bos.is_some() so a
future model with {n}-style placeholders but a non-LlamaTokenizer
processor falls through to text-prefix routing instead of silently
producing wrong block hashes, and move the leading-BOS push from
gather_mm_exact_routing_info into the helper. The caller now skips its
own leading-BOS prepend when normalized_token_ids comes from the helper
(splice_owns_bos), keeping the non-splice single-placeholder path
unchanged.
Addresses #9441 review: split-brain BOS handling + generic name hiding
Phi-3-specific encode-decode roundtrip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…al constants Use prometheus_names.name_prefix.COMPONENT + prometheus_names.router.KV_HIT_RATE to build the scraped histogram name instead of the literal "dynamo_component_router_kv_hit_rate". Matches the pattern used at the 4 other call sites in payloads.py (lines 1104, 1214, 1281, 1371). A future metric rename now cascades through this strong gate instead of silently breaking it. Addresses #9441 review: canonical-helper bypass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ount) _scrape_router_histograms wrapped a single (sum, count) tuple in a Dict keyed by the hardcoded string "router_kv_hit_rate", which callers then unpacked with .get(). Collapse to a direct Optional[tuple]: None on scrape failure, (sum, count) on success. _metrics_baseline becomes Optional[tuple] too. Renamed to _scrape_router_kv_hit_rate to match the narrowed return type. Addresses #9441 review: thin abstraction with single hardcoded key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rn on promotion Per @keivenchang's DM review: the existing tool-calling / reasoning parser tests inject already-decoded text and never run tokenizer.decode, so they cannot catch the class of bug where a previously-non-special marker gets accidentally promoted to special and silently disappears under skip_special_tokens=True. Add a single unit test in dynamo-tokenizers pinning both halves of the gate (one special:true -> promoted, one special:false -> skipped) end-to-end through the actual encode -> decode(skip_special=true) round trip — the layer the parser tests cannot reach. Also upgrade the promotion log from debug to warn, including the literal token strings, so any accidental promotion shows up loudly in worker logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Exception
Per .ai/python-guidelines.md ("If you must catch broadly, use except
Exception: and always re-raise after logging") and Devin's #9441 review:
narrow the bare except in _scrape_router_kv_hit_rate to
requests.RequestException. We expect transient endpoint flakes here
(timeout / connection refused while the frontend is still binding
/metrics) and the strong gate already has its own `is None` guard.
Programming errors now propagate at test time instead of being swallowed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test profile comment claimed vLLM expands `<|image_pad|>` to N `<|vision_pad|>` (151654), but the actual routing-side fill uses config.json's `image_token_id` (151655, `<|image_pad|>`) because that's what the worker stores in the expanded sequence. The stale claim predates the placeholder/expansion-token fix. Addresses #9441 Devin review: test comment contradicts the preprocessor.rs code comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…L1 cache specials extract_hf_special_tokens was called on the raw HfTokenizer (only tokenizer.json's specials), then the merge applied later inside wrap_hf. The L1 prefix cache's boundary list therefore missed tokens declared only in tokenizer_config.json (Qwen2-VL's <|image_pad|>), even though the wrapped tokenizer treated them as atomic — chat prefixes straddling those boundaries lost cache hits with DYN_TOKENIZER_CACHE=1. Apply the merge eagerly on the raw HfTokenizer before extracting specials so both halves stay in sync. Expose merge_special_tokens_from_config as pub for the cross-crate call. Simplify wrap_hf to a plain from_tokenizer since the merge has already run. Addresses #9441 Devin review: L1 tokenizer cache specials inconsistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-fix Pulls latest main to re-run Post-Merge CI on the freshest base after Ryan/Keiven/Devin review refactors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- cargo fmt: rejoin overflowed `if let` and `merge_special_tokens_from_config(...)` call sites. - clippy: move the dynamo-tokenizers `#[cfg(test)] mod tests` to the end of hf.rs to satisfy `clippy::items_after_test_module`. Verified locally: cargo fmt --check, cargo clippy --no-deps --all-targets (-D warnings) on dynamo-tokenizers + dynamo-llm, cargo test on the new hf::tests::merge_gate_round_trips_through_decode unit test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…llow-up Tracks the cohesion follow-up from #9441 review: the router-metric- specific fields and assertions on CachedTokensChatPayload (min_avg_kv_hit_rate, _metrics_baseline, _scrape_router_kv_hit_rate, the kv_hit_rate delta assertion) should move into a RouterMetricsAssertion mixin once a second router-specific strong gate gets added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Commit 211d975 dropped the "(lightseek)" suffix from the MM-routing init INFO log; test_router_rust_mm_logs_lightseek_initialization still pinned the old string and failed in post-merge CI (run 26788111103). Match the current emitted text instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Overview
The
mm_agg_router_*post-merge tests were passing for the wrong reasons. The base assertioncached_tokens >= 1is satisfied any time vLLM's per-worker engine prefix cache fires — even when MM-aware KV routing has silently degraded to text-prefix routing and load-balancing just happens to put both repeats on the same worker. We could not distinguish "MM routing engaged" from "lucky round-robin landing." Real regressions would have slipped through.This PR tightens those tests so they fail closed when MM-aware routing degrades, and addresses the structural mismatches that the tighter gate then exposed for Phi-3, Qwen2-VL, Qwen2.5-VL, and the fastokens tokenizer path. All four affected model profiles now route deterministically with non-zero
[ROUTING] N/M blocks overlapon the repeat (load-balance luck no longer required).How the test was hardened
CachedTokensChatPayloadgrows three opt-in kwargs (each off by default, no behavior change for callers that don't set them):require_lightseek_init=True— asserts theMM-aware KV routing enabled (lightseek)init log fired (proves the routing path is wired up at deployment time).min_avg_kv_hit_rate=0.9— scrapes therouter_kv_hit_ratePrometheus histogram from the frontend/metricsendpoint, snapshots it after R1, and asserts the R2+ delta mean is at or above the threshold. In MM-aware mode every routed model converges to(M-1)/M(≥0.944 observed across the four profiles); in the broken-hash regression class fixed in this PR the rate is ~0. Pure-text-fallback is caught separately byrequire_lightseek_init.require_vllm_mm_processor_init=True— chat-processor path equivalent for[mm-routing] Transfer mode:.make_image_payload_cached_tokensalso bumpsrepeat_countfrom 2 to 3 so the metric mean covers both R2 and R3 — a single bad routing decision on either one drops the mean below 0.9 and fails the test.The four routed MM profiles (
mm_agg_router_qwen3-vl-2b,mm_agg_router_qwen2.5-vl-3b,mm_agg_router_qwen2-vl-2b,mm_agg_router_phi3-vision) opt into the strong gate.What the strong gate exposed (and how each was fixed)
1. Phi-3-vision — chat template emits BPE-shatterable text + vLLM does a decode-roundtrip
Phi-3's chat template renders
<|image_{n}|>as plain text. The Rusttokenizerscrate BPE-shatters that into ~7 sub-tokens, and the leading<can merge with the preceding.(word.<→ token id 19423). vLLM's_apply_token_matchestherefore can't find the placeholder in the prompt tokens and falls back to_apply_text_matches, which decodes the prompt tokens back to text, substitutes, and re-encodes — and the decode inserts a space after each special token (e.g.<|user|>\n→<|user|> \n), bumping the SentencePiece prefix from▁(id 29871) to▁▁(id 259). The worker therefore hashes its KV blocks on a token sequence that differs from the router'srouting_token_idsby a few ids — and every block hash from byte 0 onward diverges.The fix mirrors vLLM's behavior exactly: the routing path does an encode → decode → re-encode roundtrip on the prompt, then splits at
<|image_N|>and tokenizes each segment with the segment-BOS that Phi-3's HF processor produces. Locally capturedall_token_idsfrom vLLM's HF processor and the router'srouting_token_idsnow match byte-for-byte.Phi-3's
tokenizer_config.jsonalso declaresadd_bos_token: true, so the routing-side sequence prepends<s>(id 1) at the very front to match the backend's HF-processor output. No-op foradd_bos_token: falsemodels.2. Qwen2-VL / Qwen2.5-VL — chat-template token id ≠ per-patch expansion token id
These families'
config.jsondeclaresimage_token_id: 151655(<|image_pad|>) — what the chat template emits per image and what the backend's HF processor fills the expanded sequence with. Lightseek'splaceholder_token_id()returns a different id (151654=<|vision_pad|>); that id never appears in the worker's stored sequence, so using it as the routing-side fill token leaves every image-region block hashing differently from the backend.Resolve
chat_placeholder_token_idfromconfig.json'simage_token_idfield and use it as both the find-target and the expansion fill token. Falls back to lightseek's value whenconfig.jsondoesn't expose the field. Families where the two coincide (Qwen3-VL, LLaVA, Phi-3 after substitution) take the same code path as before — no behavior change.3. Qwen2-VL-2B — Rust tokenizer loader misses
tokenizer_config.jsonQwen2-VL-2B's release declares
<|image_pad|>as a special token only intokenizer_config.json'sadded_tokens_decoderfield — not intokenizer.json'sadded_tokenslist. HF Python'sAutoTokenizer.from_pretrained()reads both and merges. The Rusttokenizerscrate reads onlytokenizer.jsonand BPE-shatters<|image_pad|>into 6 sub-tokens. This gap predates MM routing; text-prefix routing didn't care because it didn't require token-identity alignment with vLLM's view.HuggingFaceTokenizer::from_file(and a newfrom_tokenizer_with_model_dirvariant called frommodel_card.rs) now reads the siblingtokenizer_config.jsonat load time and merges anyadded_tokens_decoderentries withspecial: truevia thetokenizerscrate'sadd_special_tokens()API. Aligns Rust with HF Python reference behavior. Idempotent for models that already have those tokens intokenizer.json.Cost: load-time only (~10-30 ms one-time, one extra file read + JSON parse). Encode hot path: zero overhead —
tokenizersuses a trie for special-token recognition, lookup is O(input length) independent of trie size.4. fastokens — no
tokenizer_config.jsonmerge availableThe opt-in
DYN_TOKENIZER=fastokensbackend doesn't expose a special-token mutator on itsTokenizertype, so we can't replay thetokenizer_config.jsonmerge there. Running MM-aware routing on a tokenizer that's BPE-shattering placeholders would silently emit wrong block hashes — far worse than falling back. The preprocessor now logs a single warning at init whenDYN_TOKENIZER=fastokensis set and disables MM-aware routing (falls back to text-prefix routing). TaggedTODO(mm-routing)in the source; the guard goes away oncefastokensupstream exposes a mutator.Local validation
Each routed model run 3× (Phi-3 run 10× as the most-changed path); every R2 routing decision shows real
[ROUTING] N/M blocks overlapmatching, not load-balance luck.router_kv_hit_ratemm_agg_router_qwen3-vl-2b(pre_merge baseline)mm_agg_router_qwen2.5-vl-3bconfig.jsonimage_token_idas find + fillmm_agg_router_qwen2-vl-2bimage_token_idmm_agg_router_phi3-visionThe "1 missing block" in every overlap is structural: the router hashes N blocks (the trailing one is zero-padded to block boundary), but vLLM only publishes BlockStored for complete blocks of 16 tokens — so the partial-tail block sits just outside the worker's stored set. Every other block matches. The 0.9 gate threshold sits below the lowest observed floor (0.944) with margin, and well above the ~0.0 the broken-hash regression class produces.
A post-merge CI dispatch on this branch is running: https://github.com/ai-dynamo/dynamo/actions/runs/26542084108
Where reviewers should start
tests/utils/payloads.py+tests/utils/multimodal.py— the strong-gate kwargs (require_lightseek_init,min_routing_total_blocks) and the helper they wrap.tests/serve/multimodal_profiles/vllm.py— the four model profiles opting into the gate.lib/llm/src/preprocessor.rs—splice_numbered_placeholders_at_token_level(decode-roundtrip + per-segment BOS);chat_placeholder_token_idresolution; theDYN_TOKENIZER=fastokensguard.lib/tokenizers/src/hf.rs— thetokenizer_config.jsonmerge. Low-level loader used by all of dynamo; changes are additive/idempotent.lib/llm/src/model_card.rs— the one-call-site update that routes the preprocessor tokenizer through the merging constructor.A
TODO(mm-routing)inread_image_token_id_from_configand the fastokens guard flag follow-ups (consolidate dual-token info into a single lightseek API; remove the fastokens guard once upstream exposes a mutator). Not blocking this PR.Closes DIS-1977
Related issues
preprocessor_config.json, now merged to main)🤖 Generated with Claude Code