[v0.5.10][2] Fix apply_chat_template behavior for transformers >=5.0 by yueming-yuan · Pull Request #926 · radixark/miles

yueming-yuan · 2026-04-06T22:11:25Z

ci-sglang-pr: sglang-miles-v0.5.10

Summary

transformers 5.x changed apply_chat_template(tokenize=True) to return BatchEncoding instead of list[int]
Added _apply_chat_template_ids() wrapper that normalizes the return type

Replaces #925 (was merged then reverted).

…g-v0.5.10" This reverts commit 14a0cdb, reversing changes made to 3cf7a9f.

…ng-v0.5.10" This reverts commit d549b26.

gemini-code-assist

Code Review

This pull request introduces a utility function _apply_chat_template_ids to handle changes in the transformers library (version 5.0+) where apply_chat_template may return a dictionary instead of a list. All existing calls to the tokenizer have been updated to use this wrapper. Feedback suggests adding type hints to the new function and explicitly setting return_dict=False to improve robustness and maintainability.

gemini-code-assist · 2026-04-06T22:13:36Z

+def _apply_chat_template_ids(tokenizer, messages, **kwargs) -> list[int]:
+    """Wrapper that always returns list[int] from apply_chat_template(tokenize=True).
+
+    transformers >=5.0 returns BatchEncoding instead of list[int]."""
+    result = tokenizer.apply_chat_template(messages, tokenize=True, **kwargs)
+    if isinstance(result, list):
+        return result
+    return result["input_ids"]


The _apply_chat_template_ids wrapper is a good addition for compatibility with transformers 5.0. However, it can be improved by adding type hints for better maintainability and consistency with the rest of the file. Also, explicitly setting return_dict=False via kwargs.setdefault ensures that current versions of transformers return the expected list type, while the isinstance check provides a robust fallback for future versions where the default might change or the flag might be ignored.

Note: Passing tokenize in kwargs to this function will cause a TypeError because it is already explicitly passed to apply_chat_template.

Suggested change

def _apply_chat_template_ids(tokenizer, messages, **kwargs) -> list[int]:

"""Wrapper that always returns list[int] from apply_chat_template(tokenize=True).

transformers >=5.0 returns BatchEncoding instead of list[int]."""

result = tokenizer.apply_chat_template(messages, tokenize=True, **kwargs)

if isinstance(result, list):

return result

return result["input_ids"]

def _apply_chat_template_ids(tokenizer: AutoTokenizer, messages: list[dict], **kwargs) -> list[int]:

"""Wrapper that always returns list[int] from apply_chat_template(tokenize=True).

transformers >=5.0 returns BatchEncoding instead of list[int]."""

kwargs.setdefault("return_dict", False)

result = tokenizer.apply_chat_template(messages, tokenize=True, **kwargs)

if isinstance(result, list):

return result

return result["input_ids"]

…ndles

… NVLS handles" This reverts commit 1b1e83f.

Remove models broken by transformers v5 tokenizer unification (DeepSeek-V3, step3, glm-4-9b-chat) and track them in a TOOL_CALL_KNOWN_FAILURES list with root cause comments. Add new passing models: Qwen3.5, Qwen3-Coder-Next, GLM-4.7-Flash, Kimi-K2.5, MiniMax-M2.5, Nemotron-3-Super. Clean up debug helpers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

transformers >=5.0 changed apply_chat_template(tokenize=True) to return BatchEncoding instead of list[int]. Pass return_dict=False to all 6 call sites in mask_utils.py to ensure list[int] on both v4 and v5. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move Step-3.5-Flash from known failures into active tool-call test models, and clarify comments for remaining transformers v5 tokenizer/template incompatibilities. Made-with: Cursor

guapisolo · 2026-04-08T01:38:07Z

A report generated by cc & codex and briefly reviewed by me. generally make sense.

Transformers v5 Tokenizer Compatibility Analysis

Background

transformers==5.3.0 introduced a unified tokenizer architecture that merges the old "slow" (Python) and "fast" (Rust) tokenizer backends. As of April 8, 2026, under the current test matrix this leaves 4 known failures (8 parametrized cases across num_tools in {1, 2}) that previously passed on transformers==4.57.1.

test_tokenize_tool_responses validates that tokenize_tool_responses produces the correct token delta by checking:

decode(apply_chat_template(tokenize=True)[delta]) == apply_chat_template(tokenize=False)[delta]

i.e. the decode of the token delta should equal the text delta. For a few models, that assumption effectively relies on decode(encode(text)) == text, which no longer holds under v5.

This document separates:

genuine tokenizer behavior changes that break the round-trip assumption
an API return-type change in apply_chat_template(tokenize=True) that requires a small compatibility fix

Root Cause 1: LlamaTokenizer Overwrites ByteLevel with Metaspace

Directly affected models: deepseek-ai/DeepSeek-V3, stepfun-ai/step3

What changed

In v4, AutoTokenizer.from_pretrained('deepseek-ai/DeepSeek-V3') loaded LlamaTokenizerFast, which read tokenizer.json directly and preserved the original configuration:

pre_tokenizer: ByteLevel(add_prefix_space=False)      # GPT-2 style
decoder:       ByteLevel(add_prefix_space=True)        # preserves special chars as-is

In v5, LlamaTokenizerFast became an alias for LlamaTokenizer. The unified LlamaTokenizer has a custom __init__ (transformers/models/llama/tokenization_llama.py:116-133) that rebuilds the Rust tokenizer from scratch using only vocab/merges extracted from tokenizer.json, and unconditionally hardcodes:

self._tokenizer.pre_tokenizer = Metaspace(replacement="▁", prepend_scheme=always)
self._tokenizer.decoder = Sequence([Replace("▁", " "), ByteFallback(), Fuse(), Strip()])

This happens because convert_to_native_format() (transformers/tokenization_utils_tokenizers.py:118) takes the elif branch for classes with a custom __init__, extracting only vocab/merges/post_processor but not pre_tokenizer or decoder from tokenizer.json.

Consequences

Encoding changed -- the Metaspace pre_tokenizer handles spaces differently from ByteLevel:

v4 (ByteLevel): '{"year": 2026}' → ['{"', 'year', '":', 'Ġ', '202', '6', '}']  (7 tokens)
v5 (Metaspace): '{"year": 2026}' → ['{"', 'year', '":', '202', '6', '}']        (6 tokens, space lost)

Decoding changed -- Replace("▁", " ") in the decoder replaces ▁ in special token names:

v4: decode([128810]) → '<｜tool▁outputs▁begin｜>'   # original character preserved
v5: decode([128810]) → '<｜tool outputs begin｜>'    # ▁ replaced with space

Note: the token's stored text (id_to_token) is identical in both versions (<｜tool▁outputs▁begin｜>). The difference is purely in the decode pipeline output.

Why this explains the failing models

deepseek-ai/DeepSeek-V3 loads LlamaTokenizer under v5, reproduces the ByteLevel -> Metaspace overwrite, and fails because decoded tool-output text loses formatting details such as spaces and ▁ in special token names.
stepfun-ai/step3 also loads LlamaTokenizer under v5. Its visible failure string differs from DeepSeek because it has a different chat template, but the symptom is the same category: tokenization/decoding no longer preserves the exact text delta expected by the test.
deepseek-ai/DeepSeek-V3.1 also loads LlamaTokenizer, so it remains in the same general risk family. However, its current tool-call chat template fails even earlier than the decode comparison: the template concatenates tool['function']['arguments'] as a string, while our dummy tool_calls[*].function.arguments is a dict, which raises TypeError: can only concatenate str (not "dict") to str.

Upstream references

Issue: Wrong tokenizer decoder type in Transformers v5 huggingface/transformers#43066
Partial fix: use TokenizersBackend huggingface/transformers#42894
Design RFC: RFC for tokenization in v5 huggingface/transformers#40938
Main implementation PR: rm slow tokenizers huggingface/transformers#40936

Root Cause 2: Legacy `_decode` Segmentation Removed in ChatGLM4Tokenizer

Affected model: THUDM/glm-4-9b-chat

What changed

In v4, ChatGLM4Tokenizer inherited from PreTrainedTokenizer, whose _decode method had legacy special-token segmentation logic:

# v4 PreTrainedTokenizer._decode (simplified):
for token in filtered_tokens:
    if token in legacy_added_tokens:
        # flush current non-special tokens through convert_tokens_to_string()
        # insert the special token as a literal string
    else:
        current_sub_text.append(token)

This meant convert_tokens_to_string() never received str-type special tokens -- only byte tokens.

In v5, ChatGLM4Tokenizer inherits from PythonBackend, whose _decode is simplified:

# v5 PythonBackend._decode:
filtered_tokens = self.convert_ids_to_tokens(token_ids)
text = self.convert_tokens_to_string(filtered_tokens)  # ALL tokens, including special ones

The bug in GLM's custom tokenizer

GLM's convert_tokens_to_string was never designed to handle str-type tokens:

def convert_tokens_to_string(self, tokens):
    text = ""
    temp = b""
    for t in tokens:
        if isinstance(t, str):    # e.g. '<|assistant|>'
            if temp:
                text += temp.decode("utf-8", errors="replace")
            # BUG 1: does not append t (the special token text) to output
            # BUG 2: does not reset temp = b""
        elif isinstance(t, bytes):
            temp += t
    if temp:
        text += temp.decode(...)  # temp was not reset, so content is appended again
    return text

Consequence

Input tokens:  [b'<', b'|', b'tool', b'|', b'>\n', b'{"', b'year', b'":', b' ', b'202', b'6', b'}', '<|assistant|>']

v4 output: '<|tool|>\n{"year": 2026} <|assistant|>'      # correct (legacy segmentation)
v5 output: '<|tool|>\n{"year": 2026}<|tool|>\n{"year": 2026}'  # content doubled, <|assistant|> lost

Why Passing Models Still Work

Category 1: Custom `init` rebuilds a ByteLevel-compatible tokenizer

Models: Qwen2.5-0.5B-Instruct, Qwen3-0.6B, Qwen3-4B-Instruct-2507, Qwen3-Coder-30B-A3B-Instruct, Qwen3.5-0.8B, Qwen3-Coder-Next, MiMo-7B-RL, MiniMax-M2, MiniMax-M2.5

These tokenizers define a custom __init__, so v5 reconstructs the Rust backend instead of loading the full tokenizer pipeline directly from tokenizer.json. However, unlike LlamaTokenizer, their hardcoded pipeline is still compatible with the model's original tokenizer config, so the overwrite is effectively a no-op.

Qwen2Tokenizer hardcodes a Sequence([Split(...), ByteLevel(...)]) pre_tokenizer plus a ByteLevel decoder.
GPT2Tokenizer hardcodes a ByteLevel pre_tokenizer plus a ByteLevel decoder.

In both cases, the effective encoding and decoding behavior remains aligned with tokenizer.json, so the tool-response round-trip still works.

Category 2: TokenizersBackend loaded directly -- no `init` overwrite

Models: Mistral-7B-Instruct-v0.3, GLM-4.7-Flash, Step-3.5-Flash, NVIDIA-Nemotron-3-Super-120B-A12B-BF16

These models either use TokenizersBackend directly (no subclass __init__) or their tokenizer class doesn't override __init__. convert_to_native_format() takes the if branch that loads the Rust tokenizer directly from tokenizer.json, preserving all configuration including pre_tokenizer and decoder.

This is where zai-org/GLM-4.7-Flash belongs: unlike THUDM/glm-4-9b-chat, it does not rely on the old custom Python ChatGLM4Tokenizer decode path. It uses a direct Rust-backed tokenizer with a ByteLevel decoder, so the GLM-specific bug above does not apply.

stepfun-ai/Step-3.5-Flash now also belongs here. In the current revision it loads TokenizersBackend directly, not LlamaTokenizer, and the tool-response round-trip passes.

Category 3: PythonBackend custom tokenizers without bugs

Models: Kimi-K2-Instruct, Kimi-K2.5 (TikTokenTokenizer), internlm3-8b-instruct (InternLM3Tokenizer)

These are pure Python tokenizers that don't use the Rust decode pipeline at all. They also don't have the convert_tokens_to_string bug that GLM has.

Additional Compatibility Issue: `apply_chat_template` Return Type Change

Independently from the decode issues above, transformers v5 changed apply_chat_template(tokenize=True) to return BatchEncoding (a dict with input_ids and attention_mask) instead of list[int].

This broke mask_utils.py, where 6 call sites assumed the return type was list[int].

Fix: Add return_dict=False to all apply_chat_template(tokenize=True) calls. This parameter is supported in both v4 (no-op, already returns list[int]) and v5 (forces list[int] return).

Summary of Known Failures

Model	Root Cause	Upstream Issue
deepseek-ai/DeepSeek-V3	LlamaTokenizer overwrites ByteLevel with Metaspace	#43066
deepseek-ai/DeepSeek-V3.1	Tool-call chat template expects string `function.arguments`; current dummy tool-call shape provides a dict	Model-side template issue
stepfun-ai/step3	Same as above	Same
THUDM/glm-4-9b-chat	v5 removed legacy `_decode` segmentation, exposing custom tokenizer bug	N/A (model-side bug)

Summary of Passing Models

Model	Tokenizer Class	Backend	Decoder	Why Unaffected
Qwen2.5-0.5B-Instruct	Qwen2Tokenizer	TokenizersBackend	ByteLevel	Hardcoded ByteLevel matches tokenizer.json
Qwen3-0.6B	Qwen2Tokenizer	TokenizersBackend	ByteLevel	Same
Qwen3-4B-Instruct-2507	Qwen2Tokenizer	TokenizersBackend	ByteLevel	Same
Qwen3-Coder-30B-A3B-Instruct	Qwen2Tokenizer	TokenizersBackend	ByteLevel	Same
Qwen3.5-0.8B	Qwen2Tokenizer	TokenizersBackend	ByteLevel	Same
Qwen3-Coder-Next	Qwen2Tokenizer	TokenizersBackend	ByteLevel	Same
Mistral-7B-Instruct-v0.3	TokenizersBackend	TokenizersBackend	Metaspace	Direct load from tokenizer.json, no overwrite
GLM-4.7-Flash	TokenizersBackend	TokenizersBackend	ByteLevel	Direct load from tokenizer.json; does not use the old ChatGLM Python decode path
Step-3.5-Flash	TokenizersBackend	TokenizersBackend	ByteLevel-compatible	Direct load; passes tool-response round-trip as of April 8, 2026
Nemotron-3-Super-120B	TokenizersBackend	TokenizersBackend	ByteLevel	Direct load from tokenizer.json, no overwrite
MiniMax-M2	GPT2Tokenizer	TokenizersBackend	ByteLevel	Hardcoded ByteLevel matches tokenizer.json
MiniMax-M2.5	GPT2Tokenizer	TokenizersBackend	ByteLevel	Same
internlm3-8b-instruct	InternLM3Tokenizer	PythonBackend	N/A	Pure Python tokenizer, no Rust decode, no bug
Kimi-K2-Instruct	TikTokenTokenizer	PythonBackend	N/A	Pure Python tokenizer, no Rust decode, no bug
Kimi-K2.5	TikTokenTokenizer	PythonBackend	N/A	Same
MiMo-7B-RL	Qwen2Tokenizer	TokenizersBackend	ByteLevel	Hardcoded ByteLevel matches tokenizer.json

…5-v2

guapisolo

LGTM.

- Revert CI image back to radixark/miles:dev - Revert SGLANG_PR default back to sglang-miles - Revert SGLANG_BRANCH back to sglang-miles - Revert Megatron-Bridge back to merged-megatron-0.16.0rc0-miles

…adixark#926) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…region clusters (#10) * Revert "[BUGFIX] [P2PRDMA] Add rollout post-processing after P2PRDMA weight updates" (radixark#882) * [Fix] fix ci (radixark#894) * Avoid threading for ray getting object (radixark#886) * Add explicit errors for unsupported Megatron profiles (radixark#887) * Add nvfp4 quantizer files (radixark#907) * Bump flash-linear-attention version to 0.4.2 (radixark#892) * [BUGFIX] Invoke "post_process_quantization" by default after weight updating (radixark#890) Co-authored-by: Yueming Yuan <yym022502@gmail.com> * Add heartbeat and id to session server (radixark#866) * fix: adding thin glm5 image to docker build + latest tag sync (radixark#871) * Add consistent hashing routing policy for rollout (radixark#891) Co-authored-by: Yueming Yuan <yueming@Mac.attlocal.net> * [example] add retool v2 example with multi-turn framework interfaces (radixark#654) Co-authored-by: GuanxingLu <gxlu02@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Expose rollout-batch-size, n-samples-per-prompt, global-batch-size as CLI args in swe-agent-v2 (radixark#954) Co-authored-by: Shi Dong <shi.dong@radixark.ai> * chore: remove obsolete swe-agent server.py and run-qwen3.sh (radixark#952) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add weight staleness control for fully async rollout (radixark#958) * Fix/pause generation mode (radixark#924) Co-authored-by: Yueming Yuan <yym022502@gmail.com> * [v0.5.10][1] Bump sglang to v0.5.10 (radixark#898) * [v0.5.10][2] Fix apply_chat_template behavior for transformers >=5.0 (radixark#926) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [v0.5.10][3] Fix processor return_tensors duplicate kwarg for transformers >=5.0 (radixark#927) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [v0.5.10][4] Fix _no_split_modules set not subscriptable in transformers >=5.0 (radixark#931) * [v0.5.10][5] Disable piecewise cuda graph to avoid NVLS oom (radixark#935) * [v0.5.10][6][FSDP] fix outdated weight update logic in FSDP (radixark#948) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [v0.5.10][7][FSDP] move FSDP to experimental and disable by default (radixark#961) * Add skiplist and more robust calculation on val (radixark#965) * [fix] tiny fix debug rollout only in weight version check (radixark#967) * feat: real cp support with relayout fix for qwen3.5 train/rollout mismatch (radixark#885) * [AMD] Upgrade to sglv0.5.10 (radixark#973) * switch model to actor (radixark#756) * [fix] support general logic to bypass fp32 downcast and fix qwen35 A_log dtype (radixark#975) Co-authored-by: yueming-yuan <yym022502@gmail.com> * fix: populate prefix_cache_info in OpenAI/session rollout path (radixark#960) * Remove prepare_harbor_tasks.py; use harbor-private adapters (radixark#982) * [fix] Skip flush_cache in in_place mode and add fully async example (radixark#974) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * GLM47 full cmd for async and sync reasoning (radixark#986) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: handle non-tool appended messages in TITO incremental tokenization (radixark#949) Co-authored-by: Yanbin Jiang <jybsuper@gmail.com> * [docker] Add sgl-model-gateway install and download .tar.gz assets (radixark#895) * [ci] fix hf rate limit error by caching tokenizer loading (radixark#1014) Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> * Use load_generate_function in legacy sglang_rollout path (radixark#1016) * Update CODEOWNERS to add new reviewers (radixark#1021) * Support moe lora for gpt-oss (radixark#798) Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> * [fix] restore expert_bias to fp32 before bridge weight export (radixark#811) * [chore] drop legacy transformers upgrade pin for glm47-flash and qwen35 (radixark#1018) * [fix] Enforce param dtype before wrap ddp (radixark#992) Co-authored-by: Zhichen Zeng <zczeng@uw.edu> * [upgrade] update Megatron-Bridge source and LoRA CI to megatron e2e tests and (radixark#1023) * [CI] Drop --use-miles-router from R3 tests and add r3 comparasion test between sgl & miles router (radixark#1015) * wandb: raise init_timeout, add retry wrapper, fix shared-mode init for cross-region clusters In online + shared mode, both `init_wandb_primary` and `init_wandb_secondary` make HTTPS round-trips to wandb cloud (login + run create/attach). On high-latency cross-region clusters (e.g. Abu Dhabi MBZUAI ↔ wandb-cloud US-West) with concurrent actor bursts, a single round-trip can exceed the wandb SDK's 90s default `init_timeout` — tearing down the whole run with a silent handshake abort. Observed on RL360 job 1564420, which forced `WANDB_MODE=offline` as a global default ever since (see https://github.com/LLM360/RL360/issues/87). The issue's original diagnosis assumed a local primary↔secondary socket handshake race. That's not how shared mode works — per wandb's own feature PR (wandb/wandb#6882), each writer spawns an independent wandb-core that talks to the cloud directly; aggregation is server-side by run_id. No local socket exists. The failure mode is pure network/latency, not a local readiness race. Changes ------- - Bump `init_timeout` to 300s for primary and secondary Settings. Configurable via `WANDB_INIT_TIMEOUT_SECS` env var for tuning. - Wrap both init paths in a bounded exponential-backoff retry (`_wandb_init_with_retry`) that re-attempts on wandb.errors.CommError and wandb.errors.UsageError. 3 attempts with 5→10→20s backoff by default, tunable via `WANDB_INIT_RETRY_ATTEMPTS` / `WANDB_INIT_RETRY_BACKOFF_SECS`. - Add `x_label` tagging per wandb distributed-training docs: primary gets `rank_<rank>_primary`, secondaries get `rank_<rank>_secondary`. Enables per-rank console-log filtering in the wandb UI. - Drop `reinit=True` from secondary init_kwargs. Shared mode natively supports concurrent writers on a single run; `reinit=True` triggered stale-state warnings on secondary actors without functional benefit. Followups this change enables ----------------------------- - `WANDB_MODE=offline` can be removed from scale.yaml's extra_env default once a pilot run confirms online mode boots cleanly. - The tmux-based `~/bin/wandb-sync-rl360.sh` workaround on David's M2 account becomes obsolete (no more offline-only default). - Near-realtime wandb dashboards replace the ~2-minute-lag offline sync; per-rank system metrics via x_label filtering. --------- Co-authored-by: JD <jaedon.guo@gmail.com> Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ziang Li <ziangli@umich.edu> Co-authored-by: Zhichen Zeng <zczeng@uw.edu> Co-authored-by: JensenFire <xinji1@microsoft.com> Co-authored-by: Yueming Yuan <yym022502@gmail.com> Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> Co-authored-by: Douglas Yang <douglasyang88@gmail.com> Co-authored-by: Yueming Yuan <yueming@Mac.attlocal.net> Co-authored-by: Huapeng Zhou <73010314+PopSoda2002@users.noreply.github.com> Co-authored-by: GuanxingLu <gxlu02@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Shi-Dong <Shi-Dong@users.noreply.github.com> Co-authored-by: Shi Dong <shi.dong@radixark.ai> Co-authored-by: Jiajun Li <48857426+guapisolo@users.noreply.github.com> Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com> Co-authored-by: Yanbin Jiang <jybsuper@gmail.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: Yisheng Gong <yishenggong9437@gmail.com>

yueming-yuan added 11 commits April 5, 2026 18:44

Bump sglang to v0.5.10

38458fc

Fix PyJWT uninstall conflict with v0.5.10 base image

fee8ee2

Fix PyJWT: rm debian package files before pip install

dedb1fa

Retrigger CI

6d0a8e3

Fix FastAPI add_event_handler removed in newer versions

ffd110b

Fix OpenAIServingChat missing reasoning_parser in test mock

3cf7a9f

Fix mask_utils for transformers >=5.0 BatchEncoding return type

44e83de

Format

1a2d5e8

Merge branch 'fix/mask-utils-transformers-v5' into bump-sglang-v0.5.10

14a0cdb

Revert "Merge branch 'fix/mask-utils-transformers-v5' into bump-sglan…

d549b26

…g-v0.5.10" This reverts commit 14a0cdb, reversing changes made to 3cf7a9f.

Reapply "Merge branch 'fix/mask-utils-transformers-v5' into bump-sgla…

ec3cb15

…ng-v0.5.10" This reverts commit d549b26.

yueming-yuan requested review from fzyzcjy, guapisolo and maocheng23 as code owners April 6, 2026 22:11

gemini-code-assist Bot reviewed Apr 6, 2026

View reviewed changes

yueming-yuan changed the title ~~Fix mask_utils for transformers >=5.0~~ [v0.5.10] [2] Fix mask_utils for transformers >=5.0 Apr 6, 2026

yueming-yuan changed the title ~~[v0.5.10] [2] Fix mask_utils for transformers >=5.0~~ [v0.5.10] [2] Fix apply_chat_template behavior for transformers >=5.0 Apr 6, 2026

yueming-yuan changed the title ~~[v0.5.10] [2] Fix apply_chat_template behavior for transformers >=5.0~~ [v0.5.10][2] Fix apply_chat_template behavior for transformers >=5.0 Apr 6, 2026

yueming-yuan added 3 commits April 6, 2026 17:30

Use Megatron-Bridge fix/rope-theta branch for transformers 5.x compat

fbec694

Disable flashinfer allreduce fusion for H100 colocate to save NVLS ha…

1b1e83f

…ndles

Revert "Disable flashinfer allreduce fusion for H100 colocate to save…

e848032

… NVLS handles" This reverts commit 1b1e83f.

guapisolo requested a review from yushengsu-thu as a code owner April 8, 2026 00:07

guapisolo and others added 2 commits April 8, 2026 00:08

guapisolo force-pushed the fix/mask-utils-transformers-v5-v2 branch from 775a86c to f058d37 Compare April 8, 2026 00:08

Fix Step-3.5-Flash tool-call tokenizer test classification

9cb379a

Move Step-3.5-Flash from known failures into active tool-call test models, and clarify comments for remaining transformers v5 tokenizer/template incompatibilities. Made-with: Cursor

yushengsu-thu approved these changes Apr 8, 2026

View reviewed changes

Merge branch 'bump-sglang-v0.5.10' into fix/mask-utils-transformers-v…

0a0786a

…5-v2

maocheng23 approved these changes Apr 9, 2026

View reviewed changes

Base automatically changed from bump-sglang-v0.5.10 to main April 9, 2026 19:43

Merge branch 'main' into fix/mask-utils-transformers-v5-v2

51c5785

guapisolo approved these changes Apr 9, 2026

View reviewed changes

Revert CI test docker/branch and Dockerfile changes

9918d0a

- Revert CI image back to radixark/miles:dev - Revert SGLANG_PR default back to sglang-miles - Revert SGLANG_BRANCH back to sglang-miles - Revert Megatron-Bridge back to merged-megatron-0.16.0rc0-miles

yueming-yuan merged commit ef228e6 into main Apr 9, 2026
17 of 18 checks passed

yueming-yuan deleted the fix/mask-utils-transformers-v5-v2 branch April 9, 2026 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v0.5.10][2] Fix apply_chat_template behavior for transformers >=5.0#926

[v0.5.10][2] Fix apply_chat_template behavior for transformers >=5.0#926
yueming-yuan merged 20 commits intomainfrom
fix/mask-utils-transformers-v5-v2

yueming-yuan commented Apr 6, 2026 •

edited by guapisolo

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Uh oh!

guapisolo commented Apr 8, 2026

Uh oh!

guapisolo left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yueming-yuan commented Apr 6, 2026 • edited by guapisolo Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

guapisolo commented Apr 8, 2026

Transformers v5 Tokenizer Compatibility Analysis

Background

Root Cause 1: LlamaTokenizer Overwrites ByteLevel with Metaspace

What changed

Consequences

Why this explains the failing models

Upstream references

Root Cause 2: Legacy _decode Segmentation Removed in ChatGLM4Tokenizer

What changed

The bug in GLM's custom tokenizer

Consequence

Why Passing Models Still Work

Category 1: Custom __init__ rebuilds a ByteLevel-compatible tokenizer

Category 2: TokenizersBackend loaded directly -- no __init__ overwrite

Category 3: PythonBackend custom tokenizers without bugs

Additional Compatibility Issue: apply_chat_template Return Type Change

Summary of Known Failures

Summary of Passing Models

Uh oh!

guapisolo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yueming-yuan commented Apr 6, 2026 •

edited by guapisolo

Loading

Root Cause 2: Legacy `_decode` Segmentation Removed in ChatGLM4Tokenizer

Category 1: Custom `init` rebuilds a ByteLevel-compatible tokenizer

Category 2: TokenizersBackend loaded directly -- no `init` overwrite

Additional Compatibility Issue: `apply_chat_template` Return Type Change