feat: DeepSeek new v3.2 encoding#14249
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@Eva20150932-atlascloud Can we be compatible with the former template of |
|
I've tried old v32 chat template. but model doesn't work for my tool-call tests |
I mean can we put the different chat templates in separate files, and apply them to the different models (V32/V32-Exp) |
|
Verified it works |
|
@Fridge003 possible, though it needs to set xml attribute, and I'm not experienced in building jinja |
…=ChoiceDeltaToolCallFunction(arguments={}, name=None), type=function)] when streaming
…Added detection logic for using DPSK V3.2 encoding based on tokenizer configuration and architecture. Updated tests to validate encoding path and functionality. Adapted encoding_dsv32.py from Hugging Face repository. Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
/tag-and-rerun-ci |
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
I believe this PR is ready cc @Fridge003 |
| self.use_dpsk_v32_encoding = self._use_dpsk_v32_encoding() | ||
|
|
||
| def _use_dpsk_v32_encoding(self) -> bool: | ||
| has_chat_template = ( | ||
| self.tokenizer_manager.tokenizer is not None | ||
| and self.tokenizer_manager.tokenizer.chat_template is not None | ||
| ) | ||
| architectures = self.tokenizer_manager.server_args.get_hf_config().architectures | ||
| is_dpsk_v32 = "DeepseekV3" in architectures[0] if architectures else False | ||
| return not has_chat_template and is_dpsk_v32 | ||
|
|
There was a problem hiding this comment.
self.use_dpsk_v32_encoding = self.tokenizer_manager.server_args.tool_call_parser == "deepseekv32"
just dont determine with "architectures", but "tool_call_parser"
There was a problem hiding this comment.
should not as the tool_call_parser is not necessary in some cases, but this code path matters
There was a problem hiding this comment.
we can just add an environ SGLANG_USE_DPSKV32_ENCODING=True, then no need to concern how to determine this
There was a problem hiding this comment.
@JustinTong0323 i think this custom encode is a temp way. actually it is a kind of def apply_chat_template
There was a problem hiding this comment.
Not quite sure, do you mean we should not default it? But this code is adapted from deepseek's hf repo so I think it should be enabled by default
There was a problem hiding this comment.
Not quite sure, do you mean we should not default it? But this code is adapted from deepseek's hf repo so I think it should be enabled by default
do you remember when huggingface transforers~=4.2x (2023/2024), open source models usually provide a tokenizer.py with def apply_chat_template.
this encoding_dsv32.py is that def apply_chat_template.
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
|
||
| # Check if invoke_content is empty or whitespace only | ||
| # If so, skip this tool call entirely (it's likely incomplete or malformed) | ||
| if not invoke_content.strip(): |
There was a problem hiding this comment.
Here will ignore the non-parameter function.
Like :
<|DSML|invoke name="get_current_time">
</|DSML|invoke>
There was a problem hiding this comment.
Would fix that later.
|
There is a |
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
Hi, may I ask why use the |
|
Do you mean we only need to prepare for parsing one invoke-block, since the model generates only one token per forward? PR(#11652) to support MTP on v3.2 makes generating more than one invoke-block possible (though very low possibility). And by the way, I think it's harmless to use the while loop as it would break once the invoke-regex is not matched. @Muqi1029 |
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
But even though MTP generates more than one tokens once, the logic without |
|
That logic sounds good, and it makes me rethink things. Could there be a case where the response doesn't have a 'next time'? For instance, what if the MTP forward generates the eos token? |
|
@Eva20150932-atlascloud Thanks for you answering! I think maybe you are right, But here I have met another question, why you use these markers here? sglang/python/sglang/srt/function_call/deepseekv32_detector.py Lines 185 to 201 in ef3f8c9 I think model output in the token level, you can use the following scripts to see the tokens: from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V3.2")
special_tokens = [
"<|DSML|function_calls>",
"</|DSML|function_calls>",
"<|DSML|invoke",
"</|DSML|invoke",
]
for tokens in special_tokens:
print("\n\n")
print(f" Processing {tokens} ".center(80, "-"))
ids = tokenizer.encode(tokens, add_special_tokens=False)
for id in ids:
tokens = tokenizer.decode(id)
print(f"'{tokens}' : {id}")The output is as follows: So |
|
@Eva20150932-atlascloud Can you ensure that the function calls are output in the expected streaming manner? #14711 |
@jxz542189 |
This reverts commit 7c38eca.
|
When using smg and grpc mode, i think it should do similar thing with this pr @slin1237 |
Motivation
#14227
DeepSeek official release a new encoding func to replace chat_template, and I made one workable version(though it is hard-coded and breaks other models)
if you still use old chat-template for the formal v3.2, then it works bad for tool_calling. so we need new encoding to run new v3.2 model.
Modifications
Accuracy Tests
start a server like
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --trust-remote-code --tp-size 8 --host 0.0.0.0 --tool-call-parser deepseekv32 --enable-metrics --max-queued-requests 3 --max-running-requests 64 --cuda-graph-max-bs 64 --reasoning-parser deepseek-v3and my test for tool_calling passed.Benchmarking and Profiling
Checklist