feat: DeepSeek-V3.2 Streaming tool call output#15278
feat: DeepSeek-V3.2 Streaming tool call output#15278Fridge003 merged 7 commits intosgl-project:mainfrom
Conversation
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Summary of ChangesHello @JustinTong0323, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces streaming capabilities for DeepSeek-V3.2 tool calls, enabling the model to output tool names and arguments incrementally as they are generated. This enhancement significantly improves the user experience for applications requiring real-time interaction with tool-using models by providing immediate feedback on the tool call progress, rather than waiting for a complete tool call to be formed. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces streaming support for tool call arguments in the DeepSeek-V3.2 model, which is a great enhancement for user experience. The core logic in deepseekv32_detector.py has been significantly refactored to handle partial parsing and incremental streaming of arguments. The implementation appears solid and correctly handles the complexities of streaming structured data. I have one minor suggestion to simplify a piece of the new logic.
| # For partial values, we just take what we have so far | ||
| # We don't try JSON parsing for partial values unless they look complete, | ||
| # but simplistic approach is to just treat as string/partial value | ||
| if param_type == "true": | ||
| # For strings, the value is just the content so far | ||
| # We might need to be careful if the value itself contains partial closing tag | ||
| # But greedy match .* at end should capture everything | ||
| parameters[param_name] = param_value | ||
| else: | ||
| # For non-strings (JSON), partial parsing is tricky without a dedicated parser | ||
| # But we can try to return the raw string or try partial json | ||
| parameters[param_name] = param_value |
There was a problem hiding this comment.
The if param_type == "true": and else: blocks contain identical code: parameters[param_name] = param_value. This conditional is redundant. You can simplify this section by removing the if/else and using a single assignment, which makes the code cleaner and easier to maintain.
# For partial values, we just take what we have so far.
# For both string and JSON-like types, we'll take the raw partial value
# since proper partial JSON parsing is complex.
parameters[param_name] = param_valueWhen the buffer contains accumulated content from previous chunks (e.g., when a chunk ends with "<"), the code was returning `new_text` instead of `current_text` when determining the content is not DSML. This caused previously buffered content to be discarded. For example, when streaming text containing `<user_maybe_say>`: - Chunk 1: "...<" (buffered, returns empty) - Chunk 2: "user_maybe_say>..." (returns only new_text, discards "<") This resulted in `<user_maybe_say>` being output as `user_maybe_say>` with the leading "<" lost. The fix changes to return `current_text` (buffer content) instead of `new_text` (current chunk only), ensuring no content is lost when the buffer is cleared.
|
/tag-and-rerun-ci |
Co-authored-by: Muqi Li <muqi1029@gmail.com> Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
The reason I don't implement the full streaming is that I must handle annoying trailing part of parameter end token. I have tested this PR, the problem hasn't been solved. @JustinTong0323 You can reproduce the badcase by setting interval in the test case as 1, you will see Reproducing Script: Step 2: cd test/registered/function_call
python -m unittest test_function_call_parser.TestDeepSeekV32Detector.test_streaming_xml_formatBTW, I have made the test stricter, maybe you can cherry-pick that commit in the PR. |
|
How did you get toolcalling to work? I can't get it to work with dsv32 awq. hmm.. |
Could you try again in latest main? |
Hi, @Fridge003 , Thanks for your remind! But the question is not the same. And there are still some bugs in this PR. I think you SHOULD NOT merge this into main instantly, which is very dangerous. I have pointed to the bugs in the review. And you can also use stricter test cases to run the CI. |
The failure case hasn't happened in my test cases, so we just merge it to support this feature and would aim to solve the issue you referenced before the release, thanks for your check~ |
…n3_pp * 'main' of https://github.com/sgl-project/sglang: (74 commits) [bug fix][pp] fix inconsistent latency between tp (sgl-project#15379) Fix warp illegal instruction in kimi k2 thinking PCG (sgl-project#15306) Fix gpt-oss yarn with `truncate` argument (sgl-project#14270) Monkey patch deepseek-ocr's `v_head_dim` (sgl-project#15384) [model-gateway] Replace PolicyRegistry RwLock with DashMap for lock-free policy lookups (sgl-project#15361) [PP] Fix dynamic chunking strategy for PP (sgl-project#15372) Fix issue: ENABLE_BELOW_SM90 cannot be enabled on aarch64 CPU (sgl-project#12967) Split test_piecewise_cuda_graph.py to optimize CI resource usage (sgl-project#15290) unified management of environment variables for vlm cuda ipc transport (sgl-project#14501) Mistral Large 3 NVFP4 TRTLLM MoE support (sgl-project#15049) fix: adjust time for test_epd_disaggregation.py (sgl-project#15354) Add doc for qwen3 next (sgl-project#15337) feat: DeepSeek-V3.2 Streaming tool call output (sgl-project#15278) Feature/trtllm mha workspace size configurable sgl-project#15089 (sgl-project#15131) [VLM] Support cos sin cache for Qwen3-VL & GLM-4.1V (sgl-project#15205) [Deepseek V3.2] Support Overlap Spec + NSA (sgl-project#15307) Add request-level timestamp for when prefill finishes (sgl-project#14860) [CI] Migrate LoRA tests to test/registered/lora/ (sgl-project#15176) Reserve more memory for DeepSeekOCR model and adjust server start timeout for DeepGEMM to reduce flakiness (sgl-project#15277) Fix condition check for require_gathered_buffer (sgl-project#15328) ...
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: momaek <momaek17@gmail.com> Co-authored-by: Muqi Li <muqi1029@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: momaek <momaek17@gmail.com> Co-authored-by: Muqi Li <muqi1029@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: momaek <momaek17@gmail.com> Co-authored-by: Muqi Li <muqi1029@gmail.com>


Motivation
FIxes #14711
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist