Fix v32 continue_final_message not work by whybeyoung · Pull Request #16567 · sgl-project/sglang

whybeyoung · 2026-01-06T11:08:14Z

According to the official disscuss https://huggingface.co/deepseek-ai/DeepSeek-V3.2/discussions/4, and in the pr: #14249 we implement an new encoding for ds3.2

However, this new modification ignores the original handling of the continue_final_message continuation mode in SGLang.
The core logic changes of this PR are as follows:

Abstracted a unified processing method for the latest message in the user-input messages.

The specific processing logic is:

For the new encoding mode of DS3.2, added a judgment for the continue_final_message mode.
If the role of the user's latest input message is assistant but the continue_final_message mode is not enabled, the last assistant message will be forcibly converted to a user message with the content unchanged.
If the role of the user's latest input message is assistant and the continue_final_message mode is enabled, the continuation mode will be triggered.

gemini-code-assist · 2026-01-06T11:08:28Z

Summary of Changes

Hello @whybeyoung, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an issue where the v32 continue_final_message feature was not functioning as expected. The core of the fix involves encapsulating the logic for extracting and appending a final assistant message into dedicated, reusable helper functions. These functions are then integrated into the message processing pipeline, ensuring that conversations can seamlessly continue from a previous assistant response by correctly handling the prefix, thereby addressing the reported bug.

Highlights

New Helper Methods: Introduced two new private helper methods: _handle_continue_final_message to extract the last assistant message content and _append_assistant_prefix_to_prompt_ids to correctly append this content to the prompt token IDs, including handling of the Beginning Of Sequence (BOS) token.
Refactored continue_final_message Logic: The existing logic for processing the continue_final_message feature has been refactored into these new helper methods, improving code organization and reusability.
Consistent Application: The refactored logic is now consistently applied within the _apply_jinja_template method for both raw message processing and OpenAI-compatible message processing paths, ensuring the feature works correctly regardless of the input message format.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for continue_final_message in the dpsk_v32_encoding path and refactors the logic into two new helper methods, _handle_continue_final_message and _append_assistant_prefix_to_prompt_ids. This is a good change that fixes a bug for the v32 path and improves code maintainability by removing duplicated logic.

I have one suggestion to make _handle_continue_final_message more robust by handling cases where the last assistant message contains multimodal content, which would currently cause a runtime error.

whybeyoung · 2026-01-06T14:01:21Z

/tag-and-rerun-ci

whybeyoung · 2026-01-08T10:47:41Z

/tag-and-rerun-ci

JustinTong0323 · 2026-01-08T13:36:35Z

/tag-and-rerun-ci

JustinTong0323 · 2026-01-09T22:12:54Z

/tag-and-rerun-ci

* fix(ci): recover from corrupted MMMU parquet cache (sgl-project#17256) * [diffusion] feat: support default 4-step inference for Flux2-Klein distilled models (sgl-project#17225) Signed-off-by: Lancer <maruixiang6688@gmail.com> * Add runner utilization report workflow (sgl-project#17234) * cli: support sglang version (sgl-project#17250) * Use swa radix cache and memory pool for gpt-oss model (sgl-project#17261) * [VLM][Reland] Refactor load_mm_data to improve performance (sgl-project#16152) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [Tiny] Improve docs (sgl-project#17264) * [diffusion] fix: set guidance_scale default to None (sgl-project#17182) * Tiny fix comment typo (sgl-project#17287) * [SPEC_V2] Enable cudagraph draft_extend for trtllm_mla_backend and Acclen Fix for DP under cudagraph mode (sgl-project#16974) * Add kl test for swa radix cache (sgl-project#17281) * fix: Handle multiple named chat templates in HuggingFace tokenizers (sgl-project#17236) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> * Move radix cache related tests (sgl-project#17295) * [Refactor] Add `-fp4-gemm-backend` to replace `SGLANG_FLASHINFER_FP4_GEMM_BACKEND` (sgl-project#16534) Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> * [Bugfix] Fix PD accuracy when MTP is not configured on the prefill node (sgl-project#17212) Co-authored-by: Shangming Cai <csmthu@gmail.com> * [Diffusion] Apply jit qk_norm to flux1 (sgl-project#17296) * [Refactor] Split out deepseek v2 weight loader function into mixin (sgl-project#16649) * [NPU]Support GPT-OSS for NPU (sgl-project#14197) * [jit-kernel] Add CuTe DSL GDN Decode Kernel (sgl-project#15631) Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> * [GLM 4.7] Add RTX 6000 Pro aka sm120 (sgl-project#17235) Co-authored-by: root <root@ubuntu-nvidia.localdomain> * Update CODEOWNERS for multimodal_gen (sgl-project#17308) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * [Feature] overlap LoRA weight loading with compute (sgl-project#15512) * [PD] Optimize MHA models pp util calculation logic (sgl-project#17306) * [Minor] Correct sglang version when installing from source (sgl-project#17315) * Use dsv3 optimized routing `fused_topk_deepseek` instead of `moe_fused_gate` (sgl-project#15347) * [DeepSeek v3.2] Opt MTP decode cuda batch sizes and nsa implementation (sgl-project#16961) * Update code sync scripts (sgl-project#17319) * [Auto Sync] Update tokenizer_manager.py (20260119) (sgl-project#17317) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * support new qwen3_coder_detector (sgl-project#16744) Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com> * Fix kernel selection in biased_grouped_topk_gpu (sgl-project#17325) * KV Cache Events with Attention DP bug fix (sgl-project#16030) (sgl-project#16412) * [Perf] fuse q, k norm for Flux2Attention (sgl-project#17241) Co-authored-by: Minglei Zhu <zminglei@linkedin.com> * [CI] Add partition to stage-b-test-large-1-gpu (11->12) (sgl-project#17245) * fix(ci): rate limit and permission errors in trace publishing (sgl-project#17238) * Revert "[Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)" (sgl-project#17332) * Migrate performance, accuracy, and quantization tests to CI registry (sgl-project#17177) Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> * Inclusion of nvfp4 blockscale in EPLB Rebalance (sgl-project#17158) * [Refactor] Set `fp4-gemm-backend=auto` on SM100 and rename `fp4-gemm-backend` with `flashinfer_` prefix (sgl-project#17309) * [Diffusion] Apply qknorm to flux2 and apply lightx2v rms_norm_one_pass kernel(without residual) (sgl-project#17305) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Fix v32 continue_final_message not work (sgl-project#16567) * Evict swa kv cache during decoding (sgl-project#17220) * [RadixTree][1/N Refactor]: Support unified match_prefix params (sgl-project#17142) Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> * [AMD CI] Migrate and Add More Testcases (sgl-project#17116) Co-authored-by: yctseng0211 <yctseng@amd.com> * [AMD] CI - add partitions for stage-b-test-small-1-gpu-amd (sgl-project#17345) * Restore deepseek_v2.py to main's code, except the utils * Ran `pre-commit` --------- Signed-off-by: Lancer <maruixiang6688@gmail.com> Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Hudson Xing <1277646412@qq.com> Co-authored-by: Lancer <402430575@qq.com> Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu> Co-authored-by: Changyi Yang <112288487+ChangyiYang@users.noreply.github.com> Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> Co-authored-by: Ch3ngY1 <91232537+Ch3ngY1@users.noreply.github.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Jerry Ji <jerryjilol@gmail.com> Co-authored-by: Todobe <43903496+Todobe@users.noreply.github.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> Co-authored-by: Koushik Dutta <koush@koushikdutta.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Glen Liu <62917497+glenliu21@users.noreply.github.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Lee Nau <lnau@nvidia.com> Co-authored-by: Yongfei Xu <xuyongfei.xyf@antgroup.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Gaoji Liu <34803073+attack204@users.noreply.github.com> Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com> Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com> Co-authored-by: Kartik Ramesh <kartikx2000@gmail.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: Minglei Zhu <zminglei@linkedin.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Shu Wang <shuw@nvidia.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: zhangheng <hzh0425@apache.org> Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com> Co-authored-by: yctseng0211 <yctseng@amd.com>

Fix v32 continue_final_message not work

0493d81

whybeyoung requested review from CatherineSue, JustinTong0323, ispobock, merrymercy and slin1237 as code owners January 6, 2026 11:08

fix lint

c9a4257

gemini-code-assist Bot reviewed Jan 6, 2026

View reviewed changes

Comment thread python/sglang/srt/entrypoints/openai/serving_chat.py Outdated

upd

e63d32f

github-actions Bot added the run-ci label Jan 6, 2026

Fridge003 assigned JustinTong0323 Jan 6, 2026

whybeyoung and others added 2 commits January 8, 2026 16:43

Merge branch 'main' into fix_v32_continue

79d9ba1

Fix continue_final_msg in dsv32

1bd3c64

Merge branch 'main' into fix_v32_continue

5a2f882

whybeyoung mentioned this pull request Jan 15, 2026

[Track] DeepSeek v3.2/GLM 5 bug tracking #14511

Open

Merge branch 'main' into fix_v32_continue

911f059

JustinTong0323 approved these changes Jan 19, 2026

View reviewed changes

Fridge003 merged commit ebca587 into sgl-project:main Jan 19, 2026
248 of 275 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix v32 continue_final_message not work#16567

Fix v32 continue_final_message not work#16567
Fridge003 merged 7 commits intosgl-project:mainfrom
whybeyoung:fix_v32_continue

whybeyoung commented Jan 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jan 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

whybeyoung commented Jan 6, 2026

Uh oh!

whybeyoung commented Jan 8, 2026

Uh oh!

JustinTong0323 commented Jan 8, 2026

Uh oh!

JustinTong0323 commented Jan 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

whybeyoung commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Jan 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

whybeyoung commented Jan 6, 2026

Uh oh!

whybeyoung commented Jan 8, 2026

Uh oh!

JustinTong0323 commented Jan 8, 2026

Uh oh!

JustinTong0323 commented Jan 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

whybeyoung commented Jan 6, 2026 •

edited

Loading