fix(sglang): preserve reasoning replay history#81091
Conversation
7e0064b to
83771e8
Compare
|
Codex review: needs real behavior proof before merge. Summary Reproducibility: no. for the full user-visible empty-response symptom: I did not have a live SGLang/Kimi setup or logs proving the failure and fix end to end. The source path is clear, because SGLang uses Real behavior proof Next step before merge Security Review detailsBest possible solution: Land the provider-scoped SGLang replay hook after the contributor adds real after-fix proof showing a multi-turn SGLang/Kimi reasoning response produces visible final text while Gemma 4 replay remains protected. Do we have a high-confidence way to reproduce the issue? No for the full user-visible empty-response symptom: I did not have a live SGLang/Kimi setup or logs proving the failure and fix end to end. The source path is clear, because SGLang uses Is this the best way to solve the issue? Yes, based on source inspection this is the narrow owner-boundary fix: SGLang should declare its OpenAI-compatible replay policy through the existing plugin SDK helper instead of changing the global fallback. The added Gemma 4 test preserves the known guardrail, but merge should wait for real behavior proof. What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 652a56fc74a9. Re-review progress:
|
259b3a7 to
886ad78
Compare
ca097af to
d162f0c
Compare
d162f0c to
c530230
Compare
|
Maintainer verification for landing:
Thanks @akrimm702. |
c530230 to
d55083d
Compare
|
Landed via rebase onto
Thanks @akrimm702. |
Summary
openai-completionsmodel ids still drop historical reasoning.Closes #81058.
Root Cause
SGLang models are registered as
openai-completions, but the provider did not own its replay policy. That made it fall through to the core strict OpenAI-compatible fallback, which setsdropReasoningFromHistory: truefor unownedopenai-completionsproviders. For reasoning-capable self-hosted SGLang/Kimi models, that can strip replayed reasoning history and contribute to empty user-facing responses.Fix
SGLang now uses
buildProviderReplayFamilyHooks({ family: "openai-compatible", dropReasoningFromHistory: false }), matching the existing provider-owned opt-out pattern used by reasoning-capable OpenAI-compatible providers. The helper still forcesdropReasoningFromHistory: truefor Gemma 4 model ids, preserving the prior parser-safety behavior.Regression Proof
I checked the fix before republishing the branch:
mainat validation baseea7f74ffhad no SGLang replay-policy hook.d162f0c7cf83187bb578d7f90405e1d6b0123f1b.main.Validation
Latest local validation after rebasing and publishing the PR branch:
PNPM_CONFIG_OFFLINE=true pnpm test extensions/sglang/index.test.ts src/plugins/provider-replay-helpers.test.ts src/agents/transcript-policy.test.ts src/agents/pi-embedded-runner.sanitize-session-history.test.tsPNPM_CONFIG_OFFLINE=true pnpm exec oxfmt --check extensions/sglang/index.ts extensions/sglang/index.test.tsPNPM_CONFIG_OFFLINE=true pnpm exec tsc -p extensions/sglang/tsconfig.json --noEmit --pretty falsePNPM_CONFIG_OFFLINE=true pnpm check:changedRisk
Low and provider-scoped. This does not change the global OpenAI-compatible fallback. The main compatibility risk is accidental reintroduction of replayed reasoning for Gemma 4-style chat-completions models, covered by the new test.
Maintainer Verification
Maintainer proof after rebasing and adding changelog:
d55083d7e126ce1fc4d576fb841b47efe1c9093e.pnpm test extensions/sglang/index.test.ts -- --reporter=verbose-> 2 tests passed.git diff --check-> passed.tbx_01krgt6yqe7bte96dwsmker05p, GitHub Actions run25803983869.sglang 0.0.0.dev1+g9e00b7ca9.d20260513,torch 2.9.0+cpu, modelQwen/Qwen3-0.6B,/v1/modelsreturned the model id.reasoning_content; SGLang returnedcontent: "replay ok", non-emptyreasoning_content,finish: "stop".pnpm test extensions/sglang/index.test.ts -- --reporter=verbose-> 2 tests passed.Real behavior proof
reasoning_contentfor thinking-capable local models instead of stripping reasoning history through the strict fallback.tbx_01krgt6yqe7bte96dwsmker05p, GitHub Actions run25803983869, source-built SGLang CPU server,sglang 0.0.0.dev1+g9e00b7ca9.d20260513,torch 2.9.0+cpu, modelQwen/Qwen3-0.6Bserved on127.0.0.1:30000.python -m sglang.launch_server --model Qwen/Qwen3-0.6B --trust-remote-code --disable-overlap-schedule --device cpu --host 127.0.0.1 --port 30000 --tp 1 --reasoning-parser qwen3 --max-total-tokens 4096, then sent a livecurl http://127.0.0.1:30000/v1/chat/completionsrequest containing a prior assistant message withreasoning_content./v1/modelsreturnedQwen/Qwen3-0.6B; the direct replay response JSON was{ "model": "Qwen/Qwen3-0.6B", "content": "replay ok", "reasoning": "Okay, the user wants me to answer exactly: ...", "finish": "stop" }.reasoning_content, returned visible contentreplay ok, returned non-emptyreasoning_content, and finished withstop; the PR regression test on the same Testbox also passed 2/2.g5.xlargeandg4dn.xlarge; CPU SGLang with Qwen3 tested the same OpenAI-compatiblereasoning_contentreplay contract.