fix(openai): when samplingParams is set, pass it through verbatim#3458
Conversation
f4131cb to
28ff82d
Compare
|
Heads up to anyone already peeking: I just force-pushed a redesign. The earlier version tried to auto-detect GPT-5/o-series by model name and swap |
28ff82d to
2adcfb6
Compare
|
Quick update: refined the design once more based on user feedback. Previous revision spread |
7df0882 to
56f124f
Compare
Previously pipeline.ts always hardcoded max_tokens as the output-token parameter name on the OpenAI-compatible path, falling back from samplingParams.max_tokens to request.config.maxOutputTokens to provider defaults. This broke GPT-5 / o-series on OpenAI and Azure OpenAI, which require max_completion_tokens and reject max_tokens with a 400 error.
Fix: when the user provides samplingParams explicitly, treat it as the complete source of truth for the wire shape and pass its keys through verbatim. No client-injected defaults, no request fallbacks, no hardcoded parameter names. The user describes what the provider wants; the client trusts them.
When samplingParams is absent, the historical default behavior (request fallback through temperature/top_p/.../max_tokens plus provider defaults) is preserved unchanged — existing users see no difference.
Concretely, users can now set any of:
samplingParams: { max_tokens: 4096 } # GPT-4 / Qwen / DeepSeek
samplingParams: { max_completion_tokens: 4096 } # GPT-5 / o-series
samplingParams: { reasoning_effort: 'medium' } # future knobs
without waiting for a qwen-code release that adds model-specific branches.
Signed-off-by: Gordon Lam (SH) <yeelam@microsoft.com>
56f124f to
a1fce04
Compare
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅ — gpt-5.4 via Qwen Code /review
|
@wenshao. Thanks for the review. However, since I used a forked repo, I don’t have permission to check in. I need your help with that. |
End-to-end verification resultsI ran a wire-level verification that bypasses all mocks — a local HTTP server captures the exact JSON body the OpenAI SDK sends, and the script drives the real Four scenarios, 13 wire-level assertions:
Totals
Also run locally on this branch
|
) Previously pipeline.ts always hardcoded max_tokens as the output-token parameter name on the OpenAI-compatible path, falling back from samplingParams.max_tokens to request.config.maxOutputTokens to provider defaults. This broke GPT-5 / o-series on OpenAI and Azure OpenAI, which require max_completion_tokens and reject max_tokens with a 400 error. Fix: when the user provides samplingParams explicitly, treat it as the complete source of truth for the wire shape and pass its keys through verbatim. No client-injected defaults, no request fallbacks, no hardcoded parameter names. The user describes what the provider wants; the client trusts them. When samplingParams is absent, the historical default behavior (request fallback through temperature/top_p/.../max_tokens plus provider defaults) is preserved unchanged — existing users see no difference. Concretely, users can now set any of: samplingParams: { max_tokens: 4096 } # GPT-4 / Qwen / DeepSeek samplingParams: { max_completion_tokens: 4096 } # GPT-5 / o-series samplingParams: { reasoning_effort: 'medium' } # future knobs without waiting for a qwen-code release that adds model-specific branches. Signed-off-by: Gordon Lam (SH) <yeelam@microsoft.com>
* fix: backport upstream trivial bug fixes (QwenLM#3499, QwenLM#3630, QwenLM#3320) Cherry-picked from QwenLM/qwen-code: - QwenLM#3499 fix(core): use empty string instead of null for reasoning-only assistant content. Some OpenAI-compatible providers (e.g. Ollama qwen3.5:9b) reject content: null with HTTP 400 when reasoning_content is also present. Tool-call-only messages keep null per OpenAI spec. - QwenLM#3630 fix(telemetry): switch FileExporter.serialize from JSON.stringify to safeJsonStringify. OTel ReadableSpans hold a BatchSpanProcessor back-reference that forms a cycle and crashed --telemetry-outfile users. - QwenLM#3320 fix(core): cap chokidar depth at 2 in SkillManager and skip .git / special file types. Prevents FD exhaustion when a skill dir contains node_modules etc., which silently broke node-pty I/O. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(core): scope StreamingToolCallParser per-stream (QwenLM#3525) Backports upstream PR QwenLM#3525 + extends the per-stream context to also cover our fork's <think>-tag parser state. Bug: every caller of Config.getContentGenerator() — foreground turns, fork subagents, run_in_background subagents, ACP concurrent Agent calls (after QwenLM#3463) — shared a single OpenAIContentConverter, which held the StreamingToolCallParser as an instance field. Concurrent streams corrupted each other's tool-call buffers, surfacing as NO_RESPONSE_TEXT. Fix: - New ConverterStreamContext interface holds toolCallParser, thinkBuffer, inThinkTag — one per stream. - createStreamContext() factory replaces resetStreamingToolCalls(). - convertOpenAIChunkToGemini(chunk, ctx) and processThinkChunk(chunk, ctx) thread the context through every parser/think-buffer access. - ContentGenerationPipeline.processStreamWithLogging creates one context at stream entry. The error path no longer manually resets — the context is GC'd when the generator unwinds. Our protoInternal recovery-note logic is preserved on the new shape. Note: upstream's follow-up QwenLM#3550 (full stateless converter refactor) is deferred — it's hygiene without a functional bug; QwenLM#3525 alone fixes the concurrency race. Tests: - New createStreamContext describe replaces resetStreamingToolCalls suite - Streaming <think> tests use a per-test context - pipeline.test.ts mock updated to match the new API - pipeline.concurrent.test.ts (from upstream commit 38edd9d) drives two real concurrent streams and asserts neither corrupts the other's tool-call output (positive control: pre-fix, this test fails with exactly the user-reported bug shape). Refs upstream QwenLM#3516, QwenLM#3525. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(core): strip thinking blocks from history on model switch (QwenLM#3304) When switching models mid-session, reasoning_content fields from thinking-capable models leaked into API requests sent to the new provider, causing 422 errors on strict OpenAI-compatible endpoints. Call stripThoughtsFromHistory() in handleModelChange() so thought parts are removed before the next request is built for the new model. * fix(core): reject truncated subagent write_file calls (QwenLM#3505) Backport of upstream QwenLM#3505. Propagates MAX_TOKENS truncation from subagent responses into tool requests and rejects truncated edit calls before schema validation can surface misleading missing-parameter errors. Adapted to our fork's coreToolScheduler.ts which already had the truncation rejection block — kept both, dropped the unused clearRetryCountsForTool() call (we don't have that retry-counter machinery yet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(core): prevent malformed permission rules from becoming tool-wide catch-alls (QwenLM#3467) Backport of upstream QwenLM#3467. A permission rule with unbalanced parens was silently parsed with specifier: undefined, causing matchesRule to treat it as a catch-all. For deny rules this blocked all commands; for allow rules a typo could silently auto-approve everything. - Adds an invalid flag to PermissionRule - parseRule marks unbalanced-paren rules as invalid - matchesRule short-circuits invalid rules to never match - parseRules / addSession*Rule / addPersistentRule warn on malformed input - listRules filters invalid rules from /permissions UI Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(core): preserve reasoning_content during session resume and active sessions (GH#3579) * test(config): drop fork-incompatible QwenLM#3304 strip-thoughts test The test from upstream QwenLM#3304 backport assumed an in-place qwen-oauth model switch path that our fork doesn't have; the source-side fix in config.ts (stripThoughtsFromHistory call in handleModelChange) is preserved. Coverage will be re-added when the fork's switch flow stabilizes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(acp): run Agent tool calls concurrently + graceful degrade (QwenLM#3463) Backport of upstream QwenLM#3463. When the model returns multiple Agent tool calls in a single turn, ACP Session was executing them sequentially in a for-loop, multiplying latency by sub-agent count. - Add private runToolCalls() helper that mirrors coreToolScheduler's partition logic: consecutive Agent calls form a parallel batch (safe because sub-agents have no shared mutable state); other tools form sequential batches. - Replace 2 for-loops in Session.ts with runToolCalls() calls. - Switch the AgentTool eventEmitter guard from key-presence check to truthy check (commit 651979c) — the key-presence check passed for { eventEmitter: undefined } and crashed inside SubAgentTracker.setup. Note: upstream replaced 3 for-loops; our fork only had 2 in those code paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(acp): support SSE and HTTP MCP servers in ACP mode In ACP mode, the Mcp server list sent by the IDE client can include SSE (type: "sse") and HTTP (type: "http") transports, but the previous implementation only handled stdio servers via toStdioServer(). Non-stdio servers were silently skipped (continue), so any SSE/HTTP-configured MCP server would never be registered. Changes: - Add toSseServer() helper: detects type=="sse" servers and maps them to MCPServerConfig(url=..., headers=...) - Add toHttpServer() helper: detects type=="http" servers and maps them to MCPServerConfig(httpUrl=..., headers=...) - Refactor newSessionConfig() loop to handle all three transport types - Declare mcpCapabilities: { sse: true, http: true } in agentCapabilities so IDE clients know this agent supports these transports without needing a transparent proxy - Export the three helper functions for unit testing Tests: - Unit tests for toStdioServer / toSseServer / toHttpServer helpers (type discrimination, mutual exclusion) - Integration-style tests for QwenAgent.initialize() mcpCapabilities - Integration-style tests for newSession() with SSE/HTTP MCP servers, verifying MCPServerConfig is constructed with the correct arguments (url vs httpUrl, headers passthrough, empty-headers → undefined) Fixes QwenLM#3472 * fix(openai): when samplingParams is set, pass it through verbatim Previously pipeline.ts always hardcoded max_tokens as the output-token parameter name on the OpenAI-compatible path, falling back from samplingParams.max_tokens to request.config.maxOutputTokens to provider defaults. This broke GPT-5 / o-series on OpenAI and Azure OpenAI, which require max_completion_tokens and reject max_tokens with a 400 error. Fix: when the user provides samplingParams explicitly, treat it as the complete source of truth for the wire shape and pass its keys through verbatim. No client-injected defaults, no request fallbacks, no hardcoded parameter names. The user describes what the provider wants; the client trusts them. When samplingParams is absent, the historical default behavior (request fallback through temperature/top_p/.../max_tokens plus provider defaults) is preserved unchanged — existing users see no difference. Concretely, users can now set any of: samplingParams: { max_tokens: 4096 } # GPT-4 / Qwen / DeepSeek samplingParams: { max_completion_tokens: 4096 } # GPT-5 / o-series samplingParams: { reasoning_effort: 'medium' } # future knobs without waiting for a qwen-code release that adds model-specific branches. Signed-off-by: Gordon Lam (SH) <yeelam@microsoft.com> --------- Signed-off-by: Gordon Lam (SH) <yeelam@microsoft.com> Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: tanzhenxin <tanzhenxing1987@gmail.com> Co-authored-by: Yuchen Fu <fuyuchen0904@163.com> Co-authored-by: LaZzyMan <zeusdream7@gmail.com> Co-authored-by: Gordon Lam (SH) <yeelam@microsoft.com>
…enLM#3458) Previously pipeline.ts always hardcoded max_tokens as the output-token parameter name on the OpenAI-compatible path, falling back from samplingParams.max_tokens to request.config.maxOutputTokens to provider defaults. This broke GPT-5 / o-series on OpenAI and Azure OpenAI, which require max_completion_tokens and reject max_tokens with a 400 error. Fix: when the user provides samplingParams explicitly, treat it as the complete source of truth for the wire shape and pass its keys through verbatim. No client-injected defaults, no request fallbacks, no hardcoded parameter names. The user describes what the provider wants; the client trusts them. When samplingParams is absent, the historical default behavior (request fallback through temperature/top_p/.../max_tokens plus provider defaults) is preserved unchanged — existing users see no difference. Concretely, users can now set any of: samplingParams: { max_tokens: 4096 } # GPT-4 / Qwen / DeepSeek samplingParams: { max_completion_tokens: 4096 } # GPT-5 / o-series samplingParams: { reasoning_effort: 'medium' } # future knobs without waiting for a qwen-code release that adds model-specific branches. Signed-off-by: Gordon Lam (SH) <yeelam@microsoft.com>
What's broken
Azure GPT-5 / o-series deployments are unusable on the OpenAI-compatible code path, because the client unconditionally injects
max_tokenson the wire and these models reject it:No user config can opt out —
max_tokensis added in two places even whensamplingParamsdoesn't mention it:buildSamplingParameters(pipeline.ts) — hardcoded viaaddParameterIfDefined.applyOutputTokenLimit(provider/default.ts) — downstream default injectingmax_tokens: 8000when absent.Why
Parameter names are a moving target (GPT-5 →
max_completion_tokens, future reasoning knobs, provider-specific fields). Hardcoding per-model translation locks every new parameter behind a client release and misfires on custom deployment names, proxies, and Azure endpoints. Simpler primitive: let the user own the wire shape.The fix
Make
samplingParamsan opt-in, verbatim passthrough:samplingParamsset (even to{}) → wire carries exactly those keys. Both injection sites early-return.samplingParamsabsent → existing behavior preserved byte-for-byte.Also adds
[key: string]: unknownto thesamplingParamstype so users can pass arbitrary provider-specific keys (max_completion_tokens,reasoning_effort,verbosity, …) without waiting for a client release.Test plan
pipeline.test.ts— arbitrary-keys passthrough, no-cap passthrough, original behavior preserved whensamplingParamsabsent.provider/default.test.ts—applyOutputTokenLimitskips injection; a user-suppliedmax_tokenspasses through verbatim even when it exceeds the known model cap.All existing tests pass.
Example
Azure GPT-5 deployment (no longer accepts
max_tokens,temperature, ortop_p):{ "modelProviders": { "openai": [{ "id": "gpt-5", "baseUrl": "https://<your-azure>.cognitiveservices.azure.com/openai/v1/", "generationConfig": { "samplingParams": { "max_completion_tokens": 4096 } } }] } }Wire request carries exactly
max_completion_tokens: 4096.samplingParams: {}also works — passthrough engaged, nothing sent, deployment defaults apply.