fix(openai): when samplingParams is set, pass it through verbatim by yeelam-gordon · Pull Request #3458 · QwenLM/qwen-code

yeelam-gordon · 2026-04-20T04:20:22Z

What's broken

Azure GPT-5 / o-series deployments are unusable on the OpenAI-compatible code path, because the client unconditionally injects max_tokens on the wire and these models reject it:

Unsupported parameter: 'max_tokens' is not supported with this model.
Use 'max_completion_tokens' instead.

No user config can opt out — max_tokens is added in two places even when samplingParams doesn't mention it:

buildSamplingParameters (pipeline.ts) — hardcoded via addParameterIfDefined.
applyOutputTokenLimit (provider/default.ts) — downstream default injecting max_tokens: 8000 when absent.

Why

Parameter names are a moving target (GPT-5 → max_completion_tokens, future reasoning knobs, provider-specific fields). Hardcoding per-model translation locks every new parameter behind a client release and misfires on custom deployment names, proxies, and Azure endpoints. Simpler primitive: let the user own the wire shape.

The fix

Make samplingParams an opt-in, verbatim passthrough:

samplingParams set (even to {}) → wire carries exactly those keys. Both injection sites early-return.
samplingParams absent → existing behavior preserved byte-for-byte.

Also adds [key: string]: unknown to the samplingParams type so users can pass arbitrary provider-specific keys (max_completion_tokens, reasoning_effort, verbosity, …) without waiting for a client release.

Test plan

pipeline.test.ts — arbitrary-keys passthrough, no-cap passthrough, original behavior preserved when samplingParams absent.
provider/default.test.ts — applyOutputTokenLimit skips injection; a user-supplied max_tokens passes through verbatim even when it exceeds the known model cap.

All existing tests pass.

Example

Azure GPT-5 deployment (no longer accepts max_tokens, temperature, or top_p):

{
  "modelProviders": {
    "openai": [{
      "id": "gpt-5",
      "baseUrl": "https://<your-azure>.cognitiveservices.azure.com/openai/v1/",
      "generationConfig": {
        "samplingParams": { "max_completion_tokens": 4096 }
      }
    }]
  }
}

Wire request carries exactly max_completion_tokens: 4096. samplingParams: {} also works — passthrough engaged, nothing sent, deployment defaults apply.

yeelam-gordon · 2026-04-20T04:29:55Z

Heads up to anyone already peeking: I just force-pushed a redesign. The earlier version tried to auto-detect GPT-5/o-series by model name and swap max_tokens -> max_completion_tokens. That encoded a moving target in the client. The new version just makes samplingParams the source of truth for the wire shape — users write whatever key the provider wants, verbatim. Much smaller diff, no model-name regex, no client-side catalog of model families. Sorry for the churn.

yeelam-gordon · 2026-04-20T04:35:28Z

Quick update: refined the design once more based on user feedback. Previous revision spread samplingParams at the end of the params object, which overrode fallbacks but still mixed them in. New revision is stricter: if the user sets samplingParams at all, we emit only its keys on the wire — no defaults, no request fallbacks, nothing else injected. If samplingParams is absent, the original code path (including the hardcoded max_tokens emission) is preserved unchanged so existing users see zero difference. Final diff is now just an early-return block: +20/-3 across 2 files.

Previously pipeline.ts always hardcoded max_tokens as the output-token parameter name on the OpenAI-compatible path, falling back from samplingParams.max_tokens to request.config.maxOutputTokens to provider defaults. This broke GPT-5 / o-series on OpenAI and Azure OpenAI, which require max_completion_tokens and reject max_tokens with a 400 error. Fix: when the user provides samplingParams explicitly, treat it as the complete source of truth for the wire shape and pass its keys through verbatim. No client-injected defaults, no request fallbacks, no hardcoded parameter names. The user describes what the provider wants; the client trusts them. When samplingParams is absent, the historical default behavior (request fallback through temperature/top_p/.../max_tokens plus provider defaults) is preserved unchanged — existing users see no difference. Concretely, users can now set any of: samplingParams: { max_tokens: 4096 } # GPT-4 / Qwen / DeepSeek samplingParams: { max_completion_tokens: 4096 } # GPT-5 / o-series samplingParams: { reasoning_effort: 'medium' } # future knobs without waiting for a qwen-code release that adds model-specific branches. Signed-off-by: Gordon Lam (SH) <yeelam@microsoft.com>

wenshao

No issues found. LGTM! ✅ — gpt-5.4 via Qwen Code /review

yeelam-gordon · 2026-04-20T09:39:22Z

@wenshao. Thanks for the review. However, since I used a forked repo, I don’t have permission to check in. I need your help with that.

wenshao · 2026-04-21T13:46:01Z

End-to-end verification results

I ran a wire-level verification that bypasses all mocks — a local HTTP server captures the exact JSON body the OpenAI SDK sends, and the script drives the real DefaultOpenAICompatibleProvider + ContentGenerationPipeline code paths. Ran it on both main and this branch for a differential.

Four scenarios, 13 wire-level assertions:

#	`samplingParams` input	This branch	`main` (baseline)
S1	`{ max_completion_tokens: 4096, reasoning_effort: 'medium', verbosity: 'low' }` (Azure GPT-5 shape)	Exactly those 3 keys on the wire	All 3 keys dropped; wire carries `temperature=0.5, top_p=0.6, max_tokens=2048`
S2	`{}` (explicit opt-out)	Zero sampling keys on the wire	`{}` ignored; legacy defaults sent anyway
S3	`undefined` (back-compat)	Legacy behavior preserved (`temperature`, `top_p`, `max_tokens` derived from request.config)	Same
S4	`{ max_tokens: 100000 }` (exceeds gpt-4 output cap)	Passes through verbatim as `100000`	Capped to `16384`

Totals

This branch: 13/13 assertions pass
main: 3/13 pass (only the back-compat scenario) — the other 10 failures are exactly the behavior this PR changes

Also run locally on this branch

vitest run src/core/openaiContentGenerator/ → 286/286 pass
tsc --noEmit → clean
eslint on the 5 changed files → clean

) Previously pipeline.ts always hardcoded max_tokens as the output-token parameter name on the OpenAI-compatible path, falling back from samplingParams.max_tokens to request.config.maxOutputTokens to provider defaults. This broke GPT-5 / o-series on OpenAI and Azure OpenAI, which require max_completion_tokens and reject max_tokens with a 400 error. Fix: when the user provides samplingParams explicitly, treat it as the complete source of truth for the wire shape and pass its keys through verbatim. No client-injected defaults, no request fallbacks, no hardcoded parameter names. The user describes what the provider wants; the client trusts them. When samplingParams is absent, the historical default behavior (request fallback through temperature/top_p/.../max_tokens plus provider defaults) is preserved unchanged — existing users see no difference. Concretely, users can now set any of: samplingParams: { max_tokens: 4096 } # GPT-4 / Qwen / DeepSeek samplingParams: { max_completion_tokens: 4096 } # GPT-5 / o-series samplingParams: { reasoning_effort: 'medium' } # future knobs without waiting for a qwen-code release that adds model-specific branches. Signed-off-by: Gordon Lam (SH) <yeelam@microsoft.com>

* fix: backport upstream trivial bug fixes (QwenLM#3499, QwenLM#3630, QwenLM#3320) Cherry-picked from QwenLM/qwen-code: - QwenLM#3499 fix(core): use empty string instead of null for reasoning-only assistant content. Some OpenAI-compatible providers (e.g. Ollama qwen3.5:9b) reject content: null with HTTP 400 when reasoning_content is also present. Tool-call-only messages keep null per OpenAI spec. - QwenLM#3630 fix(telemetry): switch FileExporter.serialize from JSON.stringify to safeJsonStringify. OTel ReadableSpans hold a BatchSpanProcessor back-reference that forms a cycle and crashed --telemetry-outfile users. - QwenLM#3320 fix(core): cap chokidar depth at 2 in SkillManager and skip .git / special file types. Prevents FD exhaustion when a skill dir contains node_modules etc., which silently broke node-pty I/O. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(core): scope StreamingToolCallParser per-stream (QwenLM#3525) Backports upstream PR QwenLM#3525 + extends the per-stream context to also cover our fork's <think>-tag parser state. Bug: every caller of Config.getContentGenerator() — foreground turns, fork subagents, run_in_background subagents, ACP concurrent Agent calls (after QwenLM#3463) — shared a single OpenAIContentConverter, which held the StreamingToolCallParser as an instance field. Concurrent streams corrupted each other's tool-call buffers, surfacing as NO_RESPONSE_TEXT. Fix: - New ConverterStreamContext interface holds toolCallParser, thinkBuffer, inThinkTag — one per stream. - createStreamContext() factory replaces resetStreamingToolCalls(). - convertOpenAIChunkToGemini(chunk, ctx) and processThinkChunk(chunk, ctx) thread the context through every parser/think-buffer access. - ContentGenerationPipeline.processStreamWithLogging creates one context at stream entry. The error path no longer manually resets — the context is GC'd when the generator unwinds. Our protoInternal recovery-note logic is preserved on the new shape. Note: upstream's follow-up QwenLM#3550 (full stateless converter refactor) is deferred — it's hygiene without a functional bug; QwenLM#3525 alone fixes the concurrency race. Tests: - New createStreamContext describe replaces resetStreamingToolCalls suite - Streaming <think> tests use a per-test context - pipeline.test.ts mock updated to match the new API - pipeline.concurrent.test.ts (from upstream commit 38edd9d) drives two real concurrent streams and asserts neither corrupts the other's tool-call output (positive control: pre-fix, this test fails with exactly the user-reported bug shape). Refs upstream QwenLM#3516, QwenLM#3525. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(core): strip thinking blocks from history on model switch (QwenLM#3304) When switching models mid-session, reasoning_content fields from thinking-capable models leaked into API requests sent to the new provider, causing 422 errors on strict OpenAI-compatible endpoints. Call stripThoughtsFromHistory() in handleModelChange() so thought parts are removed before the next request is built for the new model. * fix(core): reject truncated subagent write_file calls (QwenLM#3505) Backport of upstream QwenLM#3505. Propagates MAX_TOKENS truncation from subagent responses into tool requests and rejects truncated edit calls before schema validation can surface misleading missing-parameter errors. Adapted to our fork's coreToolScheduler.ts which already had the truncation rejection block — kept both, dropped the unused clearRetryCountsForTool() call (we don't have that retry-counter machinery yet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(core): prevent malformed permission rules from becoming tool-wide catch-alls (QwenLM#3467) Backport of upstream QwenLM#3467. A permission rule with unbalanced parens was silently parsed with specifier: undefined, causing matchesRule to treat it as a catch-all. For deny rules this blocked all commands; for allow rules a typo could silently auto-approve everything. - Adds an invalid flag to PermissionRule - parseRule marks unbalanced-paren rules as invalid - matchesRule short-circuits invalid rules to never match - parseRules / addSession*Rule / addPersistentRule warn on malformed input - listRules filters invalid rules from /permissions UI Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(core): preserve reasoning_content during session resume and active sessions (GH#3579) * test(config): drop fork-incompatible QwenLM#3304 strip-thoughts test The test from upstream QwenLM#3304 backport assumed an in-place qwen-oauth model switch path that our fork doesn't have; the source-side fix in config.ts (stripThoughtsFromHistory call in handleModelChange) is preserved. Coverage will be re-added when the fork's switch flow stabilizes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(acp): run Agent tool calls concurrently + graceful degrade (QwenLM#3463) Backport of upstream QwenLM#3463. When the model returns multiple Agent tool calls in a single turn, ACP Session was executing them sequentially in a for-loop, multiplying latency by sub-agent count. - Add private runToolCalls() helper that mirrors coreToolScheduler's partition logic: consecutive Agent calls form a parallel batch (safe because sub-agents have no shared mutable state); other tools form sequential batches. - Replace 2 for-loops in Session.ts with runToolCalls() calls. - Switch the AgentTool eventEmitter guard from key-presence check to truthy check (commit 651979c) — the key-presence check passed for { eventEmitter: undefined } and crashed inside SubAgentTracker.setup. Note: upstream replaced 3 for-loops; our fork only had 2 in those code paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(acp): support SSE and HTTP MCP servers in ACP mode In ACP mode, the Mcp server list sent by the IDE client can include SSE (type: "sse") and HTTP (type: "http") transports, but the previous implementation only handled stdio servers via toStdioServer(). Non-stdio servers were silently skipped (continue), so any SSE/HTTP-configured MCP server would never be registered. Changes: - Add toSseServer() helper: detects type=="sse" servers and maps them to MCPServerConfig(url=..., headers=...) - Add toHttpServer() helper: detects type=="http" servers and maps them to MCPServerConfig(httpUrl=..., headers=...) - Refactor newSessionConfig() loop to handle all three transport types - Declare mcpCapabilities: { sse: true, http: true } in agentCapabilities so IDE clients know this agent supports these transports without needing a transparent proxy - Export the three helper functions for unit testing Tests: - Unit tests for toStdioServer / toSseServer / toHttpServer helpers (type discrimination, mutual exclusion) - Integration-style tests for QwenAgent.initialize() mcpCapabilities - Integration-style tests for newSession() with SSE/HTTP MCP servers, verifying MCPServerConfig is constructed with the correct arguments (url vs httpUrl, headers passthrough, empty-headers → undefined) Fixes QwenLM#3472 * fix(openai): when samplingParams is set, pass it through verbatim Previously pipeline.ts always hardcoded max_tokens as the output-token parameter name on the OpenAI-compatible path, falling back from samplingParams.max_tokens to request.config.maxOutputTokens to provider defaults. This broke GPT-5 / o-series on OpenAI and Azure OpenAI, which require max_completion_tokens and reject max_tokens with a 400 error. Fix: when the user provides samplingParams explicitly, treat it as the complete source of truth for the wire shape and pass its keys through verbatim. No client-injected defaults, no request fallbacks, no hardcoded parameter names. The user describes what the provider wants; the client trusts them. When samplingParams is absent, the historical default behavior (request fallback through temperature/top_p/.../max_tokens plus provider defaults) is preserved unchanged — existing users see no difference. Concretely, users can now set any of: samplingParams: { max_tokens: 4096 } # GPT-4 / Qwen / DeepSeek samplingParams: { max_completion_tokens: 4096 } # GPT-5 / o-series samplingParams: { reasoning_effort: 'medium' } # future knobs without waiting for a qwen-code release that adds model-specific branches. Signed-off-by: Gordon Lam (SH) <yeelam@microsoft.com> --------- Signed-off-by: Gordon Lam (SH) <yeelam@microsoft.com> Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: tanzhenxin <tanzhenxing1987@gmail.com> Co-authored-by: Yuchen Fu <fuyuchen0904@163.com> Co-authored-by: LaZzyMan <zeusdream7@gmail.com> Co-authored-by: Gordon Lam (SH) <yeelam@microsoft.com>

…enLM#3458) Previously pipeline.ts always hardcoded max_tokens as the output-token parameter name on the OpenAI-compatible path, falling back from samplingParams.max_tokens to request.config.maxOutputTokens to provider defaults. This broke GPT-5 / o-series on OpenAI and Azure OpenAI, which require max_completion_tokens and reject max_tokens with a 400 error. Fix: when the user provides samplingParams explicitly, treat it as the complete source of truth for the wire shape and pass its keys through verbatim. No client-injected defaults, no request fallbacks, no hardcoded parameter names. The user describes what the provider wants; the client trusts them. When samplingParams is absent, the historical default behavior (request fallback through temperature/top_p/.../max_tokens plus provider defaults) is preserved unchanged — existing users see no difference. Concretely, users can now set any of: samplingParams: { max_tokens: 4096 } # GPT-4 / Qwen / DeepSeek samplingParams: { max_completion_tokens: 4096 } # GPT-5 / o-series samplingParams: { reasoning_effort: 'medium' } # future knobs without waiting for a qwen-code release that adds model-specific branches. Signed-off-by: Gordon Lam (SH) <yeelam@microsoft.com>

yeelam-gordon force-pushed the fix/honor-sampling-params branch from f4131cb to 28ff82d Compare April 20, 2026 04:29

yeelam-gordon changed the title ~~fix(openai): honor samplingParams for max output tokens; support GPT-5/o-series~~ fix(openai): let samplingParams control wire shape on OpenAI-compatible path Apr 20, 2026

yeelam-gordon force-pushed the fix/honor-sampling-params branch from 28ff82d to 2adcfb6 Compare April 20, 2026 04:34

yeelam-gordon changed the title ~~fix(openai): let samplingParams control wire shape on OpenAI-compatible path~~ fix(openai): when samplingParams is set, pass it through verbatim Apr 20, 2026

yeelam-gordon force-pushed the fix/honor-sampling-params branch 5 times, most recently from 7df0882 to 56f124f Compare April 20, 2026 04:59

yeelam-gordon force-pushed the fix/honor-sampling-params branch from 56f124f to a1fce04 Compare April 20, 2026 05:15

wenshao approved these changes Apr 20, 2026

View reviewed changes

tanzhenxin added the type/bug Something isn't working as expected label Apr 21, 2026

wenshao merged commit 309b25d into QwenLM:main Apr 21, 2026
13 checks passed

mabry1985 mentioned this pull request Apr 26, 2026

fix(openai): pass samplingParams verbatim (#3458) protoLabsAI/protoCLI#116

Merged

2 tasks

mabry1985 mentioned this pull request Apr 26, 2026

promote: dev -> main (5 upstream backport PRs through v0.26.9) protoLabsAI/protoCLI#118

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(openai): when samplingParams is set, pass it through verbatim#3458

fix(openai): when samplingParams is set, pass it through verbatim#3458
wenshao merged 1 commit into
QwenLM:mainfrom
yeelam-gordon:fix/honor-sampling-params

yeelam-gordon commented Apr 20, 2026 •

edited

Loading

Uh oh!

yeelam-gordon commented Apr 20, 2026

Uh oh!

yeelam-gordon commented Apr 20, 2026

Uh oh!

wenshao left a comment

Uh oh!

yeelam-gordon commented Apr 20, 2026

Uh oh!

wenshao commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yeelam-gordon commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's broken

Why

The fix

Test plan

Example

Uh oh!

yeelam-gordon commented Apr 20, 2026

Uh oh!

yeelam-gordon commented Apr 20, 2026

Uh oh!

wenshao left a comment

Choose a reason for hiding this comment

Uh oh!

yeelam-gordon commented Apr 20, 2026

Uh oh!

wenshao commented Apr 21, 2026

End-to-end verification results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yeelam-gordon commented Apr 20, 2026 •

edited

Loading