Skip to content

fix(agents): repair malformed tool-call args on openai-completions#70294

Merged
steipete merged 6 commits into
openclaw:mainfrom
MonkeyLeeT:codex/fix-openai-completions-toolcall-repair
Apr 23, 2026
Merged

fix(agents): repair malformed tool-call args on openai-completions#70294
steipete merged 6 commits into
openclaw:mainfrom
MonkeyLeeT:codex/fix-openai-completions-toolcall-repair

Conversation

@MonkeyLeeT

Copy link
Copy Markdown
Contributor

Summary

  • Problem: self-hosted Kimi/SGLang routes on openai-completions could emit streamed tool-call arguments in shapes the existing repair wrapper knows how to fix, but the runner only enabled that wrapper for anthropic-messages + Kimi.
  • Why it matters: fragmented or malformed tool-call args could reach tool dispatch as empty or unusable objects, breaking the turn.
  • What changed: renamed the stale Anthropic-only gate to shouldRepairMalformedToolCallArguments, kept the existing Kimi provider behavior, and additionally enabled the repair wrapper for openai-completions transports. Added focused guard tests and generalized the repair log wording.
  • What did NOT change (scope boundary): this does not rewrite the repair algorithm itself, broaden the helper to every transport, or change tool-call ID handling.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause (if applicable)

  • Root cause: the malformed-argument repair wrapper already existed, but run/attempt.ts only enabled it when params.model.api === "anthropic-messages" and the provider normalized to Kimi.
  • Missing detection / guardrail: there was no direct test locking in the enablement gate itself, only wrapper-behavior tests once the wrapper was already active.
  • Contributing context (if known): self-hosted Kimi/SGLang reports the problematic stream shape over openai-completions, so the repair path stayed dormant even though the runtime already had logic for this class of malformed stream.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file:
    • src/agents/pi-embedded-runner/run/attempt.tool-call-argument-repair.test.ts
    • src/agents/pi-embedded-runner/run/attempt.test.ts
  • Scenario the test should lock in: Kimi on anthropic-messages still enables repair, openai-completions now enables repair even for non-Kimi provider ids, and unrelated transports remain off.
  • Why this is the smallest reliable guardrail: the bug was in wrapper enablement, not in the repair parser itself.
  • Existing test that already covers this (if any): src/agents/pi-embedded-runner/run/attempt.test.ts already covers the live wrapper behavior after enablement.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

  • Self-hosted openai-completions backends such as Kimi/SGLang now get the existing malformed streamed tool-call argument repair path, reducing empty-argument or unusable-argument tool dispatch failures.

Diagram (if applicable)

Before:
[self-hosted openai-completions stream] -> [wrapper gate off] -> [malformed args reach tool dispatch]

After:
[self-hosted openai-completions stream] -> [wrapper gate on] -> [existing repair wrapper normalizes args]

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: local pnpm workspace checkout
  • Model/provider: self-hosted Kimi/SGLang on openai-completions; existing Kimi anthropic-messages path preserved
  • Integration/channel (if any): N/A
  • Relevant config (redacted): N/A

Steps

  1. Run a model on openai-completions that streams fragmented tool-call arguments.
  2. Observe the embedded runner stream wrapper selection in run/attempt.ts.
  3. Verify the malformed-argument repair wrapper is now installed for that transport and existing repair tests remain green.

Expected

  • The runner enables malformed tool-call argument repair for openai-completions and preserves existing Kimi behavior.

Actual

  • Matches expected in local verification.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios:
    • pnpm check:changed
    • pnpm tsgo
    • pnpm test src/agents/pi-embedded-runner/run/attempt.tool-call-argument-repair.test.ts src/agents/pi-embedded-runner/run/attempt.test.ts
  • Edge cases checked:
    • normalized Kimi provider ids still return true on anthropic-messages
    • openai-completions now returns true even when the provider id is not normalized to Kimi
    • openai-responses remains disabled
  • What you did not verify:
    • live round-trip against a real external SGLang endpoint from this PR branch

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: enabling the wrapper for every openai-completions route is broader than a Kimi-only gate and could affect other OpenAI-compatible backends.
    • Mitigation: the wrapper already has targeted repair heuristics and existing tests remain green; unrelated non-openai-completions transports remain unchanged.
  • Risk: future parser changes might regress the enablement contract again.
    • Mitigation: the new focused gate tests lock in the intended transport/provider matrix.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S labels Apr 22, 2026
@greptile-apps

greptile-apps Bot commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes the malformed tool-call argument repair wrapper being gated only to anthropic-messages+Kimi, now also enabling it for all openai-completions transports. It also bundles several clean-up fixes: retiring the always-empty builtInTools in favor of sessionToolAllowlist, adding a fallback in resolveApiKeyForProvider when the runtime module lacks the function, and preserving pre-existing plugin IDs when building QA image-generation config patches. The previous reviewer concern about Kimi unintentionally broadening to all transports is explicitly addressed and tested.

Confidence Score: 5/5

Safe to merge — all changes are well-scoped, the prior Kimi-scoping concern is resolved, and new gate tests lock in the intended transport/provider matrix.

All remaining findings are P2 or lower. The core fix is correct and covered by four focused tests. The ancillary changes (builtInTools removal, resolveApiKeyForProvider fallback, plugin-ID preservation) are internally consistent and have matching test updates.

No files require special attention.

Reviews (2): Last reviewed commit: "docs: credit openai-completions repair P..." | Re-trigger Greptile

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 21301ee53b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/pi-embedded-runner/run/attempt.tool-call-argument-repair.ts Outdated
@MonkeyLeeT MonkeyLeeT force-pushed the codex/fix-openai-completions-toolcall-repair branch 3 times, most recently from ca1a763 to 008f4a9 Compare April 22, 2026 21:05
@steipete steipete force-pushed the codex/fix-openai-completions-toolcall-repair branch from 008f4a9 to f2d511f Compare April 23, 2026 00:35
@steipete steipete closed this Apr 23, 2026
@steipete steipete reopened this Apr 23, 2026
@steipete steipete force-pushed the codex/fix-openai-completions-toolcall-repair branch from 43f910b to f47a60e Compare April 23, 2026 01:54

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f47a60e2c0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@@ -71,12 +85,15 @@ async function resolveGeneratedImagePath(params: {
}

async function ensureImageGenerationConfigured(env: QaSuiteRuntimeEnv) {
const snapshot = await readConfigSnapshot(env);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Retry pre-patch config snapshot reads during gateway restarts

Calling readConfigSnapshot(env) directly here introduces a new hard-fail point that bypasses patchConfig’s restart/hash-conflict retry logic (runConfigMutation in suite-runtime-gateway.ts). If ensureImageGenerationConfigured runs while the gateway is restarting (a normal QA flow race), this await can throw on config.get (e.g., closed/restarting gateway) and abort media setup before patchConfig gets a chance to recover, making the scenario flaky compared with the previous implementation.

Useful? React with 👍 / 👎.

@steipete steipete force-pushed the codex/fix-openai-completions-toolcall-repair branch from f47a60e to 08b2f52 Compare April 23, 2026 02:09
@steipete

Copy link
Copy Markdown
Contributor

Maintainer review: no blocker from my pass.

The repair gate now has the shape I want: Kimi repair remains scoped to anthropic-messages, openai-completions opts into the existing malformed-args repair path, and openai-responses stays out. The focused gate tests cover that matrix. I also checked the ancillary tool allowlist / QA config / provider-auth fallback changes in this branch; they look consistent and covered.

Recommendation: merge once the remaining pending install-smoke check finishes green.

@varoudis

Copy link
Copy Markdown

@steipete @skyfallsin

Just to flag, as far as I can tell, #70294 doesn't cover two other wire-level issues on the Kimi/SGLang + openai-completions route:

  1. Tool-call IDs get sanitized on replay (functions.read:0 -> functionsread0), because the sglang provider still hits the generic openai-completions default. A provider-owned opt-out (same shape as the recent Moonshot fix) would fix it.

  2. Historical reasoning_content is replayed into assistant messages on the next request, which confuses Kimi's chat template. A narrow strip for openai-completions + sglang would handle it.

ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
globalcaos pushed a commit to globalcaos/tinkerclaw that referenced this pull request May 13, 2026
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 24, 2026
jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026
@MonkeyLeeT MonkeyLeeT deleted the codex/fix-openai-completions-toolcall-repair branch June 4, 2026 06:19
sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: openai-completions: Kimi/SGLang streamed tool-call arguments can arrive in a shape that breaks tool dispatch

3 participants