Skip to content

fix(openclaw): canonicalize mixed Kimi tool calls#4040

Merged
jyaunches merged 1 commit into
mainfrom
fix/kimi-mixed-toolcall-canonicalization
May 22, 2026
Merged

fix(openclaw): canonicalize mixed Kimi tool calls#4040
jyaunches merged 1 commit into
mainfrom
fix/kimi-mixed-toolcall-canonicalization

Conversation

@jyaunches

@jyaunches jyaunches commented May 22, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR fixes the Kimi K2.6/OpenClaw compatibility plugin after the latest OpenClaw runtime can persist a mixed streamed state containing already-split hostname/date calls plus the original combined hostname; date; uptime call. The plugin now canonicalizes that mixed safe diagnostic shape back to exactly hostname, date, and uptime so the Kimi E2E trajectory acceptance check can pass.

Related Issue

Fixes #2620

Changes

  • Refactors the Kimi exec splitter to reuse safe exec command extraction across single-call and mixed-call rewrites.
  • Adds canonicalization for the observed mixed streamed state while preserving the allowlist-only behavior for hostname, date, and uptime.
  • Adds a unit regression covering the exact hostname, date, hostname; date; uptime trajectory shape from kimi-inference-compat-e2e.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Focused verification run:

  • npm ci --ignore-scripts
  • npx vitest run test/kimi-inference-compat-plugin.test.ts
  • npm run build:cli
  • git diff --check

AI Disclosure

  • AI-assisted — tool: pi coding agent

Signed-off-by: Julie Yaunches jyaunches@nvidia.com

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved Kimi inference compatibility layer's handling of command-chain execution with enhanced support for different command sequence formats and refined canonicalization.
  • Tests

    • Added test coverage for complex, mixed command-execution scenarios.

Review Change Stack

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
@jyaunches jyaunches self-assigned this May 22, 2026
@coderabbitai

coderabbitai Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e4a142d5-a66d-4066-aa3c-b133e9545859

📥 Commits

Reviewing files that changed from the base of the PR and between 1bdb519 and 46927b8.

📒 Files selected for processing (2)
  • nemoclaw-blueprint/openclaw-plugins/kimi-inference-compat/index.js
  • test/kimi-inference-compat-plugin.test.ts

📝 Walkthrough

Walkthrough

The PR refactors safe exec tool-call rewriting in the Kimi inference compatibility layer. It adds a helper to extract individual exec commands, introduces two new rewrite strategies for parsing both single combined calls and canonical multi-toolCall sequences, and integrates the new logic into message and event rewriting paths with a corresponding test.

Changes

Safe Exec Tool-Call Canonicalization

Layer / File(s) Summary
Safe exec command extraction helper
nemoclaw-blueprint/openclaw-plugins/kimi-inference-compat/index.js
Introduces getSafeExecToolCallCommand(toolCall) to extract and validate individual exec commands, establishing the foundation for expanded rewrite logic.
Core rewrite logic with dual-strategy support
nemoclaw-blueprint/openclaw-plugins/kimi-inference-compat/index.js
Adds three new functions: getSafeCombinedExecToolCall(message) to parse single combined exec calls with semicolon-delimited commands, getCanonicalSafeExecToolCallSequence(message) to recognize and validate canonical expanded sequences, and getSafeExecRewrite(message) to select between strategies.
Message and event rewriting integration
nemoclaw-blueprint/openclaw-plugins/kimi-inference-compat/index.js
Updates rewriteSafeCombinedExecToolCallInMessage and rewriteSafeCombinedExecToolCallInEvent to use the new getSafeExecRewrite() selector instead of the older combined-only rewrite function, handling both partial and complete message streams.
Test verification of mixed split and combined calls
test/kimi-inference-compat-plugin.test.ts
Adds a test verifying the plugin canonicalizes messages with mixed already-split streamed tool calls and the original combined safe exec call, rewriting the content to three split tool calls with stable split ids.

Sequence Diagram

sequenceDiagram
  participant Message as Kimi Message
  participant Rewrite as getSafeExecRewrite
  participant Combined as getSafeCombinedExecToolCall
  participant Canonical as getCanonicalSafeExecToolCallSequence
  participant Result as Rewritten Message
  
  Message->>Rewrite: Check message content
  alt Single combined exec call
    Rewrite->>Combined: Parse delimited command chain
    Combined->>Combined: Validate hostname date uptime sequence
    Combined->>Result: Split into 3 exec toolCalls
  else Multiple exec toolCalls
    Rewrite->>Canonical: Recognize canonical sequence
    Canonical->>Canonical: Verify 2+ toolCalls match expected set
    Canonical->>Result: Deduplicate and order canonically
  else No match
    Rewrite->>Result: Return no rewrite
  end
Loading

🎯 3 (Moderate) | ⏱️ ~20 minutes

🐰 A fuzzy bunny hops through the toolkit,
Finding lost exec calls hiding in the brush—
Hostname, date, and uptime align in a row,
No more mid-task resets when logic flows!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(openclaw): canonicalize mixed Kimi tool calls' directly describes the main code change—refactoring the Kimi exec splitter to canonicalize mixed streamed tool calls.
Linked Issues check ✅ Passed The PR addresses issue #2620 by fixing mixed tool-call parsing that was causing agent context loss, and adds a regression test for the exact observed trajectory (hostname, date, hostname; date; uptime).
Out of Scope Changes check ✅ Passed All changes are focused on Kimi exec tool-call rewriting logic and a directly related regression test; no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/kimi-mixed-toolcall-canonicalization

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: kimi-inference-compat-e2e
Optional E2E: inference-routing-e2e, messaging-compatible-endpoint-e2e

Dispatch hint: kimi-inference-compat-e2e

Auto-dispatched E2E: kimi-inference-compat-e2e via nightly-e2e.yaml at 46927b8126acbf6371ce335ffbb71989f490fc85nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • kimi-inference-compat-e2e (medium-high, sandbox/onboard plus hermetic mock endpoint, timeout 45 minutes): Direct existing E2E coverage for this plugin: onboards OpenClaw with a hermetic OpenAI-compatible Kimi mock, exercises streamed Kimi tool-call output through inference.local, and verifies the sandbox trajectory records clean split exec calls instead of a combined semicolon command.

Optional E2E

  • inference-routing-e2e (medium, sandbox inference routing validation): Useful adjacent confidence that the broader inference.local routing path still works, but the PR is provider/plugin-specific and the Kimi compatibility E2E is the merge-blocking signal.
  • messaging-compatible-endpoint-e2e (medium-high, messaging plus compatible endpoint sandbox flow): Optional adjacent OpenAI-compatible endpoint coverage through inference.local; helpful if reviewers want extra confidence that compatible-endpoint streaming/routing behavior remains intact.

New E2E recommendations

  • Kimi streamed mixed split/combined tool-call canonicalization (medium): The new unit test covers a mixed streamed sequence containing already split safe exec calls plus the original combined call, but the existing Kimi E2E mock appears to emit only one combined hostname; date; uptime tool call. Add an E2E mode or assertion to test this mixed streaming shape through the real OpenClaw trajectory path.
    • Suggested test: Extend test/e2e/test-kimi-inference-compat.sh with a mock response mode that emits mixed split tool calls plus the original combined call and verifies the final trajectory contains exactly hostname, date, uptime once each with no combined semicolon command.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: kimi-inference-compat-e2e

@github-actions

Copy link
Copy Markdown
Contributor

PR Review Advisor

Recommendation: blocked
Confidence: medium
Analyzed HEAD: 46927b8126acbf6371ce335ffbb71989f490fc85
Findings: 1 blocker(s), 3 warning(s), 0 suggestion(s)

This is an automated advisory review. A human maintainer must make the final merge decision.

Limitations: Review used deterministic PR context, the provided diff, and read-only file inspection; no scripts, package-manager commands, or tests were executed.; The required E2E run was reported as auto-dispatched, but no completed passing kimi-inference-compat-e2e result for the head SHA was available in the gathered status rollup.; Review thread state was not fully available beyond the provided GraphQL/comment context.; The linked issue is already closed by PR #3046, so acceptance mapping includes historical clauses that this follow-up only partially exercises.

Workflow run

Full advisor summary

PR Review Advisor

Base: origin/main
Head: HEAD
Analyzed SHA: 46927b8126acbf6371ce335ffbb71989f490fc85
Recommendation: blocked
Confidence: medium

The code change is narrowly scoped and has a focused unit regression, but mergeability is blocked and the required Kimi E2E for the head SHA has not been observed as passing.

Gate status

  • CI: pass — 5 required status context(s) completed with no failures for 46927b8: checks, commit-lint, dco-check, check-hash, changes. Non-required contexts were still pending/in-progress in the gathered context.
  • Mergeability: fail — GitHub GraphQL reports mergeStateStatus=BLOCKED and reviewDecision=REVIEW_REQUIRED for PR fix(openclaw): canonicalize mixed Kimi tool calls #4040.
  • Review threads: unknown — No review thread state was available in the deterministic context; GraphQL reviewThreads.nodes was empty, but the advisory context marks review thread state unavailable.
  • Risky code tested: warning — Risky areas detected: credentials/inference/network and sandbox/policy/SSRF. A unit test was added for mixed Kimi tool-call canonicalization, but the E2E advisor explicitly recommends Kimi E2E coverage and a new mixed-streaming E2E mode.

🔴 Blockers

  • PR is currently merge-blocked: The hard gate for mergeability is not satisfied. GitHub reports the PR as blocked and review is still required.
    • Recommendation: Resolve the repository merge blockers and required review state, then re-evaluate the PR at the same head SHA.
    • Evidence: mergeStateStatus=BLOCKED; reviewDecision=REVIEW_REQUIRED; mergeability gate status fail.

🟡 Warnings

  • Required Kimi E2E result not confirmed for the head SHA: The E2E Advisor found that kimi-inference-compat-e2e is required for this provider/plugin change and was auto-dispatched, but the gathered status context does not include a completed passing result for that required E2E job at 46927b8.
    • Recommendation: Wait for and verify a passing kimi-inference-compat-e2e result for head SHA 46927b8 before treating the risky inference/sandbox behavior as validated.
    • Evidence: E2E Advisor comment: Required E2E: kimi-inference-compat-e2e; Auto-dispatched via nightly-e2e.yaml at 46927b8. No matching passed required job was present in the gathered rollup.
  • Mixed split-plus-combined streaming shape lacks E2E coverage (test/e2e/test-kimi-inference-compat.sh): The unit test covers the exact mixed message shape, but the E2E Advisor notes the existing Kimi E2E mock appears to emit only one combined hostname/date/uptime tool call and does not exercise the newly added mixed streamed split-plus-combined trajectory through the real OpenClaw persistence path.
    • Recommendation: Extend the Kimi inference compatibility E2E mock or assertions to emit mixed split tool calls plus the original combined call and verify the final trajectory contains exactly hostname, date, and uptime once each with no persisted semicolon command.
    • Evidence: E2E Advisor: 'Add an E2E mode or assertion to test this mixed streaming shape through the real OpenClaw trajectory path.'
  • Active overlapping PR touches the same compatibility plugin and tests: The deterministic drift evidence reports open PR chore: upgrade agent runtime dependencies #3925 touching both changed files. This creates rebase and behavior-drift risk for Kimi/OpenClaw runtime compatibility logic.

🔵 Suggestions

  • None.

Acceptance coverage

  • partial — During a multi-step task, the agent abandons its in-progress plan partway through and emits a brand-new conversational turn asking the user what they want — completely forgetting the task context.: The patch targets Kimi tool-call canonicalization that can affect persisted/replayed tool-call context, but the provided diff does not directly test abandonment of an in-progress plan or a new conversational reset.
  • unknown — Platform: macOS: The changed unit test is platform-neutral; gathered checks show macos-e2e success, but no macOS-specific result for the required Kimi mixed canonicalization scenario was provided.
  • partial — Model: moonshotai/kimi-k2.5 (NVIDIA Endpoints variant): The changed plugin is the Kimi inference compatibility plugin and the E2E advisor identifies kimi-inference-compat-e2e, but the unit test uses mocked plugin inputs rather than the real NVIDIA Endpoints model path.
  • unknown — Surface: OpenClaw Web UI at 127.0.0.1:18789/chat?session=: No Web UI or session-level test evidence is present in the diff; only plugin-level unit coverage was added.
  • unknown — Fresh install + nemoclaw onboard.: No installer/onboard path is changed or tested in the diff. The E2E Advisor says the required Kimi E2E onboards OpenClaw, but a passed result for this head SHA was not available.
  • unknown — Open the dashboard URL, start a fresh chat session.: The diff does not include dashboard or chat-session E2E assertions.
  • partial — Send this prompt: Run these 3 commands in the sandbox shell, one at a time, and after each command explain what its output means in 2 sentences: 1. hostname 2. date 3. uptime When done, summarize all three findings in a single paragraph that ties them together.: The added unit test constructs the resulting tool-call shape for hostname, date, and hostname; date; uptime, but it does not exercise the full natural-language prompt or explanation/summary behavior.
  • met — 3 exec calls in order: hostname → date → uptime.: The new unit test 'canonicalizes mixed streamed split calls plus the original combined call' expects message.content to become three exec tool calls with commands hostname, date, and uptime in order.
  • missing — 2-sentence explanation after each.: No added or changed test verifies explanatory text segments or sentence counts after each command.
  • missing — 1 final summary paragraph.: No added or changed test verifies a final summary paragraph after all tool results.
  • partial — Total: 3 tool calls + 4 text segments.: The unit test verifies exactly 3 tool calls after canonicalization, but not the 4 text segments.
  • partial — hostname exec ✓; full 2-sentence explanation rendered correctly.: The new test covers hostname as a canonical exec call, but not rendered explanation content.
  • unknown — agent emits several Tool not found / Unknown tool calls (separate UX bug, see Related).: The patch does not modify unknown-tool UX handling and no test evidence addresses this clause.
  • partial — date exec ✓; explanation begins streaming but is truncated mid-sentence at "...you sent the initial" (no period, no continuation).: The new test covers date as a canonical exec call; it does not exercise streaming truncation or partial sentence continuation.
  • missing — agent emits a brand-new turn: Hey! You had me in the middle of running those three commands — I got through hostname and date, but hadn't run uptime yet. Want me to finish that up and give you the summary? Or if you're onto something else, what's up?: No test covers prevention of a new assistant reset turn after partial tool execution.
  • partial — uptime is never executed; the summary paragraph never appears.: The unit regression ensures uptime is added to the canonicalized exec calls for the mixed tool-call state, but it does not prove the sandbox actually executes uptime or emits the summary.
  • partial — Multi-step tasks become unreliable: agent may abandon any in-progress plan and ask the user to restart, even when it has the data and tools to finish.: The code addresses one concrete Kimi tool-call persistence shape that can contribute to this failure; broader multi-step reliability is not proven by the unit test.
  • unknown — Forces the user to manually resume / re-prompt, losing time and context.: No user-flow or chat-session test verifies that manual resume is no longer required.
  • partial — Combined with the related streaming-truncation bug, makes long-form tool-using interactions effectively unusable on this build/model combo.: The plugin change handles a mixed streamed state, but does not demonstrate complete long-form interaction recovery.
  • partial — Investigate whether the truncation event (incomplete final assistant message at 8:42 PM above) is being parsed by the agent loop as a successful turn end, causing the next iteration to "forget" it was in the middle of a plan.: The patch investigates/handles persisted mixed streamed tool-call content, but does not directly test successful-turn-end parsing.

Security review

  • pass — 1. Secrets and Credentials: No hardcoded secrets, API keys, passwords, tokens, credential files, or connection strings were added. The modified code handles in-memory tool-call arguments only.
  • pass — 2. Input Validation and Data Sanitization: The exec rewrite path strictly requires toolCall type, exec name, a single command argument, JSON/object arguments only, and exact allowlisted commands hostname/date/uptime. Unsafe strings such as shell operators, redirection, substitutions, unknown commands, and trailing semicolons remain covered by negative tests.
  • pass — 3. Authentication and Authorization: No authentication or authorization paths, endpoints, tokens, scopes, or privilege checks are modified by this plugin-only canonicalization change.
  • pass — 4. Dependencies and Third-Party Libraries: No dependencies, package manifests, registries, or lockfiles are changed.
  • pass — 5. Error Handling and Logging: The new helper paths fail closed by returning null/false for malformed tool calls and do not add logging that could leak tool arguments, secrets, stack traces, or internal paths.
  • pass — 6. Cryptography and Data Protection: Not applicable — no cryptographic operations or data-protection mechanisms are changed.
  • pass — 7. Configuration and Security Headers: No HTTP headers, CORS/CSP, debug settings, container images, Dockerfiles, sandbox permissions, or configuration defaults are changed.
  • warning — 8. Security Testing: Unit tests include allowlist and unsafe-command negative cases plus the new mixed canonicalization regression, but the required Kimi E2E for this head SHA was not confirmed as passing and the E2E Advisor recommends adding mixed-streaming coverage through the real OpenClaw trajectory path.
  • warning — 9. Holistic Security Posture: The change appears fail-closed and preserves the narrow allowlist, but it affects inference-to-sandbox exec tool-call boundaries, which are high risk for sandbox/policy behavior. Active overlap with PR chore: upgrade agent runtime dependencies #3925 and missing confirmed Kimi E2E leave residual integration risk.

Test / E2E status

  • Test depth: mocks_recommended — Changed code has inference/provider stream behavior and affects sandbox exec tool-call persistence. The added unit regression covers the mixed split-plus-combined shape and existing negative cases cover unsafe command strings, but behavioral mocks or E2E are still needed for stream/event and trajectory boundaries.
  • E2E Advisor: missing
  • Required E2E jobs: kimi-inference-compat-e2e
  • Missing for analyzed SHA: kimi-inference-compat-e2e

✅ What looks good

  • The patch is narrowly scoped to the Kimi inference compatibility plugin and its focused unit test.
  • The rewrite remains constrained to exec tool calls with a single command argument and exact safe diagnostic commands.
  • The new regression test covers the observed mixed streamed split-plus-combined shape and verifies stable canonical split IDs.
  • Existing unsafe command negative tests remain in place for shell metacharacter and unknown-command cases.
  • Codebase drift evidence shows the files still exist and recent history is relevant to Kimi/OpenClaw compatibility work.

Review completeness

  • Review used deterministic PR context, the provided diff, and read-only file inspection; no scripts, package-manager commands, or tests were executed.
  • The required E2E run was reported as auto-dispatched, but no completed passing kimi-inference-compat-e2e result for the head SHA was available in the gathered status rollup.
  • Review thread state was not fully available beyond the provided GraphQL/comment context.
  • The linked issue is already closed by PR fix: support reasoning models in the OpenClaw harness #3046, so acceptance mapping includes historical clauses that this follow-up only partially exercises.
  • Human maintainer review required: yes

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26263352050
Target ref: 46927b8126acbf6371ce335ffbb71989f490fc85
Workflow ref: main
Requested jobs: kimi-inference-compat-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
kimi-inference-compat-e2e ✅ success

@jyaunches jyaunches merged commit 45e1f4c into main May 22, 2026
30 checks passed
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
@jyaunches jyaunches deleted the fix/kimi-mixed-toolcall-canonicalization branch June 12, 2026 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[NemoClaw][macOS][Agent&Skills] Agent abandons multi-step task and starts new conversation after partial response — task context lost mid-run

3 participants