Skip to content

fix(inference): suppress kimi reasoning output#3244

Merged
ericksoa merged 1 commit into
NVIDIA:mainfrom
yimoj:fix/3177-kimi-reasoning-output
May 9, 2026
Merged

fix(inference): suppress kimi reasoning output#3244
ericksoa merged 1 commit into
NVIDIA:mainfrom
yimoj:fix/3177-kimi-reasoning-output

Conversation

@yimoj

@yimoj yimoj commented May 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Suppresses Kimi K2.6 managed inference thinking/reasoning output so failed tool-call turns do not leak provider reasoning into user-visible agent output. The runtime request patch now disables Kimi thinking and the Kimi OpenClaw compatibility plugin defensively strips reasoning fields/events.

Related Issue

Fixes #3177

Changes

  • Add Kimi K2.6 to the managed inference request preload and merge chat_template_kwargs.thinking=false.
  • Filter Kimi reasoning/thinking fields, blocks, and stream events in the Kimi compatibility plugin while preserving text and tool-call deltas.
  • Align Kimi onboarding probes with the runtime thinking suppression payload.
  • Add regression tests for non-streaming and streaming failed-tool-call responses with reasoning fields.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

Targeted tests and the Kimi compatibility E2E passed. Full hook/test suites were run but are not marked passing because unrelated environment/test failures remain (test/fetch-guard-patch-regression.test.ts host permission issue; earlier installer-preflight failures in the full run).

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Additional verification run:

  • npm test -- test/kimi-inference-compat-plugin.test.ts test/nemotron-inference-fix.test.ts src/lib/inference/onboard-probes.test.ts test/generate-openclaw-config.test.ts test/validate-config-schemas.test.ts
  • npm run typecheck:cli
  • NEMOCLAW_SANDBOX_NAME=e2e-kimi-3177 NEMOCLAW_NON_INTERACTIVE=1 NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 NEMOCLAW_YES=1 bash test/e2e/test-kimi-inference-compat.sh

Signed-off-by: Yimo Jiang yimoj@nvidia.com

Summary by CodeRabbit

  • New Features

    • Added reasoning and thinking content filtering for Kimi K2.6 model outputs to ensure cleaner responses.
  • Bug Fixes

    • Disabled thinking mode for Kimi K2.6 to prevent thinking-only outputs and improve response quality.
    • Enhanced compatibility for Kimi K2.6 managed inference integration.
  • Tests

    • Updated test coverage for Kimi K2.6 payload validation and reasoning output handling.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@yimoj yimoj self-assigned this May 8, 2026
@coderabbitai

coderabbitai Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6a3ec1a8-3ddf-4985-a367-a2d04daa66fa

📥 Commits

Reviewing files that changed from the base of the PR and between b1320d5 and bf6c389.

📒 Files selected for processing (8)
  • nemoclaw-blueprint/model-specific-setup/openclaw/kimi-k2.6-managed-inference.json
  • nemoclaw-blueprint/openclaw-plugins/kimi-inference-compat/index.js
  • nemoclaw-blueprint/scripts/nemotron-inference-fix.js
  • scripts/nemoclaw-start.sh
  • src/lib/inference/onboard-probes.test.ts
  • src/lib/inference/onboard-probes.ts
  • test/kimi-inference-compat-plugin.test.ts
  • test/nemotron-inference-fix.test.ts

📝 Walkthrough

Walkthrough

This PR addresses Kimi K2.6 reasoning output leakage by implementing two coordinated suppression mechanisms: request-time injection of thinking: false to disable reasoning generation at the inference API, and stream-time filtering to remove any reasoning content that still appears in responses or failures.

Changes

Kimi K2.6 Reasoning Output Suppression

Layer / File(s) Summary
Model Configuration & Probe Setup
nemoclaw-blueprint/model-specific-setup/openclaw/kimi-k2.6-managed-inference.json, src/lib/inference/onboard-probes.ts, src/lib/inference/onboard-probes.test.ts, scripts/nemoclaw-start.sh
Kimi K2.6 model manifest updated to document reasoning-output normalization. Probe payload for Kimi now includes chat_template_kwargs: { thinking: false } alongside existing max_tokens configuration. Probe test expectations and startup documentation both updated to reflect multi-model (DeepSeek V4 Pro + Kimi K2.6) inference fix scope.
Request Interception & Template Injection
nemoclaw-blueprint/scripts/nemotron-inference-fix.js
Preload intercepts POST /v1/chat/completions requests to Kimi K2.6 and conditionally initializes chat_template_kwargs (preserving existing non-array objects), then sets thinking: false alongside existing Nemotron model handling. Updated documentation clarifies the expanded model set and request shape.
Stream & Output Reasoning Filtering
nemoclaw-blueprint/openclaw-plugins/kimi-inference-compat/index.js
Kimi inference compatibility plugin now defines reasoning-field detection (constants for field names, event types, content block types) and adds utilities to recursively strip reasoning content from messages, choices, and streamed events. Stream chunk processing updated to apply reasoning filtering in addition to existing safe exec tool-call rewriting. Provider description and __testing exports expanded.
Integration Tests & Validation
test/kimi-inference-compat-plugin.test.ts, test/nemotron-inference-fix.test.ts
New test helpers construct failed tool-result scenarios with Kimi-specific reasoning fields. Tests verify final-message filtering removes reasoning while preserving visible content, streaming filtering drops reasoning deltas while preserving content/tool-call deltas, and request injection correctly appends thinking: false without overwriting existing template kwargs. Serialization checks confirm no "PRIVATE" reasoning fields leak into outputs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A cottontail's ode to thinking suppressed
When Kimi mused too loud, we passed the test,
With thinking: false whispered soft and low,
And filters that catch what the streams bestow,
The reasoning thoughts stay private, hushed—
User-facing answers, pristine and flushed! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: suppressing Kimi reasoning output, which directly addresses the primary objective of preventing thinking/reasoning content leakage.
Linked Issues check ✅ Passed Changes comprehensively address #3177 requirements: runtime thinking suppression via chat_template_kwargs.thinking=false injection, reasoning field filtering in the Kimi plugin, and aligned onboarding probes with regression tests for failed tool calls.
Out of Scope Changes check ✅ Passed All changes are directly scoped to Kimi K2.6 reasoning suppression: managed inference config, Kimi compatibility plugin filtering, onboarding probe alignment, nemotron fix script, and corresponding tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@ericksoa ericksoa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. I verified the Kimi path disables thinking in the runtime preload and onboarding probe payload, and the OpenClaw Kimi compat plugin strips reasoning/thinking fields while preserving normal content and tool-call deltas. Also checked the broader preload remains scoped to known affected models so non-Kimi/Nemotron/DeepSeek requests pass through unchanged.\n\nValidation on a local no-commit merge with current main: npm run build:cli; npm run typecheck:cli; npm test -- test/kimi-inference-compat-plugin.test.ts test/nemotron-inference-fix.test.ts src/lib/inference/onboard-probes.test.ts test/generate-openclaw-config.test.ts test/validate-config-schemas.test.ts; bash -n scripts/nemoclaw-start.sh; git diff --check.

@ericksoa ericksoa merged commit f012dd8 into NVIDIA:main May 9, 2026
22 checks passed
ericksoa pushed a commit that referenced this pull request May 11, 2026
## Summary
- Add an E2E Advisor workflow that recommends required/optional E2E
tests for PRs.
- Add a semantic Pi analysis layer using the NVIDIA inference-style Pi
config template.
- Add sticky PR comments with E2E recommendations and dispatch hints.
- Add E2E manifest/rules/schema/config validation for advisor metadata.

## Safety model
- Static analysis only; does not execute PR-provided
scripts/tests/package managers.
- Pi runs with read-only tools only: read, grep, find, ls.
- Pi runs with no session, extensions, skills, prompt templates, or
context files.
- Generated Pi credential config is written under /tmp, outside uploaded
artifacts.
- PR commenting is best-effort and can use E2E_ADVISOR_GITHUB_TOKEN if
github.token lacks write permissions.

## Secrets
- PI_E2E_ADVISOR_API_KEY: required for Pi semantic analysis; otherwise
semantic analysis skips and the deterministic baseline is used.
- E2E_ADVISOR_GITHUB_TOKEN: optional repo-write token for PR comments
when github.token cannot comment.

## Validation
- npm run checks
- Manually ran advisor workflow from fork against #3244
and received a successful semantic recommendation.

## Follow-up
- Add an E2E Recommendation Gate required check that verifies
recommended E2E jobs passed for the same PR head SHA before merge.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added an E2E Advisor workflow that analyzes PR changes with a
Pi-powered advisor, produces structured recommendations, and can post a
sticky PR comment with test suggestions.

* **Documentation**
* Added a guide describing workflow usage, required secrets, outputs,
and manual invocation.

* **Chores**
* Emits stable artifacts (prompt, raw output, parsed JSON, final result,
markdown summary) and a machine-readable schema for advisor results.

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3289)
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@miyoungc miyoungc mentioned this pull request May 12, 2026
12 tasks
miyoungc added a commit that referenced this pull request May 12, 2026
## Summary
Refreshes the release-prep docs for v0.0.39 based on changes merged
since the Friday 4pm doc refresh. Updates the source docs, bumps the
docs version metadata, and regenerates the NemoClaw user skills from the
refreshed docs.

## Changes
- #3314 -> `docs/get-started/prerequisites.md`,
`docs/get-started/quickstart.md`, `docs/reference/troubleshooting.md`:
Documents installer Docker setup, Docker group activation, and retry
guidance.
- #3317 -> `docs/get-started/quickstart.md`,
`docs/reference/commands.md`: Documents the DGX Spark and DGX Station
express install prompt and `NEMOCLAW_NO_EXPRESS`.
- #3328 and #3329 -> `docs/security/best-practices.md`,
`docs/deployment/sandbox-hardening.md`: Updates sandbox capability
hardening docs for the stricter bounding-set and `setpriv` step-down
behavior.
- #3330, #3335, and #3346 -> `docs/inference/use-local-inference.md`:
Documents Windows-host Ollama relaunch behavior, NIM key passthrough,
early health-fail diagnostics, and mixed-GPU preflight detail.
- #2406, #2883, #3001, #3244, #3267, #3318, #3320, and #3354 ->
`docs/about/release-notes.md`: Adds the v0.0.39 release-prep section
while keeping the v0.0.38 release notes intact.
- Advances the release-prep docs metadata from v0.0.38 to v0.0.39.
- Regenerates `.agents/skills/nemoclaw-user-*` from the updated source
docs.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes v0.0.39

* **New Features**
  * Host alias management commands for easier configuration
  * Sandbox GPU control options during onboarding
  * Update command with check and confirmation modes

* **Documentation**
* Enhanced Linux installer guidance with Docker and group membership
handling
  * Expanded troubleshooting for permission and connectivity issues
  * Improved capability-dropping security documentation
  * Updated inference model switching commands
  * Brev environment-specific troubleshooting

* **Improvements**
  * DGX Spark/Station express install flow
  * Windows Ollama relay and health-check enhancements
  * NVIDIA NIM preflight GPU reporting

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3375)

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[macOS][Inference] Kimi K2.6 reasoning chain-of-thought leaks into user-visible agent output when tool call fails

3 participants