Skip to content

fix(inference): inject tool-less system prompt for Ultra 550B (#4851)#5085

Merged
cv merged 11 commits into
mainfrom
feat/ultra-550b-toolless-systemprompt-4851
Jun 11, 2026
Merged

fix(inference): inject tool-less system prompt for Ultra 550B (#4851)#5085
cv merged 11 commits into
mainfrom
feat/ultra-550b-toolless-systemprompt-4851

Conversation

@cjagwani

@cjagwani cjagwani commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

When nvidia/nemotron-3-ultra-550b-a55b is asked to perform a multi-step task (e.g. "create a file then run it") and the request has no system message and no execution-capable tools, the model plans all steps in reasoning_content but silently drops intermediate steps from content. The reporter saw content with only the final run command; in my repro content was empty. chat_template_kwargs.force_nonempty_content does not help — verified by direct curl to NVIDIA Endpoints with and without that kwarg.

Extends the existing nemotron-inference-fix preload to also inject a one-paragraph system message when the model matches Ultra 550B AND the caller supplied no system message AND no execution-capable tools (bash_execute, write_file, etc. — tight allowlist, not a substring regex). Scoped narrowly so it never overrides caller intent, never interferes with tool-using flows, and doesn't trip on harmless business tools like create_ticket or run_query.

Changes

  • nemoclaw-blueprint/scripts/nemotron-inference-fix.js:
    • Add TOOL_LESS_SYSTEM_PROMPT_RULES + applyToolLessSystemPrompt chained into patchJsonBody alongside applyChatTemplateKwargs
    • Tight EXECUTION_TOOL_NAMES allowlist (Set of canonical exec/write tool names) to avoid false positives on harmless tool names containing substrings like "create", "run", "save", "command"
    • Scan ALL messages (not just messages[0]) for an existing system message
    • Source-of-truth / removal contract block matching the existing #4063 format (invalid state, source boundary, why-not-fix-source, regression proof, removal condition)
    • Comment documenting path+model as the intentional trust boundary (this preload only runs inside NemoClaw-managed sandboxes via NODE_OPTIONS)
  • test/nemotron-inference-fix.test.ts:
    • New (#4851) test with 9 branches: inject, skip-system-at-0, skip-system-at-mid, skip-with-exec-tool, skip-non-matching-model, inject-with-non-exec-tools, skip-with-mixed-tools-containing-exec, inject-with-broad-token-business-tools (create_ticket/run_query/save_search/command_palette), skip-with-write_file
    • New pins path+model as the intended scope boundary contract test asserting all three hosts (inference.local, integrate.api.nvidia.com, some-other-openai-compat-host) get the injection — pins the documented scope contract
    • Extended real-fetch/undici test to cover Ultra 550B injection AND Content-Length refresh

Live verification (GCP Brev box, direct curl to integrate.api.nvidia.com)

Variant content length
Plain request (no preload) 1 char
+ chat_template_kwargs.force_nonempty_content only (existing preload) 1 char (no help)
+ this PR's system-prompt injection 501 chars (full heredoc + python3 /tmp/hello.py)
With caller-supplied system message caller's preserved, no injection
With execution-capable tools finish_reason: tool_calls, no injection (correct)

Reasoning side stays stable (~184 chars) — the fix lets the model emit what it was already planning.

Scope note (model-output runtime validation)

PR Review Advisor flagged "request-mutation tests do not prove the linked model-output behavior". Live curl above is the acceptance evidence. CI-side runtime model-output validation requires API-key secret infrastructure (not currently set up for this preload's CI test path) and is intentionally out of scope for this PR.

Verification checklist

  • npm test passes (6/6 on test/nemotron-inference-fix.test.ts, 9 injection branches + scope contract test + all kwargs regressions)
  • Live end-to-end verification through the preload against integrate.api.nvidia.com
  • Tests added for new behavior
  • No secrets, API keys, or credentials committed
  • CodeRabbit Major (system message scan) addressed
  • PR Review Advisor needs-attention items addressed (tool predicate refined)
  • PR Review Advisor worth-checking items addressed (SoT contract, fetch coverage, broad-regex tightened, scope contract test added)

Refs #4851 (NVB#6272828 tracked upstream).

Summary by CodeRabbit

  • New Features

    • Model-specific system prompt is prepended for certain Nemotron Ultra requests when no system message and no execution-capable tools are present.
  • Bug Fixes

    • Body mutations are applied in sequence and the request body/Content-Length are updated only when changes occur.
  • Tests

    • Expanded unit and integration tests covering model matching, tool presence, message placement, and cross-host behavior.
  • Documentation

    • Added an e2e runtime validation runbook for the Ultra tool-less injection scenario.

When `nvidia/nemotron-3-ultra-550b-a55b` is asked to perform a multi-step
task (e.g. "create a file then run it") and the request has no system
message and no tools, the model plans all steps in `reasoning_content`
but silently drops intermediate steps from `content`. The reporter saw
content with only the final run command; in my repro content was empty.
chat_template_kwargs.force_nonempty_content does not help — verified by
direct curl to NVIDIA Endpoints with and without that kwarg.

Extends the existing nemotron-inference-fix preload to also inject a
one-paragraph system message when the model matches Ultra 550B AND the
caller supplied no system message AND no tools. Scoped narrowly so it
never overrides caller intent or interferes with tool-using flows.

Live-verified end-to-end through the preload via Node fetch to NVIDIA
Endpoints:
- Without fix:    content = 1 char (empty)
- With this fix:  content = 501 chars including heredoc + run command
- Caller-system:  caller's "Respond only with 'OK'" preserved
- With-tools:     no injection, model uses tool path normally

Tests cover the four branches (inject, skip-with-system,
skip-with-tools, skip-for-other-model) and all four kwargs regression
tests still pass.

Refs #4851 (NVB#6272828 tracked upstream).

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR extends the Nemotron inference preload's request-body mutation pipeline to prepend a model-specific system prompt for nvidia/nemotron-3-ultra-550b when body.messages contains no system role and body.tools is absent or empty, and updates tests to validate the conditional injection while preserving existing chat_template_kwargs behavior.

Changes

Ultra 550B System Message Injection

Layer / File(s) Summary
System Prompt Rules and Request Mutation Pipeline
nemoclaw-blueprint/scripts/nemotron-inference-fix.js
Defines TOOL_LESS_SYSTEM_PROMPT_RULES for nvidia/nemotron-3-ultra-550b, adds execution-capable tool detection (EXECUTION_TOOL_NAME_RE, isExecutionCapableTool, hasExecutionCapableTool), implements toolLessSystemPromptForModel and applyToolLessSystemPrompt, and updates patchJsonBody to apply chat_template_kwargs and tool-less system prompt injection, re-serializing only when changes occur.
System Prompt Injection Test Coverage
test/nemotron-inference-fix.test.ts
Updates import ordering, extends the real fetch/undici harness with an Ultra 550B request asserting injected system message and refreshed Content-Length, and adds a stubbed http.request Vitest test covering injection, preservation, and skip cases across model/tool/message variants and verifying chat_template_kwargs.
E2E Runtime Validation Runbook
test/e2e-runtime/4851-ultra-toolless-validation.md
Adds a runtime validation runbook with prerequisites and three curl/JQ scenarios (baseline, kwarg-only, and system-message+kwarg) describing expected verification steps and outcomes.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant FetchWrapper
  participant patchJsonBody
  participant ToolLessRuleEngine

  Client->>FetchWrapper: POST /v1/chat/completions (body)
  FetchWrapper->>patchJsonBody: patchJsonBody(body)
  patchJsonBody->>ToolLessRuleEngine: evaluate model, messages, tools
  ToolLessRuleEngine-->>patchJsonBody: decision (inject / skip)
  patchJsonBody->>patchJsonBody: applyChatTemplateKwargs()
  patchJsonBody->>patchJsonBody: applyToolLessSystemPrompt() (if applicable)
  patchJsonBody-->>FetchWrapper: patchedBody or null
  FetchWrapper->>Client: proceed with modified or original request
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4188: Modifies the same nemotron-inference-fix.js request-body mutation pipeline for /v1/chat/completions.

Suggested labels

bug-fix

Suggested reviewers

  • cv

Poem

🐰 I hop through JSON, gentle and spry,
When Ultra is quiet and no tools apply,
I tuck a prompt at the very start,
So Nemotron greets the user's heart,
A merry fix — a carrot, oh my!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: injecting a tool-less system prompt for Ultra 550B models, with a specific issue reference.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/ultra-550b-toolless-systemprompt-4851

Comment @coderabbitai help to get the list of available commands and usage tips.

@cjagwani cjagwani added bug Something fails against expected or documented behavior area: inference Inference routing, serving, model selection, or outputs platform: ubuntu Affects Ubuntu Linux environments v0.0.62 Release target labels Jun 9, 2026
@cjagwani cjagwani self-assigned this Jun 9, 2026
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: cloud-inference-e2e, agent-turn-latency-e2e
Optional E2E: inference-routing-e2e, kimi-inference-compat-e2e

Dispatch hint: cloud-inference-e2e,agent-turn-latency-e2e

Auto-dispatched E2E: cloud-inference-e2e, agent-turn-latency-e2e via nightly-e2e.yaml at b64911fe2dfd1df9943857ccc32fefef84663df2nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • cloud-inference-e2e (medium; live NVIDIA API key, Docker, timeout 30 minutes): Validates the live sandbox -> inference.local -> NVIDIA Endpoints /v1/chat/completions path after changing the preload that mutates those requests. This should catch runtime preload wiring, syntax, Content-Length, and normal Nemotron-family live chat regressions.
  • agent-turn-latency-e2e (high; live NVIDIA API key, two sandbox installs, timeout 120 minutes): Closest existing live E2E coverage for the affected Ultra 550B model. It installs OpenClaw and Hermes sandboxes, configures nvidia/nemotron-3-ultra-550b-a55b through inference.local, and verifies real model-backed assistant turns do not stall or route through the slow/broken path.

Optional E2E

  • inference-routing-e2e (medium; live NVIDIA API key, timeout 30 minutes): Useful adjacent confidence for provider/gateway inference routing because the changed preload is scoped by /v1/chat/completions path and runs on sandbox inference traffic, even though this PR does not directly change route configuration.
  • kimi-inference-compat-e2e (medium; hermetic mock endpoint plus sandbox, timeout 45 minutes): The same preload still contains the Kimi thinking=false compatibility branch. This hermetic Kimi/OpenAI-compatible endpoint flow is a useful regression check that the refactored patchJson path did not break adjacent model-specific inference behavior.

New E2E recommendations

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: cloud-inference-e2e,agent-turn-latency-e2e

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Recommendation

Required Vitest E2E scenarios: ubuntu-repo-cloud-openclaw
Optional Vitest E2E scenarios: None

Dispatch required Vitest E2E scenarios:

  • gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Workflow run

Full Vitest E2E advisor summary

Vitest E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: medium

Required Vitest E2E scenarios

  • ubuntu-repo-cloud-openclaw: The PR changes the NemoClaw blueprint sandbox preload that mutates NVIDIA/OpenAI-compatible chat-completions requests. The Ubuntu repo cloud OpenClaw scenario is the smallest live-supported typed Vitest scenario that onboards an OpenClaw/NVIDIA sandbox and exercises the affected sandbox startup/inference-route surface.
    • Dispatch: gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Optional Vitest E2E scenarios

  • None.

Relevant changed files

  • nemoclaw-blueprint/scripts/nemotron-inference-fix.js

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 1 worth checking, 0 nice ideas
Since last review: 0 prior items resolved, 1 still applies, 0 new items found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Execution-tool detection still only matches exact names despite the suffix contract (nemoclaw-blueprint/scripts/nemotron-inference-fix.js:183): The Ultra 550B tool-less prompt injection skips when `hasExecutionCapableTool(body)` detects an execution-capable tool, but `isExecutionCapableTool` still only performs an exact lookup in `EXECUTION_TOOL_NAMES`. The nearby comment says the allowlist matches exact known names plus canonical OpenClaw/MCP suffixes. If real OpenClaw/MCP requests expose namespaced or server-prefixed execution tool names, this preload could incorrectly prepend “You do not have tools...” even when an execution-capable tool is present.
    • Recommendation: Either implement suffix-aware matching for the documented OpenClaw/MCP execution-tool suffixes, or clarify that exact-name-only matching is the intended contract and add a regression test that pins the real request shape.
    • Evidence: `return EXECUTION_TOOL_NAMES.has(name.toLowerCase());` only checks exact lowercased names. Added tests cover exact names such as `exec`, `bash_execute`, `write_file`, `tool_call`, and top-level `tool.name`, but not a namespaced/server-prefixed MCP execution name.

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — Ultra 550B request with a namespaced/server-prefixed MCP execution tool name skips the tool-less system prompt, or exact-name-only real request shape is pinned.. Deterministic coverage is strong for request mutation, skip branches, and Content-Length handling, but the issue's end-user acceptance depends on live Ultra 550B model output. The PR adds a maintained manual runbook for that runtime validation.
  • **Runtime validation** — Live Ultra 550B tool-less sandbox request returns `content` containing both `/tmp/hello.py` file creation and `python3 /tmp/hello.py` after preload injection when NVIDIA API-key runtime infrastructure is available.. Deterministic coverage is strong for request mutation, skip branches, and Content-Length handling, but the issue's end-user acceptance depends on live Ultra 550B model output. The PR adds a maintained manual runbook for that runtime validation.
  • **Acceptance clause:** Model explains it lacks a file-write tool and shows the full code the user would need to run manually — add test evidence or identify existing coverage. The injected system prompt states the model does not have tools to write files or execute commands and asks it to include complete code/commands. Deterministic tests prove the prompt is prepended, but the checked-in Scenario C transcript primarily shows complete manual commands/code rather than literally saying it lacks a file-write tool. This is acceptable because the issue's expected result is an Either condition and the alternate clause is covered.
  • **Acceptance clause:** Steps to Reproduce: `nemoclaw onboard` with `nvidia/nemotron-3-ultra-550b-a55b` (NVIDIA Endpoints); `nemoclaw ultra-test connect && openclaw tui`; send `Create a file called hello.py in /tmp with a hello world script, then run it.`; observe model response and API-level `reasoning_content` vs `content` fields — add test evidence or identify existing coverage. The PR adds deterministic preload tests and a live validation runbook for the same prompt and model-output behavior. The repository tests do not execute the full `nemoclaw onboard`/TUI flow, which is reasonable without API-key runtime infrastructure.
Since last review details

Current findings:

  • Execution-tool detection still only matches exact names despite the suffix contract (nemoclaw-blueprint/scripts/nemotron-inference-fix.js:183): The Ultra 550B tool-less prompt injection skips when `hasExecutionCapableTool(body)` detects an execution-capable tool, but `isExecutionCapableTool` still only performs an exact lookup in `EXECUTION_TOOL_NAMES`. The nearby comment says the allowlist matches exact known names plus canonical OpenClaw/MCP suffixes. If real OpenClaw/MCP requests expose namespaced or server-prefixed execution tool names, this preload could incorrectly prepend “You do not have tools...” even when an execution-capable tool is present.
    • Recommendation: Either implement suffix-aware matching for the documented OpenClaw/MCP execution-tool suffixes, or clarify that exact-name-only matching is the intended contract and add a regression test that pins the real request shape.
    • Evidence: `return EXECUTION_TOOL_NAMES.has(name.toLowerCase());` only checks exact lowercased names. Added tests cover exact names such as `exec`, `bash_execute`, `write_file`, `tool_call`, and top-level `tool.name`, but not a namespaced/server-prefixed MCP execution name.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemoclaw-blueprint/scripts/nemotron-inference-fix.js`:
- Around line 132-134: The current check only inspects the variable "first"
(messages[0]) and can miss system messages later in body.messages; update the
logic that returns null to instead detect any message with role === 'system' by
scanning body.messages (e.g., Array.prototype.some) and account for non-array or
non-object entries before checking role, so the preload respects a
caller-provided system prompt; modify the checks around the "first" usage to use
this array-wide detection and remove the narrow first-only assumption.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2a210f51-1e64-4a94-93a9-3b43bf89cd25

📥 Commits

Reviewing files that changed from the base of the PR and between 5e79195 and d7140da.

📒 Files selected for processing (2)
  • nemoclaw-blueprint/scripts/nemotron-inference-fix.js
  • test/nemotron-inference-fix.test.ts

Comment thread nemoclaw-blueprint/scripts/nemotron-inference-fix.js Outdated
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27232511353
Target ref: d7140da50febc345d02ed4ef52997e4165e7fa6f
Workflow ref: main
Requested jobs: agent-turn-latency-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
agent-turn-latency-e2e ⚠️ cancelled

Address PR #5085 review feedback:

- CodeRabbit Major: the system-message check at lines 132-133 only
  inspected messages[0], but the OpenAI chat-completions contract
  permits a system message anywhere in the array. Switch to
  messages.some(...) so the "caller prompt wins" contract holds for
  any position. Add a fifth test case covering a system message at
  index 2 of a multi-turn conversation.

- Biome ci flagged the test file's import order. Apply the auto-fix:
  alphabetize by module name (child_process, fs, os, path, vitest) and
  within the vitest import sort named imports (describe, expect, it).

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27232728511
Target ref: 068efc8594b423de07b65f40e44e62d51967b195
Workflow ref: main
Requested jobs: agent-turn-latency-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
agent-turn-latency-e2e ✅ success

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27233334531
Target ref: 13c41609a60dda6b4086e60ce46c8254655cd2a2
Workflow ref: main
Requested jobs: cloud-inference-e2e,agent-turn-latency-e2e
Summary: 1 passed, 1 failed, 0 skipped

Job Result
agent-turn-latency-e2e ❌ failure
cloud-inference-e2e ✅ success

Failed jobs: agent-turn-latency-e2e. Check run artifacts for logs.

cjagwani added 2 commits June 9, 2026 14:01
…ge (#4851)

Address PR Review Advisor on #5085 — 1 needs-attention + 3 worth-checking.

1. Refine tool predicate (advisor "needs attention"):
   The original `tools.length > 0` skip missed the practical user case
   from #4851 where the request has `toolSearch` + `web.fetch` but no
   `bash_execute` / `write_file`. Replace with `hasExecutionCapableTool`
   matching names containing bash/exec/run/shell/cmd/command/write/edit/
   patch/create/save/fs/filesystem. Non-execution tools (search, web,
   fetch, describe, read) no longer suppress the injection.

2. Add #4851 source-of-truth contract block mirroring the #4063 format:
   invalid state, source boundary, why-not-fix-source, regression proof,
   removal condition.

3. Add fetch/undici coverage:
   Extend the existing real-fetch test to assert Ultra 550B receives the
   injected system message AND that Content-Length is refreshed after
   the body grows. Previously only the stubbed http.request path was
   covered for injection.

4. Document path+model scope as the intended trust boundary:
   Add explicit comment near `TOOL_LESS_SYSTEM_PROMPT_RULES` explaining
   this preload runs inside NemoClaw-managed sandboxes where
   `inference.local` is the only chat-completions destination; non-
   sandbox OpenAI-compatible callers don't load this preload.

Tests: 5/5 unit tests pass including 7 branches of the injection logic
(inject, skip-system-at-0, skip-system-at-mid, skip-with-exec-tool,
inject-with-non-exec-tools-only, skip-with-mixed-tools, skip-non-
matching-model) and the new fetch-path assertion.

Refs #4851.

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
…act test (#4851)

Address remaining PR Review Advisor "worth-checking" items on #5085:

1. Tighten execution-tool detection (advisor "Execution-tool detection
   may skip the workaround for harmless tool names"):
   The previous regex `(bash|exec|execute|run|shell|cmd|command|write|edit|
   patch|create|save|fs|filesystem)` matched substrings, so harmless
   business tools like `create_ticket`, `run_query`, `save_search`,
   `command_palette` would have incorrectly suppressed the injection.
   Replace with an explicit allowlist (Set) of canonical exec/write
   tool names: bash, bash_execute, exec, execute, execute_command,
   shell, shell_execute, run_command, run_shell, write_file, file_write,
   edit_file, file_edit, patch_file, file_patch, create_file, file_create,
   apply_patch, str_replace_editor, computer. Two new test cases:
   - harmless business-tool names (create_ticket, run_query, save_search,
     command_palette) still trigger injection
   - explicit write_file correctly suppresses

2. Add explicit path+model scope boundary contract test (advisor "System-
   prompt injection is still scoped by path and model rather than a
   trusted provider boundary"):
   New `pins path+model as the intended scope boundary` test sends the
   same request to three different hosts (inference.local,
   integrate.api.nvidia.com, some-other-openai-compat-host.example.com)
   and asserts all three get the injection. This pins the documented
   contract: host is intentionally NOT part of the scope. A future
   move toward narrower host-aware gating must change this assertion
   too.

Note on the third "worth-checking" item (request-mutation tests don't
prove model-output behavior): this requires a live API call against
NVIDIA Endpoints. Live verification is in the PR body; CI runtime
validation needs API-key secret infrastructure (out of scope for this
PR). PR body documents the live curl results.

Tests: 6/6 unit tests pass with 9 branches of the injection logic.
Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27235858317
Target ref: 0b689356ded22bc9efb5939d5fbd314c5364dfc0
Workflow ref: main
Requested jobs: agent-turn-latency-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
agent-turn-latency-e2e ✅ success

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/nemotron-inference-fix.test.ts (1)

406-425: ⚡ Quick win

Add one case for the alternate tool.name shape.

isExecutionCapableTool() supports both { name } and { function: { name } }, but these new allowlist regressions only exercise the nested form. Adding one top-level-name case here would pin the other supported branch too.

Also applies to: 483-494

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/nemotron-inference-fix.test.ts` around lines 406 - 425, The tests only
exercise the nested tool shape ({ function: { name } }) but
isExecutionCapableTool() also supports the top-level shape ({ name }), so add an
additional send() call mirroring one of the existing assertions (e.g., for the
Ultra 550B "write_file" NO injection case and/or the "create/run/save/command"
INJECTION case) using tools with the top-level name form (e.g., { type:
'function', name: 'write_file', parameters: {} }) to ensure the alternate branch
is covered; update both the case-7/8 group and the similar block at lines
~483-494 to include the top-level-name variant.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/nemotron-inference-fix.test.ts`:
- Around line 406-425: The tests only exercise the nested tool shape ({
function: { name } }) but isExecutionCapableTool() also supports the top-level
shape ({ name }), so add an additional send() call mirroring one of the existing
assertions (e.g., for the Ultra 550B "write_file" NO injection case and/or the
"create/run/save/command" INJECTION case) using tools with the top-level name
form (e.g., { type: 'function', name: 'write_file', parameters: {} }) to ensure
the alternate branch is covered; update both the case-7/8 group and the similar
block at lines ~483-494 to include the top-level-name variant.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9ba2344c-59f6-4b20-baae-6321b022df35

📥 Commits

Reviewing files that changed from the base of the PR and between 0b68935 and b2e28a3.

📒 Files selected for processing (2)
  • nemoclaw-blueprint/scripts/nemotron-inference-fix.js
  • test/nemotron-inference-fix.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemoclaw-blueprint/scripts/nemotron-inference-fix.js

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27236454584
Target ref: b2e28a37e336f78f2d0e1e88a95863f4b7c2d105
Workflow ref: main
Requested jobs: agent-turn-latency-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
agent-turn-latency-e2e ❌ failure

Failed jobs: agent-turn-latency-e2e. Check run artifacts for logs.

@cv

cv commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

@cjagwani can you address the feedback in #5085 (comment) please?

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27241955530
Target ref: c2fc436334ced5d2f96d0acb99492c956f011f87
Workflow ref: main
Requested jobs: agent-turn-latency-e2e,cloud-inference-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
agent-turn-latency-e2e ✅ success
cloud-inference-e2e ✅ success

@jyaunches

Copy link
Copy Markdown
Contributor

Local PR review follow-up

I re-ran the local PR review against head c2fc4363. The PR Review Advisor is current and still recommends needs_rework; I agree this should not move to CI-loop/merge yet.

Blocking / action-required items

  1. Trusted [Ubuntu 24.04][Agent&Skills] Ultra 550B content omits intermediate steps when no tools configured — only final command returned #4851 acceptance evidence is still missing.
    The repo tests prove request mutation and Content-Length refresh, but they do not prove the original model-output behavior from [Ubuntu 24.04][Agent&Skills] Ultra 550B content omits intermediate steps when no tools configured — only final command returned #4851 (reasoning_content / content / finish_reason, or that the response includes the complete file-creation code + run command / explicit no-tools explanation). The current SoT block still points to PR-body live curl evidence (nemoclaw-blueprint/scripts/nemotron-inference-fix.js:76-81), which the advisor called out as not repository-verifiable.

  2. Execution-capable tool detection still looks incomplete.
    EXECUTION_TOOL_NAMES in nemoclaw-blueprint/scripts/nemotron-inference-fix.js:135-156 still omits repo-known write-capable names: write, edit, and notebook_edit (nemoclaw/src/index.ts:342). It also does not account for compact catalog tool_call, which can delegate to real tools (scripts/patch-openclaw-tool-catalog.js:149-167). This can incorrectly inject a “no tools” system prompt when write/exec capability is actually available.

  3. E2E scenario advisor recommendation appears not yet run.
    Required branch E2Es agent-turn-latency-e2e and cloud-inference-e2e passed at c2fc4363 in run https://github.com/NVIDIA/NemoClaw/actions/runs/27241955530, but I did not find the required ubuntu-repo-cloud-openclaw e2e-scenarios.yaml run for this branch/head.

Already okay

  • Regular required CI checks are green.
  • CodeRabbit has no unresolved threads.
  • The host-agnostic path+model scope is documented and pinned by test; that looks like an intentional accepted boundary rather than accidental drift.

Recommendation: address the trusted validation artifact and tool-detection gaps first, then rerun the PR Review Advisor. After it comes back clean, the remaining merge blockers can be handled by the normal CI/E2E shepherding loop.

…unbook (#4851)

Address @jyaunches review feedback on #5085.

1. Align EXECUTION_TOOL_NAMES with nemoclaw/src/index.ts:WRITE_TOOL_NAMES
   so the allowlist stays in sync with the same write-capable surface
   OpenClaw scans for secrets. Adds bare `write`, `edit`, `notebook_edit`.

2. Add `tool_call` (the OpenClaw compact-catalog wrapper from
   scripts/patch-openclaw-tool-catalog.js) to the execution-capable set.
   When `tool_call` is in the tools array, we can't tell from the
   request alone which underlying tool will be dispatched, so treat it
   as execution-capable and skip the system-prompt injection. Otherwise
   the model could receive a "no tools" prompt when it actually has
   real exec/write capability behind the catalog wrapper.

3. Add a checked-in runtime validation runbook at
   test/e2e-runtime/4851-ultra-toolless-validation.md covering three
   scenarios (baseline, force_nonempty_content only, full preload
   injection) against integrate.api.nvidia.com. This is the
   repository-verifiable acceptance evidence the advisor and Julie's
   review asked for — anyone reviewing #4851 acceptance can re-run it
   directly against NVIDIA Endpoints rather than relying on PR text.
   Updated the source-of-truth contract block to reference the runbook.

4. Tests:
   - case 9: bare write/edit/notebook_edit (mirrors WRITE_TOOL_NAMES)
   - case 10: tool_call wrapper (compact catalog dispatch)
   - case 11: top-level tool.name shape (no nested .function) —
     CodeRabbit nit asking for alternate-shape coverage

All 6 unit tests pass (12 injection branches + contract test +
fetch-path).

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27303964199
Target ref: 203536f53c014d81eb6b47f2906690bd80a24524
Workflow ref: main
Requested jobs: agent-turn-latency-e2e,cloud-e2e
Summary: 1 passed, 1 failed, 0 skipped

Job Result
agent-turn-latency-e2e ❌ failure
cloud-e2e ✅ success

Failed jobs: agent-turn-latency-e2e. Check run artifacts for logs.

cjagwani added 3 commits June 10, 2026 15:35
Address remaining PR Review Advisor items on #5085.

- jq nice-idea: every curl example in the runbook pipes to jq for
  readable parsing, but the prerequisites list only listed node and
  curl. Add jq alongside node + curl.

- Provider-output acceptance worth-checking: the previous "Last live
  verification" entry pointed back at PR-body evidence rather than a
  durable checked-in artifact. Replace with a "Sanitized acceptance
  transcript" section that records the exact response shape we captured
  on 2026-06-09 for each of the three scenarios (baseline, kwarg-only,
  full preload). Future reviewers can compare new runs against the
  transcript instead of digging through the PR body, and the dated log
  below it tracks freshness of the live confirmation.

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
)

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27304797639
Target ref: e98604975ad83fefd80bb40e123b0897d57905ef
Workflow ref: main
Requested jobs: agent-turn-latency-e2e,kimi-inference-compat-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
agent-turn-latency-e2e ⚠️ cancelled
kimi-inference-compat-e2e ⚠️ cancelled

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27305033204
Target ref: 026802ba8dca65d5c85858b3b0c5d72ead8b85b4
Workflow ref: main
Requested jobs: agent-turn-latency-e2e,cloud-inference-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
agent-turn-latency-e2e ⚠️ cancelled
cloud-inference-e2e ⚠️ cancelled

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27305151681
Target ref: 54ae58eddcb849c72ae82badf6d16a8757df1b7e
Workflow ref: main
Requested jobs: cloud-inference-e2e,kimi-inference-compat-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
cloud-inference-e2e ✅ success
kimi-inference-compat-e2e ✅ success

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27306323784
Target ref: b64911fe2dfd1df9943857ccc32fefef84663df2
Workflow ref: main
Requested jobs: cloud-inference-e2e,agent-turn-latency-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
agent-turn-latency-e2e ✅ success
cloud-inference-e2e ✅ success

@jyaunches jyaunches added v0.0.64 Release target and removed v0.0.63 Release target labels Jun 11, 2026
@cv cv merged commit 3108a19 into main Jun 11, 2026
39 checks passed
@cv cv deleted the feat/ultra-550b-toolless-systemprompt-4851 branch June 11, 2026 02:28
jyaunches pushed a commit that referenced this pull request Jun 11, 2026
## Summary

The "Selective E2E Results" comment posted by
`.github/workflows/nightly-e2e.yaml` bucketed job results into `passed`
/ `failed` / `skipped` but never accounted for `cancelled`. A job ending
in `cancelled` (e.g. when `cancel-in-progress` kills a stale run)
slipped past all three tallies and fell through to the default `"✅ All
requested jobs passed"` status, with the summary line reading `"0
passed, 0 failed, 0 skipped"` — masking that the run produced no signal
at all.

## Repro (in the wild)

PR #5085 commit `026802ba8`:

```text
### Selective E2E Results — ✅ All requested jobs passed
Run: 27305033204
Requested jobs: agent-turn-latency-e2e,cloud-inference-e2e
Summary: 0 passed, 0 failed, 0 skipped

| Job                    | Result        |
|------------------------|---------------|
| agent-turn-latency-e2e | ⚠️ cancelled |
| cloud-inference-e2e    | ⚠️ cancelled |
```

Both requested jobs were cancelled (cancel-in-progress superseding the
older run) yet the headline read green. Same pattern earlier on #4610.

## Root

`.github/workflows/nightly-e2e.yaml:2486-2495` (pre-fix):

```js
const passed  = ran.filter(([, v]) => v.result === 'success');
const failed  = ran.filter(([, v]) => v.result === 'failure');
const skipped = reportedEntries.filter(([, v]) => v.result === 'skipped');

const status =
  failed.length > 0 || missingRequested.length > 0 ? '❌ Some jobs failed'
  : skipped.length > 0 && passed.length === 0    ? '⚠️ No requested jobs ran'
  :                                                  '✅ All requested jobs passed';
```

A `cancelled` job is `!== 'success'`, `!== 'failure'`, `!== 'skipped'` —
falls through to the default.

## Changes

- Add a `cancelled` bucket derived from `ran` the same way the others
are.
- Insert a status branch between the failure case and the no-ran case:
when cancelled jobs are present and nothing passed, surface `⚠️ Run
cancelled — no signal` instead of falsely claiming success.
- Include cancelled count in the summary line so the bucket is visible
even when the other states are zero.

Successful, failed, and skipped runs continue to render exactly as
before — only the cancelled case changes from "false green" to "honest
yellow."

cc @jyaunches (flagged this earlier in slack)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Tests**
* Improved nightly e2e test reporting in PR comments to better
distinguish cancelled jobs from other outcomes. PR comments now display
cancelled job counts and provide clearer status messaging when test runs
are cancelled.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
cv pushed a commit that referenced this pull request Jun 12, 2026
## Summary
- Add v0.0.64 release notes from the release announcement and link them
to the relevant deeper docs.
- Document that custom policy presets recorded through `policy-add
--from-file` and `--from-dir` survive snapshot restore and sandbox
recreation.
- Refresh generated NemoClaw user skills from the current source docs.

## Source summary
- #5104 -> `docs/manage-sandboxes/backup-restore.mdx`,
`docs/network-policy/customize-network-policy.mdx`: Documents custom
policy presets preserved through snapshot restore.
- #4955 -> `docs/about/release-notes.mdx`: Adds release-note coverage
for Brave web-search pinning and `BRAVE_API_KEY` placeholder
preservation.
- #5116, #5269 -> `docs/about/release-notes.mdx`: Adds release-note
coverage for Docker-driver gateway health and rootfs guard stability.
- #5241, #5085 -> `docs/about/release-notes.mdx`: Adds release-note
coverage for chat-completions provider selection and Nemotron Ultra 550B
tool-less request compatibility.
- #5268, #5210, #5257 -> `docs/about/release-notes.mdx`: Adds
release-note coverage for messaging render plan refresh, OpenClaw
scope-upgrade approval recovery, and Hermes WhatsApp bridge dependency
setup.
- Current source docs -> `.agents/skills/`: Regenerates user-skill
references so agent-facing guidance matches the source documentation.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `npm run docs`
- `npm run build:cli`
- `npm run typecheck:cli`
- Commit/pre-push hooks: markdownlint, gitleaks, docs-to-skills
verification, TypeScript CLI, and skills YAML checks passed.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Clarified sandbox snapshot restore preserves custom policy presets and
restores them without original files.
* Switched sandbox setup and remote deployment guidance to Docker-based
workflows and emphasized remote onboarding flow.
* Expanded troubleshooting for gateway recovery, Docker GPU/WSL issues,
and onboarding resume.
* Added/updated CLI docs: advanced maintenance, session export,
upload/download wrappers, and status recovery guidance.
* Added v0.0.64 release notes and links to NemoClaw Community; fixed
command reference formatting.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: inference Inference routing, serving, model selection, or outputs bug Something fails against expected or documented behavior platform: ubuntu Affects Ubuntu Linux environments v0.0.64 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ubuntu 24.04][Agent&Skills] Ultra 550B content omits intermediate steps when no tools configured — only final command returned

4 participants