fix(inference): inject tool-less system prompt for Ultra 550B (#4851)#5085
Conversation
When `nvidia/nemotron-3-ultra-550b-a55b` is asked to perform a multi-step task (e.g. "create a file then run it") and the request has no system message and no tools, the model plans all steps in `reasoning_content` but silently drops intermediate steps from `content`. The reporter saw content with only the final run command; in my repro content was empty. chat_template_kwargs.force_nonempty_content does not help — verified by direct curl to NVIDIA Endpoints with and without that kwarg. Extends the existing nemotron-inference-fix preload to also inject a one-paragraph system message when the model matches Ultra 550B AND the caller supplied no system message AND no tools. Scoped narrowly so it never overrides caller intent or interferes with tool-using flows. Live-verified end-to-end through the preload via Node fetch to NVIDIA Endpoints: - Without fix: content = 1 char (empty) - With this fix: content = 501 chars including heredoc + run command - Caller-system: caller's "Respond only with 'OK'" preserved - With-tools: no injection, model uses tool path normally Tests cover the four branches (inject, skip-with-system, skip-with-tools, skip-for-other-model) and all four kwargs regression tests still pass. Refs #4851 (NVB#6272828 tracked upstream). Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe PR extends the Nemotron inference preload's request-body mutation pipeline to prepend a model-specific system prompt for ChangesUltra 550B System Message Injection
Sequence Diagram(s)sequenceDiagram
participant Client
participant FetchWrapper
participant patchJsonBody
participant ToolLessRuleEngine
Client->>FetchWrapper: POST /v1/chat/completions (body)
FetchWrapper->>patchJsonBody: patchJsonBody(body)
patchJsonBody->>ToolLessRuleEngine: evaluate model, messages, tools
ToolLessRuleEngine-->>patchJsonBody: decision (inject / skip)
patchJsonBody->>patchJsonBody: applyChatTemplateKwargs()
patchJsonBody->>patchJsonBody: applyToolLessSystemPrompt() (if applicable)
patchJsonBody-->>FetchWrapper: patchedBody or null
FetchWrapper->>Client: proceed with modified or original request
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
Vitest E2E Scenario RecommendationRequired Vitest E2E scenarios: Dispatch required Vitest E2E scenarios:
Full Vitest E2E advisor summaryVitest E2E Scenario AdvisorBase: Required Vitest E2E scenarios
Optional Vitest E2E scenarios
Relevant changed files
|
PR Review AdvisorFindings: 0 needs attention, 1 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemoclaw-blueprint/scripts/nemotron-inference-fix.js`:
- Around line 132-134: The current check only inspects the variable "first"
(messages[0]) and can miss system messages later in body.messages; update the
logic that returns null to instead detect any message with role === 'system' by
scanning body.messages (e.g., Array.prototype.some) and account for non-array or
non-object entries before checking role, so the preload respects a
caller-provided system prompt; modify the checks around the "first" usage to use
this array-wide detection and remove the narrow first-only assumption.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2a210f51-1e64-4a94-93a9-3b43bf89cd25
📒 Files selected for processing (2)
nemoclaw-blueprint/scripts/nemotron-inference-fix.jstest/nemotron-inference-fix.test.ts
Selective E2E Results — ✅ All requested jobs passedRun: 27232511353
|
Address PR #5085 review feedback: - CodeRabbit Major: the system-message check at lines 132-133 only inspected messages[0], but the OpenAI chat-completions contract permits a system message anywhere in the array. Switch to messages.some(...) so the "caller prompt wins" contract holds for any position. Add a fifth test case covering a system message at index 2 of a multi-turn conversation. - Biome ci flagged the test file's import order. Apply the auto-fix: alphabetize by module name (child_process, fs, os, path, vitest) and within the vitest import sort named imports (describe, expect, it). Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 27232728511
|
Selective E2E Results — ❌ Some jobs failedRun: 27233334531
|
…ge (#4851) Address PR Review Advisor on #5085 — 1 needs-attention + 3 worth-checking. 1. Refine tool predicate (advisor "needs attention"): The original `tools.length > 0` skip missed the practical user case from #4851 where the request has `toolSearch` + `web.fetch` but no `bash_execute` / `write_file`. Replace with `hasExecutionCapableTool` matching names containing bash/exec/run/shell/cmd/command/write/edit/ patch/create/save/fs/filesystem. Non-execution tools (search, web, fetch, describe, read) no longer suppress the injection. 2. Add #4851 source-of-truth contract block mirroring the #4063 format: invalid state, source boundary, why-not-fix-source, regression proof, removal condition. 3. Add fetch/undici coverage: Extend the existing real-fetch test to assert Ultra 550B receives the injected system message AND that Content-Length is refreshed after the body grows. Previously only the stubbed http.request path was covered for injection. 4. Document path+model scope as the intended trust boundary: Add explicit comment near `TOOL_LESS_SYSTEM_PROMPT_RULES` explaining this preload runs inside NemoClaw-managed sandboxes where `inference.local` is the only chat-completions destination; non- sandbox OpenAI-compatible callers don't load this preload. Tests: 5/5 unit tests pass including 7 branches of the injection logic (inject, skip-system-at-0, skip-system-at-mid, skip-with-exec-tool, inject-with-non-exec-tools-only, skip-with-mixed-tools, skip-non- matching-model) and the new fetch-path assertion. Refs #4851. Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
…act test (#4851) Address remaining PR Review Advisor "worth-checking" items on #5085: 1. Tighten execution-tool detection (advisor "Execution-tool detection may skip the workaround for harmless tool names"): The previous regex `(bash|exec|execute|run|shell|cmd|command|write|edit| patch|create|save|fs|filesystem)` matched substrings, so harmless business tools like `create_ticket`, `run_query`, `save_search`, `command_palette` would have incorrectly suppressed the injection. Replace with an explicit allowlist (Set) of canonical exec/write tool names: bash, bash_execute, exec, execute, execute_command, shell, shell_execute, run_command, run_shell, write_file, file_write, edit_file, file_edit, patch_file, file_patch, create_file, file_create, apply_patch, str_replace_editor, computer. Two new test cases: - harmless business-tool names (create_ticket, run_query, save_search, command_palette) still trigger injection - explicit write_file correctly suppresses 2. Add explicit path+model scope boundary contract test (advisor "System- prompt injection is still scoped by path and model rather than a trusted provider boundary"): New `pins path+model as the intended scope boundary` test sends the same request to three different hosts (inference.local, integrate.api.nvidia.com, some-other-openai-compat-host.example.com) and asserts all three get the injection. This pins the documented contract: host is intentionally NOT part of the scope. A future move toward narrower host-aware gating must change this assertion too. Note on the third "worth-checking" item (request-mutation tests don't prove model-output behavior): this requires a live API call against NVIDIA Endpoints. Live verification is in the PR body; CI runtime validation needs API-key secret infrastructure (out of scope for this PR). PR body documents the live curl results. Tests: 6/6 unit tests pass with 9 branches of the injection logic. Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 27235858317
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/nemotron-inference-fix.test.ts (1)
406-425: ⚡ Quick winAdd one case for the alternate
tool.nameshape.
isExecutionCapableTool()supports both{ name }and{ function: { name } }, but these new allowlist regressions only exercise the nested form. Adding one top-level-name case here would pin the other supported branch too.Also applies to: 483-494
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/nemotron-inference-fix.test.ts` around lines 406 - 425, The tests only exercise the nested tool shape ({ function: { name } }) but isExecutionCapableTool() also supports the top-level shape ({ name }), so add an additional send() call mirroring one of the existing assertions (e.g., for the Ultra 550B "write_file" NO injection case and/or the "create/run/save/command" INJECTION case) using tools with the top-level name form (e.g., { type: 'function', name: 'write_file', parameters: {} }) to ensure the alternate branch is covered; update both the case-7/8 group and the similar block at lines ~483-494 to include the top-level-name variant.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/nemotron-inference-fix.test.ts`:
- Around line 406-425: The tests only exercise the nested tool shape ({
function: { name } }) but isExecutionCapableTool() also supports the top-level
shape ({ name }), so add an additional send() call mirroring one of the existing
assertions (e.g., for the Ultra 550B "write_file" NO injection case and/or the
"create/run/save/command" INJECTION case) using tools with the top-level name
form (e.g., { type: 'function', name: 'write_file', parameters: {} }) to ensure
the alternate branch is covered; update both the case-7/8 group and the similar
block at lines ~483-494 to include the top-level-name variant.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9ba2344c-59f6-4b20-baae-6321b022df35
📒 Files selected for processing (2)
nemoclaw-blueprint/scripts/nemotron-inference-fix.jstest/nemotron-inference-fix.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- nemoclaw-blueprint/scripts/nemotron-inference-fix.js
Selective E2E Results — ❌ Some jobs failedRun: 27236454584
|
|
@cjagwani can you address the feedback in #5085 (comment) please? |
Selective E2E Results — ✅ All requested jobs passedRun: 27241955530
|
Local PR review follow-upI re-ran the local PR review against head Blocking / action-required items
Already okay
Recommendation: address the trusted validation artifact and tool-detection gaps first, then rerun the PR Review Advisor. After it comes back clean, the remaining merge blockers can be handled by the normal CI/E2E shepherding loop. |
…unbook (#4851) Address @jyaunches review feedback on #5085. 1. Align EXECUTION_TOOL_NAMES with nemoclaw/src/index.ts:WRITE_TOOL_NAMES so the allowlist stays in sync with the same write-capable surface OpenClaw scans for secrets. Adds bare `write`, `edit`, `notebook_edit`. 2. Add `tool_call` (the OpenClaw compact-catalog wrapper from scripts/patch-openclaw-tool-catalog.js) to the execution-capable set. When `tool_call` is in the tools array, we can't tell from the request alone which underlying tool will be dispatched, so treat it as execution-capable and skip the system-prompt injection. Otherwise the model could receive a "no tools" prompt when it actually has real exec/write capability behind the catalog wrapper. 3. Add a checked-in runtime validation runbook at test/e2e-runtime/4851-ultra-toolless-validation.md covering three scenarios (baseline, force_nonempty_content only, full preload injection) against integrate.api.nvidia.com. This is the repository-verifiable acceptance evidence the advisor and Julie's review asked for — anyone reviewing #4851 acceptance can re-run it directly against NVIDIA Endpoints rather than relying on PR text. Updated the source-of-truth contract block to reference the runbook. 4. Tests: - case 9: bare write/edit/notebook_edit (mirrors WRITE_TOOL_NAMES) - case 10: tool_call wrapper (compact catalog dispatch) - case 11: top-level tool.name shape (no nested .function) — CodeRabbit nit asking for alternate-shape coverage All 6 unit tests pass (12 injection branches + contract test + fetch-path). Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 27303964199
|
Address remaining PR Review Advisor items on #5085. - jq nice-idea: every curl example in the runbook pipes to jq for readable parsing, but the prerequisites list only listed node and curl. Add jq alongside node + curl. - Provider-output acceptance worth-checking: the previous "Last live verification" entry pointed back at PR-body evidence rather than a durable checked-in artifact. Replace with a "Sanitized acceptance transcript" section that records the exact response shape we captured on 2026-06-09 for each of the three scenarios (baseline, kwarg-only, full preload). Future reviewers can compare new runs against the transcript instead of digging through the PR body, and the dated log below it tracks freshness of the live confirmation. Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 27304797639
|
Selective E2E Results — ✅ All requested jobs passedRun: 27305033204
|
Selective E2E Results — ✅ All requested jobs passedRun: 27305151681
|
Selective E2E Results — ✅ All requested jobs passedRun: 27306323784
|
## Summary The "Selective E2E Results" comment posted by `.github/workflows/nightly-e2e.yaml` bucketed job results into `passed` / `failed` / `skipped` but never accounted for `cancelled`. A job ending in `cancelled` (e.g. when `cancel-in-progress` kills a stale run) slipped past all three tallies and fell through to the default `"✅ All requested jobs passed"` status, with the summary line reading `"0 passed, 0 failed, 0 skipped"` — masking that the run produced no signal at all. ## Repro (in the wild) PR #5085 commit `026802ba8`: ```text ### Selective E2E Results — ✅ All requested jobs passed Run: 27305033204 Requested jobs: agent-turn-latency-e2e,cloud-inference-e2e Summary: 0 passed, 0 failed, 0 skipped | Job | Result | |------------------------|---------------| | agent-turn-latency-e2e |⚠️ cancelled | | cloud-inference-e2e |⚠️ cancelled | ``` Both requested jobs were cancelled (cancel-in-progress superseding the older run) yet the headline read green. Same pattern earlier on #4610. ## Root `.github/workflows/nightly-e2e.yaml:2486-2495` (pre-fix): ```js const passed = ran.filter(([, v]) => v.result === 'success'); const failed = ran.filter(([, v]) => v.result === 'failure'); const skipped = reportedEntries.filter(([, v]) => v.result === 'skipped'); const status = failed.length > 0 || missingRequested.length > 0 ? '❌ Some jobs failed' : skipped.length > 0 && passed.length === 0 ? '⚠️ No requested jobs ran' : '✅ All requested jobs passed'; ``` A `cancelled` job is `!== 'success'`, `!== 'failure'`, `!== 'skipped'` — falls through to the default. ## Changes - Add a `cancelled` bucket derived from `ran` the same way the others are. - Insert a status branch between the failure case and the no-ran case: when cancelled jobs are present and nothing passed, surface `⚠️ Run cancelled — no signal` instead of falsely claiming success. - Include cancelled count in the summary line so the bucket is visible even when the other states are zero. Successful, failed, and skipped runs continue to render exactly as before — only the cancelled case changes from "false green" to "honest yellow." cc @jyaunches (flagged this earlier in slack) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Improved nightly e2e test reporting in PR comments to better distinguish cancelled jobs from other outcomes. PR comments now display cancelled job counts and provide clearer status messaging when test runs are cancelled. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
## Summary - Add v0.0.64 release notes from the release announcement and link them to the relevant deeper docs. - Document that custom policy presets recorded through `policy-add --from-file` and `--from-dir` survive snapshot restore and sandbox recreation. - Refresh generated NemoClaw user skills from the current source docs. ## Source summary - #5104 -> `docs/manage-sandboxes/backup-restore.mdx`, `docs/network-policy/customize-network-policy.mdx`: Documents custom policy presets preserved through snapshot restore. - #4955 -> `docs/about/release-notes.mdx`: Adds release-note coverage for Brave web-search pinning and `BRAVE_API_KEY` placeholder preservation. - #5116, #5269 -> `docs/about/release-notes.mdx`: Adds release-note coverage for Docker-driver gateway health and rootfs guard stability. - #5241, #5085 -> `docs/about/release-notes.mdx`: Adds release-note coverage for chat-completions provider selection and Nemotron Ultra 550B tool-less request compatibility. - #5268, #5210, #5257 -> `docs/about/release-notes.mdx`: Adds release-note coverage for messaging render plan refresh, OpenClaw scope-upgrade approval recovery, and Hermes WhatsApp bridge dependency setup. - Current source docs -> `.agents/skills/`: Regenerates user-skill references so agent-facing guidance matches the source documentation. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` - `npm run build:cli` - `npm run typecheck:cli` - Commit/pre-push hooks: markdownlint, gitleaks, docs-to-skills verification, TypeScript CLI, and skills YAML checks passed. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Clarified sandbox snapshot restore preserves custom policy presets and restores them without original files. * Switched sandbox setup and remote deployment guidance to Docker-based workflows and emphasized remote onboarding flow. * Expanded troubleshooting for gateway recovery, Docker GPU/WSL issues, and onboarding resume. * Added/updated CLI docs: advanced maintenance, session export, upload/download wrappers, and status recovery guidance. * Added v0.0.64 release notes and links to NemoClaw Community; fixed command reference formatting. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
When
nvidia/nemotron-3-ultra-550b-a55bis asked to perform a multi-step task (e.g. "create a file then run it") and the request has no system message and no execution-capable tools, the model plans all steps inreasoning_contentbut silently drops intermediate steps fromcontent. The reporter saw content with only the final run command; in my repro content was empty.chat_template_kwargs.force_nonempty_contentdoes not help — verified by direct curl to NVIDIA Endpoints with and without that kwarg.Extends the existing
nemotron-inference-fixpreload to also inject a one-paragraph system message when the model matches Ultra 550B AND the caller supplied no system message AND no execution-capable tools (bash_execute, write_file, etc. — tight allowlist, not a substring regex). Scoped narrowly so it never overrides caller intent, never interferes with tool-using flows, and doesn't trip on harmless business tools likecreate_ticketorrun_query.Changes
nemoclaw-blueprint/scripts/nemotron-inference-fix.js:TOOL_LESS_SYSTEM_PROMPT_RULES+applyToolLessSystemPromptchained intopatchJsonBodyalongsideapplyChatTemplateKwargsEXECUTION_TOOL_NAMESallowlist (Set of canonical exec/write tool names) to avoid false positives on harmless tool names containing substrings like "create", "run", "save", "command"messages[0]) for an existing system message#4063format (invalid state, source boundary, why-not-fix-source, regression proof, removal condition)test/nemotron-inference-fix.test.ts:(#4851)test with 9 branches: inject, skip-system-at-0, skip-system-at-mid, skip-with-exec-tool, skip-non-matching-model, inject-with-non-exec-tools, skip-with-mixed-tools-containing-exec, inject-with-broad-token-business-tools (create_ticket/run_query/save_search/command_palette), skip-with-write_filepins path+model as the intended scope boundarycontract test asserting all three hosts (inference.local, integrate.api.nvidia.com, some-other-openai-compat-host) get the injection — pins the documented scope contractLive verification (GCP Brev box, direct curl to
integrate.api.nvidia.com)chat_template_kwargs.force_nonempty_contentonly (existing preload)python3 /tmp/hello.py)finish_reason: tool_calls, no injection (correct)Reasoning side stays stable (~184 chars) — the fix lets the model emit what it was already planning.
Scope note (model-output runtime validation)
PR Review Advisor flagged "request-mutation tests do not prove the linked model-output behavior". Live curl above is the acceptance evidence. CI-side runtime model-output validation requires API-key secret infrastructure (not currently set up for this preload's CI test path) and is intentionally out of scope for this PR.
Verification checklist
npm testpasses (6/6 ontest/nemotron-inference-fix.test.ts, 9 injection branches + scope contract test + all kwargs regressions)integrate.api.nvidia.comRefs #4851 (NVB#6272828 tracked upstream).
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Documentation