Skip to content

fix: complete #2731, #2697, and #2727 reliability fixes#2737

Merged
ericksoa merged 9 commits into
NVIDIA:mainfrom
Saibabu7770:fix/2731-ollama-tool-call-fallback
May 8, 2026
Merged

fix: complete #2731, #2697, and #2727 reliability fixes#2737
ericksoa merged 9 commits into
NVIDIA:mainfrom
Saibabu7770:fix/2731-ollama-tool-call-fallback

Conversation

@Saibabu7770

@Saibabu7770 Saibabu7770 commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR fixes three reliability bugs across onboarding, blueprint apply, and sandbox rebuild flows. It improves tool-call validation for local Ollama/Hermes responses, ensures components.policy.additions from blueprint.yaml are actually applied at runtime, and makes rebuild backups resilient when tar returns partial output due to unreadable/root-owned files.

Related Issue

Closes #2731
Closes #2697
Closes #2727

Changes

  • Added robust chat-completions tool-call validation for local Ollama/Hermes, including detection of leaked stringified tool-call JSON in message.content.
  • Updated Local Ollama onboarding validation to require structured chat tool_calls and avoid relying on the /responses path for this provider/runtime combo.
  • Wired blueprint components.policy.additions into actionApply by fetching the live policy, merging additions into network_policies, and applying via openshell policy set --wait.
  • Persisted applied policy_additions in run state metadata (plan.json) for traceability.
  • Updated backup behavior to treat tar exit code 2 as partial success when archive bytes are present, then mark only missing directories as failed.
  • Added/updated regression tests for all three fixes:
    • onboarding tool-call leak and structured tool-call detection
    • runner policy additions merge/apply flow
    • sandbox backup partial-tar handling

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Summary by CodeRabbit

  • New Features

    • Merge and apply blueprint policy additions into sandbox policies; writes and applies a merged policy when additions exist and persists the chosen additions for the run.
  • Improvements

    • Stricter validation for local model selection and OpenAI-like chat-completions probing, requiring structured chat-completions tool-calls and detecting leaked tool-call payloads; skips probing in certain local flows.
  • Tests

    • Expanded tests for policy merge/apply behavior (including empty, invalid-YAML, and merge cases) and for strict tool-calling probe scenarios.

Signed-off-by: Aaron Erickson aerickson@nvidia.com

@copy-pr-bot

copy-pr-bot Bot commented Apr 30, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Two independent changes: (1) onboarding now enforces structured Chat Completions tool-calling for Ollama/local flows and adds helpers/probes to detect structured or leaked tool-call payloads; (2) blueprint components.policy.additions is validated, merged with the current sandbox policy, written to stateDir as merged-policy.yaml, applied via openshell, and persisted into plan.json, with tests covering success, malformed current policy, and empty-additions no-op.

Changes

Ollama tool-calling validation

Layer / File(s) Summary
Probe helper import & re-export
src/lib/onboard.ts, src/lib/inference/onboard-probes.ts
Re-exported hasChatCompletionsToolCall and hasChatCompletionsToolCallLeak so onboarding and tests can use them.
Validation option and flow
src/lib/onboard.ts
validateOpenAiLikeSelection gains requireChatCompletionsToolCalling?: boolean; Ollama local selection paths now use skipResponsesProbe: true and requireChatCompletionsToolCalling: true.
Probe implementation & helpers
src/lib/inference/onboard-probes.ts
Added parsing helpers to detect structured tool_calls and leaked stringified tool-call JSON, probeChatCompletionsToolCalling, sandbox-internal fail-closed behavior, and stricter retry handling.
Tests & types
src/lib/inference/onboard-probes.test.ts, test/onboard.test.ts
Added sandbox-internal probe tests, unit tests for detection helpers, and test-type exports/imports to expose helpers to tests.

Blueprint policy additions application

Layer / File(s) Summary
Policy types & validators
nemoclaw/src/blueprint/runner.ts
Added TypeScript types and predicate helpers to validate components.policy.additions schema (rules/endpoints and enums for methods/protocol/enforcement/tls).
Parse & merge utilities
nemoclaw/src/blueprint/runner.ts
Added parseCurrentPolicy(raw) to extract YAML after the first --- and mergePolicyAdditions(currentPolicyRaw, additions) which preserves other top-level keys, normalizes version, and merges network_policies.
actionApply wiring
nemoclaw/src/blueprint/runner.ts
Create stateDir early; when policy_additions non-empty, fetch current policy, merge, write merged-policy.yaml with 0600 into stateDir, call openshell policy set --policy <file> --wait <sandbox>, and persist policy_additions into plan.json.
Tests
nemoclaw/src/blueprint/runner.test.ts
Tests updated/added: loadBlueprint success for schema-valid additions, loadBlueprint failure for invalid additions, actionApply success merging+apply, actionApply failure on malformed current policy (no set), and actionApply no-op when additions empty.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Runner as actionApply
  participant Openshell
  participant FS as stateDir
  User->>Runner: run actionApply with blueprint
  Runner->>Runner: create stateDir, compute policyAdditions
  alt policyAdditions present
    Runner->>Openshell: openshell policy get --full <sandbox>
    Openshell-->>Runner: raw policy (--- + YAML)
    Runner->>Runner: parse & merge additions
    Runner->>FS: write merged-policy.yaml (0600)
    Runner->>Openshell: openshell policy set --policy merged-policy.yaml --wait <sandbox>
    Openshell-->>Runner: result
  else empty
    Runner-->Runner: skip openshell policy calls
  end
Loading
sequenceDiagram
  participant Onboard
  participant Probe as probeChatCompletionsToolCalling
  participant Endpoint as ModelEndpoint
  Onboard->>Probe: requireChatCompletionsToolCalling probe
  Probe->>Endpoint: POST /v1/chat/completions (force tool)
  Endpoint-->>Probe: response (choices -> message -> tool_calls / content)
  Probe->>Probe: validate structured tool_calls or detect leaked JSON
  alt structured tool_calls
    Probe-->>Onboard: ok:true
  else leaked or invalid
    Probe-->>Onboard: ok:false, error
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I sniffed the calls that hide in text,
I stitched the policies the blueprint next.
I probe for tools with careful art,
I write the merge and set it apart.
The rabbit hops—persist, apply, restart.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'fix: complete #2731, #2697, and #2727 reliability fixes' accurately summarizes the main changes across onboarding tool-call validation, blueprint policy additions, and sandbox rebuild handling.
Linked Issues check ✅ Passed All three linked issues are addressed: #2731 tool-call leak detection implemented via hasChatCompletionsToolCall/hasChatCompletionsToolCallLeak helpers and probeChatCompletionsToolCalling probe; #2697 policy additions now merged and applied via parseCurrentPolicy/mergePolicyAdditions; #2727 tar partial-failure handling added (though tar exit code 2 resilience not visible in summaries).
Out of Scope Changes check ✅ Passed All changes align with stated objectives: onboarding probe enhancements, policy additions merge/apply, and test coverage. No unrelated alterations detected in the file summaries.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
nemoclaw/src/blueprint/runner.ts (1)

224-230: ⚡ Quick win

Add try-catch around YAML.parse() to handle malformed policy output gracefully.

If openshell policy get --full returns output that passes the initial checks but contains malformed YAML (e.g., truncated output, error messages embedded in the response), YAML.parse() will throw an unhandled exception with a confusing YAMLParseError message. The established pattern in src/lib/policies.ts wraps this in a try-catch and returns a safe fallback.

🛡️ Proposed fix to add error handling
 function parseCurrentPolicy(raw: string): UnknownRecord {
   const sep = raw.indexOf("---");
   const yaml = (sep >= 0 ? raw.slice(sep + 3) : raw).trim();
   if (!yaml) return {};
-  const parsed = YAML.parse(yaml);
-  return isObjectLike(parsed) ? parsed : {};
+  try {
+    const parsed = YAML.parse(yaml);
+    return isObjectLike(parsed) ? parsed : {};
+  } catch {
+    return {};
+  }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemoclaw/src/blueprint/runner.ts` around lines 224 - 230, parseCurrentPolicy
currently calls YAML.parse(yaml) without protection; wrap the parse in a
try-catch inside parseCurrentPolicy so malformed YAML from `openshell policy get
--full` doesn't throw: catch parsing errors, optionally log or debug the error,
and return the safe fallback (an empty object) if parse fails; ensure you still
return parsed when isObjectLike(parsed) is true and keep the same function
signature and behavior otherwise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard-inference-probes.ts`:
- Around line 62-73: The hasChatCompletionsToolCall detector currently treats
any tool_calls[*].function.name as a valid tool call; tighten validation in
hasChatCompletionsToolCall (which uses parseJsonObject) to require the proper
function-call shape: ensure each call is an object, that call.function is an
object with call.function.type === "function", a non-empty string
call.function.name, and a present arguments field (e.g., "arguments" in call or
typeof call.arguments === "string"/"object" depending on payload expectations)
before returning true; update the tool_calls and call.function checks to enforce
these conditions so malformed payloads no longer produce false positives.

In `@test/onboard.test.ts`:
- Around line 627-659: The test only covers string-valued message.content but
not the array-form content items the detector supports; update the test for
hasChatCompletionsToolCallLeak to include at least one case where
message.content is an array like [{ text: '...'}] containing the stringified
tool-call JSON (expect true) and one array-form normal text item (expect false),
and keep the existing malformed input case; locate and modify the test block
using the hasChatCompletionsToolCallLeak calls to add these array-form
assertions.

---

Nitpick comments:
In `@nemoclaw/src/blueprint/runner.ts`:
- Around line 224-230: parseCurrentPolicy currently calls YAML.parse(yaml)
without protection; wrap the parse in a try-catch inside parseCurrentPolicy so
malformed YAML from `openshell policy get --full` doesn't throw: catch parsing
errors, optionally log or debug the error, and return the safe fallback (an
empty object) if parse fails; ensure you still return parsed when
isObjectLike(parsed) is true and keep the same function signature and behavior
otherwise.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f8816128-c928-4691-bb9e-e4b9b7dff3ad

📥 Commits

Reviewing files that changed from the base of the PR and between 6677d48 and ef604a4.

📒 Files selected for processing (7)
  • nemoclaw/src/blueprint/runner.test.ts
  • nemoclaw/src/blueprint/runner.ts
  • src/lib/onboard-inference-probes.ts
  • src/lib/onboard.ts
  • src/lib/sandbox-state.ts
  • test/onboard.test.ts
  • test/shields.test.ts

Comment thread src/lib/inference/onboard-probes.ts
Comment thread test/onboard.test.ts
@wscurran wscurran added bug Something fails against expected or documented behavior dependencies Pull requests that update a dependency file provider: ollama Ollama local model provider behavior labels Apr 30, 2026
@wscurran

Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this PR that fixes reliability bugs across onboarding, blueprint apply, and sandbox rebuild flows, and improves tool-call validation for local Ollama/Hermes responses.


Related open issues:

…ixes

Improve Ollama chat-completions tool-call validation, apply blueprint policy additions during runner apply, and make rebuild backups tolerate partial tar output from root-owned files.
@Saibabu7770 Saibabu7770 force-pushed the fix/2731-ollama-tool-call-fallback branch from ef604a4 to f632cc0 Compare May 2, 2026 19:36

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
test/onboard.test.ts (1)

890-922: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add array-form message.content leak coverage.

hasChatCompletionsToolCallLeak also handles array-form content items, but this test only covers string content. Please add one positive and one negative array-form case.

Suggested test additions
   it("detects leaked stringified tool-call JSON in chat-completions content", () => {
@@
     ).toBe(true);
+    expect(
+      hasChatCompletionsToolCallLeak(
+        JSON.stringify({
+          choices: [
+            {
+              message: {
+                role: "assistant",
+                content: [
+                  {
+                    type: "text",
+                    text: '{"name":"sessions_send","arguments":{"message":"hello"}}',
+                  },
+                ],
+                tool_calls: null,
+              },
+            },
+          ],
+        }),
+      ),
+    ).toBe(true);
     expect(
       hasChatCompletionsToolCallLeak(
         JSON.stringify({
@@
     ).toBe(false);
+    expect(
+      hasChatCompletionsToolCallLeak(
+        JSON.stringify({
+          choices: [
+            {
+              message: {
+                role: "assistant",
+                content: [{ type: "text", text: "Regular assistant text response." }],
+                tool_calls: null,
+              },
+            },
+          ],
+        }),
+      ),
+    ).toBe(false);
     expect(hasChatCompletionsToolCallLeak("{")).toBe(false);
   });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/onboard.test.ts` around lines 890 - 922, Update the "detects leaked
stringified tool-call JSON in chat-completions content" test to also cover
array-form message.content cases: add one positive case where message.content is
an array (e.g., [{type: "output_text", text: '{ "arguments":{...},
"name":"sessions_send" }'}]) that should make
hasChatCompletionsToolCallLeak(...) return true, and one negative case where
message.content is an array of normal text items that should return false;
locate the test block containing hasChatCompletionsToolCallLeak and add these
two expectations mirroring the existing string-form assertions.
🧹 Nitpick comments (2)
test/onboard.test.ts (1)

85-86: ⚡ Quick win

Fail fast when newly required internals are missing.

Lines 85-86 and 189-190 add new required helpers, but isOnboardTestInternals does not validate these keys. If exports drift, the suite fails later with less actionable errors.

Suggested patch
 function isOnboardTestInternals(
   value: OnboardTestInternalsCandidate,
 ): value is OnboardTestInternals {
   return (
@@
     typeof value.classifySandboxCreateFailure === "function" &&
+    typeof value.hasChatCompletionsToolCall === "function" &&
+    typeof value.hasChatCompletionsToolCallLeak === "function" &&
     typeof value.getDefaultSandboxNameForAgent === "function" &&
@@
   );
 }

Also applies to: 189-190

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/onboard.test.ts` around lines 85 - 86, Update the isOnboardTestInternals
runtime type guard to ensure the newly required helpers are present: validate
that the object has functions named hasChatCompletionsToolCall and
hasChatCompletionsToolCallLeak (accepting optional string|null and returning
boolean) alongside the existing checks so missing exports fail fast; modify the
isOnboardTestInternals implementation to check typeof
internals.hasChatCompletionsToolCall === 'function' and typeof
internals.hasChatCompletionsToolCallLeak === 'function' and include them in the
overall boolean result returned by isOnboardTestInternals.
nemoclaw/src/blueprint/runner.ts (1)

509-510: ⚡ Quick win

Consider restricting file permissions for the merged policy file.

The established pattern in src/lib/policies.ts (context snippet 2) writes policy files with mode: 0o600 to restrict access. The merged policy may contain sensitive network configuration.

🔒 Proposed fix
     const mergedPolicyFile = join(stateDir, "merged-policy.yaml");
-    writeFileSync(mergedPolicyFile, mergePolicyAdditions(currentPolicy.stdout, policyAdditions));
+    writeFileSync(mergedPolicyFile, mergePolicyAdditions(currentPolicy.stdout, policyAdditions), {
+      encoding: "utf-8",
+      mode: 0o600,
+    });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemoclaw/src/blueprint/runner.ts` around lines 509 - 510, The merged policy
file is written without file-permission restrictions; update the write to use a
restricted mode (0o600) so only the owner can read/write the merged policy.
Locate the mergedPolicyFile and the writeFileSync call that writes
mergePolicyAdditions(currentPolicy.stdout, policyAdditions) (in the runner
module) and change the write to include file-mode options (mode: 0o600) or an
equivalent API to set permissions after write, ensuring the merged policy
inherits the same access restrictions used elsewhere (see pattern in
src/lib/policies.ts).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemoclaw/src/blueprint/runner.ts`:
- Around line 224-230: The function parseCurrentPolicy shadows the imported path
separator named sep and calls YAML.parse without error handling; rename the
local separator variable (e.g., sepIndex or idx) to avoid shadowing the imported
sep, wrap the YAML.parse call in a try/catch following the pattern in
src/lib/policies.ts, and return {} (and optionally log the parse error) on parse
failure while still returning parsed only if isObjectLike(parsed) is true;
update references inside parseCurrentPolicy accordingly.

---

Duplicate comments:
In `@test/onboard.test.ts`:
- Around line 890-922: Update the "detects leaked stringified tool-call JSON in
chat-completions content" test to also cover array-form message.content cases:
add one positive case where message.content is an array (e.g., [{type:
"output_text", text: '{ "arguments":{...}, "name":"sessions_send" }'}]) that
should make hasChatCompletionsToolCallLeak(...) return true, and one negative
case where message.content is an array of normal text items that should return
false; locate the test block containing hasChatCompletionsToolCallLeak and add
these two expectations mirroring the existing string-form assertions.

---

Nitpick comments:
In `@nemoclaw/src/blueprint/runner.ts`:
- Around line 509-510: The merged policy file is written without file-permission
restrictions; update the write to use a restricted mode (0o600) so only the
owner can read/write the merged policy. Locate the mergedPolicyFile and the
writeFileSync call that writes mergePolicyAdditions(currentPolicy.stdout,
policyAdditions) (in the runner module) and change the write to include
file-mode options (mode: 0o600) or an equivalent API to set permissions after
write, ensuring the merged policy inherits the same access restrictions used
elsewhere (see pattern in src/lib/policies.ts).

In `@test/onboard.test.ts`:
- Around line 85-86: Update the isOnboardTestInternals runtime type guard to
ensure the newly required helpers are present: validate that the object has
functions named hasChatCompletionsToolCall and hasChatCompletionsToolCallLeak
(accepting optional string|null and returning boolean) alongside the existing
checks so missing exports fail fast; modify the isOnboardTestInternals
implementation to check typeof internals.hasChatCompletionsToolCall ===
'function' and typeof internals.hasChatCompletionsToolCallLeak === 'function'
and include them in the overall boolean result returned by
isOnboardTestInternals.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 82c81299-e984-4adf-a63e-ee24013306c9

📥 Commits

Reviewing files that changed from the base of the PR and between ef604a4 and f632cc0.

📒 Files selected for processing (7)
  • nemoclaw/src/blueprint/runner.test.ts
  • nemoclaw/src/blueprint/runner.ts
  • src/lib/onboard-inference-probes.ts
  • src/lib/onboard.ts
  • src/lib/sandbox-state.ts
  • test/onboard.test.ts
  • test/shields.test.ts
✅ Files skipped from review due to trivial changes (4)
  • test/shields.test.ts
  • src/lib/onboard.ts
  • nemoclaw/src/blueprint/runner.test.ts
  • src/lib/sandbox-state.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard-inference-probes.ts

Comment thread nemoclaw/src/blueprint/runner.ts
Saibabu7770 and others added 4 commits May 2, 2026 15:18
Replays blueprint policy additions application and strict Ollama chat-completions tool-call validation on current main. Drops the rebuild backup portion because NVIDIA#2734 already covers NVIDIA#2727.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
src/lib/inference/onboard-probes.ts (1)

83-94: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Tighten tool_calls validation before treating this probe as a pass.

This still accepts malformed entries like { function: { name: "x" } }, so strict onboarding can false-pass a backend that will fail once dispatch expects a real function call shape. Require type === "function" and a present arguments field before returning true (and ideally don’t stop at only choices[0]).

For the OpenAI-compatible Chat Completions API, what fields are required on each `message.tool_calls[]` entry for a valid function tool call? Please verify whether a valid entry requires `type: "function"` plus `function.name` and `function.arguments`.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/inference/onboard-probes.ts` around lines 83 - 94, The
hasChatCompletionsToolCall probe is too permissive: tighten validation inside
hasChatCompletionsToolCall so it verifies each message.tool_calls entry has type
=== "function", a function object with a non-empty name string, and a present
arguments field (ensure typeof arguments !== "undefined" or is an object/string
as your runtime expects) before returning true; also iterate over all
parsed.choices (not only choices[0]) and check each choice.message.tool_calls
array entries for the stricter shape (use symbols: hasChatCompletionsToolCall,
parsed.choices, message.tool_calls, call.type, call.function.name,
call.function.arguments).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/inference/onboard-probes.ts`:
- Around line 68-81: parseStringifiedToolCall currently only accepts a bare
{name, arguments} object; update it to also detect and unwrap common Chat
Completions wrapper shapes so leaks like {type: "function", function: {...}} or
{tool_calls: [...]} are recognized. Modify parseStringifiedToolCall to
JSON.parse the trimmed string as before, then if the parsed object lacks direct
name/arguments, check for parsed.function (and treat parsed.function as the
candidate), check for parsed.tool_calls being an array and use
parsed.tool_calls[0], and any similar wrapper that contains a nested object with
name (string) and arguments (present). After unwrapping run the same validations
(name is non-empty string, hasOwnProperty "arguments") and return the inner tool
object or null.
- Around line 516-525: The retry branch must use the same structured-tool-call
probe when requireChatCompletionsToolCalling is true; modify the timeout-retry
path so it calls probeChatCompletionsToolCalling(endpointUrl, model, apiKey, {
authMode: options.authMode }) instead of falling back to
runChatCompletionsProbe(...). Locate the retry logic that currently chooses
between probeChatCompletionsToolCalling and runChatCompletionsProbe (using
symbols requireChatCompletionsToolCalling, probeChatCompletionsToolCalling,
runChatCompletionsProbe, authHeader, appendKey) and ensure both initial and
doubled-timeout retry branches preserve the same probe selection based on
options.requireChatCompletionsToolCalling.

---

Duplicate comments:
In `@src/lib/inference/onboard-probes.ts`:
- Around line 83-94: The hasChatCompletionsToolCall probe is too permissive:
tighten validation inside hasChatCompletionsToolCall so it verifies each
message.tool_calls entry has type === "function", a function object with a
non-empty name string, and a present arguments field (ensure typeof arguments
!== "undefined" or is an object/string as your runtime expects) before returning
true; also iterate over all parsed.choices (not only choices[0]) and check each
choice.message.tool_calls array entries for the stricter shape (use symbols:
hasChatCompletionsToolCall, parsed.choices, message.tool_calls, call.type,
call.function.name, call.function.arguments).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8880a1bb-b922-4aa6-a659-046d5a677a5b

📥 Commits

Reviewing files that changed from the base of the PR and between c0f3b90 and 5d0a127.

📒 Files selected for processing (4)
  • nemoclaw/src/blueprint/runner.test.ts
  • nemoclaw/src/blueprint/runner.ts
  • src/lib/inference/onboard-probes.test.ts
  • src/lib/inference/onboard-probes.ts

Comment thread src/lib/inference/onboard-probes.ts
Comment thread src/lib/inference/onboard-probes.ts

@ericksoa ericksoa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved after reviewing the current head and validating that checks pass.

@ericksoa ericksoa enabled auto-merge (squash) May 8, 2026 17:28
@ericksoa ericksoa disabled auto-merge May 8, 2026 17:30
@ericksoa ericksoa merged commit 97f8a9a into NVIDIA:main May 8, 2026
12 of 13 checks passed
miyoungc added a commit that referenced this pull request May 9, 2026
## Summary
- Bump the docs release metadata to `0.0.38`.
- Document release-prep updates for status policy versions, Local Ollama
validation and cleanup, blueprint policy additions, rebuild backup
handling, and NemoHermes uninstall branding.
- Refresh generated `nemoclaw-user-*` skills from the updated docs.

## Source summary
- #3185 -> `docs/reference/commands.md`: Documents that `nemoclaw <name>
status` displays the gateway active policy version when OpenShell
reports one.
- #3167 -> `docs/reference/commands.md`,
`docs/inference/use-local-inference.md`: Documents uninstall cleanup for
matching Local Ollama auth proxy processes.
- #2737 -> `docs/inference/use-local-inference.md`,
`docs/network-policy/customize-network-policy.md`,
`docs/manage-sandboxes/lifecycle.md`, `docs/reference/commands.md`:
Documents stricter Local Ollama tool-call validation, blueprint policy
additions, and partial rebuild backup handling.
- #3220 -> `docs/reference/commands.md`: Documents NemoHermes-specific
uninstall progress and completion text.
- #3158 -> `.agents/skills/nemoclaw-user-configure-inference/*`:
Refreshes generated user skills from existing
`docs/inference/switch-inference-providers.md` heartbeat documentation.
- #3199 -> `.agents/skills/nemoclaw-user-get-started/SKILL.md`:
Refreshes generated user skills from existing
`docs/get-started/quickstart.md` Model Router wording.

## Skipped
- #3272 and #3268 were already documented by their merged docs updates
on `main`.
- #3154, #3216, #3166, and #3195 have no additional user-facing docs
impact for this release-prep pass.
- No commits matched `docs/.docs-skip`.

## Test plan
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user`
- `make docs`
- `npm run build:cli`
- Commit and pre-push hooks: markdownlint, docs-to-skills verification,
gitleaks, commitlint, skills YAML tests, CLI typecheck


Made with [Cursor](https://cursor.com)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Behavior Changes**
* Rebuild now safely handles partial backups, preserving successfully
captured entries while reporting only unarchived paths
* Uninstall for Local Ollama setups now stops proxy processes before
cleanup
* Local Ollama models require stricter tool-call response validation
during onboarding
* Blueprint policy additions enable custom network policy extensions via
`components.policy.additions`
* New `NEMOCLAW_AGENT_HEARTBEAT_EVERY` configuration controls agent
periodic task frequency
  * Status display now shows active policy version when available

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Cursor <cursoragent@cursor.com>
jyaunches added a commit that referenced this pull request May 13, 2026
…ture (#3363)

<!-- markdownlint-disable MD041 -->
## Summary

Supersedes #3290. Introduces declarative **scenario-based E2E
orchestration** and the **Phase 1 pre-flight infrastructure** that
prepares for wave-by-wave migration of the existing `test/e2e/test-*.sh`
scripts into the new structure. The old legacy scripts remain in place
and unchanged; no E2E workflow is deleted or muted by this PR.

The scope grew beyond the original #3290 title after rebase against
`main` and the addition of Phase 1 helpers, fixtures, parity harness,
and convention lint. Because upstream rules forbid force-push on feature
branches, this is opened as a fresh PR with clean linear history rather
than rewriting #3290 in place.

## Related Issue

Supersedes #3290. No single tracking issue — foundation for the E2E
scenario-matrix refactor discussed across #2737, #3154, and the E2E
test-gap audit.

## Changes

### Part 1 — scenario-based E2E orchestration (commits 1–3, previously
in #3290)

- **Declarative metadata** (`test/e2e/scenarios.yaml`,
`expected-states.yaml`, `suites.yaml`) for initial scenarios: Ubuntu
cloud OpenClaw/Hermes, macOS, WSL, GPU local Ollama, Brev launchable,
no-Docker negative.
- **TypeScript resolver/validator/plan-printer/coverage report** under
`test/e2e/resolver/` (invoked via `tsx`, uses `js-yaml` already in root
`package.json`; unit-tested in the Vitest `cli` project).
- **Shell helpers** under `test/e2e/lib/` that produce a normalized
context at `$E2E_CONTEXT_DIR` (default `.e2e/`). Wraps existing
`sandbox-teardown.sh` and `install-path-refresh.sh`; does not duplicate
them.
- **Entry scripts**: `run-scenario.sh`, `run-suites.sh`,
`coverage-report.sh` with `--plan-only`, `--dry-run` (`E2E_DRY_RUN=1`)
flags.
- **Suite scripts** under
`test/e2e/suites/{smoke,inference,credentials,lifecycle,messaging,onboarding,platform,sandbox,security}/`.
- **New workflow** `.github/workflows/e2e-scenarios.yaml` — manual
(`workflow_dispatch`) with `scenario`, `plan_only`, and `suite_filter`
inputs. Existing workflows are unchanged.
- `.gitignore`: add `.e2e/` runtime context directory.

### Part 2 — Phase 1 pre-flight migration infrastructure (commits 4–7,
new)

- **`test/e2e/MIGRATION.md`** — in-tree tracker mapping each legacy
`test-*.sh` script to its target scenario + suite home. Includes the
reuse-absorption table showing where pair-variant coverage (e.g.
OpenClaw + Hermes) collapses into a single suite.
- **Fixtures** (`test/e2e/lib/fixtures/`): `fake-openai.sh`,
`fake-{telegram,discord,slack}.sh` (built on shared
`_fake-http-stub.sh`), `older-base-image.sh`. Follow the inline-Node
`http.createServer` pattern from `test-messaging-providers.sh`; no new
framework.
- **Helpers** (`test/e2e/lib/`): `logging.sh` (PASS/FAIL/=== Phase
markers), `sandbox-exec.sh` (`e2e_sandbox_exec` /
`e2e_sandbox_exec_stdin`). `env.sh` auto-sources `logging.sh`.
- **Assertion helpers** (`test/e2e/lib/assert/`): `inference-works.sh`,
`no-credentials-leaked.sh`, `policy-preset-applied.sh`,
`messaging-bridge-reachable.sh`.
- **Install splits** (`test/e2e/lib/setup/`):
`install-{repo,curl,ollama,launchable}.sh` + `install.sh` dispatcher, so
each migrated suite can compose the minimal install it needs.
- **Runtime probe wiring**: `run-scenario.sh --validate-only` flag
(mutually exclusive with `--plan-only`) that reads
`$E2E_CONTEXT_DIR/context.env`, runs the expected-state validator, and
emits JSON — skipping install/onboard/suites. `resolver/index.ts` gains
`E2E_PROBE_OVERRIDES_JSON` escape hatch for probe keys with embedded
underscores.
- **Convention lint**: `scripts/e2e/lint-conventions.ts` (6 rules,
Vitest-backed in the `cli` project) blocks new orphan `test-*.sh`
scripts and enforces `parity-map.yaml` completeness.
- **Parity harness**: `scripts/e2e/compare-parity.sh`,
`test/e2e/parity-map.yaml` (seeded with 39 legacy-script placeholder
entries), `.github/workflows/e2e-parity-compare.yaml`. This is the
mechanism that will drive waves 2–12 of the migration — each migrated
suite gets dispatched against the parity-compare workflow until zero
divergence vs. its legacy counterpart.

### What this PR does **not** change

- Existing E2E workflows (`nightly-e2e.yaml`, `macos-e2e.yaml`,
`wsl-e2e.yaml`, `ollama-proxy-e2e.yaml`, `e2e-branch-validation.yaml`,
`sandbox-images-and-e2e.yaml`) run as before.
- Existing `test/e2e/test-*.sh` scripts are **not** deleted. Per-wave
retirement happens in follow-up PRs once `parity-map.yaml` entries
report zero divergence.
- No new behavior in the product itself; this is test-harness
scaffolding only.

### Migration sequencing (follow-up PRs)

Waves 2–12 migrate the 40 legacy `test-*.sh` scripts into the scenario
matrix, one wave per PR, gated by parity-compare zero-divergence. Phase
13 is the merge gate that flips the default CI dispatch from
`nightly-e2e.yaml` to `e2e-parity-compare.yaml`.

## Type of Change

- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification

- [x] `npx prek run --all-files` passes
- [x] `npm test` passes — 3,258 `cli` tests, 0 failures (41 new tests
from Phase 1)
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
(`test/e2e/README.md`, `test/e2e/MIGRATION.md`, `AGENTS.md`)
- [ ] `make docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

### Fork E2E evidence (Phase 1 branch dispatch)

Dispatched `nightly-e2e.yaml` on `jyaunches/NemoClaw` against this
branch (with a temporary local-only repo-gate lift, not part of this
PR). Result:

| Outcome | Count | Notes |
|---|---|---|
| ✅ success | 35 | All CPU E2E jobs green on fork |
| ⏭ skipped by design | 2 | `gpu-e2e`, `gpu-double-onboard-e2e` — need
NVIDIA self-hosted GPU runner |
| ⏸ unrunnable on fork | 1 | `hermes-slack-e2e` — needs
`linux-amd64-cpu4` runner (NVIDIA-internal) |
| ⚠️ pre-existing fork-environment hang | 1 | `hermes-e2e` hangs during
`nemohermes onboard` on fork (not a Phase 1 regression — same legacy
script, no changes to onboard path) |

**35/35 of jobs that could run on the fork pass.** This attests that
Phase 1 infrastructure does not regress any existing E2E coverage.

---
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Introduced scenario-based end-to-end testing framework with
declarative setup profiles and ordered validation suites.

* **Tests**
* Added CI workflows for scenario execution and parity comparison with
legacy tests.
* New validation suites for smoke tests, inference, credentials,
platform checks, and security.

* **Documentation**
* Added E2E testing guide and migration tracker for moving tests to
scenario-based architecture.

* **Chores**
* Added test infrastructure, fixtures, and assertion helpers supporting
multiple platforms and configurations.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3363)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
@wscurran wscurran added bug-fix PR fixes a bug or regression chore Build, CI, dependency, or tooling maintenance and removed bug Something fails against expected or documented behavior chore Build, CI, dependency, or tooling maintenance labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression dependencies Pull requests that update a dependency file integration: hermes Hermes integration behavior provider: ollama Ollama local model provider behavior

Projects

None yet

3 participants