fix: complete #2731, #2697, and #2727 reliability fixes#2737
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughTwo independent changes: (1) onboarding now enforces structured Chat Completions tool-calling for Ollama/local flows and adds helpers/probes to detect structured or leaked tool-call payloads; (2) blueprint ChangesOllama tool-calling validation
Blueprint policy additions application
Sequence Diagram(s)sequenceDiagram
participant User
participant Runner as actionApply
participant Openshell
participant FS as stateDir
User->>Runner: run actionApply with blueprint
Runner->>Runner: create stateDir, compute policyAdditions
alt policyAdditions present
Runner->>Openshell: openshell policy get --full <sandbox>
Openshell-->>Runner: raw policy (--- + YAML)
Runner->>Runner: parse & merge additions
Runner->>FS: write merged-policy.yaml (0600)
Runner->>Openshell: openshell policy set --policy merged-policy.yaml --wait <sandbox>
Openshell-->>Runner: result
else empty
Runner-->Runner: skip openshell policy calls
end
sequenceDiagram
participant Onboard
participant Probe as probeChatCompletionsToolCalling
participant Endpoint as ModelEndpoint
Onboard->>Probe: requireChatCompletionsToolCalling probe
Probe->>Endpoint: POST /v1/chat/completions (force tool)
Endpoint-->>Probe: response (choices -> message -> tool_calls / content)
Probe->>Probe: validate structured tool_calls or detect leaked JSON
alt structured tool_calls
Probe-->>Onboard: ok:true
else leaked or invalid
Probe-->>Onboard: ok:false, error
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
nemoclaw/src/blueprint/runner.ts (1)
224-230: ⚡ Quick winAdd try-catch around
YAML.parse()to handle malformed policy output gracefully.If
openshell policy get --fullreturns output that passes the initial checks but contains malformed YAML (e.g., truncated output, error messages embedded in the response),YAML.parse()will throw an unhandled exception with a confusingYAMLParseErrormessage. The established pattern insrc/lib/policies.tswraps this in a try-catch and returns a safe fallback.🛡️ Proposed fix to add error handling
function parseCurrentPolicy(raw: string): UnknownRecord { const sep = raw.indexOf("---"); const yaml = (sep >= 0 ? raw.slice(sep + 3) : raw).trim(); if (!yaml) return {}; - const parsed = YAML.parse(yaml); - return isObjectLike(parsed) ? parsed : {}; + try { + const parsed = YAML.parse(yaml); + return isObjectLike(parsed) ? parsed : {}; + } catch { + return {}; + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemoclaw/src/blueprint/runner.ts` around lines 224 - 230, parseCurrentPolicy currently calls YAML.parse(yaml) without protection; wrap the parse in a try-catch inside parseCurrentPolicy so malformed YAML from `openshell policy get --full` doesn't throw: catch parsing errors, optionally log or debug the error, and return the safe fallback (an empty object) if parse fails; ensure you still return parsed when isObjectLike(parsed) is true and keep the same function signature and behavior otherwise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard-inference-probes.ts`:
- Around line 62-73: The hasChatCompletionsToolCall detector currently treats
any tool_calls[*].function.name as a valid tool call; tighten validation in
hasChatCompletionsToolCall (which uses parseJsonObject) to require the proper
function-call shape: ensure each call is an object, that call.function is an
object with call.function.type === "function", a non-empty string
call.function.name, and a present arguments field (e.g., "arguments" in call or
typeof call.arguments === "string"/"object" depending on payload expectations)
before returning true; update the tool_calls and call.function checks to enforce
these conditions so malformed payloads no longer produce false positives.
In `@test/onboard.test.ts`:
- Around line 627-659: The test only covers string-valued message.content but
not the array-form content items the detector supports; update the test for
hasChatCompletionsToolCallLeak to include at least one case where
message.content is an array like [{ text: '...'}] containing the stringified
tool-call JSON (expect true) and one array-form normal text item (expect false),
and keep the existing malformed input case; locate and modify the test block
using the hasChatCompletionsToolCallLeak calls to add these array-form
assertions.
---
Nitpick comments:
In `@nemoclaw/src/blueprint/runner.ts`:
- Around line 224-230: parseCurrentPolicy currently calls YAML.parse(yaml)
without protection; wrap the parse in a try-catch inside parseCurrentPolicy so
malformed YAML from `openshell policy get --full` doesn't throw: catch parsing
errors, optionally log or debug the error, and return the safe fallback (an
empty object) if parse fails; ensure you still return parsed when
isObjectLike(parsed) is true and keep the same function signature and behavior
otherwise.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f8816128-c928-4691-bb9e-e4b9b7dff3ad
📒 Files selected for processing (7)
nemoclaw/src/blueprint/runner.test.tsnemoclaw/src/blueprint/runner.tssrc/lib/onboard-inference-probes.tssrc/lib/onboard.tssrc/lib/sandbox-state.tstest/onboard.test.tstest/shields.test.ts
|
✨ Thanks for submitting this PR that fixes reliability bugs across onboarding, blueprint apply, and sandbox rebuild flows, and improves tool-call validation for local Ollama/Hermes responses. Related open issues: |
…ixes Improve Ollama chat-completions tool-call validation, apply blueprint policy additions during runner apply, and make rebuild backups tolerate partial tar output from root-owned files.
ef604a4 to
f632cc0
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
test/onboard.test.ts (1)
890-922:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd array-form
message.contentleak coverage.
hasChatCompletionsToolCallLeakalso handles array-form content items, but this test only covers string content. Please add one positive and one negative array-form case.Suggested test additions
it("detects leaked stringified tool-call JSON in chat-completions content", () => { @@ ).toBe(true); + expect( + hasChatCompletionsToolCallLeak( + JSON.stringify({ + choices: [ + { + message: { + role: "assistant", + content: [ + { + type: "text", + text: '{"name":"sessions_send","arguments":{"message":"hello"}}', + }, + ], + tool_calls: null, + }, + }, + ], + }), + ), + ).toBe(true); expect( hasChatCompletionsToolCallLeak( JSON.stringify({ @@ ).toBe(false); + expect( + hasChatCompletionsToolCallLeak( + JSON.stringify({ + choices: [ + { + message: { + role: "assistant", + content: [{ type: "text", text: "Regular assistant text response." }], + tool_calls: null, + }, + }, + ], + }), + ), + ).toBe(false); expect(hasChatCompletionsToolCallLeak("{")).toBe(false); });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/onboard.test.ts` around lines 890 - 922, Update the "detects leaked stringified tool-call JSON in chat-completions content" test to also cover array-form message.content cases: add one positive case where message.content is an array (e.g., [{type: "output_text", text: '{ "arguments":{...}, "name":"sessions_send" }'}]) that should make hasChatCompletionsToolCallLeak(...) return true, and one negative case where message.content is an array of normal text items that should return false; locate the test block containing hasChatCompletionsToolCallLeak and add these two expectations mirroring the existing string-form assertions.
🧹 Nitpick comments (2)
test/onboard.test.ts (1)
85-86: ⚡ Quick winFail fast when newly required internals are missing.
Lines 85-86 and 189-190 add new required helpers, but
isOnboardTestInternalsdoes not validate these keys. If exports drift, the suite fails later with less actionable errors.Suggested patch
function isOnboardTestInternals( value: OnboardTestInternalsCandidate, ): value is OnboardTestInternals { return ( @@ typeof value.classifySandboxCreateFailure === "function" && + typeof value.hasChatCompletionsToolCall === "function" && + typeof value.hasChatCompletionsToolCallLeak === "function" && typeof value.getDefaultSandboxNameForAgent === "function" && @@ ); }Also applies to: 189-190
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/onboard.test.ts` around lines 85 - 86, Update the isOnboardTestInternals runtime type guard to ensure the newly required helpers are present: validate that the object has functions named hasChatCompletionsToolCall and hasChatCompletionsToolCallLeak (accepting optional string|null and returning boolean) alongside the existing checks so missing exports fail fast; modify the isOnboardTestInternals implementation to check typeof internals.hasChatCompletionsToolCall === 'function' and typeof internals.hasChatCompletionsToolCallLeak === 'function' and include them in the overall boolean result returned by isOnboardTestInternals.nemoclaw/src/blueprint/runner.ts (1)
509-510: ⚡ Quick winConsider restricting file permissions for the merged policy file.
The established pattern in
src/lib/policies.ts(context snippet 2) writes policy files withmode: 0o600to restrict access. The merged policy may contain sensitive network configuration.🔒 Proposed fix
const mergedPolicyFile = join(stateDir, "merged-policy.yaml"); - writeFileSync(mergedPolicyFile, mergePolicyAdditions(currentPolicy.stdout, policyAdditions)); + writeFileSync(mergedPolicyFile, mergePolicyAdditions(currentPolicy.stdout, policyAdditions), { + encoding: "utf-8", + mode: 0o600, + });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemoclaw/src/blueprint/runner.ts` around lines 509 - 510, The merged policy file is written without file-permission restrictions; update the write to use a restricted mode (0o600) so only the owner can read/write the merged policy. Locate the mergedPolicyFile and the writeFileSync call that writes mergePolicyAdditions(currentPolicy.stdout, policyAdditions) (in the runner module) and change the write to include file-mode options (mode: 0o600) or an equivalent API to set permissions after write, ensuring the merged policy inherits the same access restrictions used elsewhere (see pattern in src/lib/policies.ts).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemoclaw/src/blueprint/runner.ts`:
- Around line 224-230: The function parseCurrentPolicy shadows the imported path
separator named sep and calls YAML.parse without error handling; rename the
local separator variable (e.g., sepIndex or idx) to avoid shadowing the imported
sep, wrap the YAML.parse call in a try/catch following the pattern in
src/lib/policies.ts, and return {} (and optionally log the parse error) on parse
failure while still returning parsed only if isObjectLike(parsed) is true;
update references inside parseCurrentPolicy accordingly.
---
Duplicate comments:
In `@test/onboard.test.ts`:
- Around line 890-922: Update the "detects leaked stringified tool-call JSON in
chat-completions content" test to also cover array-form message.content cases:
add one positive case where message.content is an array (e.g., [{type:
"output_text", text: '{ "arguments":{...}, "name":"sessions_send" }'}]) that
should make hasChatCompletionsToolCallLeak(...) return true, and one negative
case where message.content is an array of normal text items that should return
false; locate the test block containing hasChatCompletionsToolCallLeak and add
these two expectations mirroring the existing string-form assertions.
---
Nitpick comments:
In `@nemoclaw/src/blueprint/runner.ts`:
- Around line 509-510: The merged policy file is written without file-permission
restrictions; update the write to use a restricted mode (0o600) so only the
owner can read/write the merged policy. Locate the mergedPolicyFile and the
writeFileSync call that writes mergePolicyAdditions(currentPolicy.stdout,
policyAdditions) (in the runner module) and change the write to include
file-mode options (mode: 0o600) or an equivalent API to set permissions after
write, ensuring the merged policy inherits the same access restrictions used
elsewhere (see pattern in src/lib/policies.ts).
In `@test/onboard.test.ts`:
- Around line 85-86: Update the isOnboardTestInternals runtime type guard to
ensure the newly required helpers are present: validate that the object has
functions named hasChatCompletionsToolCall and hasChatCompletionsToolCallLeak
(accepting optional string|null and returning boolean) alongside the existing
checks so missing exports fail fast; modify the isOnboardTestInternals
implementation to check typeof internals.hasChatCompletionsToolCall ===
'function' and typeof internals.hasChatCompletionsToolCallLeak === 'function'
and include them in the overall boolean result returned by
isOnboardTestInternals.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 82c81299-e984-4adf-a63e-ee24013306c9
📒 Files selected for processing (7)
nemoclaw/src/blueprint/runner.test.tsnemoclaw/src/blueprint/runner.tssrc/lib/onboard-inference-probes.tssrc/lib/onboard.tssrc/lib/sandbox-state.tstest/onboard.test.tstest/shields.test.ts
✅ Files skipped from review due to trivial changes (4)
- test/shields.test.ts
- src/lib/onboard.ts
- nemoclaw/src/blueprint/runner.test.ts
- src/lib/sandbox-state.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/onboard-inference-probes.ts
Replays blueprint policy additions application and strict Ollama chat-completions tool-call validation on current main. Drops the rebuild backup portion because NVIDIA#2734 already covers NVIDIA#2727.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/lib/inference/onboard-probes.ts (1)
83-94:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTighten
tool_callsvalidation before treating this probe as a pass.This still accepts malformed entries like
{ function: { name: "x" } }, so strict onboarding can false-pass a backend that will fail once dispatch expects a real function call shape. Requiretype === "function"and a presentargumentsfield before returningtrue(and ideally don’t stop at onlychoices[0]).For the OpenAI-compatible Chat Completions API, what fields are required on each `message.tool_calls[]` entry for a valid function tool call? Please verify whether a valid entry requires `type: "function"` plus `function.name` and `function.arguments`.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/inference/onboard-probes.ts` around lines 83 - 94, The hasChatCompletionsToolCall probe is too permissive: tighten validation inside hasChatCompletionsToolCall so it verifies each message.tool_calls entry has type === "function", a function object with a non-empty name string, and a present arguments field (ensure typeof arguments !== "undefined" or is an object/string as your runtime expects) before returning true; also iterate over all parsed.choices (not only choices[0]) and check each choice.message.tool_calls array entries for the stricter shape (use symbols: hasChatCompletionsToolCall, parsed.choices, message.tool_calls, call.type, call.function.name, call.function.arguments).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/inference/onboard-probes.ts`:
- Around line 68-81: parseStringifiedToolCall currently only accepts a bare
{name, arguments} object; update it to also detect and unwrap common Chat
Completions wrapper shapes so leaks like {type: "function", function: {...}} or
{tool_calls: [...]} are recognized. Modify parseStringifiedToolCall to
JSON.parse the trimmed string as before, then if the parsed object lacks direct
name/arguments, check for parsed.function (and treat parsed.function as the
candidate), check for parsed.tool_calls being an array and use
parsed.tool_calls[0], and any similar wrapper that contains a nested object with
name (string) and arguments (present). After unwrapping run the same validations
(name is non-empty string, hasOwnProperty "arguments") and return the inner tool
object or null.
- Around line 516-525: The retry branch must use the same structured-tool-call
probe when requireChatCompletionsToolCalling is true; modify the timeout-retry
path so it calls probeChatCompletionsToolCalling(endpointUrl, model, apiKey, {
authMode: options.authMode }) instead of falling back to
runChatCompletionsProbe(...). Locate the retry logic that currently chooses
between probeChatCompletionsToolCalling and runChatCompletionsProbe (using
symbols requireChatCompletionsToolCalling, probeChatCompletionsToolCalling,
runChatCompletionsProbe, authHeader, appendKey) and ensure both initial and
doubled-timeout retry branches preserve the same probe selection based on
options.requireChatCompletionsToolCalling.
---
Duplicate comments:
In `@src/lib/inference/onboard-probes.ts`:
- Around line 83-94: The hasChatCompletionsToolCall probe is too permissive:
tighten validation inside hasChatCompletionsToolCall so it verifies each
message.tool_calls entry has type === "function", a function object with a
non-empty name string, and a present arguments field (ensure typeof arguments
!== "undefined" or is an object/string as your runtime expects) before returning
true; also iterate over all parsed.choices (not only choices[0]) and check each
choice.message.tool_calls array entries for the stricter shape (use symbols:
hasChatCompletionsToolCall, parsed.choices, message.tool_calls, call.type,
call.function.name, call.function.arguments).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8880a1bb-b922-4aa6-a659-046d5a677a5b
📒 Files selected for processing (4)
nemoclaw/src/blueprint/runner.test.tsnemoclaw/src/blueprint/runner.tssrc/lib/inference/onboard-probes.test.tssrc/lib/inference/onboard-probes.ts
ericksoa
left a comment
There was a problem hiding this comment.
Approved after reviewing the current head and validating that checks pass.
## Summary - Bump the docs release metadata to `0.0.38`. - Document release-prep updates for status policy versions, Local Ollama validation and cleanup, blueprint policy additions, rebuild backup handling, and NemoHermes uninstall branding. - Refresh generated `nemoclaw-user-*` skills from the updated docs. ## Source summary - #3185 -> `docs/reference/commands.md`: Documents that `nemoclaw <name> status` displays the gateway active policy version when OpenShell reports one. - #3167 -> `docs/reference/commands.md`, `docs/inference/use-local-inference.md`: Documents uninstall cleanup for matching Local Ollama auth proxy processes. - #2737 -> `docs/inference/use-local-inference.md`, `docs/network-policy/customize-network-policy.md`, `docs/manage-sandboxes/lifecycle.md`, `docs/reference/commands.md`: Documents stricter Local Ollama tool-call validation, blueprint policy additions, and partial rebuild backup handling. - #3220 -> `docs/reference/commands.md`: Documents NemoHermes-specific uninstall progress and completion text. - #3158 -> `.agents/skills/nemoclaw-user-configure-inference/*`: Refreshes generated user skills from existing `docs/inference/switch-inference-providers.md` heartbeat documentation. - #3199 -> `.agents/skills/nemoclaw-user-get-started/SKILL.md`: Refreshes generated user skills from existing `docs/get-started/quickstart.md` Model Router wording. ## Skipped - #3272 and #3268 were already documented by their merged docs updates on `main`. - #3154, #3216, #3166, and #3195 have no additional user-facing docs impact for this release-prep pass. - No commits matched `docs/.docs-skip`. ## Test plan - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user` - `make docs` - `npm run build:cli` - Commit and pre-push hooks: markdownlint, docs-to-skills verification, gitleaks, commitlint, skills YAML tests, CLI typecheck Made with [Cursor](https://cursor.com) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Behavior Changes** * Rebuild now safely handles partial backups, preserving successfully captured entries while reporting only unarchived paths * Uninstall for Local Ollama setups now stops proxy processes before cleanup * Local Ollama models require stricter tool-call response validation during onboarding * Blueprint policy additions enable custom network policy extensions via `components.policy.additions` * New `NEMOCLAW_AGENT_HEARTBEAT_EVERY` configuration controls agent periodic task frequency * Status display now shows active policy version when available <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Cursor <cursoragent@cursor.com>
…ture (#3363) <!-- markdownlint-disable MD041 --> ## Summary Supersedes #3290. Introduces declarative **scenario-based E2E orchestration** and the **Phase 1 pre-flight infrastructure** that prepares for wave-by-wave migration of the existing `test/e2e/test-*.sh` scripts into the new structure. The old legacy scripts remain in place and unchanged; no E2E workflow is deleted or muted by this PR. The scope grew beyond the original #3290 title after rebase against `main` and the addition of Phase 1 helpers, fixtures, parity harness, and convention lint. Because upstream rules forbid force-push on feature branches, this is opened as a fresh PR with clean linear history rather than rewriting #3290 in place. ## Related Issue Supersedes #3290. No single tracking issue — foundation for the E2E scenario-matrix refactor discussed across #2737, #3154, and the E2E test-gap audit. ## Changes ### Part 1 — scenario-based E2E orchestration (commits 1–3, previously in #3290) - **Declarative metadata** (`test/e2e/scenarios.yaml`, `expected-states.yaml`, `suites.yaml`) for initial scenarios: Ubuntu cloud OpenClaw/Hermes, macOS, WSL, GPU local Ollama, Brev launchable, no-Docker negative. - **TypeScript resolver/validator/plan-printer/coverage report** under `test/e2e/resolver/` (invoked via `tsx`, uses `js-yaml` already in root `package.json`; unit-tested in the Vitest `cli` project). - **Shell helpers** under `test/e2e/lib/` that produce a normalized context at `$E2E_CONTEXT_DIR` (default `.e2e/`). Wraps existing `sandbox-teardown.sh` and `install-path-refresh.sh`; does not duplicate them. - **Entry scripts**: `run-scenario.sh`, `run-suites.sh`, `coverage-report.sh` with `--plan-only`, `--dry-run` (`E2E_DRY_RUN=1`) flags. - **Suite scripts** under `test/e2e/suites/{smoke,inference,credentials,lifecycle,messaging,onboarding,platform,sandbox,security}/`. - **New workflow** `.github/workflows/e2e-scenarios.yaml` — manual (`workflow_dispatch`) with `scenario`, `plan_only`, and `suite_filter` inputs. Existing workflows are unchanged. - `.gitignore`: add `.e2e/` runtime context directory. ### Part 2 — Phase 1 pre-flight migration infrastructure (commits 4–7, new) - **`test/e2e/MIGRATION.md`** — in-tree tracker mapping each legacy `test-*.sh` script to its target scenario + suite home. Includes the reuse-absorption table showing where pair-variant coverage (e.g. OpenClaw + Hermes) collapses into a single suite. - **Fixtures** (`test/e2e/lib/fixtures/`): `fake-openai.sh`, `fake-{telegram,discord,slack}.sh` (built on shared `_fake-http-stub.sh`), `older-base-image.sh`. Follow the inline-Node `http.createServer` pattern from `test-messaging-providers.sh`; no new framework. - **Helpers** (`test/e2e/lib/`): `logging.sh` (PASS/FAIL/=== Phase markers), `sandbox-exec.sh` (`e2e_sandbox_exec` / `e2e_sandbox_exec_stdin`). `env.sh` auto-sources `logging.sh`. - **Assertion helpers** (`test/e2e/lib/assert/`): `inference-works.sh`, `no-credentials-leaked.sh`, `policy-preset-applied.sh`, `messaging-bridge-reachable.sh`. - **Install splits** (`test/e2e/lib/setup/`): `install-{repo,curl,ollama,launchable}.sh` + `install.sh` dispatcher, so each migrated suite can compose the minimal install it needs. - **Runtime probe wiring**: `run-scenario.sh --validate-only` flag (mutually exclusive with `--plan-only`) that reads `$E2E_CONTEXT_DIR/context.env`, runs the expected-state validator, and emits JSON — skipping install/onboard/suites. `resolver/index.ts` gains `E2E_PROBE_OVERRIDES_JSON` escape hatch for probe keys with embedded underscores. - **Convention lint**: `scripts/e2e/lint-conventions.ts` (6 rules, Vitest-backed in the `cli` project) blocks new orphan `test-*.sh` scripts and enforces `parity-map.yaml` completeness. - **Parity harness**: `scripts/e2e/compare-parity.sh`, `test/e2e/parity-map.yaml` (seeded with 39 legacy-script placeholder entries), `.github/workflows/e2e-parity-compare.yaml`. This is the mechanism that will drive waves 2–12 of the migration — each migrated suite gets dispatched against the parity-compare workflow until zero divergence vs. its legacy counterpart. ### What this PR does **not** change - Existing E2E workflows (`nightly-e2e.yaml`, `macos-e2e.yaml`, `wsl-e2e.yaml`, `ollama-proxy-e2e.yaml`, `e2e-branch-validation.yaml`, `sandbox-images-and-e2e.yaml`) run as before. - Existing `test/e2e/test-*.sh` scripts are **not** deleted. Per-wave retirement happens in follow-up PRs once `parity-map.yaml` entries report zero divergence. - No new behavior in the product itself; this is test-harness scaffolding only. ### Migration sequencing (follow-up PRs) Waves 2–12 migrate the 40 legacy `test-*.sh` scripts into the scenario matrix, one wave per PR, gated by parity-compare zero-divergence. Phase 13 is the merge gate that flips the default CI dispatch from `nightly-e2e.yaml` to `e2e-parity-compare.yaml`. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes — 3,258 `cli` tests, 0 failures (41 new tests from Phase 1) - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes (`test/e2e/README.md`, `test/e2e/MIGRATION.md`, `AGENTS.md`) - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ### Fork E2E evidence (Phase 1 branch dispatch) Dispatched `nightly-e2e.yaml` on `jyaunches/NemoClaw` against this branch (with a temporary local-only repo-gate lift, not part of this PR). Result: | Outcome | Count | Notes | |---|---|---| | ✅ success | 35 | All CPU E2E jobs green on fork | | ⏭ skipped by design | 2 | `gpu-e2e`, `gpu-double-onboard-e2e` — need NVIDIA self-hosted GPU runner | | ⏸ unrunnable on fork | 1 | `hermes-slack-e2e` — needs `linux-amd64-cpu4` runner (NVIDIA-internal) | |⚠️ pre-existing fork-environment hang | 1 | `hermes-e2e` hangs during `nemohermes onboard` on fork (not a Phase 1 regression — same legacy script, no changes to onboard path) | **35/35 of jobs that could run on the fork pass.** This attests that Phase 1 infrastructure does not regress any existing E2E coverage. --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Introduced scenario-based end-to-end testing framework with declarative setup profiles and ordered validation suites. * **Tests** * Added CI workflows for scenario execution and parity comparison with legacy tests. * New validation suites for smoke tests, inference, credentials, platform checks, and security. * **Documentation** * Added E2E testing guide and migration tracker for moving tests to scenario-based architecture. * **Chores** * Added test infrastructure, fixtures, and assertion helpers supporting multiple platforms and configurations. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3363) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Summary
This PR fixes three reliability bugs across onboarding, blueprint apply, and sandbox rebuild flows. It improves tool-call validation for local Ollama/Hermes responses, ensures
components.policy.additionsfromblueprint.yamlare actually applied at runtime, and makes rebuild backups resilient whentarreturns partial output due to unreadable/root-owned files.Related Issue
Closes #2731
Closes #2697
Closes #2727
Changes
message.content.tool_callsand avoid relying on the/responsespath for this provider/runtime combo.components.policy.additionsintoactionApplyby fetching the live policy, merging additions intonetwork_policies, and applying viaopenshell policy set --wait.policy_additionsin run state metadata (plan.json) for traceability.tarexit code2as partial success when archive bytes are present, then mark only missing directories as failed.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Summary by CodeRabbit
New Features
Improvements
Tests
Signed-off-by: Aaron Erickson aerickson@nvidia.com