fix(runtime): reconcile agent identity with provider on sandbox start (#3175) by jason-ma-nv · Pull Request #3319 · NVIDIA/NemoClaw

jason-ma-nv · 2026-05-10T13:05:21Z

Summary

After openshell inference set swaps the gateway's inference provider entry, agents.defaults.model.primary in openclaw.json can drift from models.providers.<key>.models[0].id. The gateway then routes requests to the new model but the OpenClaw agent self-reports the old one. This change reconciles the two on every sandbox start so the agent boots with a consistent identity.

Related Issue

Fixes #3175

Changes

Add reconcile_agent_model_with_provider to scripts/nemoclaw-start.sh. It aligns agents.defaults.model.primary to the inference provider's first model id, recomputes .config-hash, and is a no-op when already in sync, when the file is missing, or when no inference provider is configured.
Wire the new function in after apply_model_override in both the non-root and root entrypoint paths so explicit NEMOCLAW_MODEL_OVERRIDE values still win.
Add test/nemoclaw-start-reconcile.test.ts covering the drift case, the in-sync no-op, the missing-provider no-op, and the missing-keys no-op.
Update existing runPreGatewaySetup and non-root fallback test scaffolds in test/nemoclaw-start.test.ts to stub the new function so they continue to pass.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
make docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Notes for reviewers:

The full CLI vitest project (3288 tests) was run manually and passed.
npx prek run --all-files was attempted but the Test (CLI) hook hits a vitest@4 + v8 coverage race condition (unhandled rejection on coverage/cli/.tmp/coverage-<pid>.json) under prek's environment, while the same hook command run directly passes. Reproducible on the existing main branch with no changes; not introduced by this PR.

Signed-off-by: jason jama@nvidia.com

Summary by CodeRabbit

New Features
- Agent model config now auto-synchronizes with the active inference provider on startup.
- New CLI: nemoclaw inference set to switch inference provider/model at runtime and update the running sandbox.
- Host-side sandbox config helpers to write configs and recompute sandbox config hashes.
Tests
- Added comprehensive tests for reconciliation, the inference-set workflow, CLI wiring, and multiple edge cases.
Documentation
- Updated docs to use nemoclaw inference set and describe runtime model/provider switching.

…#3175) When `openshell inference set` swaps the gateway's inference provider entry, agents.defaults.model.primary in openclaw.json can drift from models.providers.<key>.models[0].id. The gateway then routes requests to the new model but the OpenClaw agent self-reports the old one. Add reconcile_agent_model_with_provider in nemoclaw-start.sh and call it after apply_model_override in both the root and non-root entrypoint paths. The reconciler aligns primary to the provider's first model id, recomputes the config hash, and is a no-op when already in sync, so explicit NEMOCLAW_MODEL_OVERRIDE values still win. Signed-off-by: Test User <test@example.com>

copy-pr-bot · 2026-05-10T13:05:25Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-10T13:05:33Z

📝 Walkthrough

Walkthrough

Adds a runtime "inference set" workflow (CLI, action, resolver, sandbox persistence helpers, tests, docs) and a startup reconciliation hook that, when run as root, patches sandbox openclaw.json agent primary model to match the gateway routed model and recomputes the config hash.

Changes

Runtime Inference Set + Startup Reconciliation

Layer / File(s)	Summary
Action core `src/lib/actions/inference-set.ts`	Adds `runInferenceSet`, `patchOpenClawInferenceConfig`, types, validation, OpenShell invocation, sandbox config mutation, registry/session sync, audit logging, and InferenceSetError.
CLI command & registration `src/lib/commands/inference/set.ts`, `src/commands/inference/set.ts`	New `inference:set` oclif command and CLI registration wrapper; parses flags and calls `runInferenceSet`.
Global dispatch & adapter tests `src/lib/cli/oclif-dispatch.ts`, `src/lib/cli/command-registry.test.ts`, `src/lib/commands/global-oclif-command-adapters.test.ts`	Recognizes global `inference set` dispatch, updates command registry expectations, and adds adapter tests wiring `runInferenceSet`.
Provider config resolver `src/lib/inference/config.ts`, `src/lib/onboard/providers.ts`	Adds `SandboxInferenceConfig` and `getSandboxInferenceConfig`, moves provider routing logic out of onboard/providers and re-exports it.
Sandbox persistence helpers `src/lib/sandbox-config.ts`	Exports `AgentConfigTarget`, adds `writeSandboxConfig()` and `recomputeSandboxConfigHash()` plus `shellQuote()` for host-side sandbox config writes and hash recompute.
Entrypoint reconciliation `scripts/nemoclaw-start.sh`	Adds `reconcile_agent_model_with_provider()` that, when run as root, compares `agents.defaults.model.primary` to the gateway provider's model ref and patches openclaw.json, recomputes `.config-hash`, and restores ownership/permissions; invoked after `apply_model_override()` in both non-root and root startup paths (non-root path is a no-op).
Reconciliation tests `test/nemoclaw-start-reconcile.test.ts`	New Vitest harness running the reconciliation function in a sandbox workspace covering drift, already-aligned, missing-provider, missing-keys, and provider-model-without-name scenarios.
Startup harness updates `test/nemoclaw-start.test.ts`	Injects reconcile stub into two existing inlined startup test harnesses to preserve expected function ordering.
Docs & help `docs/**`, `src/lib/actions/root-help.ts`, `docs/reference/commands.md`	Documentation and help text updated to use `nemoclaw inference set` for runtime provider/model switching and to document that it patches the running OpenClaw config and recomputes its hash.
Unit tests for action `src/lib/actions/inference-set.test.ts`	Extensive unit tests for `patchOpenClawInferenceConfig` and `runInferenceSet`, covering rewrites, no-ops, provider specifics, orchestration side effects, and negative paths.

🎯 3 (Moderate) | ⏱️ ~20 minutes

Sequence Diagram(s)

sequenceDiagram
  participant CLI as NemoClaw CLI
  participant Cmd as InferenceSetCommand
  participant Action as runInferenceSet
  participant OpenShell as openshell
  participant Sandbox as Sandbox persistence
  participant Registry as NemoClaw registry

  CLI->>Cmd: user runs "nemoclaw inference set"
  Cmd->>Action: call runInferenceSet(options)
  Action->>OpenShell: runOpenshell inference set (apply gateway route)
  OpenShell-->>Action: exit status
  Action->>Sandbox: writeSandboxConfig(updated openclaw.json)
  Action->>Sandbox: recomputeSandboxConfigHash()
  Sandbox-->>Action: confirm write and hash
  Action->>Registry: update sandbox state and onboard session
  Action-->>Cmd: return InferenceSetResult
  Cmd-->>CLI: display result

"I hop and patch where models roam,
Gateway speaks and sands come home,
Hashes recomputed, permissions right,
Agents awake to the new model's light." 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.63% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'fix(runtime): reconcile agent identity with provider on sandbox start' clearly and concisely summarizes the main objective—synchronizing the agent's reported model identity with the gateway's inference provider at runtime startup—without unnecessary details or vague terminology.
Linked Issues check	✅ Passed	The PR comprehensively addresses the objective in issue `#3175` by adding runtime reconciliation logic that ensures the sandbox agent reports the same model identity as the gateway provider after inference changes, without requiring a rebuild.
Out of Scope Changes check	✅ Passed	All changes are scoped to implementing the runtime reconciliation feature and its related CLI infrastructure; documentation updates, tests, and code organization changes directly support the fix without introducing unrelated modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch worktree-fix-3175-reconcile-agent-model

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…oncile-agent-model

coderabbitai

🧹 Nitpick comments (1)

src/lib/sandbox-config.ts (1)
843-843: 💤 Low value

Inconsistent shell escaping: configSet uses unquoted path while writeSandboxConfig quotes it.

The new writeSandboxConfig correctly uses shellQuote(target.configPath) at line 393, but the existing configSet function passes target.configPath directly to the shell command without quoting. While configPath values are currently controlled (e.g., /sandbox/.openclaw/openclaw.json), this inconsistency could become a problem if paths ever contain special characters.

Consider applying the same quoting for consistency:
♻️ Suggested fix
       "-i",
       "--",
       "sh",
       "-c",
-      `cat > ${target.configPath}`,
+      `cat > ${shellQuote(target.configPath)}`,
     ],
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/sandbox-config.ts` at line 843, The configSet function currently
injects target.configPath unquoted into the shell command while
writeSandboxConfig uses shellQuote(target.configPath); update configSet to wrap
the path with shellQuote (use the same helper as writeSandboxConfig) when
constructing the `cat > ${...}` command to ensure paths with spaces/special
chars are safely quoted and to keep behavior consistent between configSet and
writeSandboxConfig.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/sandbox-config.ts`:
- Line 843: The configSet function currently injects target.configPath unquoted
into the shell command while writeSandboxConfig uses
shellQuote(target.configPath); update configSet to wrap the path with shellQuote
(use the same helper as writeSandboxConfig) when constructing the `cat > ${...}`
command to ensure paths with spaces/special chars are safely quoted and to keep
behavior consistent between configSet and writeSandboxConfig.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c169b42e-f2e7-4c4a-83e8-565b4800955b

📥 Commits

Reviewing files that changed from the base of the PR and between 3c87b2c and 048212f.

📒 Files selected for processing (21)

docs/inference/switch-inference-providers.md
docs/inference/use-local-inference.md
docs/manage-sandboxes/lifecycle.md
docs/reference/cli-selection-guide.md
docs/reference/commands.md
docs/security/best-practices.md
scripts/nemoclaw-start.sh
src/commands/inference/set.ts
src/lib/actions/inference-set.test.ts
src/lib/actions/inference-set.ts
src/lib/actions/root-help.ts
src/lib/cli/command-registry.test.ts
src/lib/cli/oclif-dispatch.test.ts
src/lib/cli/oclif-dispatch.ts
src/lib/commands/global-oclif-command-adapters.test.ts
src/lib/commands/inference/set.ts
src/lib/inference/config.ts
src/lib/onboard/providers.ts
src/lib/sandbox-config.ts
test/nemoclaw-start-reconcile.test.ts
test/nemoclaw-start.test.ts

✅ Files skipped from review due to trivial changes (5)

src/commands/inference/set.ts
docs/manage-sandboxes/lifecycle.md
docs/security/best-practices.md
src/lib/cli/oclif-dispatch.test.ts
docs/reference/commands.md

🚧 Files skipped from review as they are similar to previous changes (1)

test/nemoclaw-start.test.ts

wscurran · 2026-05-11T14:48:48Z

✨ Related open issues:

#3175 [macOS][Inference] openshell inference set changes gateway model but sandbox agent still reports old model

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

src/lib/sandbox-config.ts (1)

368-373: 💤 Low value

Consider eliminating the unnecessary temp file round-trip.

The function writes the config to a temp file and immediately reads it back only to pipe it to stdin. The serialized content could be used directly, removing the temp file overhead entirely.

♻️ Suggested simplification

 function writeSandboxConfig(
   sandboxName: string,
   target: AgentConfigTarget,
   config: ConfigObject,
 ): void {
-  const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), "nemoclaw-config-"));
-  const tmpFile = path.join(tmpDir, target.configFile);
+  const content = serializeConfig(config, target.format);
   try {
-    fs.writeFileSync(tmpFile, serializeConfig(config, target.format), { mode: 0o600 });
-
-    const content = fs.readFileSync(tmpFile, "utf-8");
     dockerExecFileSync(
       [
         "exec",
         "-i",
         K3S_CONTAINER,
         ...
       ],
       { input: content, stdio: ["pipe", "pipe", "pipe"], timeout: 15000 },
     );
     // ... chown block unchanged ...
-  } finally {
-    try {
-      fs.unlinkSync(tmpFile);
-      fs.rmdirSync(tmpDir);
-    } catch {
-      // Best effort.
-    }
   }
+  catch (err) {
+    throw err;
+  }
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/sandbox-config.ts` around lines 368 - 373, Avoid the temp-file round
trip: remove creation of tmpDir/tmpFile and the writeFileSync/readFileSync calls
and instead call serializeConfig(config, target.format) once, store that string
(e.g., serialized) and pass it directly to the code that currently reads the
file to pipe to stdin; update any permission/cleanup logic tied to
tmpDir/tmpFile and ensure references to tmpFile/tmpDir are removed (look for
serializeConfig, config, target.format, target.configFile usage to locate the
code to change).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/sandbox-config.ts`:
- Around line 368-373: Avoid the temp-file round trip: remove creation of
tmpDir/tmpFile and the writeFileSync/readFileSync calls and instead call
serializeConfig(config, target.format) once, store that string (e.g.,
serialized) and pass it directly to the code that currently reads the file to
pipe to stdin; update any permission/cleanup logic tied to tmpDir/tmpFile and
ensure references to tmpFile/tmpDir are removed (look for serializeConfig,
config, target.format, target.configFile usage to locate the code to change).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3f2af97e-2cdb-4660-b7b0-a9fa9246b6d5

📥 Commits

Reviewing files that changed from the base of the PR and between 048212f and 98f92ad.

📒 Files selected for processing (1)

src/lib/sandbox-config.ts

ericksoa

Approved after replay fixes and green CI on head 98f92ad.

…3679) ## Summary The `reconcile_agent_model_with_provider` safety net added in #3319 reads both sides of its equality check from the same file (primary vs `providers.inference.models[0].name`). When a user runs `openshell inference set` — which only writes the gateway and leaves `/sandbox/.openclaw/openclaw.json` untouched — both fields stay equal to each other and the reconciler short-circuits as a no-op. This PR makes the reconciler probe the live gateway via `openshell inference get --json` and use that as the source of truth so the file is actually realigned to the routed model on the next sandbox start. ## Related Issue Partial fix for #3175. See the [issue comment](#3175 (comment)) for why a complete fix also requires an OpenShell-side change. ## Changes - `scripts/nemoclaw-start.sh`: `reconcile_agent_model_with_provider` now probes the live gateway and, when a model is returned, treats it as authoritative — aligning both `agents.defaults.model.primary` and `models.providers.inference.models[0].{id,name}` so the file no longer carries a stale entry the gateway can push back on its next reconcile. - Legacy in-file reconcile is preserved as a fallback for environments where the openshell binary is missing or the probe fails (no behavioral change for existing test scaffolds without a stubbed openshell). - `test/nemoclaw-start-reconcile.test.ts`: harness now optionally installs a stubbed `openshell` on PATH and scrubs any inherited host openshell. Adds four cases covering the gateway-mode happy path (the user's repro from #3175), inference-prefix idempotency, gateway-matches-file no-op, and empty-probe fallback. ## Scope This PR does not stop the silent runtime revert reported in the #3175 repro (gateway version 3 → 4 without a sandbox restart). That push-back is owned by openshell-gateway and needs an OpenShell-side change; see the linked issue comment. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] \`npx prek run --all-files\` passes - [x] \`npm test\` passes for the touched test file and the adjacent \`test/nemoclaw-start.test.ts\` (81/81) - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] \`make docs\` builds without warnings (doc changes only) Notes for reviewers: - Local full CLI vitest run had 4 pre-existing failures on this machine unrelated to the diff (`ssrf-parity` build artifact missing, `cli.test.ts` gateway-cleanup timeouts, `sandbox-build-context` umask 775 vs 755). Will rely on PR CI to validate. - The gateway-probe Python block uses `subprocess.run(timeout=3)` to bound the probe and silently falls back on any error. --- Signed-off-by: jason-ma-nv <jama@nvidia.com>  ## Summary by CodeRabbit * **New Features** * Reconciliation now prefers a live gateway probe as the source of truth for model alignment; when available it updates the agent's primary model and the provider's primary model entry, with a legacy in-file fallback if the probe is unavailable. * **Tests** * Expanded tests cover probe-driven updates, no-op when already aligned, fallback behavior, malformed probe output, and deterministic probe stubbing.  --------- Signed-off-by: jason-ma-nv <jama@nvidia.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Signed-off-by: Carlos Villela <cvillela@nvidia.com> Co-authored-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>

jason-ma-nv self-assigned this May 10, 2026

jason-ma-nv and others added 3 commits May 11, 2026 09:30

Merge branch 'main' into worktree-fix-3175-reconcile-agent-model

fa8f2cc

Merge remote-tracking branch 'origin/main' into worktree-fix-3175-rec…

5cca3f2

…oncile-agent-model

fix(inference): add managed model switch command

048212f

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

wscurran added OpenShell labels May 11, 2026

fix: reuse shared shell quoting helper

98f92ad

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

ericksoa approved these changes May 11, 2026

View reviewed changes

ericksoa merged commit 9bc5e7c into main May 11, 2026
17 checks passed

jyaunches mentioned this pull request May 11, 2026

refactor(cli): group remaining architecture modules #3294

Merged

12 tasks

This was referenced May 18, 2026

[Inference] openshell inference set changes gateway model but sandbox agent still reports old model #3175

Closed

fix(runtime): probe live gateway to reconcile agent identity (#3175) #3679

Merged

wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression and removed OpenShell labels Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(runtime): reconcile agent identity with provider on sandbox start (#3175)#3319

fix(runtime): reconcile agent identity with provider on sandbox start (#3175)#3319
ericksoa merged 5 commits into
mainfrom
worktree-fix-3175-reconcile-agent-model

jason-ma-nv commented May 10, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented May 10, 2026

Uh oh!

coderabbitai Bot commented May 10, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

wscurran commented May 11, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

ericksoa left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jason-ma-nv commented May 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 10, 2026

Uh oh!

coderabbitai Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

wscurran commented May 11, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ericksoa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jason-ma-nv commented May 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 10, 2026 •

edited

Loading