Skip to content

fix(runtime): reconcile agent identity with provider on sandbox start (#3175)#3319

Merged
ericksoa merged 5 commits into
mainfrom
worktree-fix-3175-reconcile-agent-model
May 11, 2026
Merged

fix(runtime): reconcile agent identity with provider on sandbox start (#3175)#3319
ericksoa merged 5 commits into
mainfrom
worktree-fix-3175-reconcile-agent-model

Conversation

@jason-ma-nv

@jason-ma-nv jason-ma-nv commented May 10, 2026

Copy link
Copy Markdown
Contributor

Summary

After openshell inference set swaps the gateway's inference provider entry, agents.defaults.model.primary in openclaw.json can drift from models.providers.<key>.models[0].id. The gateway then routes requests to the new model but the OpenClaw agent self-reports the old one. This change reconciles the two on every sandbox start so the agent boots with a consistent identity.

Related Issue

Fixes #3175

Changes

  • Add reconcile_agent_model_with_provider to scripts/nemoclaw-start.sh. It aligns agents.defaults.model.primary to the inference provider's first model id, recomputes .config-hash, and is a no-op when already in sync, when the file is missing, or when no inference provider is configured.
  • Wire the new function in after apply_model_override in both the non-root and root entrypoint paths so explicit NEMOCLAW_MODEL_OVERRIDE values still win.
  • Add test/nemoclaw-start-reconcile.test.ts covering the drift case, the in-sync no-op, the missing-provider no-op, and the missing-keys no-op.
  • Update existing runPreGatewaySetup and non-root fallback test scaffolds in test/nemoclaw-start.test.ts to stub the new function so they continue to pass.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Notes for reviewers:

  • The full CLI vitest project (3288 tests) was run manually and passed.
  • npx prek run --all-files was attempted but the Test (CLI) hook hits a vitest@4 + v8 coverage race condition (unhandled rejection on coverage/cli/.tmp/coverage-<pid>.json) under prek's environment, while the same hook command run directly passes. Reproducible on the existing main branch with no changes; not introduced by this PR.

Signed-off-by: jason jama@nvidia.com

Summary by CodeRabbit

  • New Features

    • Agent model config now auto-synchronizes with the active inference provider on startup.
    • New CLI: nemoclaw inference set to switch inference provider/model at runtime and update the running sandbox.
    • Host-side sandbox config helpers to write configs and recompute sandbox config hashes.
  • Tests

    • Added comprehensive tests for reconciliation, the inference-set workflow, CLI wiring, and multiple edge cases.
  • Documentation

    • Updated docs to use nemoclaw inference set and describe runtime model/provider switching.

Review Change Stack

…#3175)

When `openshell inference set` swaps the gateway's inference provider
entry, agents.defaults.model.primary in openclaw.json can drift from
models.providers.<key>.models[0].id. The gateway then routes requests
to the new model but the OpenClaw agent self-reports the old one.

Add reconcile_agent_model_with_provider in nemoclaw-start.sh and call
it after apply_model_override in both the root and non-root entrypoint
paths. The reconciler aligns primary to the provider's first model id,
recomputes the config hash, and is a no-op when already in sync, so
explicit NEMOCLAW_MODEL_OVERRIDE values still win.

Signed-off-by: Test User <test@example.com>
@jason-ma-nv jason-ma-nv self-assigned this May 10, 2026
@copy-pr-bot

copy-pr-bot Bot commented May 10, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 10, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Adds a runtime "inference set" workflow (CLI, action, resolver, sandbox persistence helpers, tests, docs) and a startup reconciliation hook that, when run as root, patches sandbox openclaw.json agent primary model to match the gateway routed model and recomputes the config hash.

Changes

Runtime Inference Set + Startup Reconciliation

Layer / File(s) Summary
Action core
src/lib/actions/inference-set.ts
Adds runInferenceSet, patchOpenClawInferenceConfig, types, validation, OpenShell invocation, sandbox config mutation, registry/session sync, audit logging, and InferenceSetError.
CLI command & registration
src/lib/commands/inference/set.ts, src/commands/inference/set.ts
New inference:set oclif command and CLI registration wrapper; parses flags and calls runInferenceSet.
Global dispatch & adapter tests
src/lib/cli/oclif-dispatch.ts, src/lib/cli/command-registry.test.ts, src/lib/commands/global-oclif-command-adapters.test.ts
Recognizes global inference set dispatch, updates command registry expectations, and adds adapter tests wiring runInferenceSet.
Provider config resolver
src/lib/inference/config.ts, src/lib/onboard/providers.ts
Adds SandboxInferenceConfig and getSandboxInferenceConfig, moves provider routing logic out of onboard/providers and re-exports it.
Sandbox persistence helpers
src/lib/sandbox-config.ts
Exports AgentConfigTarget, adds writeSandboxConfig() and recomputeSandboxConfigHash() plus shellQuote() for host-side sandbox config writes and hash recompute.
Entrypoint reconciliation
scripts/nemoclaw-start.sh
Adds reconcile_agent_model_with_provider() that, when run as root, compares agents.defaults.model.primary to the gateway provider's model ref and patches openclaw.json, recomputes .config-hash, and restores ownership/permissions; invoked after apply_model_override() in both non-root and root startup paths (non-root path is a no-op).
Reconciliation tests
test/nemoclaw-start-reconcile.test.ts
New Vitest harness running the reconciliation function in a sandbox workspace covering drift, already-aligned, missing-provider, missing-keys, and provider-model-without-name scenarios.
Startup harness updates
test/nemoclaw-start.test.ts
Injects reconcile stub into two existing inlined startup test harnesses to preserve expected function ordering.
Docs & help
docs/**, src/lib/actions/root-help.ts, docs/reference/commands.md
Documentation and help text updated to use nemoclaw inference set for runtime provider/model switching and to document that it patches the running OpenClaw config and recomputes its hash.
Unit tests for action
src/lib/actions/inference-set.test.ts
Extensive unit tests for patchOpenClawInferenceConfig and runInferenceSet, covering rewrites, no-ops, provider specifics, orchestration side effects, and negative paths.

🎯 3 (Moderate) | ⏱️ ~20 minutes

Sequence Diagram(s)

sequenceDiagram
  participant CLI as NemoClaw CLI
  participant Cmd as InferenceSetCommand
  participant Action as runInferenceSet
  participant OpenShell as openshell
  participant Sandbox as Sandbox persistence
  participant Registry as NemoClaw registry

  CLI->>Cmd: user runs "nemoclaw inference set"
  Cmd->>Action: call runInferenceSet(options)
  Action->>OpenShell: runOpenshell inference set (apply gateway route)
  OpenShell-->>Action: exit status
  Action->>Sandbox: writeSandboxConfig(updated openclaw.json)
  Action->>Sandbox: recomputeSandboxConfigHash()
  Sandbox-->>Action: confirm write and hash
  Action->>Registry: update sandbox state and onboard session
  Action-->>Cmd: return InferenceSetResult
  Cmd-->>CLI: display result
Loading

"I hop and patch where models roam,
Gateway speaks and sands come home,
Hashes recomputed, permissions right,
Agents awake to the new model's light." 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'fix(runtime): reconcile agent identity with provider on sandbox start' clearly and concisely summarizes the main objective—synchronizing the agent's reported model identity with the gateway's inference provider at runtime startup—without unnecessary details or vague terminology.
Linked Issues check ✅ Passed The PR comprehensively addresses the objective in issue #3175 by adding runtime reconciliation logic that ensures the sandbox agent reports the same model identity as the gateway provider after inference changes, without requiring a rebuild.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing the runtime reconciliation feature and its related CLI infrastructure; documentation updates, tests, and code organization changes directly support the fix without introducing unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch worktree-fix-3175-reconcile-agent-model

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/sandbox-config.ts (1)

843-843: 💤 Low value

Inconsistent shell escaping: configSet uses unquoted path while writeSandboxConfig quotes it.

The new writeSandboxConfig correctly uses shellQuote(target.configPath) at line 393, but the existing configSet function passes target.configPath directly to the shell command without quoting. While configPath values are currently controlled (e.g., /sandbox/.openclaw/openclaw.json), this inconsistency could become a problem if paths ever contain special characters.

Consider applying the same quoting for consistency:

♻️ Suggested fix
       "-i",
       "--",
       "sh",
       "-c",
-      `cat > ${target.configPath}`,
+      `cat > ${shellQuote(target.configPath)}`,
     ],
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/sandbox-config.ts` at line 843, The configSet function currently
injects target.configPath unquoted into the shell command while
writeSandboxConfig uses shellQuote(target.configPath); update configSet to wrap
the path with shellQuote (use the same helper as writeSandboxConfig) when
constructing the `cat > ${...}` command to ensure paths with spaces/special
chars are safely quoted and to keep behavior consistent between configSet and
writeSandboxConfig.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/sandbox-config.ts`:
- Line 843: The configSet function currently injects target.configPath unquoted
into the shell command while writeSandboxConfig uses
shellQuote(target.configPath); update configSet to wrap the path with shellQuote
(use the same helper as writeSandboxConfig) when constructing the `cat > ${...}`
command to ensure paths with spaces/special chars are safely quoted and to keep
behavior consistent between configSet and writeSandboxConfig.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c169b42e-f2e7-4c4a-83e8-565b4800955b

📥 Commits

Reviewing files that changed from the base of the PR and between 3c87b2c and 048212f.

📒 Files selected for processing (21)
  • docs/inference/switch-inference-providers.md
  • docs/inference/use-local-inference.md
  • docs/manage-sandboxes/lifecycle.md
  • docs/reference/cli-selection-guide.md
  • docs/reference/commands.md
  • docs/security/best-practices.md
  • scripts/nemoclaw-start.sh
  • src/commands/inference/set.ts
  • src/lib/actions/inference-set.test.ts
  • src/lib/actions/inference-set.ts
  • src/lib/actions/root-help.ts
  • src/lib/cli/command-registry.test.ts
  • src/lib/cli/oclif-dispatch.test.ts
  • src/lib/cli/oclif-dispatch.ts
  • src/lib/commands/global-oclif-command-adapters.test.ts
  • src/lib/commands/inference/set.ts
  • src/lib/inference/config.ts
  • src/lib/onboard/providers.ts
  • src/lib/sandbox-config.ts
  • test/nemoclaw-start-reconcile.test.ts
  • test/nemoclaw-start.test.ts
✅ Files skipped from review due to trivial changes (5)
  • src/commands/inference/set.ts
  • docs/manage-sandboxes/lifecycle.md
  • docs/security/best-practices.md
  • src/lib/cli/oclif-dispatch.test.ts
  • docs/reference/commands.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/nemoclaw-start.test.ts

@wscurran

Copy link
Copy Markdown
Contributor

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/sandbox-config.ts (1)

368-373: 💤 Low value

Consider eliminating the unnecessary temp file round-trip.

The function writes the config to a temp file and immediately reads it back only to pipe it to stdin. The serialized content could be used directly, removing the temp file overhead entirely.

♻️ Suggested simplification
 function writeSandboxConfig(
   sandboxName: string,
   target: AgentConfigTarget,
   config: ConfigObject,
 ): void {
-  const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), "nemoclaw-config-"));
-  const tmpFile = path.join(tmpDir, target.configFile);
+  const content = serializeConfig(config, target.format);
   try {
-    fs.writeFileSync(tmpFile, serializeConfig(config, target.format), { mode: 0o600 });
-
-    const content = fs.readFileSync(tmpFile, "utf-8");
     dockerExecFileSync(
       [
         "exec",
         "-i",
         K3S_CONTAINER,
         ...
       ],
       { input: content, stdio: ["pipe", "pipe", "pipe"], timeout: 15000 },
     );
     // ... chown block unchanged ...
-  } finally {
-    try {
-      fs.unlinkSync(tmpFile);
-      fs.rmdirSync(tmpDir);
-    } catch {
-      // Best effort.
-    }
   }
+  catch (err) {
+    throw err;
+  }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/sandbox-config.ts` around lines 368 - 373, Avoid the temp-file round
trip: remove creation of tmpDir/tmpFile and the writeFileSync/readFileSync calls
and instead call serializeConfig(config, target.format) once, store that string
(e.g., serialized) and pass it directly to the code that currently reads the
file to pipe to stdin; update any permission/cleanup logic tied to
tmpDir/tmpFile and ensure references to tmpFile/tmpDir are removed (look for
serializeConfig, config, target.format, target.configFile usage to locate the
code to change).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/sandbox-config.ts`:
- Around line 368-373: Avoid the temp-file round trip: remove creation of
tmpDir/tmpFile and the writeFileSync/readFileSync calls and instead call
serializeConfig(config, target.format) once, store that string (e.g.,
serialized) and pass it directly to the code that currently reads the file to
pipe to stdin; update any permission/cleanup logic tied to tmpDir/tmpFile and
ensure references to tmpFile/tmpDir are removed (look for serializeConfig,
config, target.format, target.configFile usage to locate the code to change).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3f2af97e-2cdb-4660-b7b0-a9fa9246b6d5

📥 Commits

Reviewing files that changed from the base of the PR and between 048212f and 98f92ad.

📒 Files selected for processing (1)
  • src/lib/sandbox-config.ts

@ericksoa ericksoa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved after replay fixes and green CI on head 98f92ad.

@ericksoa ericksoa merged commit 9bc5e7c into main May 11, 2026
17 checks passed
@wscurran wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression and removed OpenShell labels Jun 3, 2026
cv added a commit that referenced this pull request Jun 4, 2026
…3679)

## Summary
The `reconcile_agent_model_with_provider` safety net added in #3319
reads both sides of its equality check from the same file (primary vs
`providers.inference.models[0].name`). When a user runs `openshell
inference set` — which only writes the gateway and leaves
`/sandbox/.openclaw/openclaw.json` untouched — both fields stay equal to
each other and the reconciler short-circuits as a no-op. This PR makes
the reconciler probe the live gateway via `openshell inference get
--json` and use that as the source of truth so the file is actually
realigned to the routed model on the next sandbox start.

## Related Issue
Partial fix for #3175. See the [issue
comment](#3175 (comment))
for why a complete fix also requires an OpenShell-side change.

## Changes
- `scripts/nemoclaw-start.sh`: `reconcile_agent_model_with_provider` now
probes the live gateway and, when a model is returned, treats it as
authoritative — aligning both `agents.defaults.model.primary` and
`models.providers.inference.models[0].{id,name}` so the file no longer
carries a stale entry the gateway can push back on its next reconcile.
- Legacy in-file reconcile is preserved as a fallback for environments
where the openshell binary is missing or the probe fails (no behavioral
change for existing test scaffolds without a stubbed openshell).
- `test/nemoclaw-start-reconcile.test.ts`: harness now optionally
installs a stubbed `openshell` on PATH and scrubs any inherited host
openshell. Adds four cases covering the gateway-mode happy path (the
user's repro from #3175), inference-prefix idempotency,
gateway-matches-file no-op, and empty-probe fallback.

## Scope
This PR does not stop the silent runtime revert reported in the #3175
repro (gateway version 3 → 4 without a sandbox restart). That push-back
is owned by openshell-gateway and needs an OpenShell-side change; see
the linked issue comment.

## Type of Change
- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [ ] \`npx prek run --all-files\` passes
- [x] \`npm test\` passes for the touched test file and the adjacent
\`test/nemoclaw-start.test.ts\` (81/81)
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] \`make docs\` builds without warnings (doc changes only)

Notes for reviewers:
- Local full CLI vitest run had 4 pre-existing failures on this machine
unrelated to the diff (`ssrf-parity` build artifact missing,
`cli.test.ts` gateway-cleanup timeouts, `sandbox-build-context` umask
775 vs 755). Will rely on PR CI to validate.
- The gateway-probe Python block uses `subprocess.run(timeout=3)` to
bound the probe and silently falls back on any error.

---
Signed-off-by: jason-ma-nv <jama@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Reconciliation now prefers a live gateway probe as the source of truth
for model alignment; when available it updates the agent's primary model
and the provider's primary model entry, with a legacy in-file fallback
if the probe is unavailable.

* **Tests**
* Expanded tests cover probe-driven updates, no-op when already aligned,
fallback behavior, malformed probe output, and deterministic probe
stubbing.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: jason-ma-nv <jama@nvidia.com>
Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Co-authored-by: Charan Jagwani <cjagwani@nvidia.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Inference] openshell inference set changes gateway model but sandbox agent still reports old model

3 participants