fix(runtime): reconcile agent identity with provider on sandbox start (#3175)#3319
Conversation
…#3175) When `openshell inference set` swaps the gateway's inference provider entry, agents.defaults.model.primary in openclaw.json can drift from models.providers.<key>.models[0].id. The gateway then routes requests to the new model but the OpenClaw agent self-reports the old one. Add reconcile_agent_model_with_provider in nemoclaw-start.sh and call it after apply_model_override in both the root and non-root entrypoint paths. The reconciler aligns primary to the provider's first model id, recomputes the config hash, and is a no-op when already in sync, so explicit NEMOCLAW_MODEL_OVERRIDE values still win. Signed-off-by: Test User <test@example.com>
📝 WalkthroughWalkthroughAdds a runtime "inference set" workflow (CLI, action, resolver, sandbox persistence helpers, tests, docs) and a startup reconciliation hook that, when run as root, patches sandbox openclaw.json agent primary model to match the gateway routed model and recomputes the config hash. ChangesRuntime Inference Set + Startup Reconciliation
🎯 3 (Moderate) | ⏱️ ~20 minutes Sequence Diagram(s)sequenceDiagram
participant CLI as NemoClaw CLI
participant Cmd as InferenceSetCommand
participant Action as runInferenceSet
participant OpenShell as openshell
participant Sandbox as Sandbox persistence
participant Registry as NemoClaw registry
CLI->>Cmd: user runs "nemoclaw inference set"
Cmd->>Action: call runInferenceSet(options)
Action->>OpenShell: runOpenshell inference set (apply gateway route)
OpenShell-->>Action: exit status
Action->>Sandbox: writeSandboxConfig(updated openclaw.json)
Action->>Sandbox: recomputeSandboxConfigHash()
Sandbox-->>Action: confirm write and hash
Action->>Registry: update sandbox state and onboard session
Action-->>Cmd: return InferenceSetResult
Cmd-->>CLI: display result
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/sandbox-config.ts (1)
843-843: 💤 Low valueInconsistent shell escaping:
configSetuses unquoted path whilewriteSandboxConfigquotes it.The new
writeSandboxConfigcorrectly usesshellQuote(target.configPath)at line 393, but the existingconfigSetfunction passestarget.configPathdirectly to the shell command without quoting. WhileconfigPathvalues are currently controlled (e.g.,/sandbox/.openclaw/openclaw.json), this inconsistency could become a problem if paths ever contain special characters.Consider applying the same quoting for consistency:
♻️ Suggested fix
"-i", "--", "sh", "-c", - `cat > ${target.configPath}`, + `cat > ${shellQuote(target.configPath)}`, ],🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/sandbox-config.ts` at line 843, The configSet function currently injects target.configPath unquoted into the shell command while writeSandboxConfig uses shellQuote(target.configPath); update configSet to wrap the path with shellQuote (use the same helper as writeSandboxConfig) when constructing the `cat > ${...}` command to ensure paths with spaces/special chars are safely quoted and to keep behavior consistent between configSet and writeSandboxConfig.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/sandbox-config.ts`:
- Line 843: The configSet function currently injects target.configPath unquoted
into the shell command while writeSandboxConfig uses
shellQuote(target.configPath); update configSet to wrap the path with shellQuote
(use the same helper as writeSandboxConfig) when constructing the `cat > ${...}`
command to ensure paths with spaces/special chars are safely quoted and to keep
behavior consistent between configSet and writeSandboxConfig.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c169b42e-f2e7-4c4a-83e8-565b4800955b
📒 Files selected for processing (21)
docs/inference/switch-inference-providers.mddocs/inference/use-local-inference.mddocs/manage-sandboxes/lifecycle.mddocs/reference/cli-selection-guide.mddocs/reference/commands.mddocs/security/best-practices.mdscripts/nemoclaw-start.shsrc/commands/inference/set.tssrc/lib/actions/inference-set.test.tssrc/lib/actions/inference-set.tssrc/lib/actions/root-help.tssrc/lib/cli/command-registry.test.tssrc/lib/cli/oclif-dispatch.test.tssrc/lib/cli/oclif-dispatch.tssrc/lib/commands/global-oclif-command-adapters.test.tssrc/lib/commands/inference/set.tssrc/lib/inference/config.tssrc/lib/onboard/providers.tssrc/lib/sandbox-config.tstest/nemoclaw-start-reconcile.test.tstest/nemoclaw-start.test.ts
✅ Files skipped from review due to trivial changes (5)
- src/commands/inference/set.ts
- docs/manage-sandboxes/lifecycle.md
- docs/security/best-practices.md
- src/lib/cli/oclif-dispatch.test.ts
- docs/reference/commands.md
🚧 Files skipped from review as they are similar to previous changes (1)
- test/nemoclaw-start.test.ts
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/sandbox-config.ts (1)
368-373: 💤 Low valueConsider eliminating the unnecessary temp file round-trip.
The function writes the config to a temp file and immediately reads it back only to pipe it to stdin. The serialized content could be used directly, removing the temp file overhead entirely.
♻️ Suggested simplification
function writeSandboxConfig( sandboxName: string, target: AgentConfigTarget, config: ConfigObject, ): void { - const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), "nemoclaw-config-")); - const tmpFile = path.join(tmpDir, target.configFile); + const content = serializeConfig(config, target.format); try { - fs.writeFileSync(tmpFile, serializeConfig(config, target.format), { mode: 0o600 }); - - const content = fs.readFileSync(tmpFile, "utf-8"); dockerExecFileSync( [ "exec", "-i", K3S_CONTAINER, ... ], { input: content, stdio: ["pipe", "pipe", "pipe"], timeout: 15000 }, ); // ... chown block unchanged ... - } finally { - try { - fs.unlinkSync(tmpFile); - fs.rmdirSync(tmpDir); - } catch { - // Best effort. - } } + catch (err) { + throw err; + } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/sandbox-config.ts` around lines 368 - 373, Avoid the temp-file round trip: remove creation of tmpDir/tmpFile and the writeFileSync/readFileSync calls and instead call serializeConfig(config, target.format) once, store that string (e.g., serialized) and pass it directly to the code that currently reads the file to pipe to stdin; update any permission/cleanup logic tied to tmpDir/tmpFile and ensure references to tmpFile/tmpDir are removed (look for serializeConfig, config, target.format, target.configFile usage to locate the code to change).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/sandbox-config.ts`:
- Around line 368-373: Avoid the temp-file round trip: remove creation of
tmpDir/tmpFile and the writeFileSync/readFileSync calls and instead call
serializeConfig(config, target.format) once, store that string (e.g.,
serialized) and pass it directly to the code that currently reads the file to
pipe to stdin; update any permission/cleanup logic tied to tmpDir/tmpFile and
ensure references to tmpFile/tmpDir are removed (look for serializeConfig,
config, target.format, target.configFile usage to locate the code to change).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3f2af97e-2cdb-4660-b7b0-a9fa9246b6d5
📒 Files selected for processing (1)
src/lib/sandbox-config.ts
…3679) ## Summary The `reconcile_agent_model_with_provider` safety net added in #3319 reads both sides of its equality check from the same file (primary vs `providers.inference.models[0].name`). When a user runs `openshell inference set` — which only writes the gateway and leaves `/sandbox/.openclaw/openclaw.json` untouched — both fields stay equal to each other and the reconciler short-circuits as a no-op. This PR makes the reconciler probe the live gateway via `openshell inference get --json` and use that as the source of truth so the file is actually realigned to the routed model on the next sandbox start. ## Related Issue Partial fix for #3175. See the [issue comment](#3175 (comment)) for why a complete fix also requires an OpenShell-side change. ## Changes - `scripts/nemoclaw-start.sh`: `reconcile_agent_model_with_provider` now probes the live gateway and, when a model is returned, treats it as authoritative — aligning both `agents.defaults.model.primary` and `models.providers.inference.models[0].{id,name}` so the file no longer carries a stale entry the gateway can push back on its next reconcile. - Legacy in-file reconcile is preserved as a fallback for environments where the openshell binary is missing or the probe fails (no behavioral change for existing test scaffolds without a stubbed openshell). - `test/nemoclaw-start-reconcile.test.ts`: harness now optionally installs a stubbed `openshell` on PATH and scrubs any inherited host openshell. Adds four cases covering the gateway-mode happy path (the user's repro from #3175), inference-prefix idempotency, gateway-matches-file no-op, and empty-probe fallback. ## Scope This PR does not stop the silent runtime revert reported in the #3175 repro (gateway version 3 → 4 without a sandbox restart). That push-back is owned by openshell-gateway and needs an OpenShell-side change; see the linked issue comment. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] \`npx prek run --all-files\` passes - [x] \`npm test\` passes for the touched test file and the adjacent \`test/nemoclaw-start.test.ts\` (81/81) - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] \`make docs\` builds without warnings (doc changes only) Notes for reviewers: - Local full CLI vitest run had 4 pre-existing failures on this machine unrelated to the diff (`ssrf-parity` build artifact missing, `cli.test.ts` gateway-cleanup timeouts, `sandbox-build-context` umask 775 vs 755). Will rely on PR CI to validate. - The gateway-probe Python block uses `subprocess.run(timeout=3)` to bound the probe and silently falls back on any error. --- Signed-off-by: jason-ma-nv <jama@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Reconciliation now prefers a live gateway probe as the source of truth for model alignment; when available it updates the agent's primary model and the provider's primary model entry, with a legacy in-file fallback if the probe is unavailable. * **Tests** * Expanded tests cover probe-driven updates, no-op when already aligned, fallback behavior, malformed probe output, and deterministic probe stubbing. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: jason-ma-nv <jama@nvidia.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Signed-off-by: Carlos Villela <cvillela@nvidia.com> Co-authored-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
After
openshell inference setswaps the gateway's inference provider entry,agents.defaults.model.primaryinopenclaw.jsoncan drift frommodels.providers.<key>.models[0].id. The gateway then routes requests to the new model but the OpenClaw agent self-reports the old one. This change reconciles the two on every sandbox start so the agent boots with a consistent identity.Related Issue
Fixes #3175
Changes
reconcile_agent_model_with_providertoscripts/nemoclaw-start.sh. It alignsagents.defaults.model.primaryto the inference provider's first model id, recomputes.config-hash, and is a no-op when already in sync, when the file is missing, or when no inference provider is configured.apply_model_overridein both the non-root and root entrypoint paths so explicitNEMOCLAW_MODEL_OVERRIDEvalues still win.test/nemoclaw-start-reconcile.test.tscovering the drift case, the in-sync no-op, the missing-provider no-op, and the missing-keys no-op.runPreGatewaySetupand non-root fallback test scaffolds intest/nemoclaw-start.test.tsto stub the new function so they continue to pass.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Notes for reviewers:
npx prek run --all-fileswas attempted but theTest (CLI)hook hits a vitest@4 + v8 coverage race condition (unhandled rejection oncoverage/cli/.tmp/coverage-<pid>.json) under prek's environment, while the same hook command run directly passes. Reproducible on the existing main branch with no changes; not introduced by this PR.Signed-off-by: jason jama@nvidia.com
Summary by CodeRabbit
New Features
nemoclaw inference setto switch inference provider/model at runtime and update the running sandbox.Tests
Documentation
nemoclaw inference setand describe runtime model/provider switching.