fix(runtime): probe live gateway to reconcile agent identity (#3175)#3679
Conversation
The reconciler in scripts/nemoclaw-start.sh was meant to be the safety
net for users who run `openshell inference set` instead of `nemoclaw
inference set`. As written it read both sides of its equality check
from the same file (primary vs providers.inference.models[0].name),
so when `openshell inference set` modified only the gateway and left
the sandbox file untouched, the reconciler saw the two file fields
still equal and short-circuited as a no-op.
Probe the live gateway via `openshell inference get --json` and use
it as the source of truth. When the gateway model differs from the
file, align primary AND models[0].{id,name} so the agent identity
and the gateway route stay consistent across the next reconcile
cycle. Fall back to the legacy in-file reconcile when openshell is
unavailable so environments without the probe still resolve drift.
This is a partial fix for #3175: it closes the agent self-report
drift at the next sandbox restart, but the silent runtime revert
in the user's repro also needs an OpenShell-side fix to stop the
gateway from pushing the sandbox's stale config back onto a runtime
`inference set` change.
Signed-off-by: jason-ma-nv <jama@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe PR makes reconcile_agent_model_with_provider probe the live gateway (openshell) for the current model, normalizes and aligns agent and provider model entries when successful, and falls back to legacy in-file reconciliation when the probe is unavailable or invalid. Tests and harness support for gateway stubbing are added. ChangesGateway Model Reconciliation
🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
|
|
Sprint 5 planning update: we’re organizing this PR as a partial fix for #3175. Relationship:
This PR should be tracked for Sprint 5 review, but it should not auto-close #3175 when merged. |
… JSON Adds the missing coverage case from a deep-review pass: today the gateway probe absorbs malformed stdout via Python's `json.loads` → `SystemExit(0)` → empty `gateway_model` → legacy in-file reconcile. That behavior is correct but only protected by the implementation detail of the parser. A future refactor that swaps the parser or changes the exit handling could silently turn the malformed-output case into "do nothing" without any unit test catching it. Extends the test harness with a `gatewayRawOutput` option that lets the stub emit arbitrary stdout (bypassing the JSON-formatted shape), then adds one test that drives an HTML-like response and asserts the legacy-fallback shape (primary aligned to file's first model, models[0] untouched). 10/10 reconciler tests pass. Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
cjagwani
left a comment
There was a problem hiding this comment.
Approving. Pushed one follow-up test commit (7f6ea8c) to pin the legacy-fallback shape when the gateway probe emits malformed JSON.
The case is already handled correctly today (json.loads → SystemExit(0) → empty gateway_model → legacy in-file reconcile), but only via implementation detail of the parser. A future refactor that swaps how stdout is parsed or how the exit is handled could silently degrade malformed-output to "do nothing" with no test catching it. The new test drives an HTML-like response through the existing harness (extended with a gatewayRawOutput option) and asserts the legacy-fallback config shape.
10/10 reconciler tests pass.
Heads-up not blocking this PR:
command -v openshellat the call site is host-namespaced; if a future packaging change ever ships a shim openshell withoutinference get --json, the current absorb-via-SystemExit(0) path covers it but worth a code comment.- Sync'd-but-non-canonical
id: "inference/foo"values pass the equality check without rewriting (write path always writes bare). Cosmetic — config stays non-canonical until next true drift.
Scope check vs #3175: PR is intentionally a partial fix per its body and @wscurran's note — closes the "agent self-report stays old after restart" symptom but not the runtime v3→v4 silent revert that lives in the openshell-gateway layer. Issue should stay open.
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
PR Review AdvisorFindings: 0 needs attention, 1 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
This is an automated advisory review. A human maintainer must make the final merge decision. |
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
## Summary
- Add the v0.0.59 release notes from the GitHub announcement discussion.
- Refresh local inference and credential-storage guidance for the
current release behavior.
- Regenerate the user skills from the updated Fern docs.
- Tighten release-prep and docs review guidance for generated skills, PR
labels, and shared `$$nemoclaw` command placeholders.
## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" --glob '*.{md,mdx}'`
- `git diff --check`
- `npm run docs` (rerun outside sandbox after sandbox-only `tsx` IPC
permission failure)
- `npm run typecheck:cli`
- Pre-commit hooks during commit passed, including markdownlint,
docs-to-skills verification, gitleaks, commitlint, and skills YAML
tests.
## Source Summary
- #3679, #4437, #4681, #4766, #4772, #4775, #4786 ->
`docs/about/release-notes.mdx`, `docs/reference/commands.mdx`,
`docs/reference/troubleshooting.mdx`: Summarize OpenClaw 2026.5.27
compatibility, runtime path pinning, plugin registry recovery, live
gateway reconciliation, and clearer host-alias/startup diagnostics.
- #4332, #4402, #4769, #4776, #4779 -> `docs/about/release-notes.mdx`,
`docs/inference/inference-options.mdx`,
`docs/inference/use-local-inference.mdx`,
`docs/inference/switch-inference-providers.mdx`: Document the release
inference changes covering Local NIM waits, Hermes Anthropic routing,
Nemotron 3 Ultra, the current Ollama starter fallback, and Spark
managed-vLLM context length.
- #4628, #4652, #4733, #4745 -> `docs/about/release-notes.mdx`,
`docs/security/credential-storage.mdx`,
`docs/manage-sandboxes/messaging-channels.mdx`,
`docs/reference/troubleshooting.mdx`: Capture permission healing,
gateway-stored credential reuse, cross-sandbox messaging credential
conflict checks, and CDI preflight diagnostics.
- #4728, #4737, #4743, #4744, #4782 -> `.agents/skills/nemoclaw-user-*`:
Regenerate the user skill references from the updated source docs.
- Follow-up maintenance ->
`.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`,
`.coderabbit.yaml`: Add release-prep area labels for docs and skills
PRs, and teach docs review guidance that `$$nemoclaw` is the correct
shared command placeholder for examples that work across agent aliases.
Note: the `documentation` label was not present in the repository, so
this PR is labeled with `v0.0.59` only.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Documentation**
* Updated default model for local Ollama inference setup to qwen3.5:9b
* Added Nemotron 3 Ultra 550B as an NVIDIA Endpoints model option
* Clarified credential storage and reuse behavior for post-deployment
(day-two) operations
* Added v0.0.59 release notes covering OpenClaw compatibility, inference
options, Hermes messaging sync, and troubleshooting
* Clarified CLI selection guidance and updated OpenClaw version example
in status output
* Revised release-prep instructions and docs review guidance for CLI
alias usage
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
The
reconcile_agent_model_with_providersafety net added in #3319 reads both sides of its equality check from the same file (primary vsproviders.inference.models[0].name). When a user runsopenshell inference set— which only writes the gateway and leaves/sandbox/.openclaw/openclaw.jsonuntouched — both fields stay equal to each other and the reconciler short-circuits as a no-op. This PR makes the reconciler probe the live gateway viaopenshell inference get --jsonand use that as the source of truth so the file is actually realigned to the routed model on the next sandbox start.Related Issue
Partial fix for #3175. See the issue comment for why a complete fix also requires an OpenShell-side change.
Changes
scripts/nemoclaw-start.sh:reconcile_agent_model_with_providernow probes the live gateway and, when a model is returned, treats it as authoritative — aligning bothagents.defaults.model.primaryandmodels.providers.inference.models[0].{id,name}so the file no longer carries a stale entry the gateway can push back on its next reconcile.test/nemoclaw-start-reconcile.test.ts: harness now optionally installs a stubbedopenshellon PATH and scrubs any inherited host openshell. Adds four cases covering the gateway-mode happy path (the user's repro from [Inference] openshell inference set changes gateway model but sandbox agent still reports old model #3175), inference-prefix idempotency, gateway-matches-file no-op, and empty-probe fallback.Scope
This PR does not stop the silent runtime revert reported in the #3175 repro (gateway version 3 → 4 without a sandbox restart). That push-back is owned by openshell-gateway and needs an OpenShell-side change; see the linked issue comment.
Type of Change
Verification
Notes for reviewers:
ssrf-paritybuild artifact missing,cli.test.tsgateway-cleanup timeouts,sandbox-build-contextumask 775 vs 755). Will rely on PR CI to validate.subprocess.run(timeout=3)to bound the probe and silently falls back on any error.Signed-off-by: jason-ma-nv jama@nvidia.com
Summary by CodeRabbit
New Features
Tests