fix(rebuild): reuse gateway-stored credential when host env is empty#4745
Conversation
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughRebuild and onboarding credential checks now consult the OpenShell gateway: missing env vars no longer abort when the provider is registered in the gateway. Provider CLI arg building omits empty credentials on update. Unit and E2E tests cover gateway-stored credential reuse and missing-provider failures. ChangesGateway-aware credential validation across rebuild and onboarding
Sequence DiagramsequenceDiagram
participant User
participant Rebuild as RebuildPreflight
participant Env as hydrateCredentialEnv
participant Gateway as OpenShellGateway
participant CLI as RecreateCLI
User->>Rebuild: trigger rebuild (non-interactive)
Rebuild->>Env: check credential env (e.g., NVIDIA_API_KEY)
Env-->>Rebuild: missing
Rebuild->>Gateway: providerExistsInGateway(rebuildProvider)
alt provider registered in gateway
Gateway-->>Rebuild: true
Rebuild->>CLI: proceed without env credential (use gateway)
CLI-->>Rebuild: recreate proceeds
else provider not registered
Gateway-->>Rebuild: false
Rebuild->>User: abort with missing-credential message
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
PR Review AdvisorFindings: 0 needs attention, 4 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 4435-4438: Remove the duplicated explanatory comment blocks about
skipping the env requirement for already-registered providers and condense them
into a single concise reference comment (e.g. "// See `#3895`: skip env
requirement when provider already registered in gateway") near the related logic
in the onboard flow; update the comment adjacent to the upsertProvider /
provider update handling (the code that mentions `upsertProvider`, `provider
update`, gateway rebuild after `channels add`) and delete the repeated blocks
elsewhere (also remove the duplicate at the other location referenced) so
behavior is unchanged but duplicate inline commentary is collapsed to one short
reference.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4f103dda-5332-469b-8a93-80e9f8963a0f
📒 Files selected for processing (6)
src/lib/actions/sandbox/rebuild.tssrc/lib/onboard.tssrc/lib/onboard/providers.test.tssrc/lib/onboard/providers.tstest/e2e/test-channels-add-remove.shtest/rebuild-credential-preflight.test.ts
Selective E2E Results — ✅ All requested jobs passedRun: 26927683622
|
…urces, stub openshell in vitest Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26931170802
|
…IA_API_KEY is unset Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 26932186450
|
Selective E2E Results — ❌ Some jobs failedRun: 26932586923
|
…e error Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/e2e/test-hermes-discord-e2e.sh (1)
599-605: ⚡ Quick winDump the rebuild log in the grep-failure branch too.
The new diagnostics only fire when
rebuildexits non-zero. If the command exits 0 but still logsprovider credential not found, this branch fails without printing the captured log, which makes the regression harder to triage in CI.🛠️ Suggested change
if [ "$rebuild_rc" -ne 0 ]; then fail "Hermes rebuild failed with NVIDIA_API_KEY unset (rc=${rebuild_rc})" echo "---- begin rebuild log (${HERMES_REBUILD_LOG}) ----" cat "$HERMES_REBUILD_LOG" 2>/dev/null || true echo "---- end rebuild log ----" elif grep -q "provider credential not found" "$HERMES_REBUILD_LOG"; then fail "REGRESSION — rebuild aborted on missing NVIDIA_API_KEY despite gateway-registered credential" + echo "---- begin rebuild log (${HERMES_REBUILD_LOG}) ----" + cat "$HERMES_REBUILD_LOG" 2>/dev/null || true + echo "---- end rebuild log ----" else pass "Hermes rebuild reused gateway-stored credential without NVIDIA_API_KEY" fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/test-hermes-discord-e2e.sh` around lines 599 - 605, The grep-failure branch (elif grep -q "provider credential not found" "$HERMES_REBUILD_LOG"; then) fails without printing the captured rebuild log; update that branch to dump the same diagnostics as the non-zero exit branch by emitting the "---- begin rebuild log (${HERMES_REBUILD_LOG}) ----" header, cat "$HERMES_REBUILD_LOG" 2>/dev/null || true, and the "---- end rebuild log ----" footer before failing so the rebuild output is available for CI triage.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/e2e/test-hermes-discord-e2e.sh`:
- Around line 599-605: The grep-failure branch (elif grep -q "provider
credential not found" "$HERMES_REBUILD_LOG"; then) fails without printing the
captured rebuild log; update that branch to dump the same diagnostics as the
non-zero exit branch by emitting the "---- begin rebuild log
(${HERMES_REBUILD_LOG}) ----" header, cat "$HERMES_REBUILD_LOG" 2>/dev/null ||
true, and the "---- end rebuild log ----" footer before failing so the rebuild
output is available for CI triage.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f2e006c0-33ff-4836-a5c7-7e7edfaa492d
📒 Files selected for processing (1)
test/e2e/test-hermes-discord-e2e.sh
Selective E2E Results — ❌ Some jobs failedRun: 26932932909
|
… surfaces Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 26933035536
|
Selective E2E Results — ❌ Some jobs failedRun: 26933621737
|
…t-owned scratch Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26935129127
|
…cause is fixed Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26935176986
|
Selective E2E Results — ✅ All requested jobs passedRun: 26936023663
|
Selective E2E Results — ✅ All requested jobs passedRun: 26935779178
|
prekshivyas
left a comment
There was a problem hiding this comment.
LGTM — approving. Traced the fix end to end:
- The gate
providerExistsInGatewayreturnsresult.status === 0, so it'strueonly whenopenshell provider get <name>genuinely succeeds. Any ambiguous case — gateway down, provider absent, command error — returnsfalse, so the env-credential check is not skipped. The fix fails safe: it can only bypass the host-env requirement when the provider is verifiably registered, never on a gateway outage. Good design. setupNim's reordering is correct: validate the key when present, otherwise only fail when the provider isn't in the gateway — preserving the invalid-key path and adding the gateway-aware skip.buildProviderArgs/upsertProviderdropping--credential KEYon the update path when the env is empty is the right call (OpenShell rejects an empty--credential), andincludeCredential = action === "create" || credentialValueAvailablekeepscreatecorrect. The "update without --credential preserves the gateway secret" assumption is exercised for real in the two e2e scripts withNVIDIA_API_KEYunset, and CI is green — so it's empirically validated, not just asserted.
Test coverage is strong: arg-builder unit cases (omit/keep/staged), rebuild preflight regressions, and e2e gating in both test-channels-add-remove.sh and test-hermes-discord-e2e.sh with careful env restore.
Non-blocking, take or leave: CodeRabbit's two nits are reasonable — collapse the duplicated explanatory comment in onboard.ts, and dump the rebuild log in the grep-failure branch of test-hermes-discord-e2e.sh (line ~603) so a CI regression is triageable. Neither blocks merge.
## Summary
- Add the v0.0.59 release notes from the GitHub announcement discussion.
- Refresh local inference and credential-storage guidance for the
current release behavior.
- Regenerate the user skills from the updated Fern docs.
- Tighten release-prep and docs review guidance for generated skills, PR
labels, and shared `$$nemoclaw` command placeholders.
## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" --glob '*.{md,mdx}'`
- `git diff --check`
- `npm run docs` (rerun outside sandbox after sandbox-only `tsx` IPC
permission failure)
- `npm run typecheck:cli`
- Pre-commit hooks during commit passed, including markdownlint,
docs-to-skills verification, gitleaks, commitlint, and skills YAML
tests.
## Source Summary
- #3679, #4437, #4681, #4766, #4772, #4775, #4786 ->
`docs/about/release-notes.mdx`, `docs/reference/commands.mdx`,
`docs/reference/troubleshooting.mdx`: Summarize OpenClaw 2026.5.27
compatibility, runtime path pinning, plugin registry recovery, live
gateway reconciliation, and clearer host-alias/startup diagnostics.
- #4332, #4402, #4769, #4776, #4779 -> `docs/about/release-notes.mdx`,
`docs/inference/inference-options.mdx`,
`docs/inference/use-local-inference.mdx`,
`docs/inference/switch-inference-providers.mdx`: Document the release
inference changes covering Local NIM waits, Hermes Anthropic routing,
Nemotron 3 Ultra, the current Ollama starter fallback, and Spark
managed-vLLM context length.
- #4628, #4652, #4733, #4745 -> `docs/about/release-notes.mdx`,
`docs/security/credential-storage.mdx`,
`docs/manage-sandboxes/messaging-channels.mdx`,
`docs/reference/troubleshooting.mdx`: Capture permission healing,
gateway-stored credential reuse, cross-sandbox messaging credential
conflict checks, and CDI preflight diagnostics.
- #4728, #4737, #4743, #4744, #4782 -> `.agents/skills/nemoclaw-user-*`:
Regenerate the user skill references from the updated source docs.
- Follow-up maintenance ->
`.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`,
`.coderabbit.yaml`: Add release-prep area labels for docs and skills
PRs, and teach docs review guidance that `$$nemoclaw` is the correct
shared command placeholder for examples that work across agent aliases.
Note: the `documentation` label was not present in the repository, so
this PR is labeled with `v0.0.59` only.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Documentation**
* Updated default model for local Ollama inference setup to qwen3.5:9b
* Added Nemotron 3 Ultra 550B as an NVIDIA Endpoints model option
* Clarified credential storage and reuse behavior for post-deployment
(day-two) operations
* Added v0.0.59 release notes covering OpenClaw compatibility, inference
options, Hermes messaging sync, and troubleshooting
* Clarified CLI selection guidance and updated OpenClaw version example
in status output
* Revised release-prep instructions and docs review guidance for CLI
alias usage
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…ng onboard (#4598) ## Summary Two preflight cleanup paths assumed the OpenShell gateway and dashboard forward were process-wide singletons. When a second NemoClaw onboard ran with `NEMOCLAW_GATEWAY_PORT=N`, the preflight retired the existing per-port gateway as "legacy" and killed the first sandbox's dashboard SSH forward — leaving the first sandbox unreachable. This PR scopes both cleanups so the second instance starts its own gateway alongside the first instead of replacing it. ## Related Issue Fixes #4422 · Refs #3053 #4422 is the specific SIGKILL-on-second-onboard symptom: a second onboard with `NEMOCLAW_GATEWAY_PORT=N` destroyed the previous instance's per-port gateway and dashboard forward. This PR fixes both preflight cleanups so concurrent onboards no longer step on each other. #3053 is the broader ask — full multi-instance segregation of registry, credentials, snapshots, messaging, and lifecycle behind a configurable `NEMOCLAW_INSTANCE` identity. That work is out of scope here and tracked separately; this PR removes the destructive cross-talk that previously prevented two NemoClaw-managed sandboxes from coexisting at all, but does not yet introduce the instance identity primitive. ## Changes - `src/lib/onboard/machine/handlers/gateway.ts`: skip `retireLegacyGatewayForDockerDriverUpgrade` when `gatewayReuseState === "foreign-active"`. A foreign-active gateway is another sandbox's per-port `nemoclaw-<port>` — not legacy state to retire. Normalises to `"missing"` so the current onboard proceeds with its own per-port gateway alongside. - `src/lib/onboard.ts`: dashboard-port preflight no longer kills an "orphaned SSH port-forward" when `openshell forward list` shows the port is held by another live sandbox. The runtime allocator picks a different dashboard port for this sandbox at create time instead. - `src/lib/onboard/machine/handlers/gateway.test.ts`: unit test for the foreign-active no-retire branch. - `test/e2e/test-concurrent-gateway-ports.sh`: new E2E that onboards two sandboxes (default + `NEMOCLAW_GATEWAY_PORT=18080`), asserts both reach `Ready`, distinct gateway ports (8080 + 18080), distinct dashboard ports (18789 + 18790), and that destroying one leaves the other intact. Each sandbox is queried via its own gateway with `openshell sandbox list -g <gateway-name>` so the global active-gateway pointer does not flip the read. - `.github/workflows/nightly-e2e.yaml`: registers `concurrent-gateway-ports-e2e` in the dispatchable-jobs catalog, `needs` lists, and the advisor comment block. Also documents existing `openclaw-skill-cli-e2e` and `channels-add-remove-e2e` in the catalog so the PR-review E2E advisor surfaces them when relevant changes land — catches up leftover automation from PRs #4766 (#4709 OpenClaw skill CLI) and #4745 (#3895 channels add/remove) where the tests shipped but were never advertised to the advisor. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `npm run docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- Signed-off-by: Tinson Lai <tinsonl@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Manage concurrent gateway ports safely across multiple sandboxes on the same host. * **Bug Fixes** * Improved cleanup for orphaned SSH port-forwards that block dashboard ports. * **Tests** * Added E2E test validating concurrent gateway-port scenarios. * Added/updated unit tests for gateway-state and orphaned-forward handling. * **Chores** * Added nightly E2E workflow job for concurrent gateway port testing and integrated it into reporting. * **Documentation** * Expanded nightly E2E job documentation for related tests. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Tinson Lai <tinsonl@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
Rebuild preflight aborted with
provider credential not foundwhen the inference provider was already registered in the OpenShell gateway but the credential env var was missing from the host shell (e.g. afternemohermes channels addfrom a fresh terminal). #3030 fixed this only for the Hermes Provider; non-Hermes providers likenvidia-prodstill demanded the host env. Reuse the gateway-held secret when present and drop--credential KEYfromprovider updatecalls when the host env is empty so OpenShell's CLI does not reject the call before reaching the gateway.Related Issue
Fixes #3895
Changes
src/lib/actions/sandbox/rebuild.ts: skip the env-only preflight whenproviderExistsInGateway(rebuildProvider)returns true. The recreate step reuses the gateway-stored credential.src/lib/onboard.ts: gate the non-interactivebuildand remote-provider env checks insetupNimonproviderExistsInGateway(provider)so a gateway-registered provider does not require the host env in recreate flows.src/lib/onboard/providers.ts:upsertProviderdrops--credential KEYfromprovider updatecalls when the host env does not carry a value. OpenShell's CLI rejects--credential KEYwith an empty env (parse_credential_pairs), so the flag had to go for the no-rotation update path to succeed.provider createstill requires the credential.test/rebuild-credential-preflight.test.ts: regression tests fornvidia-prodwith the gateway-registered credential and host env unset, plus the negative gate where both gateway and env are empty. Existing negative tests pinned toproviderRegistered: falseto keep their original intent.src/lib/onboard/providers.test.ts: new tests for theupdatearg builder with and without a staged env value, and for thecreatepath still requiring the credential.test/e2e/test-channels-add-remove.sh: unsetNVIDIA_API_KEYaround the post-add rebuild so the channel-add flow regression-gates the gateway-credential reuse path. Restores the key for the later phases.Type of Change
Verification
Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
New Features
Behavior Change
Tests