Skip to content

fix(rebuild): reuse gateway-stored credential when host env is empty#4745

Merged
cv merged 7 commits into
mainfrom
fix/3895-rebuild-credential-gateway-preflight
Jun 4, 2026
Merged

fix(rebuild): reuse gateway-stored credential when host env is empty#4745
cv merged 7 commits into
mainfrom
fix/3895-rebuild-credential-gateway-preflight

Conversation

@laitingsheng

@laitingsheng laitingsheng commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Rebuild preflight aborted with provider credential not found when the inference provider was already registered in the OpenShell gateway but the credential env var was missing from the host shell (e.g. after nemohermes channels add from a fresh terminal). #3030 fixed this only for the Hermes Provider; non-Hermes providers like nvidia-prod still demanded the host env. Reuse the gateway-held secret when present and drop --credential KEY from provider update calls when the host env is empty so OpenShell's CLI does not reject the call before reaching the gateway.

Related Issue

Fixes #3895

Changes

  • src/lib/actions/sandbox/rebuild.ts: skip the env-only preflight when providerExistsInGateway(rebuildProvider) returns true. The recreate step reuses the gateway-stored credential.
  • src/lib/onboard.ts: gate the non-interactive build and remote-provider env checks in setupNim on providerExistsInGateway(provider) so a gateway-registered provider does not require the host env in recreate flows.
  • src/lib/onboard/providers.ts: upsertProvider drops --credential KEY from provider update calls when the host env does not carry a value. OpenShell's CLI rejects --credential KEY with an empty env (parse_credential_pairs), so the flag had to go for the no-rotation update path to succeed. provider create still requires the credential.
  • test/rebuild-credential-preflight.test.ts: regression tests for nvidia-prod with the gateway-registered credential and host env unset, plus the negative gate where both gateway and env are empty. Existing negative tests pinned to providerRegistered: false to keep their original intent.
  • src/lib/onboard/providers.test.ts: new tests for the update arg builder with and without a staged env value, and for the create path still requiring the credential.
  • test/e2e/test-channels-add-remove.sh: unset NVIDIA_API_KEY around the post-add rebuild so the channel-add flow regression-gates the gateway-credential reuse path. Restores the key for the later phases.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • `npx prek run --all-files` passes
  • `npm test` passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • `npm run docs` builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • New Features

    • Rebuilds and onboarding can proceed when a provider is already registered in the gateway, avoiding unnecessary aborts for missing credential environment variables.
  • Behavior Change

    • Provider create/update commands now omit empty credentials to prevent failing CLI calls; credential checks will consult the gateway and skip env-missing failures when appropriate.
  • Tests

    • Added unit, integration, and end-to-end tests to validate gateway-aware credential handling and rebuild credential reuse.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7d9189be-f37d-43b8-bf00-bf46f1b4b621

📥 Commits

Reviewing files that changed from the base of the PR and between 1b177de and be52892.

📒 Files selected for processing (1)
  • test/e2e/test-hermes-discord-e2e.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/test-hermes-discord-e2e.sh

📝 Walkthrough

Walkthrough

Rebuild and onboarding credential checks now consult the OpenShell gateway: missing env vars no longer abort when the provider is registered in the gateway. Provider CLI arg building omits empty credentials on update. Unit and E2E tests cover gateway-stored credential reuse and missing-provider failures.

Changes

Gateway-aware credential validation across rebuild and onboarding

Layer / File(s) Summary
Provider arg building & upsertProvider credential handling
src/lib/onboard/providers.ts, src/lib/onboard/providers.test.ts
buildProviderArgs accepts opts.includeCredential; upsertProvider inspects env and omits --credential on updates when no value is present. Tests assert create/update behaviors and credential flag presence/absence.
Onboard non-interactive gateway checks
src/lib/onboard.ts, test/onboard-selection.test.ts
Non-interactive onboarding (build/NVIDIA and remote-provider branches) now calls providerExistsInGateway before aborting on missing credential env; the helper was refactored to an arrow wrapper. Test adds a fake openshell to ensure missing-key path remains testable.
Rebuild sandbox gateway-aware preflight
src/lib/actions/sandbox/rebuild.ts
rebuildSandbox now imports and uses providerExistsInGateway; when hydrateCredentialEnv yields no env value, the code checks the gateway and skips the env-only abort if the provider exists there, otherwise it preserves the previous abort-and-leave-sandbox behavior.
Unit & E2E tests for credential reuse and failure
test/rebuild-credential-preflight.test.ts, test/e2e/test-channels-add-remove.sh, test/e2e/test-hermes-discord-e2e.sh
Rebuild preflight tests updated to cover gateway-registered and gateway-missing scenarios; e2e scripts back up/unset NVIDIA_API_KEY around channels add + rebuild and assert rebuild did not log "provider credential not found".

Sequence Diagram

sequenceDiagram
  participant User
  participant Rebuild as RebuildPreflight
  participant Env as hydrateCredentialEnv
  participant Gateway as OpenShellGateway
  participant CLI as RecreateCLI

  User->>Rebuild: trigger rebuild (non-interactive)
  Rebuild->>Env: check credential env (e.g., NVIDIA_API_KEY)
  Env-->>Rebuild: missing
  Rebuild->>Gateway: providerExistsInGateway(rebuildProvider)
  alt provider registered in gateway
    Gateway-->>Rebuild: true
    Rebuild->>CLI: proceed without env credential (use gateway)
    CLI-->>Rebuild: recreate proceeds
  else provider not registered
    Gateway-->>Rebuild: false
    Rebuild->>User: abort with missing-credential message
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

fix, Sandbox, area: inference

Suggested reviewers

  • cv
  • prekshivyas

Poem

🐰 I hopped through logs and gateway queues,

Secrets stored where envs don't snooze.
Rebuild now looks where the gateway keeps,
No more lost keys in the shell that sleeps.
Hooray — credentials rest, the sandbox leaps!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main fix: allowing rebuild to reuse gateway-stored credentials when the host environment variable is empty, which directly addresses issue #3895.
Linked Issues check ✅ Passed All code changes directly implement the required behavior from #3895: consulting the OpenShell gateway when determining credential availability and allowing rebuild to proceed when the provider is registered in the gateway.
Out of Scope Changes check ✅ Passed All changes are scope-aligned: modifications to rebuild preflight, non-interactive provider selection, credential argument handling, and test coverage are all necessary to implement the gateway credential reuse feature from #3895.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3895-rebuild-credential-gateway-preflight

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 4 worth checking, 0 nice ideas
Since last review: 1 prior item resolved, 1 still applies, 0 new items found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Source-of-truth review needed: Gateway-stored credential reuse when host credential env is empty: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `rebuildSandbox()` skips the env-only preflight if `providerExistsInGateway(rebuildProvider, runOpenshell)` returns true.
  • Source-of-truth review needed: Omitting `--credential` on provider update when host env is empty: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `includeCredential = action === "create" || credentialValueAvailable` controls the credential flag, but `buildProviderArgs()` still appends endpoint `--config` arguments.
  • Constrain endpoint config changes when reusing gateway-stored credentials (src/lib/onboard/providers.ts:324): When an existing provider is updated and the host env does not contain the credential, `upsertProvider()` now omits `--credential` but still allows `buildProviderArgs()` to append endpoint config such as `OPENAI_BASE_URL` or `ANTHROPIC_BASE_URL`. That is appropriate for the intended no-rotation rebuild of fixed providers, but for custom or compatible endpoints it can retain the gateway-stored secret while changing where future inference traffic sends that secret.
    • Recommendation: Only use the omit-credential update path when provider endpoint config is unchanged or is a fixed built-in provider endpoint. For custom/OpenAI-compatible and Anthropic-compatible providers, require a fresh host credential before changing base URL, or compare the gateway/current config and reject mutations without credential rotation. Add a regression test for an existing custom provider with empty host env and a changed `NEMOCLAW_ENDPOINT_URL`.
    • Evidence: `credentialValueAvailable` controls `includeCredential`, but `buildProviderArgs()` still appends `--config OPENAI_BASE_URL=${baseUrl}` / `ANTHROPIC_BASE_URL=${baseUrl}` even when `includeCredential` is false. The new tests assert `--credential` is omitted while `OPENAI_BASE_URL=https://integrate.api.nvidia.com/v1\` remains present, but do not cover changed custom endpoints.
  • Document or strengthen the gateway-provider existence fallback (src/lib/actions/sandbox/rebuild.ts:363): The rebuild preflight now treats `openshell provider get <name>` success as enough to skip the missing host credential error. This addresses the reported issue, but `providerExistsInGateway()` is gateway-level only: it does not prove the provider is attached to the target sandbox, still has a usable credential, or has config matching the target sandbox. That weakens the pre-destroy preflight guarantee for corrupted or drifted local registry/session state.
    • Recommendation: Either document the precise fallback contract and removal condition, or strengthen the check when OpenShell exposes a credential-present or sandbox-scoped provider query. Keep the negative tests that prove rebuild still aborts when neither env nor gateway provider exists.
    • Evidence: `providerExistsInGateway()` is documented as querying only the gateway-level registry and `rebuildSandbox()` uses it to skip the env-only preflight when `hydrateCredentialEnv()` returns empty.

🌱 Nice ideas

  • None.
Since last review details

Current findings:

  • Source-of-truth review needed: Gateway-stored credential reuse when host credential env is empty: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `rebuildSandbox()` skips the env-only preflight if `providerExistsInGateway(rebuildProvider, runOpenshell)` returns true.
  • Source-of-truth review needed: Omitting `--credential` on provider update when host env is empty: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `includeCredential = action === "create" || credentialValueAvailable` controls the credential flag, but `buildProviderArgs()` still appends endpoint `--config` arguments.
  • Constrain endpoint config changes when reusing gateway-stored credentials (src/lib/onboard/providers.ts:324): When an existing provider is updated and the host env does not contain the credential, `upsertProvider()` now omits `--credential` but still allows `buildProviderArgs()` to append endpoint config such as `OPENAI_BASE_URL` or `ANTHROPIC_BASE_URL`. That is appropriate for the intended no-rotation rebuild of fixed providers, but for custom or compatible endpoints it can retain the gateway-stored secret while changing where future inference traffic sends that secret.
    • Recommendation: Only use the omit-credential update path when provider endpoint config is unchanged or is a fixed built-in provider endpoint. For custom/OpenAI-compatible and Anthropic-compatible providers, require a fresh host credential before changing base URL, or compare the gateway/current config and reject mutations without credential rotation. Add a regression test for an existing custom provider with empty host env and a changed `NEMOCLAW_ENDPOINT_URL`.
    • Evidence: `credentialValueAvailable` controls `includeCredential`, but `buildProviderArgs()` still appends `--config OPENAI_BASE_URL=${baseUrl}` / `ANTHROPIC_BASE_URL=${baseUrl}` even when `includeCredential` is false. The new tests assert `--credential` is omitted while `OPENAI_BASE_URL=https://integrate.api.nvidia.com/v1\` remains present, but do not cover changed custom endpoints.
  • Document or strengthen the gateway-provider existence fallback (src/lib/actions/sandbox/rebuild.ts:363): The rebuild preflight now treats `openshell provider get <name>` success as enough to skip the missing host credential error. This addresses the reported issue, but `providerExistsInGateway()` is gateway-level only: it does not prove the provider is attached to the target sandbox, still has a usable credential, or has config matching the target sandbox. That weakens the pre-destroy preflight guarantee for corrupted or drifted local registry/session state.
    • Recommendation: Either document the precise fallback contract and removal condition, or strengthen the check when OpenShell exposes a credential-present or sandbox-scoped provider query. Keep the negative tests that prove rebuild still aborts when neither env nor gateway provider exists.
    • Evidence: `providerExistsInGateway()` is documented as querying only the gateway-level registry and `rebuildSandbox()` uses it to skip the env-only preflight when `hydrateCredentialEnv()` returns empty.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@laitingsheng laitingsheng added integration: hermes Hermes integration behavior bug-fix PR fixes a bug or regression labels Jun 4, 2026
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: channels-add-remove-e2e, hermes-discord-e2e
Optional E2E: rebuild-openclaw-e2e, messaging-providers-e2e, credential-migration-e2e

Dispatch hint: channels-add-remove-e2e,hermes-discord-e2e

Auto-dispatched E2E: channels-add-remove-e2e, hermes-discord-e2e via nightly-e2e.yaml at be52892ff0b96fcd1e87f0dfdc7c558b5d4564a2nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • channels-add-remove-e2e (~75 minutes; requires NVIDIA_API_KEY and messaging E2E secrets/fake-token path): Required because the PR changes the exact OpenClaw channels add/remove + rebuild credential flow and updates this E2E to assert rebuild reuses the gateway-stored nvidia-prod credential when NVIDIA_API_KEY is absent from the host environment.
  • hermes-discord-e2e (~60 minutes; requires NVIDIA_API_KEY and Discord E2E/fake-token path): Required because the PR changes gateway-aware credential handling for rebuild/onboard and adds Hermes Discord coverage for rebuilding with NVIDIA_API_KEY unset, a real assistant flow involving Hermes config, Discord provider placeholders, policy, and sandbox recreation.

Optional E2E

  • rebuild-openclaw-e2e (~60 minutes): Useful adjacent confidence for the generic OpenClaw rebuild path, backup/restore, credential chain after rebuild, and policy preservation, though it does not specifically gate the new host-env-absent gateway credential reuse scenario added here.
  • messaging-providers-e2e (~75 minutes): Useful adjacent coverage for provider creation/update, messaging credential placeholder rewrite, and channel policy/provider attachment across Telegram/Discord/Slack/WeChat/WhatsApp. Optional because the PR-specific add/remove rebuild regression is covered by channels-add-remove-e2e.
  • credential-migration-e2e (~30 minutes): Optional confidence for credential storage/migration and gateway credential behavior because rebuild now treats the gateway as the source of truth when host env credentials are absent.

New E2E recommendations

  • non-NVIDIA remote provider gateway credential reuse (medium): The implementation applies providerExistsInGateway fallback to remote providers beyond nvidia-prod, but the changed E2Es primarily validate NVIDIA-backed OpenClaw/Hermes paths. A hermetic or fake compatible-endpoint E2E that registers a non-NVIDIA provider, unsets its host credential env, then runs non-interactive onboard/rebuild would close this gap.
    • Suggested test: Add an E2E scenario for a gateway-registered OpenAI-compatible/custom remote provider rebuild with the provider-specific host credential unset.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: channels-add-remove-e2e,hermes-discord-e2e

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: ubuntu-repo-cloud-openclaw, ubuntu-repo-cloud-openclaw-telegram, ubuntu-repo-cloud-hermes-discord
Optional scenario E2E: ubuntu-repo-cloud-openclaw-discord, ubuntu-repo-cloud-openclaw-token-rotation

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-telegram
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-hermes-discord

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • ubuntu-repo-cloud-openclaw: Core Ubuntu repo/cloud OpenClaw path exercises NVIDIA provider onboarding, gateway credential registration, credential storage checks, inference route setup, and the rebuild surface changed in src/lib/actions/sandbox/rebuild.ts and src/lib/onboard.ts.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw
  • ubuntu-repo-cloud-openclaw-telegram: Provider upsert and gateway-stored credential reuse changes affect messaging provider attachment paths; this is the closest scenario coverage for the Telegram channels/provider surface touched by the PR.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-telegram
  • ubuntu-repo-cloud-hermes-discord: The PR changes non-interactive NVIDIA credential fallback and rebuild behavior that is explicitly exercised by the Hermes + Discord provider path; this scenario targets the Hermes/messaging regression surface.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-hermes-discord

Optional scenario E2E

  • ubuntu-repo-cloud-openclaw-discord: Adjacent Discord messaging coverage on OpenClaw can help separate generic messaging provider-update behavior from Hermes-specific behavior, but the Hermes Discord scenario is the primary Discord target.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-discord
  • ubuntu-repo-cloud-openclaw-token-rotation: Provider update/credential handling changes are adjacent to token rotation behavior; useful if extra coverage is desired for credential update isolation.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-token-rotation

Relevant changed files

  • src/lib/actions/sandbox/rebuild.ts
  • src/lib/onboard.ts
  • src/lib/onboard/providers.ts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 4435-4438: Remove the duplicated explanatory comment blocks about
skipping the env requirement for already-registered providers and condense them
into a single concise reference comment (e.g. "// See `#3895`: skip env
requirement when provider already registered in gateway") near the related logic
in the onboard flow; update the comment adjacent to the upsertProvider /
provider update handling (the code that mentions `upsertProvider`, `provider
update`, gateway rebuild after `channels add`) and delete the repeated blocks
elsewhere (also remove the duplicate at the other location referenced) so
behavior is unchanged but duplicate inline commentary is collapsed to one short
reference.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4f103dda-5332-469b-8a93-80e9f8963a0f

📥 Commits

Reviewing files that changed from the base of the PR and between 17734b1 and 5cb702d.

📒 Files selected for processing (6)
  • src/lib/actions/sandbox/rebuild.ts
  • src/lib/onboard.ts
  • src/lib/onboard/providers.test.ts
  • src/lib/onboard/providers.ts
  • test/e2e/test-channels-add-remove.sh
  • test/rebuild-credential-preflight.test.ts

Comment thread src/lib/onboard.ts Outdated
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26927683622
Target ref: 5cb702d71829cc760e9184e14a88f99071f4614e
Workflow ref: main
Requested jobs: channels-add-remove-e2e,rebuild-openclaw-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
channels-add-remove-e2e ✅ success
rebuild-openclaw-e2e ✅ success

…urces, stub openshell in vitest

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26931170802
Target ref: 1a0f851d2bc60655937e3c046596b053688a82ed
Workflow ref: main
Requested jobs: channels-add-remove-e2e,rebuild-openclaw-e2e,cloud-onboard-e2e
Summary: 3 passed, 0 failed, 0 skipped

Job Result
channels-add-remove-e2e ✅ success
cloud-onboard-e2e ✅ success
rebuild-openclaw-e2e ✅ success

…IA_API_KEY is unset

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@laitingsheng laitingsheng added the area: e2e End-to-end tests, nightly failures, or validation infrastructure label Jun 4, 2026
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26932186450
Target ref: 8132fd630a46fbec802f4d3de4883c71c260fc62
Workflow ref: main
Requested jobs: channels-add-remove-e2e,hermes-discord-e2e
Summary: 1 passed, 1 failed, 0 skipped

Job Result
channels-add-remove-e2e ✅ success
hermes-discord-e2e ❌ failure

Failed jobs: hermes-discord-e2e. Check run artifacts for logs.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26932586923
Target ref: fix/3895-rebuild-credential-gateway-preflight
Requested jobs: hermes-discord-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
hermes-discord-e2e ❌ failure

Failed jobs: hermes-discord-e2e. Check run artifacts for logs.

…e error

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/e2e/test-hermes-discord-e2e.sh (1)

599-605: ⚡ Quick win

Dump the rebuild log in the grep-failure branch too.

The new diagnostics only fire when rebuild exits non-zero. If the command exits 0 but still logs provider credential not found, this branch fails without printing the captured log, which makes the regression harder to triage in CI.

🛠️ Suggested change
 if [ "$rebuild_rc" -ne 0 ]; then
   fail "Hermes rebuild failed with NVIDIA_API_KEY unset (rc=${rebuild_rc})"
   echo "---- begin rebuild log (${HERMES_REBUILD_LOG}) ----"
   cat "$HERMES_REBUILD_LOG" 2>/dev/null || true
   echo "---- end rebuild log ----"
 elif grep -q "provider credential not found" "$HERMES_REBUILD_LOG"; then
   fail "REGRESSION — rebuild aborted on missing NVIDIA_API_KEY despite gateway-registered credential"
+  echo "---- begin rebuild log (${HERMES_REBUILD_LOG}) ----"
+  cat "$HERMES_REBUILD_LOG" 2>/dev/null || true
+  echo "---- end rebuild log ----"
 else
   pass "Hermes rebuild reused gateway-stored credential without NVIDIA_API_KEY"
 fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-hermes-discord-e2e.sh` around lines 599 - 605, The grep-failure
branch (elif grep -q "provider credential not found" "$HERMES_REBUILD_LOG";
then) fails without printing the captured rebuild log; update that branch to
dump the same diagnostics as the non-zero exit branch by emitting the "----
begin rebuild log (${HERMES_REBUILD_LOG}) ----" header, cat
"$HERMES_REBUILD_LOG" 2>/dev/null || true, and the "---- end rebuild log ----"
footer before failing so the rebuild output is available for CI triage.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/e2e/test-hermes-discord-e2e.sh`:
- Around line 599-605: The grep-failure branch (elif grep -q "provider
credential not found" "$HERMES_REBUILD_LOG"; then) fails without printing the
captured rebuild log; update that branch to dump the same diagnostics as the
non-zero exit branch by emitting the "---- begin rebuild log
(${HERMES_REBUILD_LOG}) ----" header, cat "$HERMES_REBUILD_LOG" 2>/dev/null ||
true, and the "---- end rebuild log ----" footer before failing so the rebuild
output is available for CI triage.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f2e006c0-33ff-4836-a5c7-7e7edfaa492d

📥 Commits

Reviewing files that changed from the base of the PR and between 8132fd6 and 1b177de.

📒 Files selected for processing (1)
  • test/e2e/test-hermes-discord-e2e.sh

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26932932909
Target ref: fix/3895-rebuild-credential-gateway-preflight
Requested jobs: hermes-discord-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
hermes-discord-e2e ❌ failure

Failed jobs: hermes-discord-e2e. Check run artifacts for logs.

… surfaces

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26933035536
Target ref: 1b177dec68823cef5324d9929470bc0edd1b112a
Workflow ref: main
Requested jobs: channels-add-remove-e2e,hermes-discord-e2e
Summary: 1 passed, 1 failed, 0 skipped

Job Result
channels-add-remove-e2e ✅ success
hermes-discord-e2e ❌ failure

Failed jobs: hermes-discord-e2e. Check run artifacts for logs.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26933621737
Target ref: fix/3895-rebuild-credential-gateway-preflight
Requested jobs: hermes-discord-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
hermes-discord-e2e ❌ failure

Failed jobs: hermes-discord-e2e. Check run artifacts for logs.

…t-owned scratch

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26935129127
Target ref: fix/3895-rebuild-credential-gateway-preflight
Requested jobs: hermes-discord-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
hermes-discord-e2e ✅ success

…cause is fixed

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26935176986
Target ref: 448e06334ff3d5b80998587ee3661f85cf811afe
Workflow ref: main
Requested jobs: channels-add-remove-e2e,hermes-discord-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
channels-add-remove-e2e ⚠️ cancelled
hermes-discord-e2e ✅ success

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26936023663
Target ref: fix/3895-rebuild-credential-gateway-preflight
Requested jobs: hermes-discord-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
hermes-discord-e2e ✅ success

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26935779178
Target ref: be52892ff0b96fcd1e87f0dfdc7c558b5d4564a2
Workflow ref: main
Requested jobs: channels-add-remove-e2e,hermes-discord-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
channels-add-remove-e2e ✅ success
hermes-discord-e2e ✅ success

@laitingsheng laitingsheng added the v0.0.59 Release target label Jun 4, 2026
@prekshivyas prekshivyas self-assigned this Jun 4, 2026

@prekshivyas prekshivyas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — approving. Traced the fix end to end:

  • The gate providerExistsInGateway returns result.status === 0, so it's true only when openshell provider get <name> genuinely succeeds. Any ambiguous case — gateway down, provider absent, command error — returns false, so the env-credential check is not skipped. The fix fails safe: it can only bypass the host-env requirement when the provider is verifiably registered, never on a gateway outage. Good design.
  • setupNim's reordering is correct: validate the key when present, otherwise only fail when the provider isn't in the gateway — preserving the invalid-key path and adding the gateway-aware skip.
  • buildProviderArgs/upsertProvider dropping --credential KEY on the update path when the env is empty is the right call (OpenShell rejects an empty --credential), and includeCredential = action === "create" || credentialValueAvailable keeps create correct. The "update without --credential preserves the gateway secret" assumption is exercised for real in the two e2e scripts with NVIDIA_API_KEY unset, and CI is green — so it's empirically validated, not just asserted.

Test coverage is strong: arg-builder unit cases (omit/keep/staged), rebuild preflight regressions, and e2e gating in both test-channels-add-remove.sh and test-hermes-discord-e2e.sh with careful env restore.

Non-blocking, take or leave: CodeRabbit's two nits are reasonable — collapse the duplicated explanatory comment in onboard.ts, and dump the rebuild log in the grep-failure branch of test-hermes-discord-e2e.sh (line ~603) so a CI regression is triageable. Neither blocks merge.

@cv cv merged commit f972efe into main Jun 4, 2026
92 checks passed
@cv cv deleted the fix/3895-rebuild-credential-gateway-preflight branch June 4, 2026 17:45
cv pushed a commit that referenced this pull request Jun 5, 2026
## Summary
- Add the v0.0.59 release notes from the GitHub announcement discussion.
- Refresh local inference and credential-storage guidance for the
current release behavior.
- Regenerate the user skills from the updated Fern docs.
- Tighten release-prep and docs review guidance for generated skills, PR
labels, and shared `$$nemoclaw` command placeholders.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" --glob '*.{md,mdx}'`
- `git diff --check`
- `npm run docs` (rerun outside sandbox after sandbox-only `tsx` IPC
permission failure)
- `npm run typecheck:cli`
- Pre-commit hooks during commit passed, including markdownlint,
docs-to-skills verification, gitleaks, commitlint, and skills YAML
tests.

## Source Summary
- #3679, #4437, #4681, #4766, #4772, #4775, #4786 ->
`docs/about/release-notes.mdx`, `docs/reference/commands.mdx`,
`docs/reference/troubleshooting.mdx`: Summarize OpenClaw 2026.5.27
compatibility, runtime path pinning, plugin registry recovery, live
gateway reconciliation, and clearer host-alias/startup diagnostics.
- #4332, #4402, #4769, #4776, #4779 -> `docs/about/release-notes.mdx`,
`docs/inference/inference-options.mdx`,
`docs/inference/use-local-inference.mdx`,
`docs/inference/switch-inference-providers.mdx`: Document the release
inference changes covering Local NIM waits, Hermes Anthropic routing,
Nemotron 3 Ultra, the current Ollama starter fallback, and Spark
managed-vLLM context length.
- #4628, #4652, #4733, #4745 -> `docs/about/release-notes.mdx`,
`docs/security/credential-storage.mdx`,
`docs/manage-sandboxes/messaging-channels.mdx`,
`docs/reference/troubleshooting.mdx`: Capture permission healing,
gateway-stored credential reuse, cross-sandbox messaging credential
conflict checks, and CDI preflight diagnostics.
- #4728, #4737, #4743, #4744, #4782 -> `.agents/skills/nemoclaw-user-*`:
Regenerate the user skill references from the updated source docs.
- Follow-up maintenance ->
`.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`,
`.coderabbit.yaml`: Add release-prep area labels for docs and skills
PRs, and teach docs review guidance that `$$nemoclaw` is the correct
shared command placeholder for examples that work across agent aliases.

Note: the `documentation` label was not present in the repository, so
this PR is labeled with `v0.0.59` only.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
  * Updated default model for local Ollama inference setup to qwen3.5:9b
  * Added Nemotron 3 Ultra 550B as an NVIDIA Endpoints model option
* Clarified credential storage and reuse behavior for post-deployment
(day-two) operations
* Added v0.0.59 release notes covering OpenClaw compatibility, inference
options, Hermes messaging sync, and troubleshooting
* Clarified CLI selection guidance and updated OpenClaw version example
in status output
* Revised release-prep instructions and docs review guidance for CLI
alias usage
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
cv added a commit that referenced this pull request Jun 5, 2026
…ng onboard (#4598)

## Summary

Two preflight cleanup paths assumed the OpenShell gateway and dashboard
forward were process-wide singletons. When a second NemoClaw onboard ran
with `NEMOCLAW_GATEWAY_PORT=N`, the preflight retired the existing
per-port gateway as "legacy" and killed the first sandbox's dashboard
SSH forward — leaving the first sandbox unreachable. This PR scopes both
cleanups so the second instance starts its own gateway alongside the
first instead of replacing it.

## Related Issue

Fixes #4422 · Refs #3053

#4422 is the specific SIGKILL-on-second-onboard symptom: a second
onboard with `NEMOCLAW_GATEWAY_PORT=N` destroyed the previous instance's
per-port gateway and dashboard forward. This PR fixes both preflight
cleanups so concurrent onboards no longer step on each other.

#3053 is the broader ask — full multi-instance segregation of registry,
credentials, snapshots, messaging, and lifecycle behind a configurable
`NEMOCLAW_INSTANCE` identity. That work is out of scope here and tracked
separately; this PR removes the destructive cross-talk that previously
prevented two NemoClaw-managed sandboxes from coexisting at all, but
does not yet introduce the instance identity primitive.

## Changes

- `src/lib/onboard/machine/handlers/gateway.ts`: skip
`retireLegacyGatewayForDockerDriverUpgrade` when `gatewayReuseState ===
"foreign-active"`. A foreign-active gateway is another sandbox's
per-port `nemoclaw-<port>` — not legacy state to retire. Normalises to
`"missing"` so the current onboard proceeds with its own per-port
gateway alongside.
- `src/lib/onboard.ts`: dashboard-port preflight no longer kills an
"orphaned SSH port-forward" when `openshell forward list` shows the port
is held by another live sandbox. The runtime allocator picks a different
dashboard port for this sandbox at create time instead.
- `src/lib/onboard/machine/handlers/gateway.test.ts`: unit test for the
foreign-active no-retire branch.
- `test/e2e/test-concurrent-gateway-ports.sh`: new E2E that onboards two
sandboxes (default + `NEMOCLAW_GATEWAY_PORT=18080`), asserts both reach
`Ready`, distinct gateway ports (8080 + 18080), distinct dashboard ports
(18789 + 18790), and that destroying one leaves the other intact. Each
sandbox is queried via its own gateway with `openshell sandbox list -g
<gateway-name>` so the global active-gateway pointer does not flip the
read.
- `.github/workflows/nightly-e2e.yaml`: registers
`concurrent-gateway-ports-e2e` in the dispatchable-jobs catalog, `needs`
lists, and the advisor comment block. Also documents existing
`openclaw-skill-cli-e2e` and `channels-add-remove-e2e` in the catalog so
the PR-review E2E advisor surfaces them when relevant changes land —
catches up leftover automation from PRs #4766 (#4709 OpenClaw skill CLI)
and #4745 (#3895 channels add/remove) where the tests shipped but were
never advertised to the advisor.

## Type of Change

- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification

- [x] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] `npm run docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Manage concurrent gateway ports safely across multiple sandboxes on
the same host.

* **Bug Fixes**
* Improved cleanup for orphaned SSH port-forwards that block dashboard
ports.

* **Tests**
  * Added E2E test validating concurrent gateway-port scenarios.
* Added/updated unit tests for gateway-state and orphaned-forward
handling.

* **Chores**
* Added nightly E2E workflow job for concurrent gateway port testing and
integrated it into reporting.

* **Documentation**
  * Expanded nightly E2E job documentation for related tests.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: e2e End-to-end tests, nightly failures, or validation infrastructure bug-fix PR fixes a bug or regression integration: hermes Hermes integration behavior v0.0.59 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[macOS][Sandbox] nemohermes rebuild preflight fails with "provider credential not found" despite credential registered in gateway

4 participants