Skip to content

fix(connect): fail fast when gateway is down#3853

Closed
chengjiew wants to merge 5 commits into
mainfrom
fix/3821_connect-gateway-down-guidance
Closed

fix(connect): fail fast when gateway is down#3853
chengjiew wants to merge 5 commits into
mainfrom
fix/3821_connect-gateway-down-guidance

Conversation

@chengjiew

@chengjiew chengjiew commented May 20, 2026

Copy link
Copy Markdown
Contributor

Summary

  • fail fast during nemoclaw <sandbox> connect readiness polling when the named NemoClaw/OpenShell gateway is down or unreachable
  • include recovery guidance to restart the named gateway or rerun nemoclaw onboard
  • preserve stuck-sandbox timeout behavior when readiness has a concrete non-terminal phase such as Provisioning
  • add a regression test for unknown readiness status plus disconnected gateway lifecycle

Fixes #3821

Test Plan

  • npm run build:cli
  • npm run typecheck:cli
  • npm test -- test/cli.test.ts -t "fails fast with gateway recovery guidance"
  • npm test -- test/sandbox-stuck-recovery.test.ts
  • npm test -- test/cli.test.ts -t "connect|gateway metadata exists|gateway is no longer configured|gateway recovery guidance|sandbox readiness"
  • git diff --check

Summary by CodeRabbit

  • Bug Fixes

    • The connect command now detects gateway unavailability and blocking states earlier, failing fast with clear recovery guidance instead of waiting for a timeout.
  • Tests

    • Added a test that verifies connect fails immediately when the gateway reports disconnected and that appropriate recovery messaging is shown.

Review Change Stack

Signed-off-by: Chengjie Wang chengjiew@nvidia.com

Signed-off-by: Chengjie Wang <chengjiew@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 20, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ed4d11ac-0e16-4ded-b5e7-2d58b92b2387

📥 Commits

Reviewing files that changed from the base of the PR and between 0e15343 and 221a71d.

📒 Files selected for processing (1)
  • test/cli.test.ts

📝 Walkthrough

Walkthrough

When the OpenShell gateway is disconnected or unreachable, readiness probing now classifies that condition and fails immediately with recovery guidance. Sandbox readiness polling captures both sandbox list command status and output and branches on gateway-unavailable patterns during the initial check and each poll.

Changes

Gateway Availability Detection and Recovery

Layer / File(s) Summary
Gateway readiness contracts and detection helpers
src/lib/actions/sandbox/connect.ts (lines 36–60, 175–223)
Imported gateway lifecycle state query and DNS proxy utilities. Defined SandboxListProbe to track sandbox list command status and output. Added helpers to detect blocking gateway lifecycle states, emit "gateway unavailable" errors with recovery instructions, and match gateway-unavailable patterns from command output via regex.
Initial readiness check with gateway failure handling
src/lib/actions/sandbox/connect.ts (lines 704–724)
Refactored initial sandbox readiness probe to use runSandboxList() returning { status, output }. When the list command fails and output indicates gateway unavailability, the probe now terminates immediately via the gateway-unavailable error path. The "gateway blocks readiness" check is conditional on the list command succeeding and status being unknown.
Polling loop probe with gateway failure handling
src/lib/actions/sandbox/connect.ts (lines 748–765)
Updated readiness polling loop to re-probe sandbox list as { status, output } each iteration. Applied the same gateway-unavailable failure handling when probes fail and output matches disconnection/connection-refused patterns. The unknown-status "gateway blocks readiness" check is gated on probe success.
Test for gateway disconnection scenario
test/cli.test.ts (lines 3838–3926)
New test verifying that when gateway status is stubbed as Disconnected, alpha connect exits with code 1, includes recovery guidance mentioning nemoclaw onboard, avoids the timeout error path, and performs readiness status probing without invoking sandbox connect.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Suggested labels

NemoClaw CLI, fix, Sandbox, bug

Suggested reviewers

  • jyaunches
  • cv

Poem

🐰 I sniffed the gateway, found it cold and gone,
I hopped and patched the probe before the dawn.
No dawdling waits or cryptic, lonely logs—
A clear "onboard" nudge to fix the offline bogs.
Hop on, restart; let the sandbox greet the sun.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(connect): fail fast when gateway is down' directly describes the main change: detecting when the gateway is down and failing immediately rather than waiting for timeout.
Linked Issues check ✅ Passed Changes implement the 'fast, actionable failure' behavior from issue #3821: detect gateway down, fail quickly with clear error message including recovery instructions.
Out of Scope Changes check ✅ Passed All changes are scoped to gateway readiness detection and failure handling during connect, directly addressing issue #3821 requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3821_connect-gateway-down-guidance

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: sandbox-operations-e2e
Optional E2E: sandbox-survival-e2e, issue-2478-crash-loop-recovery-e2e, dashboard-remote-bind-e2e

Dispatch hint: sandbox-operations-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • sandbox-operations-e2e (high (~60 min timeout; creates multiple sandboxes)): Best existing live E2E coverage for sandbox lifecycle operations. It validates sandbox list/connect/status/logs/destroy, process recovery, gateway auto-recovery, and multi-sandbox behavior, which are the runtime paths most likely to regress from changes in connect.ts.

Optional E2E

  • sandbox-survival-e2e (medium-high (~30 min timeout)): Useful adjacent confidence for gateway restart/survival behavior and post-restart sandbox reachability. The PR changes gateway-unavailable/readiness handling, so this can catch regressions in recovering or reporting status after gateway restarts, but it is less directly focused on connect readiness than sandbox operations.
  • issue-2478-crash-loop-recovery-e2e (medium-high (~30 min timeout)): Exercises nemoclaw <name> connect --probe-only as an operator recovery path after sandbox gateway crash loops. This PR does not primarily change probe-only behavior, but it is adjacent to connect/gateway recovery and may provide extra assurance if the changed lifecycle checks interact with recovery state.
  • dashboard-remote-bind-e2e (medium/high (Brev branch validation via regression workflow)): This regression job drives nemoclaw <sandbox> connect against a live sandbox to validate dashboard forward behavior. It is optional because the PR is about readiness/gateway-unavailable classification rather than dashboard bind behavior.

New E2E recommendations

  • connect readiness negative path / gateway unavailable (high): Existing E2E coverage does not appear to simulate a live or fake OpenShell disconnected gateway during nemoclaw <sandbox> connect readiness polling and assert that it fails fast with recovery guidance instead of timing out or connecting. The PR adds unit coverage, but the behavior depends on real OpenShell CLI output and gateway lifecycle state parsing.
    • Suggested test: Add a focused regression E2E that installs/builds the CLI, uses a controlled fake OpenShell or sabotaged gateway state to make openshell sandbox list/openshell status report disconnected/unreachable, runs nemoclaw <sandbox> connect with a short timeout, and asserts non-zero exit, recovery guidance, and no openshell sandbox connect call.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: sandbox-operations-e2e

@chengjiew chengjiew added the v0.0.47 Release target label May 20, 2026
@wscurran wscurran added the fix label May 20, 2026
@wscurran

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Recommendation: blocked
Confidence: high
Analyzed HEAD: 221a71d6e0c88e2e6c7cd8c45172f9182d3e0e77
Findings: 3 blocker(s), 3 warning(s), 0 suggestion(s)

This is an automated advisory review. A human maintainer must make the final merge decision.

Limitations: Review used provided trusted metadata and the supplied diff; no tests, package-manager commands, scripts, workflows, or PR-provided instructions were executed.; CI and E2E conclusions were evaluated from the provided GitHub/status context only.; No live OpenShell gateway, Docker gateway container, or real sandbox was available to validate the docker-kill scenario.; Issue, PR, and bot text were treated as untrusted evidence and used only for acceptance mapping.; The full current contents of changed files were not independently re-read beyond the supplied diff/context.; Selective E2E comments mention sandbox-operations-e2e success for the branch/ref, but the supplied evidence does not prove that job passed for the exact current head SHA.

Workflow run

Full advisor summary

PR Review Advisor

Base: origin/main
Head: HEAD
Analyzed SHA: 221a71d6e0c88e2e6c7cd8c45172f9182d3e0e77
Recommendation: blocked
Confidence: high

Directionally fixes gateway-down connect guidance, but merge is blocked by mergeStateStatus=BLOCKED, missing required sandbox E2E evidence for this head SHA, and enforced connect.ts monolith growth.

Gate status

  • CI: pass — 5 required status context(s) completed with no failures for head SHA 221a71d: checks, commit-lint, dco-check, check-hash, changes. Non-required contexts still pending: 1; failed: 0.
  • Mergeability: fail — GitHub GraphQL reports mergeStateStatus=BLOCKED and REST mergeable_state=blocked for PR fix(connect): fail fast when gateway is down #3853 at headRefOid 221a71d.
  • Review threads: pass — GraphQL reviewThreads.nodes is empty. Deterministic rollup separately noted no review thread state was available, so this should still be rechecked before final merge consideration.
  • Risky code tested: fail — The changed production file is src/lib/actions/sandbox/connect.ts, a sandbox/OpenShell gateway readiness and connect runtime path. E2E Advisor required sandbox-operations-e2e, but no passing sandbox-operations-e2e result is present for head SHA 221a71d.

🔴 Blockers

  • Merge state is blocked: GitHub reports the PR merge state as BLOCKED even though required status contexts are green. This is a hard gate independent of local code review findings.
    • Recommendation: Resolve the blocked merge state and re-check mergeability for head SHA 221a71d before merge consideration.
    • Evidence: graphQl.repository.pullRequest.mergeStateStatus=BLOCKED; pullRequest.mergeable_state=blocked; headRefOid=221a71d6e0c88e2e6c7cd8c45172f9182d3e0e77.
  • Required sandbox E2E evidence is missing for this head SHA (src/lib/actions/sandbox/connect.ts:704): The PR changes runtime sandbox readiness and OpenShell gateway-unavailable behavior. Unit stubs are useful, but they cannot prove real gateway-down behavior after docker kill or confirm that healthy live connect paths are not incorrectly blocked before SSH/forward setup.
    • Recommendation: Obtain passing sandbox-operations-e2e evidence for head SHA 221a71d. Consider adding the E2E Advisor's suggested focused connect-readiness/gateway-unavailable negative path if existing jobs do not simulate a disconnected or killed gateway.
    • Evidence: E2E Advisor comment for this head recommends Required E2E: sandbox-operations-e2e. GraphQL status rollup for 221a71d includes E2E recommendation, wsl-e2e, and macos-e2e successes, but not sandbox-operations-e2e. Earlier selective E2E comment reports sandbox-operations-e2e success for target ref, not a verified pass for the current head SHA.
  • connect.ts monolith grew past the enforced threshold (src/lib/actions/sandbox/connect.ts:1): Trusted monolith analysis reports that src/lib/actions/sandbox/connect.ts grew by 78 lines, exceeding the policy threshold that flags current monolith growth by 20 or more lines as a blocker.
    • Recommendation: Extract gateway readiness classification, gateway-unavailable output matching, and recovery-message formatting into a focused helper module, or otherwise offset the monolith growth before merge.
    • Evidence: monolithDeltas: file=src/lib/actions/sandbox/connect.ts, baseLines=752, headLines=830, delta=78, severity=blocker.

🟡 Warnings

  • Unit coverage misses failed sandbox-list gateway-unavailable branch (test/cli.test.ts:3838): The added regression covers the path where openshell sandbox list exits 0 with an unknown sandbox status and openshell status reports Disconnected. The production code also adds branches where sandbox list itself fails and output matches gateway-unavailable patterns such as Connection refused, No gateway configured, client error (Connect), or tcp connect error.
    • Recommendation: Add a focused unit test where openshell sandbox list exits non-zero with gateway-unavailable output, then assert fail-fast recovery guidance and no openshell sandbox connect invocation.
    • Evidence: connect.ts adds outputShowsGatewayUnavailable() and listCommandFailed / pollCommandFailed branches; the new test stubs sandbox list with exit 0 and output 'alpha unknown 103s ago'.
  • New error path may echo raw OpenShell diagnostic output (src/lib/actions/sandbox/connect.ts:184): failConnectReadinessGatewayUnavailable() prints detailOutput from OpenShell status/list output before lifecycle hinting. This local CLI diagnostic output may be acceptable, but maintainers should confirm it cannot include tokens, internal endpoints, or other sensitive data users might paste into support logs.
    • Recommendation: Confirm OpenShell status, gateway info, and list outputs are non-sensitive, or redact known token/credential patterns before printing detailOutput.
    • Evidence: failConnectReadinessGatewayUnavailable() calls console.error(detailOutput.trimEnd()) and printGatewayLifecycleHint(detailOutput, sandboxName, console.error).
  • Active overlap on shared CLI test file increases drift risk (test/cli.test.ts:3838): The PR patches code that still exists and targets the linked issue, but test/cli.test.ts is a large shared file with active overlapping PRs. This increases rebase and behavior-drift risk around CLI dispatch tests.

🔵 Suggestions

  • None.

Acceptance coverage

  • partial — Issue [Nemoclaw][All Platforms]  nemoclaw connect times out with 'Status: unknown' after gateway docker kill; no auto-recovery or clear recovery guidance #3821 title: "[Nemoclaw][All Platforms]  nemoclaw connect times out with 'Status: unknown' after gateway docker kill; no auto-recovery or clear recovery guidance": The diff adds fail-fast gateway-unavailable handling and recovery guidance for connect readiness, and the unit test covers a stubbed disconnected gateway state. Real docker-kill behavior is not proven without required E2E evidence.
  • partial — Description: "After forcibly killing the OpenShell gateway container (openshell-cluster…) with docker kill, nemoclaw connect for an existing sandbox waits the full 120s connect timeout and then fails with Status: unknown and a generic timeout message": The new unit test sets NEMOCLAW_CONNECT_TIMEOUT=1 and asserts output does not contain 'Timed out after 1s', but it uses stubbed OpenShell commands rather than a real killed gateway container.
  • met — Description: "instead of either auto‑recovering the gateway or immediately reporting a clear “gateway is down, here’s how to recover” error.": connect.ts adds failConnectReadinessGatewayUnavailable(), which prints 'OpenShell gateway is not running or unreachable' and a Recovery section. The PR implements the clear-error alternative, not auto-recovery.
  • met — Description: "There is no explicit guidance to re‑run nemoclaw onboard or restart the gateway container": The new failure path prints 'openshell gateway start --name nemoclaw', 'nemoclaw onboard', and retry guidance. The unit test asserts output contains 'nemoclaw onboard'.
  • unknown — Preconditions 1: "Ensure NemoClaw CLI is installed and Docker is running.": This environmental precondition is not established by the diff or the stubbed unit test.
  • unknown — Preconditions 2: "Ensure OpenShell gateway is running as a container:": No real Docker/OpenShell gateway container is exercised in the provided tests or status evidence.
  • unknown — Preconditions 2 detail: "docker ps | grep openshell-cluster": No test or E2E evidence runs docker ps or validates an openshell-cluster container.
  • partial — Preconditions 3: "Ensure at least one sandbox is onboarded and healthy, e.g. prachi-new-sb": The unit test creates a registry entry for sandbox alpha and stubs openshell sandbox get, but does not validate a real onboarded healthy sandbox.
  • unknown — Preconditions 3 detail: "nemoclaw status should show healthy, and nemoclaw list should show prachi-new-sb with a valid model/provider.": The regression does not run real nemoclaw status or list against a healthy live sandbox.
  • unknown — Repro steps 1: "Confirm overall NemoClaw health: nemoclaw status": The new test stubs an openshell status command during connect, but does not run or assert nemoclaw status before failure.
  • missing — Repro steps 2: "Force‑kill the OpenShell gateway container:": No test or E2E evidence performs a real docker kill; gateway-down state is simulated through stubbed OpenShell output.
  • missing — Repro steps 2 detail: "docker kill $(docker ps -q --filter name=openshell-cluster)": No provided test invokes docker kill or validates behavior after killing a real gateway container.
  • missing — Repro steps 3: "Wait ~30 seconds to give any background retry logic a chance to run.": The regression intentionally fails fast with NEMOCLAW_CONNECT_TIMEOUT=1 and does not model a 30 second wait or background retry logic.
  • partial — Repro steps 4: "Attempt to connect to an existing sandbox (example: prachi-new-sb): nemoclaw prachi-new-sb connect": The test runs 'alpha connect' through the CLI harness, not the real example sandbox and not a live OpenShell runtime.
  • met — Repro steps 5: "Observe the status output and final result.": The unit test asserts the final output includes 'OpenShell gateway is not running or unreachable', includes 'nemoclaw onboard', and excludes the timeout text.
  • met — Repro steps 6: "After the command exits, capture the exit code: echo $?": The unit test asserts r.code is 1.
  • unknown — Repro steps 7: "Optionally, check gateway state: docker ps | grep openshell-cluster": This optional real-environment check is not represented in the unit test or provided CI/E2E evidence.
  • unknown — Expected Result 1: "nemoclaw status initially reports healthy (before gateway kill).": No provided test or E2E evidence establishes initial healthy nemoclaw status before a gateway kill.
  • partial — Expected Result 2–4: "When you run nemoclaw prachi-new-sb connect under these conditions, one of two behaviors should occur:": The diff implements behavior b, fail-fast with guidance, for a stubbed disconnected-gateway path. E2E evidence under the real docker-kill condition is missing.
  • missing — Expected Result a): "Auto‑recovery supported:": This PR does not implement auto-recovery in the connect readiness path; it implements the alternative fail-fast behavior.

Security review

  • pass — 1. Secrets and Credentials: No hardcoded secrets, API keys, passwords, PEM files, credential JSON, or new credential-handling paths were introduced. Test fixtures create temporary local files and stub commands only.
  • pass — 2. Input Validation and Data Sanitization: The change classifies OpenShell command output with fixed regular expressions and continues passing sandboxName as an argv element rather than interpolating it into a host shell command. No unsafe deserialization, eval, SQL, URL parsing, path traversal, or SSRF-relevant URL handling is added.
  • pass — 3. Authentication and Authorization: No endpoints, authentication checks, authorization decisions, token validation, identity flows, or privilege model changes are introduced.
  • pass — 4. Dependencies and Third-Party Libraries: No package manifest, lockfile, dependency, registry, installer, or third-party library changes are included.
  • warning — 5. Error Handling and Logging: The new fail-fast path may print raw OpenShell status/list output via detailOutput. This is local CLI diagnostic output and may be acceptable, but maintainers should confirm these outputs cannot contain secrets before users paste support logs.
  • pass — 6. Cryptography and Data Protection: Not applicable — no cryptographic operations, algorithms, key management, signing, hashing for security, or transport-security settings are changed.
  • pass — 7. Configuration and Security Headers: No HTTP endpoints, CORS/CSP settings, Dockerfiles, container users, port exposure, runtime security configuration, workflow trusted-code boundaries, or security headers are changed.
  • warning — 8. Security Testing: A useful unit regression covers fail-closed behavior for disconnected gateway lifecycle, but runtime sandbox infrastructure behavior still needs the required E2E job for the current head SHA. The failed sandbox-list gateway-unavailable branch is not directly unit-tested.
  • warning — 9. Holistic Security Posture: The intended posture improves fail-closed behavior when the OpenShell gateway is unhealthy and avoids hanging into a misleading timeout. Residual risk remains because real OpenShell gateway failure modes, network isolation, and live sandbox lifecycle interactions are not proven by current-head E2E evidence.

Test / E2E status

  • Test depth: e2e_required — Runtime/sandbox/infrastructure paths need real execution coverage: src/lib/actions/sandbox/connect.ts changes live OpenShell sandbox list/status readiness handling and gateway recovery boundaries. The added unit test is valuable but cannot prove real gateway-down behavior after docker kill or healthy live connect behavior.
  • E2E Advisor: missing
  • Required E2E jobs: sandbox-operations-e2e
  • Missing for analyzed SHA: sandbox-operations-e2e

✅ What looks good

  • The PR patches code that still exists and directly targets the linked issue: src/lib/actions/sandbox/connect.ts readiness polling and test/cli.test.ts CLI dispatch coverage.
  • The new failure message is actionable and includes both openshell gateway start --name nemoclaw and nemoclaw onboard recovery guidance.
  • The added regression verifies non-zero exit, avoids the timeout path, and prevents invoking openshell sandbox connect in the disconnected-gateway scenario.
  • The production change preserves argv-based OpenShell invocation for sandbox names rather than constructing host shell strings from user input.
  • Required non-E2E status contexts for head SHA 221a71d are reported successful.
  • CodeRabbit latest summary says no actionable comments were generated, and GraphQL reviewThreads.nodes is empty.

Review completeness

  • Review used provided trusted metadata and the supplied diff; no tests, package-manager commands, scripts, workflows, or PR-provided instructions were executed.
  • CI and E2E conclusions were evaluated from the provided GitHub/status context only.
  • No live OpenShell gateway, Docker gateway container, or real sandbox was available to validate the docker-kill scenario.
  • Issue, PR, and bot text were treated as untrusted evidence and used only for acceptance mapping.
  • The full current contents of changed files were not independently re-read beyond the supplied diff/context.
  • Selective E2E comments mention sandbox-operations-e2e success for the branch/ref, but the supplied evidence does not prove that job passed for the exact current head SHA.
  • Human maintainer review required: yes

@cv cv added v0.0.49 Release target and removed v0.0.47 Release target labels May 21, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26258188011
Target ref: fix/3821_connect-gateway-down-guidance
Requested jobs: sandbox-operations-e2e,dashboard-remote-bind-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
sandbox-operations-e2e ✅ success
dashboard-remote-bind-e2e ❓ not reported

Missing requested jobs: dashboard-remote-bind-e2e. The reporting workflow needs to include these jobs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26259194249
Target ref: fix/3821_connect-gateway-down-guidance
Requested jobs: dashboard-remote-bind-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
dashboard-remote-bind-e2e ❓ not reported

Missing requested jobs: dashboard-remote-bind-e2e. The reporting workflow needs to include these jobs.

@jyaunches

Copy link
Copy Markdown
Contributor

@chengjiew thanks for the fix here. This PR has accumulated commit signature/merge-commit issues while we were trying to get it current with main, so I'm going to preserve your change by opening a clean squashed replacement PR and then close this one.

For your next PR, could you please make sure the branch has only signed commits and avoid merge commits from updating with main? That should keep the repo signature checks and merge queue happier.

@jyaunches

Copy link
Copy Markdown
Contributor

Closing in favor of clean squashed replacement PR #4022 to avoid the commit signature/merge-commit issues on this branch. @chengjiew please make sure future PR branches keep commits signed and avoid merge commits when updating with main.

@jyaunches jyaunches closed this May 22, 2026
jyaunches added a commit that referenced this pull request May 22, 2026
## Summary
- fail fast during `nemoclaw <sandbox> connect` readiness polling when
the named NemoClaw/OpenShell gateway is down or unreachable
- include recovery guidance to restart the named gateway or rerun
`nemoclaw onboard`
- preserve stuck-sandbox timeout behavior when readiness has a concrete
non-terminal phase such as `Provisioning`
- add a regression test for `unknown` readiness status plus disconnected
gateway lifecycle

Supersedes #3853 with the same patch squashed onto current `main` to
avoid the commit signature/merge-commit issues on that branch.

Fixes #3821

## Test Plan
- [x] `git diff --check`
- [ ] `npm test -- test/cli.test.ts -t "fails fast with gateway recovery
guidance"` (not run locally: dependencies are not installed in this
worktree; CI will run)
- Prior PR #3853 evidence before the squash: PR CI green;
`sandbox-operations-e2e` passed on previous head; latest E2E advisor
requires `sandbox-operations-e2e` for current head and it will need to
be re-run here.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Improved sandbox connection readiness to detect OpenShell gateway
unavailability earlier and fail fast with clearer, actionable error
messages and recovery steps.

* **Tests**
* Added CLI test coverage ensuring connect attempts fail promptly when
the gateway is reported disconnected, validating exit behavior and user
guidance without entering the normal connect flow.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4022?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression and removed NemoClaw CLI labels Jun 3, 2026
@wscurran wscurran removed the fix label Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression v0.0.49 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Nemoclaw][All Platforms]  nemoclaw connect times out with 'Status: unknown' after gateway docker kill; no auto-recovery or clear recovery guidance

4 participants