fix(connect): fail fast when gateway is down by chengjiew · Pull Request #3853 · NVIDIA/NemoClaw

chengjiew · 2026-05-20T03:14:16Z

Summary

fail fast during nemoclaw <sandbox> connect readiness polling when the named NemoClaw/OpenShell gateway is down or unreachable
include recovery guidance to restart the named gateway or rerun nemoclaw onboard
preserve stuck-sandbox timeout behavior when readiness has a concrete non-terminal phase such as Provisioning
add a regression test for unknown readiness status plus disconnected gateway lifecycle

Fixes #3821

Test Plan

npm run build:cli
npm run typecheck:cli
npm test -- test/cli.test.ts -t "fails fast with gateway recovery guidance"
npm test -- test/sandbox-stuck-recovery.test.ts
npm test -- test/cli.test.ts -t "connect|gateway metadata exists|gateway is no longer configured|gateway recovery guidance|sandbox readiness"
git diff --check

Summary by CodeRabbit

Bug Fixes
- The connect command now detects gateway unavailability and blocking states earlier, failing fast with clear recovery guidance instead of waiting for a timeout.
Tests
- Added a test that verifies connect fails immediately when the gateway reports disconnected and that appropriate recovery messaging is shown.

Signed-off-by: Chengjie Wang chengjiew@nvidia.com

Signed-off-by: Chengjie Wang <chengjiew@nvidia.com>

copy-pr-bot · 2026-05-20T03:14:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-20T03:14:29Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ed4d11ac-0e16-4ded-b5e7-2d58b92b2387

📥 Commits

Reviewing files that changed from the base of the PR and between 0e15343 and 221a71d.

📒 Files selected for processing (1)

test/cli.test.ts

📝 Walkthrough

Walkthrough

When the OpenShell gateway is disconnected or unreachable, readiness probing now classifies that condition and fails immediately with recovery guidance. Sandbox readiness polling captures both sandbox list command status and output and branches on gateway-unavailable patterns during the initial check and each poll.

Changes

Gateway Availability Detection and Recovery

Layer / File(s)	Summary
Gateway readiness contracts and detection helpers `src/lib/actions/sandbox/connect.ts` (lines 36–60, 175–223)	Imported gateway lifecycle state query and DNS proxy utilities. Defined `SandboxListProbe` to track `sandbox list` command status and output. Added helpers to detect blocking gateway lifecycle states, emit "gateway unavailable" errors with recovery instructions, and match gateway-unavailable patterns from command output via regex.
Initial readiness check with gateway failure handling `src/lib/actions/sandbox/connect.ts` (lines 704–724)	Refactored initial sandbox readiness probe to use `runSandboxList()` returning `{ status, output }`. When the list command fails and output indicates gateway unavailability, the probe now terminates immediately via the gateway-unavailable error path. The "gateway blocks readiness" check is conditional on the list command succeeding and status being `unknown`.
Polling loop probe with gateway failure handling `src/lib/actions/sandbox/connect.ts` (lines 748–765)	Updated readiness polling loop to re-probe `sandbox list` as `{ status, output }` each iteration. Applied the same gateway-unavailable failure handling when probes fail and output matches disconnection/connection-refused patterns. The `unknown`-status "gateway blocks readiness" check is gated on probe success.
Test for gateway disconnection scenario `test/cli.test.ts` (lines 3838–3926)	New test verifying that when gateway status is stubbed as `Disconnected`, `alpha connect` exits with code 1, includes recovery guidance mentioning `nemoclaw onboard`, avoids the timeout error path, and performs readiness `status` probing without invoking `sandbox connect`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Suggested labels

NemoClaw CLI, fix, Sandbox, bug

Suggested reviewers

jyaunches
cv

Poem

🐰 I sniffed the gateway, found it cold and gone,
I hopped and patched the probe before the dawn.
No dawdling waits or cryptic, lonely logs—
A clear "onboard" nudge to fix the offline bogs.
Hop on, restart; let the sandbox greet the sun.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(connect): fail fast when gateway is down' directly describes the main change: detecting when the gateway is down and failing immediately rather than waiting for timeout.
Linked Issues check	✅ Passed	Changes implement the 'fast, actionable failure' behavior from issue `#3821`: detect gateway down, fail quickly with clear error message including recovery instructions.
Out of Scope Changes check	✅ Passed	All changes are scoped to gateway readiness detection and failure handling during connect, directly addressing issue `#3821` requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/3821_connect-gateway-down-guidance

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-20T03:15:50Z

E2E Advisor Recommendation

Required E2E: sandbox-operations-e2e
Optional E2E: sandbox-survival-e2e, issue-2478-crash-loop-recovery-e2e, dashboard-remote-bind-e2e

Dispatch hint: sandbox-operations-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

sandbox-operations-e2e (high (~60 min timeout; creates multiple sandboxes)): Best existing live E2E coverage for sandbox lifecycle operations. It validates sandbox list/connect/status/logs/destroy, process recovery, gateway auto-recovery, and multi-sandbox behavior, which are the runtime paths most likely to regress from changes in connect.ts.

Optional E2E

sandbox-survival-e2e (medium-high (~30 min timeout)): Useful adjacent confidence for gateway restart/survival behavior and post-restart sandbox reachability. The PR changes gateway-unavailable/readiness handling, so this can catch regressions in recovering or reporting status after gateway restarts, but it is less directly focused on connect readiness than sandbox operations.
issue-2478-crash-loop-recovery-e2e (medium-high (~30 min timeout)): Exercises nemoclaw <name> connect --probe-only as an operator recovery path after sandbox gateway crash loops. This PR does not primarily change probe-only behavior, but it is adjacent to connect/gateway recovery and may provide extra assurance if the changed lifecycle checks interact with recovery state.
dashboard-remote-bind-e2e (medium/high (Brev branch validation via regression workflow)): This regression job drives nemoclaw <sandbox> connect against a live sandbox to validate dashboard forward behavior. It is optional because the PR is about readiness/gateway-unavailable classification rather than dashboard bind behavior.

New E2E recommendations

connect readiness negative path / gateway unavailable (high): Existing E2E coverage does not appear to simulate a live or fake OpenShell disconnected gateway during nemoclaw <sandbox> connect readiness polling and assert that it fails fast with recovery guidance instead of timing out or connecting. The PR adds unit coverage, but the behavior depends on real OpenShell CLI output and gateway lifecycle state parsing.
- Suggested test: Add a focused regression E2E that installs/builds the CLI, uses a controlled fake OpenShell or sabotaged gateway state to make openshell sandbox list/openshell status report disconnected/unreachable, runs nemoclaw <sandbox> connect with a short timeout, and asserts non-zero exit, recovery guidance, and no openshell sandbox connect call.

Dispatch hint

Workflow: .github/workflows/nightly-e2e.yaml
jobs input: sandbox-operations-e2e

wscurran · 2026-05-20T15:56:36Z

✨ Related open issues:

#3821 [Nemoclaw][All Platforms] nemoclaw connect times out with 'Status: unknown' after gateway docker kill; no auto-recovery or clear recovery guidance

github-actions · 2026-05-21T01:07:46Z

PR Review Advisor

Recommendation: blocked
Confidence: high
Analyzed HEAD: 221a71d6e0c88e2e6c7cd8c45172f9182d3e0e77
Findings: 3 blocker(s), 3 warning(s), 0 suggestion(s)

This is an automated advisory review. A human maintainer must make the final merge decision.

Limitations: Review used provided trusted metadata and the supplied diff; no tests, package-manager commands, scripts, workflows, or PR-provided instructions were executed.; CI and E2E conclusions were evaluated from the provided GitHub/status context only.; No live OpenShell gateway, Docker gateway container, or real sandbox was available to validate the docker-kill scenario.; Issue, PR, and bot text were treated as untrusted evidence and used only for acceptance mapping.; The full current contents of changed files were not independently re-read beyond the supplied diff/context.; Selective E2E comments mention sandbox-operations-e2e success for the branch/ref, but the supplied evidence does not prove that job passed for the exact current head SHA.

Workflow run

Full advisor summary

PR Review Advisor

Base: origin/main
Head: HEAD
Analyzed SHA: 221a71d6e0c88e2e6c7cd8c45172f9182d3e0e77
Recommendation: blocked
Confidence: high

Directionally fixes gateway-down connect guidance, but merge is blocked by mergeStateStatus=BLOCKED, missing required sandbox E2E evidence for this head SHA, and enforced connect.ts monolith growth.

Gate status

CI: pass — 5 required status context(s) completed with no failures for head SHA 221a71d: checks, commit-lint, dco-check, check-hash, changes. Non-required contexts still pending: 1; failed: 0.
Mergeability: fail — GitHub GraphQL reports mergeStateStatus=BLOCKED and REST mergeable_state=blocked for PR fix(connect): fail fast when gateway is down #3853 at headRefOid 221a71d.
Review threads: pass — GraphQL reviewThreads.nodes is empty. Deterministic rollup separately noted no review thread state was available, so this should still be rechecked before final merge consideration.
Risky code tested: fail — The changed production file is src/lib/actions/sandbox/connect.ts, a sandbox/OpenShell gateway readiness and connect runtime path. E2E Advisor required sandbox-operations-e2e, but no passing sandbox-operations-e2e result is present for head SHA 221a71d.

🔴 Blockers

Merge state is blocked: GitHub reports the PR merge state as BLOCKED even though required status contexts are green. This is a hard gate independent of local code review findings.
- Recommendation: Resolve the blocked merge state and re-check mergeability for head SHA 221a71d before merge consideration.
- Evidence: graphQl.repository.pullRequest.mergeStateStatus=BLOCKED; pullRequest.mergeable_state=blocked; headRefOid=221a71d6e0c88e2e6c7cd8c45172f9182d3e0e77.
Required sandbox E2E evidence is missing for this head SHA (src/lib/actions/sandbox/connect.ts:704): The PR changes runtime sandbox readiness and OpenShell gateway-unavailable behavior. Unit stubs are useful, but they cannot prove real gateway-down behavior after docker kill or confirm that healthy live connect paths are not incorrectly blocked before SSH/forward setup.
- Recommendation: Obtain passing sandbox-operations-e2e evidence for head SHA 221a71d. Consider adding the E2E Advisor's suggested focused connect-readiness/gateway-unavailable negative path if existing jobs do not simulate a disconnected or killed gateway.
- Evidence: E2E Advisor comment for this head recommends Required E2E: sandbox-operations-e2e. GraphQL status rollup for 221a71d includes E2E recommendation, wsl-e2e, and macos-e2e successes, but not sandbox-operations-e2e. Earlier selective E2E comment reports sandbox-operations-e2e success for target ref, not a verified pass for the current head SHA.
connect.ts monolith grew past the enforced threshold (src/lib/actions/sandbox/connect.ts:1): Trusted monolith analysis reports that src/lib/actions/sandbox/connect.ts grew by 78 lines, exceeding the policy threshold that flags current monolith growth by 20 or more lines as a blocker.
- Recommendation: Extract gateway readiness classification, gateway-unavailable output matching, and recovery-message formatting into a focused helper module, or otherwise offset the monolith growth before merge.
- Evidence: monolithDeltas: file=src/lib/actions/sandbox/connect.ts, baseLines=752, headLines=830, delta=78, severity=blocker.

🟡 Warnings

Unit coverage misses failed sandbox-list gateway-unavailable branch (test/cli.test.ts:3838): The added regression covers the path where openshell sandbox list exits 0 with an unknown sandbox status and openshell status reports Disconnected. The production code also adds branches where sandbox list itself fails and output matches gateway-unavailable patterns such as Connection refused, No gateway configured, client error (Connect), or tcp connect error.
- Recommendation: Add a focused unit test where openshell sandbox list exits non-zero with gateway-unavailable output, then assert fail-fast recovery guidance and no openshell sandbox connect invocation.
- Evidence: connect.ts adds outputShowsGatewayUnavailable() and listCommandFailed / pollCommandFailed branches; the new test stubs sandbox list with exit 0 and output 'alpha unknown 103s ago'.
New error path may echo raw OpenShell diagnostic output (src/lib/actions/sandbox/connect.ts:184): failConnectReadinessGatewayUnavailable() prints detailOutput from OpenShell status/list output before lifecycle hinting. This local CLI diagnostic output may be acceptable, but maintainers should confirm it cannot include tokens, internal endpoints, or other sensitive data users might paste into support logs.
- Recommendation: Confirm OpenShell status, gateway info, and list outputs are non-sensitive, or redact known token/credential patterns before printing detailOutput.
- Evidence: failConnectReadinessGatewayUnavailable() calls console.error(detailOutput.trimEnd()) and printGatewayLifecycleHint(detailOutput, sandboxName, console.error).
Active overlap on shared CLI test file increases drift risk (test/cli.test.ts:3838): The PR patches code that still exists and targets the linked issue, but test/cli.test.ts is a large shared file with active overlapping PRs. This increases rebase and behavior-drift risk around CLI dispatch tests.
- Recommendation: Before merge, re-check latest main and active overlapping PRs touching test/cli.test.ts, especially feat(hermes): request network access through policy local #3750, fix(cli): render missing channel arg usage #3863, and fix(cli): clarify live gateway drift in list #3907.
- Evidence: openPrOverlaps lists PRs feat(hermes): request network access through policy local #3750, fix(cli): render missing channel arg usage #3863, and fix(cli): clarify live gateway drift in list #3907 with sameFiles=[test/cli.test.ts]; drift history shows frequent recent CLI test changes.

🔵 Suggestions

None.

Acceptance coverage

partial — Issue [Nemoclaw][All Platforms] nemoclaw connect times out with 'Status: unknown' after gateway docker kill; no auto-recovery or clear recovery guidance #3821 title: "[Nemoclaw][All Platforms] nemoclaw connect times out with 'Status: unknown' after gateway docker kill; no auto-recovery or clear recovery guidance": The diff adds fail-fast gateway-unavailable handling and recovery guidance for connect readiness, and the unit test covers a stubbed disconnected gateway state. Real docker-kill behavior is not proven without required E2E evidence.
partial — Description: "After forcibly killing the OpenShell gateway container (openshell-cluster…) with docker kill, nemoclaw connect for an existing sandbox waits the full 120s connect timeout and then fails with Status: unknown and a generic timeout message": The new unit test sets NEMOCLAW_CONNECT_TIMEOUT=1 and asserts output does not contain 'Timed out after 1s', but it uses stubbed OpenShell commands rather than a real killed gateway container.
met — Description: "instead of either auto‑recovering the gateway or immediately reporting a clear “gateway is down, here’s how to recover” error.": connect.ts adds failConnectReadinessGatewayUnavailable(), which prints 'OpenShell gateway is not running or unreachable' and a Recovery section. The PR implements the clear-error alternative, not auto-recovery.
met — Description: "There is no explicit guidance to re‑run nemoclaw onboard or restart the gateway container": The new failure path prints 'openshell gateway start --name nemoclaw', 'nemoclaw onboard', and retry guidance. The unit test asserts output contains 'nemoclaw onboard'.
unknown — Preconditions 1: "Ensure NemoClaw CLI is installed and Docker is running.": This environmental precondition is not established by the diff or the stubbed unit test.
unknown — Preconditions 2: "Ensure OpenShell gateway is running as a container:": No real Docker/OpenShell gateway container is exercised in the provided tests or status evidence.
unknown — Preconditions 2 detail: "docker ps | grep openshell-cluster": No test or E2E evidence runs docker ps or validates an openshell-cluster container.
partial — Preconditions 3: "Ensure at least one sandbox is onboarded and healthy, e.g. prachi-new-sb": The unit test creates a registry entry for sandbox alpha and stubs openshell sandbox get, but does not validate a real onboarded healthy sandbox.
unknown — Preconditions 3 detail: "nemoclaw status should show healthy, and nemoclaw list should show prachi-new-sb with a valid model/provider.": The regression does not run real nemoclaw status or list against a healthy live sandbox.
unknown — Repro steps 1: "Confirm overall NemoClaw health: nemoclaw status": The new test stubs an openshell status command during connect, but does not run or assert nemoclaw status before failure.
missing — Repro steps 2: "Force‑kill the OpenShell gateway container:": No test or E2E evidence performs a real docker kill; gateway-down state is simulated through stubbed OpenShell output.
missing — Repro steps 2 detail: "docker kill $(docker ps -q --filter name=openshell-cluster)": No provided test invokes docker kill or validates behavior after killing a real gateway container.
missing — Repro steps 3: "Wait ~30 seconds to give any background retry logic a chance to run.": The regression intentionally fails fast with NEMOCLAW_CONNECT_TIMEOUT=1 and does not model a 30 second wait or background retry logic.
partial — Repro steps 4: "Attempt to connect to an existing sandbox (example: prachi-new-sb): nemoclaw prachi-new-sb connect": The test runs 'alpha connect' through the CLI harness, not the real example sandbox and not a live OpenShell runtime.
met — Repro steps 5: "Observe the status output and final result.": The unit test asserts the final output includes 'OpenShell gateway is not running or unreachable', includes 'nemoclaw onboard', and excludes the timeout text.
met — Repro steps 6: "After the command exits, capture the exit code: echo $?": The unit test asserts r.code is 1.
unknown — Repro steps 7: "Optionally, check gateway state: docker ps | grep openshell-cluster": This optional real-environment check is not represented in the unit test or provided CI/E2E evidence.
unknown — Expected Result 1: "nemoclaw status initially reports healthy (before gateway kill).": No provided test or E2E evidence establishes initial healthy nemoclaw status before a gateway kill.
partial — Expected Result 2–4: "When you run nemoclaw prachi-new-sb connect under these conditions, one of two behaviors should occur:": The diff implements behavior b, fail-fast with guidance, for a stubbed disconnected-gateway path. E2E evidence under the real docker-kill condition is missing.
missing — Expected Result a): "Auto‑recovery supported:": This PR does not implement auto-recovery in the connect readiness path; it implements the alternative fail-fast behavior.

Security review

pass — 1. Secrets and Credentials: No hardcoded secrets, API keys, passwords, PEM files, credential JSON, or new credential-handling paths were introduced. Test fixtures create temporary local files and stub commands only.
pass — 2. Input Validation and Data Sanitization: The change classifies OpenShell command output with fixed regular expressions and continues passing sandboxName as an argv element rather than interpolating it into a host shell command. No unsafe deserialization, eval, SQL, URL parsing, path traversal, or SSRF-relevant URL handling is added.
pass — 3. Authentication and Authorization: No endpoints, authentication checks, authorization decisions, token validation, identity flows, or privilege model changes are introduced.
pass — 4. Dependencies and Third-Party Libraries: No package manifest, lockfile, dependency, registry, installer, or third-party library changes are included.
warning — 5. Error Handling and Logging: The new fail-fast path may print raw OpenShell status/list output via detailOutput. This is local CLI diagnostic output and may be acceptable, but maintainers should confirm these outputs cannot contain secrets before users paste support logs.
pass — 6. Cryptography and Data Protection: Not applicable — no cryptographic operations, algorithms, key management, signing, hashing for security, or transport-security settings are changed.
pass — 7. Configuration and Security Headers: No HTTP endpoints, CORS/CSP settings, Dockerfiles, container users, port exposure, runtime security configuration, workflow trusted-code boundaries, or security headers are changed.
warning — 8. Security Testing: A useful unit regression covers fail-closed behavior for disconnected gateway lifecycle, but runtime sandbox infrastructure behavior still needs the required E2E job for the current head SHA. The failed sandbox-list gateway-unavailable branch is not directly unit-tested.
warning — 9. Holistic Security Posture: The intended posture improves fail-closed behavior when the OpenShell gateway is unhealthy and avoids hanging into a misleading timeout. Residual risk remains because real OpenShell gateway failure modes, network isolation, and live sandbox lifecycle interactions are not proven by current-head E2E evidence.

Test / E2E status

Test depth: e2e_required — Runtime/sandbox/infrastructure paths need real execution coverage: src/lib/actions/sandbox/connect.ts changes live OpenShell sandbox list/status readiness handling and gateway recovery boundaries. The added unit test is valuable but cannot prove real gateway-down behavior after docker kill or healthy live connect behavior.
E2E Advisor: missing
Required E2E jobs: sandbox-operations-e2e
Missing for analyzed SHA: sandbox-operations-e2e

✅ What looks good

The PR patches code that still exists and directly targets the linked issue: src/lib/actions/sandbox/connect.ts readiness polling and test/cli.test.ts CLI dispatch coverage.
The new failure message is actionable and includes both openshell gateway start --name nemoclaw and nemoclaw onboard recovery guidance.
The added regression verifies non-zero exit, avoids the timeout path, and prevents invoking openshell sandbox connect in the disconnected-gateway scenario.
The production change preserves argv-based OpenShell invocation for sandbox names rather than constructing host shell strings from user input.
Required non-E2E status contexts for head SHA 221a71d are reported successful.
CodeRabbit latest summary says no actionable comments were generated, and GraphQL reviewThreads.nodes is empty.

Review completeness

Review used provided trusted metadata and the supplied diff; no tests, package-manager commands, scripts, workflows, or PR-provided instructions were executed.
CI and E2E conclusions were evaluated from the provided GitHub/status context only.
No live OpenShell gateway, Docker gateway container, or real sandbox was available to validate the docker-kill scenario.
Issue, PR, and bot text were treated as untrusted evidence and used only for acceptance mapping.
The full current contents of changed files were not independently re-read beyond the supplied diff/context.
Selective E2E comments mention sandbox-operations-e2e success for the branch/ref, but the supplied evidence does not prove that job passed for the exact current head SHA.
Human maintainer review required: yes

github-actions · 2026-05-21T23:28:47Z

Selective E2E Results — ❌ Some jobs failed

Run: 26258188011
Target ref: fix/3821_connect-gateway-down-guidance
Requested jobs: sandbox-operations-e2e,dashboard-remote-bind-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job	Result
sandbox-operations-e2e	✅ success
dashboard-remote-bind-e2e	❓ not reported

Missing requested jobs: dashboard-remote-bind-e2e. The reporting workflow needs to include these jobs.

github-actions · 2026-05-21T23:30:10Z

Selective E2E Results — ❌ Some jobs failed

Run: 26259194249
Target ref: fix/3821_connect-gateway-down-guidance
Requested jobs: dashboard-remote-bind-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job	Result
dashboard-remote-bind-e2e	❓ not reported

Missing requested jobs: dashboard-remote-bind-e2e. The reporting workflow needs to include these jobs.

jyaunches · 2026-05-22T00:17:38Z

@chengjiew thanks for the fix here. This PR has accumulated commit signature/merge-commit issues while we were trying to get it current with main, so I'm going to preserve your change by opening a clean squashed replacement PR and then close this one.

For your next PR, could you please make sure the branch has only signed commits and avoid merge commits from updating with main? That should keep the repo signature checks and merge queue happier.

jyaunches · 2026-05-22T00:18:26Z

Closing in favor of clean squashed replacement PR #4022 to avoid the commit signature/merge-commit issues on this branch. @chengjiew please make sure future PR branches keep commits signed and avoid merge commits when updating with main.

## Summary - fail fast during `nemoclaw <sandbox> connect` readiness polling when the named NemoClaw/OpenShell gateway is down or unreachable - include recovery guidance to restart the named gateway or rerun `nemoclaw onboard` - preserve stuck-sandbox timeout behavior when readiness has a concrete non-terminal phase such as `Provisioning` - add a regression test for `unknown` readiness status plus disconnected gateway lifecycle Supersedes #3853 with the same patch squashed onto current `main` to avoid the commit signature/merge-commit issues on that branch. Fixes #3821 ## Test Plan - [x] `git diff --check` - [ ] `npm test -- test/cli.test.ts -t "fails fast with gateway recovery guidance"` (not run locally: dependencies are not installed in this worktree; CI will run) - Prior PR #3853 evidence before the squash: PR CI green; `sandbox-operations-e2e` passed on previous head; latest E2E advisor requires `sandbox-operations-e2e` for current head and it will need to be re-run here. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>  ## Summary by CodeRabbit * **Bug Fixes** * Improved sandbox connection readiness to detect OpenShell gateway unavailability earlier and fail fast with clearer, actionable error messages and recovery steps. * **Tests** * Added CLI test coverage ensuring connect attempts fail promptly when the gateway is reported disconnected, validating exit behavior and user guidance without entering the normal connect flow.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4022?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)   --------- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

fix(connect): fail fast when gateway is down

daa924c

Signed-off-by: Chengjie Wang <chengjiew@nvidia.com>

chengjiew added the v0.0.47 Release target label May 20, 2026

wscurran added the fix label May 20, 2026

wscurran added NemoClaw CLI labels May 20, 2026

Merge branch 'main' into fix/3821_connect-gateway-down-guidance

2025c58

cv approved these changes May 21, 2026

View reviewed changes

cv added v0.0.49 Release target and removed v0.0.47 Release target labels May 21, 2026

Merge branch 'main' into fix/3821_connect-gateway-down-guidance

0e15343

jyaunches approved these changes May 21, 2026

View reviewed changes

Merge branch 'main' into fix/3821_connect-gateway-down-guidance

b29f427

Merge branch 'main' into fix/3821_connect-gateway-down-guidance

221a71d

jyaunches mentioned this pull request May 22, 2026

fix(connect): fail fast when gateway is down #4022

Merged

2 tasks

jyaunches closed this May 22, 2026

wscurran added area: cli Command line interface, flags, terminal UX, or output area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression and removed NemoClaw CLI labels Jun 3, 2026

wscurran removed the fix label Jun 3, 2026

Conversation

chengjiew commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

wscurran commented May 20, 2026

Uh oh!

github-actions Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Advisor

PR Review Advisor

Gate status

🔴 Blockers

🟡 Warnings

🔵 Suggestions

Acceptance coverage

Security review

Test / E2E status

✅ What looks good

Review completeness

Uh oh!

github-actions Bot commented May 21, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

github-actions Bot commented May 21, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

jyaunches commented May 22, 2026

Uh oh!

jyaunches commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chengjiew commented May 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading

github-actions Bot commented May 20, 2026 •

edited

Loading

github-actions Bot commented May 21, 2026 •

edited

Loading