Skip to content

fix(onboard): recover from SSH 255 when sandbox was already created#1598

Merged
ericksoa merged 1 commit into
mainfrom
fix/sandbox-create-ssh255-recovery
Apr 8, 2026
Merged

fix(onboard): recover from SSH 255 when sandbox was already created#1598
ericksoa merged 1 commit into
mainfrom
fix/sandbox-create-ssh255-recovery

Conversation

@ericksoa

@ericksoa ericksoa commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Summary

  • When openshell sandbox create exits with SSH 255 after printing "Created sandbox:", NemoClaw now treats this as recoverable instead of hard-exiting
  • Added a final readyCheck in streamSandboxCreate on non-zero close to catch the race where the sandbox is already Ready when SSH dies
  • In onboard.js, sandbox_create_incomplete failures now fall through to the existing 60s ready-wait loop instead of calling process.exit()

Root cause

Regression from #1516 (create-stream extraction) combined with #1081/#1527 (messaging provider migration forcing sandbox recreation). The create stream returns non-zero after "Created sandbox:" and NemoClaw exits before checking if the sandbox reached Ready state.

Test plan

  • New unit tests: stream recovers when readyCheck is true at close time (SSH 255)
  • New unit test: stream still returns non-zero when readyCheck is false at close time
  • All existing sandbox-create-stream tests pass (7/7)
  • All onboard integration tests pass (83/83)
  • All src unit tests pass (538/538)
  • Manual: trigger SSH 255 during sandbox create and verify recovery

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced sandbox creation error handling with improved failure classification and recovery messaging.
    • Improved sandbox readiness detection to prevent unnecessary creation failures when the sandbox is operational.

When openshell sandbox create exits with SSH 255 after printing
"Created sandbox:", NemoClaw previously hard-exited instead of checking
whether the sandbox reached Ready state. This was a regression from
the create-stream extraction (#1516) combined with the messaging
provider migration path (#1081, #1527) that forces sandbox recreation.

Two fixes:
- streamSandboxCreate: do one final readyCheck on non-zero close to
  catch the race where the sandbox is already Ready when SSH dies.
- onboard.js: when failure is sandbox_create_incomplete, fall through
  to the existing ready-wait loop (60s polling) instead of exiting.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@coderabbitai

coderabbitai Bot commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aa92ed04-1fd8-40d9-81a6-e30f370b5bcc

📥 Commits

Reviewing files that changed from the base of the PR and between 4c38bfa and 48b39fb.

📒 Files selected for processing (3)
  • bin/lib/onboard.js
  • src/lib/sandbox-create-stream.test.ts
  • src/lib/sandbox-create-stream.ts

📝 Walkthrough

Walkthrough

This PR enhances sandbox creation error handling in two places: the onboard orchestrator now classifies creation failures and conditionally continues or exits, while the create-stream module adds a final readiness check when the child process exits with a non-zero code to override the error if the sandbox is actually ready.

Changes

Cohort / File(s) Summary
Sandbox Creation Error Recovery
bin/lib/onboard.js, src/lib/sandbox-create-stream.ts
Enhanced error handling with failure classification and readiness checks. Onboard now classifies creation failures and continues on sandbox_create_incomplete. Create-stream adds a synchronous final readiness check on non-zero child process exit to override errors when sandbox is already functional.
Creation Stream Tests
src/lib/sandbox-create-stream.test.ts
Added two test cases covering child process close behavior: one where readiness check succeeds before close (overrides non-zero exit), another where readiness check fails (preserves non-zero exit status).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A sandbox born, but stumbling fast,
Yet readiness checked—it's here at last!
When child processes exit with a frown,
We check one more time before we're down.
Recovery blooms where errors stood before! 🌱

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly references the main fix: recovering from SSH exit code 255 when a sandbox was already created, which is the core objective of this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/sandbox-create-ssh255-recovery

Comment @coderabbitai help to get the list of available commands and usage tips.

@ericksoa ericksoa merged commit adbea05 into main Apr 8, 2026
9 checks passed
@ericksoa ericksoa deleted the fix/sandbox-create-ssh255-recovery branch April 8, 2026 03:55
gemini2026 pushed a commit to gemini2026/NemoClaw that referenced this pull request Apr 14, 2026
…VIDIA#1598)

## Summary
- When `openshell sandbox create` exits with SSH 255 after printing
"Created sandbox:", NemoClaw now treats this as recoverable instead of
hard-exiting
- Added a final `readyCheck` in `streamSandboxCreate` on non-zero close
to catch the race where the sandbox is already Ready when SSH dies
- In `onboard.js`, `sandbox_create_incomplete` failures now fall through
to the existing 60s ready-wait loop instead of calling `process.exit()`

## Root cause
Regression from NVIDIA#1516 (create-stream extraction) combined with
NVIDIA#1081/NVIDIA#1527 (messaging provider migration forcing sandbox recreation).
The create stream returns non-zero after "Created sandbox:" and NemoClaw
exits before checking if the sandbox reached Ready state.

## Test plan
- [x] New unit tests: stream recovers when readyCheck is true at close
time (SSH 255)
- [x] New unit test: stream still returns non-zero when readyCheck is
false at close time
- [x] All existing `sandbox-create-stream` tests pass (7/7)
- [x] All onboard integration tests pass (83/83)
- [x] All src unit tests pass (538/538)
- [ ] Manual: trigger SSH 255 during sandbox create and verify recovery

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Enhanced sandbox creation error handling with improved failure
classification and recovery messaging.
* Improved sandbox readiness detection to prevent unnecessary creation
failures when the sandbox is operational.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants