Skip to content

fix(onboard): suppress provider cleanup probe output#5030

Merged
cv merged 2 commits into
mainfrom
fix/quiet-onboard-provider-cleanup
Jun 9, 2026
Merged

fix(onboard): suppress provider cleanup probe output#5030
cv merged 2 commits into
mainfrom
fix/quiet-onboard-provider-cleanup

Conversation

@sandl99

@sandl99 sandl99 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Suppress expected OpenShell provider-cleanup probe output during onboarding so fresh sandbox creation no longer prints repeated sandbox not found errors before the real create step. The cleanup still captures and classifies OpenShell failures internally, preserving the stale-provider recovery behavior added for rebuild and resume flows.

Related Issue

Closes #5025

Changes

  • Add suppressOutput: true to captured provider detach/delete cleanup calls in src/lib/onboard/sandbox-provider-cleanup.ts.
  • Add regression coverage proving tolerated missing-sandbox detach probes are quiet while still attempted for every provider suffix.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Additional verification:

  • npm run build:cli passes
  • npx vitest run test/sandbox-provider-cleanup.test.ts passes
  • npx prek run --all-files was attempted but failed in unrelated existing CLI hook failures/timeouts outside this change.

Signed-off-by: San Dang sdang@nvidia.com

Summary by CodeRabbit

  • Tests

    • Added test coverage for sandbox provider cleanup operations to validate proper handling when sandboxes are missing.
  • Refactor

    • Enhanced command output handling in sandbox provider cleanup processes for improved reliability.

Signed-off-by: San Dang <sdang@nvidia.com>
@sandl99 sandl99 self-assigned this Jun 9, 2026
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0b1d8337-652d-4815-958d-71efec7f5323

📥 Commits

Reviewing files that changed from the base of the PR and between 3048632 and e1baa08.

📒 Files selected for processing (2)
  • src/lib/onboard/sandbox-provider-cleanup.ts
  • test/sandbox-provider-cleanup.test.ts

📝 Walkthrough

Walkthrough

Provider cleanup functions now pass suppressOutput: true when detaching or deleting sandbox providers. The change affects detachSandboxProviders, recoverAttachedProvider, and deleteProviderWithRecovery, and is validated by a new test case for missing sandbox tolerance.

Changes

Output suppression for provider cleanup

Layer / File(s) Summary
Suppress detach and delete command output
src/lib/onboard/sandbox-provider-cleanup.ts
Add suppressOutput: true option to four runOpenshell calls in detachSandboxProviders, recoverAttachedProvider (detach), and deleteProviderWithRecovery (two delete attempts). These calls already ignore errors and pipe stdout/stderr.
Test output suppression with missing sandbox tolerance
test/sandbox-provider-cleanup.test.ts
Add test case verifying that when tolerateMissingSandbox: true is set, detach operations with missing sandboxes do not fail and include both ignoreError: true and suppressOutput: true in the options.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4943: The main PR's change to pass suppressOutput: true through runOpenshell in detachSandboxProviders, recoverAttachedProvider, and deleteProviderWithRecovery is directly aligned with the retrieved PR's sandbox-provider cleanup implementation of those same functions and their detach/delete invocation paths.

Suggested labels

area: sandbox, bug-fix

Poem

🐰 When sandboxes are lost without a trace,
We quietly clean without leaving a face,
Suppress the errors, let silence prevail,
A gentle goodbye with a whisper, not wail! 🌙

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: suppressing provider cleanup probe output during onboarding, which directly addresses the issue of repeated 'sandbox not found' errors.
Linked Issues check ✅ Passed The pull request fully implements the requirements from issue #5025 by suppressing the noisy 'sandbox not found' error output during provider cleanup probes while maintaining internal failure tracking and recovery behavior.
Out of Scope Changes check ✅ Passed All changes are directly related to the objective of suppressing provider cleanup probe output; modifications to sandbox-provider-cleanup.ts and test coverage are appropriately scoped to the issue.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/quiet-onboard-provider-cleanup

Comment @coderabbitai help to get the list of available commands and usage tips.

@sandl99 sandl99 added area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow bug-fix PR fixes a bug or regression VRDC Issues and PRs submitted by NVIDIA VRDC test team. labels Jun 9, 2026
@cv cv added the v0.0.62 Release target label Jun 9, 2026
@sandl99 sandl99 requested a review from laitingsheng June 9, 2026 06:50
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 2 worth checking, 0 nice ideas
Since last review: 0 prior items resolved, 1 still applies, 0 new items found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Source-of-truth review needed: Provider cleanup quieting around existing OpenShell sandbox/provider lifecycle workaround: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `src/lib/onboard/sandbox-provider-cleanup.ts` documents OpenShell as the source boundary and adds suppression at the cleanup subprocess calls; `test/sandbox-provider-cleanup.test.ts` covers only the tolerated missing-sandbox detach suppression case.
  • Add coverage for all newly quiet provider cleanup paths (test/sandbox-provider-cleanup.test.ts:172): The patch adds `suppressOutput: true` to `detachSandboxProviders()`, `recoverAttachedProvider()`, and both `deleteProviderWithRecovery()` provider-delete attempts, but the added regression test only asserts option forwarding for `detachSandboxProviders()`. This leaves the credential-adjacent provider-delete recovery path unguarded against future regressions, and the linked issue's upgrade/onboard terminal-noise acceptance is only proven at the unit option-forwarding level rather than by runtime terminal-output validation.
    • Recommendation: Add behavior-specific tests asserting `recoverAttachedProvider()` forwards `suppressOutput: true` while still collecting non-tolerated failures, and `deleteProviderWithRecovery()` suppresses the initial delete, recovery detach, and retry delete calls. Consider a targeted runtime/integration validation that a fresh onboard tolerated missing-sandbox cleanup captures OpenShell `NotFound` stderr without writing it to user terminal output.
    • Evidence: Production changes add `suppressOutput: true` at `src/lib/onboard/sandbox-provider-cleanup.ts:118`, `:177`, `:271`, and `:283`. The new test beginning around `test/sandbox-provider-cleanup.test.ts:172` checks only tolerated missing-sandbox detach probes over `SANDBOX_PROVIDER_SUFFIXES`; existing recovery/delete tests assert argv and outcomes but not options.

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — recoverAttachedProvider passes suppressOutput true on each recovery detach while still returning non-tolerated failures. This is sandbox/OpenShell infrastructure code where the user-visible bug is terminal output from subprocess stderr. Unit option-forwarding coverage is useful but currently incomplete across changed paths, and runtime-style validation would better prove the linked onboarding noise is gone.
  • **Runtime validation** — deleteProviderWithRecovery passes suppressOutput true on the initial provider delete when no recovery is needed. This is sandbox/OpenShell infrastructure code where the user-visible bug is terminal output from subprocess stderr. Unit option-forwarding coverage is useful but currently incomplete across changed paths, and runtime-style validation would better prove the linked onboarding noise is gone.
  • **Runtime validation** — deleteProviderWithRecovery passes suppressOutput true on initial delete, recovery detach, and retry delete after an attached-sandbox diagnostic. This is sandbox/OpenShell infrastructure code where the user-visible bug is terminal output from subprocess stderr. Unit option-forwarding coverage is useful but currently incomplete across changed paths, and runtime-style validation would better prove the linked onboarding noise is gone.
  • **Runtime validation** — fresh onboard tolerated missing-sandbox provider cleanup captures NotFound stderr without writing it to user terminal output. This is sandbox/OpenShell infrastructure code where the user-visible bug is terminal output from subprocess stderr. Unit option-forwarding coverage is useful but currently incomplete across changed paths, and runtime-style validation would better prove the linked onboarding noise is gone.
  • **Add coverage for all newly quiet provider cleanup paths** — Add behavior-specific tests asserting `recoverAttachedProvider()` forwards `suppressOutput: true` while still collecting non-tolerated failures, and `deleteProviderWithRecovery()` suppresses the initial delete, recovery detach, and retry delete calls. Consider a targeted runtime/integration validation that a fresh onboard tolerated missing-sandbox cleanup captures OpenShell `NotFound` stderr without writing it to user terminal output.
  • **Acceptance clause:** Issue "NotFound" error while onboarding #5025 Description: "After upgrading to v0.0.61 in onboard meessages there are a lot of errors like:" followed by repeated `status: NotFound, message: "sandbox not found"` diagnostics and metadata. — add test evidence or identify existing coverage. The fresh-onboard cleanup path now calls OpenShell with `stdio: ["ignore", "pipe", "pipe"]` and `suppressOutput: true`, and the added unit test verifies tolerated missing-sandbox detach probes are attempted for every provider suffix with `suppressOutput: true`. No runtime terminal-output validation is included.
  • **Acceptance clause:** Issue "NotFound" error while onboarding #5025: "In v0.0.60 hadn't such errors." — add test evidence or identify existing coverage. The patch targets the v0.0.61-era provider cleanup output path by suppressing captured OpenShell command output. The diff does not include cross-version installer/upgrade validation.
  • **Acceptance clause:** Issue "NotFound" error while onboarding #5025 Reproduction Steps 1. `Run curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_INSTALL_TAG=v0.0.60 bash` — add test evidence or identify existing coverage. No installer or cross-version setup validation is added in this diff; changed files are limited to provider cleanup code and its unit test.
Since last review details

Current findings:

  • Source-of-truth review needed: Provider cleanup quieting around existing OpenShell sandbox/provider lifecycle workaround: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `src/lib/onboard/sandbox-provider-cleanup.ts` documents OpenShell as the source boundary and adds suppression at the cleanup subprocess calls; `test/sandbox-provider-cleanup.test.ts` covers only the tolerated missing-sandbox detach suppression case.
  • Add coverage for all newly quiet provider cleanup paths (test/sandbox-provider-cleanup.test.ts:172): The patch adds `suppressOutput: true` to `detachSandboxProviders()`, `recoverAttachedProvider()`, and both `deleteProviderWithRecovery()` provider-delete attempts, but the added regression test only asserts option forwarding for `detachSandboxProviders()`. This leaves the credential-adjacent provider-delete recovery path unguarded against future regressions, and the linked issue's upgrade/onboard terminal-noise acceptance is only proven at the unit option-forwarding level rather than by runtime terminal-output validation.
    • Recommendation: Add behavior-specific tests asserting `recoverAttachedProvider()` forwards `suppressOutput: true` while still collecting non-tolerated failures, and `deleteProviderWithRecovery()` suppresses the initial delete, recovery detach, and retry delete calls. Consider a targeted runtime/integration validation that a fresh onboard tolerated missing-sandbox cleanup captures OpenShell `NotFound` stderr without writing it to user terminal output.
    • Evidence: Production changes add `suppressOutput: true` at `src/lib/onboard/sandbox-provider-cleanup.ts:118`, `:177`, `:271`, and `:283`. The new test beginning around `test/sandbox-provider-cleanup.test.ts:172` checks only tolerated missing-sandbox detach probes over `SANDBOX_PROVIDER_SUFFIXES`; existing recovery/delete tests assert argv and outcomes but not options.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: messaging-providers-e2e, double-onboard-e2e, onboard-repair-e2e
Optional E2E: brave-search-e2e, onboard-resume-e2e, sandbox-operations-e2e

Dispatch hint: messaging-providers-e2e,double-onboard-e2e,onboard-repair-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • messaging-providers-e2e (high (~75 min timeout, live Docker/OpenShell sandbox)): Exercises real onboarding with per-sandbox messaging providers, placeholder credentials, provider attachment, and cleanup paths affected by sandbox-provider-cleanup.ts.
  • double-onboard-e2e (high (~90 min timeout, multiple sandbox lifecycle phases)): Covers repeated onboard/recreate for the same sandbox and explicit destroy cleanup, directly exercising pre-delete provider detach probes added to the recreate/destroy lifecycle.
  • onboard-repair-e2e (medium-high (~60 min timeout, live onboarding repair)): Covers resume/repair after the recorded sandbox is pruned from OpenShell, the path where missing-sandbox detach probes are tolerated and now suppress OpenShell output.

Optional E2E

  • brave-search-e2e (medium-high (live Brave/search onboarding path)): Optional adjacent coverage for the brave-search provider suffix included in SANDBOX_PROVIDER_SUFFIXES; useful if the change owner wants search-provider-specific cleanup confidence.
  • onboard-resume-e2e (medium-high (~60 min timeout, live resume)): Optional companion to onboard-repair-e2e for interrupted onboard resume behavior; less directly targeted than repair because it mostly skips existing sandbox/gateway work.
  • sandbox-operations-e2e (medium-high (~60 min timeout, multi-sandbox operations)): Optional broader destroy/lifecycle confidence for sandbox deletion and registry cleanup, though it does not focus on attached provider recovery.

New E2E recommendations

  • provider cleanup output suppression (medium): Existing live E2Es cover provider lifecycle broadly, but none appears to assert the specific regression this PR targets: tolerated missing-sandbox/provider-detach probes and provider delete recovery should not print raw OpenShell diagnostics while still preserving captured/redacted failure output.
    • Suggested test: Add a targeted live provider-cleanup E2E that creates attached messaging/search providers, prunes or deletes the sandbox to leave stale attachment conditions, reruns onboard/destroy, and asserts cleanup remains quiet for tolerated detach/delete failures while non-tolerated diagnostics are redacted and surfaced.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: messaging-providers-e2e,double-onboard-e2e,onboard-repair-e2e

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: ubuntu-repo-cloud-openclaw
Optional scenario E2E: ubuntu-repo-cloud-openclaw-double-same-provider, ubuntu-repo-cloud-openclaw-telegram

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • ubuntu-repo-cloud-openclaw: The changed onboard sandbox-provider cleanup helper is called during normal cloud OpenClaw onboarding before sandbox creation/deletion. This scenario is the smallest standard Ubuntu repo-checkout path that exercises the onboard flow and smoke-validates the resulting sandbox.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Optional scenario E2E

  • ubuntu-repo-cloud-openclaw-double-same-provider: Optional adjacent coverage for provider replacement/reuse flows, which are closer to the deleteProviderWithRecovery path touched by this helper change.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-double-same-provider
  • ubuntu-repo-cloud-openclaw-telegram: Optional adjacent coverage for a messaging-provider onboarding path; the cleanup helper detaches per-sandbox messaging providers and this scenario validates a concrete messaging provider attachment flow.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-telegram

Relevant changed files

  • src/lib/onboard/sandbox-provider-cleanup.ts

@cv cv enabled auto-merge (squash) June 9, 2026 06:59
@cv cv merged commit 38c03b1 into main Jun 9, 2026
35 checks passed
@cv cv deleted the fix/quiet-onboard-provider-cleanup branch June 9, 2026 07:00
jyaunches pushed a commit that referenced this pull request Jun 10, 2026
## Summary
- Add v0.0.62 release notes from Discussion #5100 and link release
highlights to the relevant docs pages.
- Document the release's GPU sandbox recreation, sandbox-side local
inference verification, and Hermes dashboard port guard in the command
and inference references.
- Refresh generated NemoClaw user skills for the release-prep docs set.

## Source Summary
- #4956 -> `docs/reference/commands.mdx`: Document CDI-first Docker GPU
recreation behavior for Linux Docker-driver sandboxes.
- #5024 -> `docs/inference/use-local-inference.mdx`: Document
sandbox-runtime verification of the `inference.local` local inference
route.
- #5018 -> `docs/reference/commands.mdx`: Document Jetson/Tegra
device-node group propagation for sandbox CUDA initialization.
- #5012, #4763, #4706, #5030, #5015 -> `docs/about/release-notes.mdx`:
Summarize onboarding and recovery reliability fixes, including the
reserved Hermes API port guard.
- #5017 and #5043 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Summarize mutable OpenClaw config
recovery and host-side `agents list` coverage.
- #5010 and #5016 -> `docs/about/release-notes.mdx`: Summarize Hermes
upstream metadata visibility and WhatsApp QR rendering reliability.
- #5045 and prior source docs in the v0.0.62 range -> `.agents/skills/`:
Refresh generated user-skill references from the current docs source.

## Skipped
- #5019 -> skipped for new prose because it touched
`openclaw-sandbox-permissive.yaml`, which matches `docs/.docs-skip`.
Existing source docs remain the source for generated skill
synchronization.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `npm run docs` (passes; Fern reports 0 errors and 1 hidden warning)
- Pre-commit hooks passed during commit, including docs-to-skills
verification, markdown lint, gitleaks, and skills YAML tests.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
  * Added `nemoclaw <name> agents list` command.
* v0.0.62 release notes added summarizing onboarding and recovery
improvements.

* **Bug Fixes**
* Improved GPU sandbox onboarding reliability (NVIDIA CDI path,
Jetson/Tegra device handling).
* Better local inference verification and recovery for Linux
Docker-driver GPU sandboxes.
  * Quieter/earlier handling of onboarding drift and port collisions.

* **Documentation**
* Expanded GPU passthrough, inference verification, writable paths
(`/dev/pts`), port 8642 restriction, and command examples.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Prekshi Vyas <34834085+prekshivyas@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow bug-fix PR fixes a bug or regression v0.0.62 Release target VRDC Issues and PRs submitted by NVIDIA VRDC test team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"NotFound" error while onboarding

2 participants