[DGX Station][Sandbox] nemoclaw rebuild aborts at pre-backup audit when state_dirs contain root-owned subdirs (find exit 1)

## Description

`nemoclaw <name> rebuild` aborts at the pre-backup audit step on a freshly-onboarded v0.0.49 sandbox. The audit's `find` chain over `state_dirs` is joined by `&&`; when the base image bakes root-owned subdirs inside a state_dir (e.g. `agents/main`, `extensions/nemoclaw`, `extensions/openclaw-weixin`), the sandbox-user SSH session cannot traverse them, `find` returns exit 1 even with `2>/dev/null`, and the whole audit is treated as failed. Rebuild then aborts "to prevent data loss". This blocks any `channels add <slack|telegram|discord|wechat>` flow because that path requires a rebuild to materialize the channel config into `/sandbox/.openclaw/openclaw.json`.

## Environment

```text
Device:        DGX Station (GB300 + RTX PRO 6000 Blackwell Max-Q, mixed-GPU)
OS:            Ubuntu noble (kernel 6.17.0-1018-nvidia)
Architecture:  aarch64
Node.js:       v22.22.3
npm:           10.9.8
Docker:        29.2.1
OpenShell CLI: 0.0.39
NemoClaw:      v0.0.49
OpenClaw:      2026.4.24 (installed inside sandbox)
```

## Steps to Reproduce

1. Fresh install on DGX Station with Station express recipe env:
   ```bash
   export NEMOCLAW_INSTALL_TAG=v0.0.49
   export NEMOCLAW_PROVIDER=install-vllm
   export NEMOCLAW_NON_INTERACTIVE=1
   export NEMOCLAW_YES=1
   export NEMOCLAW_POLICY_MODE=suggested
   export NEMOCLAW_VLLM_MODEL=qwen3.6-27b
   export NEMOCLAW_SANDBOX_NAME=my-station-assistant
   curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash -s -- --yes-i-accept-third-party-software
   ```
2. Wait for `nemoclaw my-station-assistant status` to show `Phase=Ready`, `Provider=vllm-local`, `Model=Qwen/Qwen3.6-27B-FP8`, `Inference=healthy`.
3. Try to add Slack channel (or any messaging channel):
   ```bash
   export SLACK_BOT_TOKEN=xoxb-...
   export SLACK_APP_TOKEN=xapp-...
   nemoclaw my-station-assistant channels add slack
   ```
   Result: bridge providers `my-station-assistant-slack-bridge` and `my-station-assistant-slack-app` are registered with OpenShell gateway; sandbox registry's `messagingChannels=['slack']`; preset `slack` is applied. CLI prints `Rebuild 'my-station-assistant' now to apply? [Y/n]:` and exits.
4. Run rebuild with full instrumentation:
   ```bash
   NEMOCLAW_REBUILD_VERBOSE=1 nemoclaw my-station-assistant rebuild --yes --force --verbose
   ```
5. Rebuild aborts at "Backing up sandbox state..." → "Failed to back up sandbox state. Failed: agents, extensions, workspace, skills, hooks, identity, devices, canvas, cron, memory, telegram, wechat, credentials. Aborting rebuild to prevent data loss."

## Expected Result

- Pre-backup audit gracefully handles unreadable / root-owned subdirs inside state_dirs (auto-whitelist or skip via `find -prune`, OR treat exit-1 from a single `find` invocation as non-fatal when stdout is empty).
- Rebuild proceeds, `openclaw.json` is regenerated with `channels.slack.accounts.default.botToken` populated, and `openclaw message send --channel slack --target channel: --message "..."` from inside the sandbox sends successfully via the configured bot token.

## Actual Result

Aborts with "Failed to back up sandbox state" listing every state_dir as failed, even the dirs whose `find` returned exit 0. Root cause is the audit's `&&`-joined per-dir `find` chain: any single non-zero exit kills the audit, marking ALL state_dirs as failed and rolling up to rebuild abort.

Downstream impact:
- `openclaw message send --channel slack ...` from inside sandbox errors with:
  ```text
  Error: Slack bot token missing for account "default"
  (set channels.slack.accounts.default.botToken or SLACK_BOT_TOKEN for default).
  ```
  even though the gateway provider records the token — because `openclaw.json` was never updated.
- Blocks T5918892 (Slack E2E), T5910682 (Discord E2E), T5910681 (Telegram E2E), T5910683 (WeChat E2E), and any other v0.0.49 manual function_test case that exercises `channels add`.
- Also blocks legitimate sandbox upgrades because the same audit gate fires on every rebuild.

Direct verification of the failing audit pattern from inside the sandbox via `nemoclaw exec`:
```text
find /sandbox/.openclaw/agents \( -type l -o \( -type f -a -links +1 \) -o \( ! -type f -a ! -type d \) \) -printf '%y\t%p\t%l\n' 2>&1
    → find: '/sandbox/.openclaw/agents/main': Permission denied
    → EXIT=1
find /sandbox/.openclaw/extensions \( ... \) ... 2>&1
    → find: '/sandbox/.openclaw/extensions/nemoclaw': Permission denied
    → find: '/sandbox/.openclaw/extensions/openclaw-weixin': Permission denied
    → EXIT=1
find /sandbox/.openclaw/{workspace,skills,hooks,identity,devices,canvas,cron,memory,telegram,wechat,credentials} ...
    → EXIT=0 each (no permission issues)
```

Code path: `src/lib/state/sandbox.ts:1040-1066` builds the audit command by joining per-dir `find` invocations with ` && `, then `spawnSync('ssh', ..., auditCmd)`. When any one `find` exits non-zero, the whole chain returns non-zero, and the function treats every existing state_dir as failed.

## Logs

```text
[sandbox-state 2026-05-22T07:09:31.540Z] backupSandboxState: agent=openclaw, stateDirs=[agents,extensions,workspace,skills,hooks,identity,devices,canvas,cron,memory,telegram,wechat,whatsapp,credentials], stateFiles=[]
[sandbox-state 2026-05-22T07:09:31.540Z] policyPresets from registry: [npm,pypi,huggingface,brew,local-inference,slack]
[sandbox-state 2026-05-22T07:09:31.541Z] Getting SSH config via openshell sandbox ssh-config
[sandbox-state 2026-05-22T07:09:31.548Z] SSH config obtained (334 bytes)
[sandbox-state 2026-05-22T07:09:31.548Z] Checking existing dirs via SSH: { [ -d '/sandbox/.openclaw/agents' ] && printf '%s\n' 'agents'; [ -d '/sandbox/.openclaw/extensions'...
[sandbox-state 2026-05-22T07:09:31.798Z] Dir check: exit=0, stdout=agents\ncredentials, stderr=
[sandbox-state 2026-05-22T07:09:31.798Z] Existing dirs in sandbox: [agents,extensions,workspace,skills,hooks,identity,devices,canvas,cron,memory,telegram,wechat,credentials] (13/14)
[sandbox-state 2026-05-22T07:09:31.798Z] Pre-backup audit: checking for symlinks, hard links, and special files
[sandbox-state 2026-05-22T07:09:31.998Z] FAILED: Pre-backup audit command failed — exit 1
[rebuild 2026-05-22T07:09:31.998Z] Backup result: success=false, backed=; files=, failed=agents,extensions,workspace,skills,hooks,identity,devices,canvas,cron,memory,telegram,wechat,credentials; failedFiles=
Failed to back up sandbox state.
Failed: agents, extensions, workspace, skills, hooks, identity, devices, canvas, cron, memory, telegram, wechat, credentials
Aborting rebuild to prevent data loss.
```

## Suggested Fix

`src/lib/state/sandbox.ts:1040-1066` — change the per-state-dir audit `find` chain so a single permission-denied subdir does not abort the audit. Three viable options:

- **(a) Replace ` && ` joiner with `;`** so all `find` invocations run regardless of individual exit codes, and let the existing stdout-parsing logic categorize whatever entries are emitted. The stderr/exit signal becomes informational, not fatal.
- **(b) Per-dir invoke**: spawn `find` separately for each state_dir, capture stdout independently, and only fail the audit if a non-`find`-related error (e.g. ssh connection lost) occurs. Permission-denied on a subdir becomes a recorded warning (and the missing entries are simply not audited — which is the existing whitelist behaviour for known-safe paths anyway).
- **(c) Auto-whitelist base-image-owned subdirs**: pre-baked subdirs like `agents/<agent_id>`, `extensions/<plugin>` are root-owned by design. Detect them via a baked manifest and `-prune` them out of `find` before running.

Option (a) is the smallest delta and matches the existing tolerance the audit already shows (it parses stdout for violations regardless of stderr). Option (b) is more correct but a bigger change.

Either way, the rebuild must not refuse to proceed just because `find` cannot read a baked subdir. Today the data-loss guard fires correctly (abort, do not destroy) but the precondition that triggered it is a false positive in every onboard-then-rebuild sequence on v0.0.49.

This is NOT introduced by PR #3925 (the rebuild pre-backup audit code path is unchanged in that PR). PR #3925's separate "ARM64 post-rebuild gateway-unhealthy" regression is tracked under NVBug 6198894.

---
[NVB#6204923](https://nvbugspro.nvidia.com/bug/6204923)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DGX Station][Sandbox] nemoclaw rebuild aborts at pre-backup audit when state_dirs contain root-owned subdirs (find exit 1) #4059

Description

Environment

Steps to Reproduce

Expected Result

Actual Result

Logs

Suggested Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[DGX Station][Sandbox] nemoclaw rebuild aborts at pre-backup audit when state_dirs contain root-owned subdirs (find exit 1) #4059

Description

Description

Environment

Steps to Reproduce

Expected Result

Actual Result

Logs

Suggested Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions