Description
nemoclaw <name> rebuild aborts at the pre-backup audit step on a freshly-onboarded v0.0.49 sandbox. The audit's find chain over state_dirs is joined by &&; when the base image bakes root-owned subdirs inside a state_dir (e.g. agents/main, extensions/nemoclaw, extensions/openclaw-weixin), the sandbox-user SSH session cannot traverse them, find returns exit 1 even with 2>/dev/null, and the whole audit is treated as failed. Rebuild then aborts "to prevent data loss". This blocks any channels add <slack|telegram|discord|wechat> flow because that path requires a rebuild to materialize the channel config into /sandbox/.openclaw/openclaw.json.
Environment
Device: DGX Station (GB300 + RTX PRO 6000 Blackwell Max-Q, mixed-GPU)
OS: Ubuntu noble (kernel 6.17.0-1018-nvidia)
Architecture: aarch64
Node.js: v22.22.3
npm: 10.9.8
Docker: 29.2.1
OpenShell CLI: 0.0.39
NemoClaw: v0.0.49
OpenClaw: 2026.4.24 (installed inside sandbox)
Steps to Reproduce
- Fresh install on DGX Station with Station express recipe env:
export NEMOCLAW_INSTALL_TAG=v0.0.49
export NEMOCLAW_PROVIDER=install-vllm
export NEMOCLAW_NON_INTERACTIVE=1
export NEMOCLAW_YES=1
export NEMOCLAW_POLICY_MODE=suggested
export NEMOCLAW_VLLM_MODEL=qwen3.6-27b
export NEMOCLAW_SANDBOX_NAME=my-station-assistant
curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash -s -- --yes-i-accept-third-party-software
- Wait for
nemoclaw my-station-assistant status to show Phase=Ready, Provider=vllm-local, Model=Qwen/Qwen3.6-27B-FP8, Inference=healthy.
- Try to add Slack channel (or any messaging channel):
export SLACK_BOT_TOKEN=xoxb-...
export SLACK_APP_TOKEN=xapp-...
nemoclaw my-station-assistant channels add slack
Result: bridge providers my-station-assistant-slack-bridge and my-station-assistant-slack-app are registered with OpenShell gateway; sandbox registry's messagingChannels=['slack']; preset slack is applied. CLI prints Rebuild 'my-station-assistant' now to apply? [Y/n]: and exits.
- Run rebuild with full instrumentation:
NEMOCLAW_REBUILD_VERBOSE=1 nemoclaw my-station-assistant rebuild --yes --force --verbose
- Rebuild aborts at "Backing up sandbox state..." → "Failed to back up sandbox state. Failed: agents, extensions, workspace, skills, hooks, identity, devices, canvas, cron, memory, telegram, wechat, credentials. Aborting rebuild to prevent data loss."
Expected Result
- Pre-backup audit gracefully handles unreadable / root-owned subdirs inside state_dirs (auto-whitelist or skip via
find -prune, OR treat exit-1 from a single find invocation as non-fatal when stdout is empty).
- Rebuild proceeds,
openclaw.json is regenerated with channels.slack.accounts.default.botToken populated, and openclaw message send --channel slack --target channel: --message "..." from inside the sandbox sends successfully via the configured bot token.
Actual Result
Aborts with "Failed to back up sandbox state" listing every state_dir as failed, even the dirs whose find returned exit 0. Root cause is the audit's &&-joined per-dir find chain: any single non-zero exit kills the audit, marking ALL state_dirs as failed and rolling up to rebuild abort.
Downstream impact:
openclaw message send --channel slack ... from inside sandbox errors with:
Error: Slack bot token missing for account "default"
(set channels.slack.accounts.default.botToken or SLACK_BOT_TOKEN for default).
even though the gateway provider records the token — because openclaw.json was never updated.
- Blocks T5918892 (Slack E2E), T5910682 (Discord E2E), T5910681 (Telegram E2E), T5910683 (WeChat E2E), and any other v0.0.49 manual function_test case that exercises
channels add.
- Also blocks legitimate sandbox upgrades because the same audit gate fires on every rebuild.
Direct verification of the failing audit pattern from inside the sandbox via nemoclaw exec:
find /sandbox/.openclaw/agents \( -type l -o \( -type f -a -links +1 \) -o \( ! -type f -a ! -type d \) \) -printf '%y\t%p\t%l\n' 2>&1
→ find: '/sandbox/.openclaw/agents/main': Permission denied
→ EXIT=1
find /sandbox/.openclaw/extensions \( ... \) ... 2>&1
→ find: '/sandbox/.openclaw/extensions/nemoclaw': Permission denied
→ find: '/sandbox/.openclaw/extensions/openclaw-weixin': Permission denied
→ EXIT=1
find /sandbox/.openclaw/{workspace,skills,hooks,identity,devices,canvas,cron,memory,telegram,wechat,credentials} ...
→ EXIT=0 each (no permission issues)
Code path: src/lib/state/sandbox.ts:1040-1066 builds the audit command by joining per-dir find invocations with &&, then spawnSync('ssh', ..., auditCmd). When any one find exits non-zero, the whole chain returns non-zero, and the function treats every existing state_dir as failed.
Logs
[sandbox-state 2026-05-22T07:09:31.540Z] backupSandboxState: agent=openclaw, stateDirs=[agents,extensions,workspace,skills,hooks,identity,devices,canvas,cron,memory,telegram,wechat,whatsapp,credentials], stateFiles=[]
[sandbox-state 2026-05-22T07:09:31.540Z] policyPresets from registry: [npm,pypi,huggingface,brew,local-inference,slack]
[sandbox-state 2026-05-22T07:09:31.541Z] Getting SSH config via openshell sandbox ssh-config
[sandbox-state 2026-05-22T07:09:31.548Z] SSH config obtained (334 bytes)
[sandbox-state 2026-05-22T07:09:31.548Z] Checking existing dirs via SSH: { [ -d '/sandbox/.openclaw/agents' ] && printf '%s\n' 'agents'; [ -d '/sandbox/.openclaw/extensions'...
[sandbox-state 2026-05-22T07:09:31.798Z] Dir check: exit=0, stdout=agents\ncredentials, stderr=
[sandbox-state 2026-05-22T07:09:31.798Z] Existing dirs in sandbox: [agents,extensions,workspace,skills,hooks,identity,devices,canvas,cron,memory,telegram,wechat,credentials] (13/14)
[sandbox-state 2026-05-22T07:09:31.798Z] Pre-backup audit: checking for symlinks, hard links, and special files
[sandbox-state 2026-05-22T07:09:31.998Z] FAILED: Pre-backup audit command failed — exit 1
[rebuild 2026-05-22T07:09:31.998Z] Backup result: success=false, backed=; files=, failed=agents,extensions,workspace,skills,hooks,identity,devices,canvas,cron,memory,telegram,wechat,credentials; failedFiles=
Failed to back up sandbox state.
Failed: agents, extensions, workspace, skills, hooks, identity, devices, canvas, cron, memory, telegram, wechat, credentials
Aborting rebuild to prevent data loss.
Suggested Fix
src/lib/state/sandbox.ts:1040-1066 — change the per-state-dir audit find chain so a single permission-denied subdir does not abort the audit. Three viable options:
- (a) Replace
&& joiner with ; so all find invocations run regardless of individual exit codes, and let the existing stdout-parsing logic categorize whatever entries are emitted. The stderr/exit signal becomes informational, not fatal.
- (b) Per-dir invoke: spawn
find separately for each state_dir, capture stdout independently, and only fail the audit if a non-find-related error (e.g. ssh connection lost) occurs. Permission-denied on a subdir becomes a recorded warning (and the missing entries are simply not audited — which is the existing whitelist behaviour for known-safe paths anyway).
- (c) Auto-whitelist base-image-owned subdirs: pre-baked subdirs like
agents/<agent_id>, extensions/<plugin> are root-owned by design. Detect them via a baked manifest and -prune them out of find before running.
Option (a) is the smallest delta and matches the existing tolerance the audit already shows (it parses stdout for violations regardless of stderr). Option (b) is more correct but a bigger change.
Either way, the rebuild must not refuse to proceed just because find cannot read a baked subdir. Today the data-loss guard fires correctly (abort, do not destroy) but the precondition that triggered it is a false positive in every onboard-then-rebuild sequence on v0.0.49.
This is NOT introduced by PR #3925 (the rebuild pre-backup audit code path is unchanged in that PR). PR #3925's separate "ARM64 post-rebuild gateway-unhealthy" regression is tracked under NVBug 6198894.
NVB#6204923
Description
nemoclaw <name> rebuildaborts at the pre-backup audit step on a freshly-onboarded v0.0.49 sandbox. The audit'sfindchain overstate_dirsis joined by&&; when the base image bakes root-owned subdirs inside a state_dir (e.g.agents/main,extensions/nemoclaw,extensions/openclaw-weixin), the sandbox-user SSH session cannot traverse them,findreturns exit 1 even with2>/dev/null, and the whole audit is treated as failed. Rebuild then aborts "to prevent data loss". This blocks anychannels add <slack|telegram|discord|wechat>flow because that path requires a rebuild to materialize the channel config into/sandbox/.openclaw/openclaw.json.Environment
Steps to Reproduce
nemoclaw my-station-assistant statusto showPhase=Ready,Provider=vllm-local,Model=Qwen/Qwen3.6-27B-FP8,Inference=healthy.my-station-assistant-slack-bridgeandmy-station-assistant-slack-appare registered with OpenShell gateway; sandbox registry'smessagingChannels=['slack']; presetslackis applied. CLI printsRebuild 'my-station-assistant' now to apply? [Y/n]:and exits.Expected Result
find -prune, OR treat exit-1 from a singlefindinvocation as non-fatal when stdout is empty).openclaw.jsonis regenerated withchannels.slack.accounts.default.botTokenpopulated, andopenclaw message send --channel slack --target channel: --message "..."from inside the sandbox sends successfully via the configured bot token.Actual Result
Aborts with "Failed to back up sandbox state" listing every state_dir as failed, even the dirs whose
findreturned exit 0. Root cause is the audit's&&-joined per-dirfindchain: any single non-zero exit kills the audit, marking ALL state_dirs as failed and rolling up to rebuild abort.Downstream impact:
openclaw message send --channel slack ...from inside sandbox errors with:openclaw.jsonwas never updated.channels add.Direct verification of the failing audit pattern from inside the sandbox via
nemoclaw exec:Code path:
src/lib/state/sandbox.ts:1040-1066builds the audit command by joining per-dirfindinvocations with&&, thenspawnSync('ssh', ..., auditCmd). When any onefindexits non-zero, the whole chain returns non-zero, and the function treats every existing state_dir as failed.Logs
Suggested Fix
src/lib/state/sandbox.ts:1040-1066— change the per-state-dir auditfindchain so a single permission-denied subdir does not abort the audit. Three viable options:&&joiner with;so allfindinvocations run regardless of individual exit codes, and let the existing stdout-parsing logic categorize whatever entries are emitted. The stderr/exit signal becomes informational, not fatal.findseparately for each state_dir, capture stdout independently, and only fail the audit if a non-find-related error (e.g. ssh connection lost) occurs. Permission-denied on a subdir becomes a recorded warning (and the missing entries are simply not audited — which is the existing whitelist behaviour for known-safe paths anyway).agents/<agent_id>,extensions/<plugin>are root-owned by design. Detect them via a baked manifest and-prunethem out offindbefore running.Option (a) is the smallest delta and matches the existing tolerance the audit already shows (it parses stdout for violations regardless of stderr). Option (b) is more correct but a bigger change.
Either way, the rebuild must not refuse to proceed just because
findcannot read a baked subdir. Today the data-loss guard fires correctly (abort, do not destroy) but the precondition that triggered it is a false positive in every onboard-then-rebuild sequence on v0.0.49.This is NOT introduced by PR #3925 (the rebuild pre-backup audit code path is unchanged in that PR). PR #3925's separate "ARM64 post-rebuild gateway-unhealthy" regression is tracked under NVBug 6198894.
NVB#6204923