Skip to content

nemoclaw rebuild aborts when files in .openclaw-data are root-owned #2727

@camerono

Description

@camerono

Description

nemoclaw <name> rebuild aborts with:

Failed to back up sandbox state.
Failed: agents, extensions, workspace, skills, hooks, identity, devices, canvas, cron, memory, telegram, credentials
Aborting rebuild to prevent data loss.

Even though the --name pre-… snapshot path (via nemoclaw <name> snapshot create) succeeds and lists the same "Failed directories" as a non-fatal warning. The fatal-vs-non-fatal divergence between the two code paths is itself a bug, but the underlying cause is shared: the SSH-as-sandbox-user backup tar fails with Cannot open: Permission denied on individual files inside ${writableDir}/<state-dir> that are owned by root and mode 0600.

Verbatim verbose log (NEMOCLAW_REBUILD_VERBOSE=1):

[sandbox-state ...] Downloading via SSH+tar: tar -cf - -C /sandbox/.openclaw-data agents extensions workspace skills hooks identity devices canvas cron memory telegram credentials
[sandbox-state ...] SSH+tar download: exit=2, stdout=4546560 bytes,
  stderr=tar: agents/main/sessions/sessions.json: Cannot open: Permission denied
         tar: agents/main/agent/models.json: Cannot open: Permission denied
         tar: Exiting with failure status due to previous errors
[rebuild ...] Backup result: success=false, backed=, failed=agents,extensions,workspace,...

Tar exited 2 (errors-encountered, but tar still wrote 4.5 MB of data to stdout). The code at src/lib/sandbox-state.ts:702 treats any non-zero tar exit as a complete backup failure and marks all existing state dirs as failed (not just the offending files). The rebuild guard at src/nemoclaw.ts:2810 then aborts.

How files came to be root-owned in our case: yesterday's diagnostic session used kubectl exec rtfm (defaults to root in the agent container) to invoke openclaw memory index, openclaw agent --message, and a few file writes. Anything those root-as-sandbox-pod commands created landed at root:root 0600. The sandbox user later had read-permission to its own files but not those.

Reproduction Steps

  1. Onboard a sandbox: nemoclaw onboard with any provider.
  2. From the host, exec into the running sandbox pod as root and have a NemoClaw-aware command write into the writable dir:
    docker exec openshell-cluster-nemoclaw kubectl -n openshell exec <sandbox> -- \
      sh -c 'echo "{}" > /sandbox/.openclaw-data/agents/main/sessions/sessions.json'
    
    The file ends up root:root 0644 (or 0600 depending on umask). For a more realistic repro, run any openclaw subcommand via kubectl-exec — e.g. openclaw memory index — which produces multiple root-owned files in agents/main/, memory/, and workspace/.
  3. Run nemoclaw <name> rebuild --yes. Expected: rebuild proceeds, partial backup succeeds with a warning. Actual: rebuild aborts before the destroy step with the message above.

Environment

  • OS: Ubuntu 24.04 (Linux 6.17.0-1014-nvidia aarch64)
  • Hardware: NVIDIA GB10 (DGX Spark)
  • Docker: Engine 27.x
  • Node.js: v22.22.2
  • NemoClaw: v0.0.29
  • OpenShell (cluster): 0.0.36
  • Sandbox image: openshell/sandbox-from:1777485515 (built locally)
  • Tar inside sandbox: GNU tar 1.35

Debug Output

Output of `nemoclaw debug --quick --sandbox <sandbox>` captured 2026-04-29 18:47 UTC. Full 836-line capture archived at [debug-output-2026-04-29-1847.txt](https://github.com/user-attachments/files/27226637/debug-output-2026-04-29-1847.txt). Focused excerpt (the post-recovery healthy state of the sandbox; the *failed* rebuild's tar errors are reproduced under "Logs" below):


$ nemoclaw --version
nemoclaw v0.0.29

═══ System ═══

Linux <host> 6.17.0-1014-nvidia #14-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 17 19:01:40 UTC 2026 aarch64 aarch64 aarch64 GNU/Linux

═══ OpenShell ═══

Server:  https://127.0.0.1:8080  Status: Connected  Version: 0.0.36
Sandbox: <sandbox>  Namespace: openshell  Phase: Ready  Revision: 7

═══ Sandbox Filesystem Policy (excerpt) ═══

filesystem_policy:
  read_only:  [/usr, /lib, /proc, /dev/urandom, /app, /etc, /var/log, /sandbox, /sandbox/.openclaw]
  read_write: [/tmp, /dev/null, /sandbox/.openclaw-data, /sandbox/.nemoclaw]
  process:
    run_as_user: sandbox
    run_as_group: sandbox

═══ Onboard Session ═══

  "provider": "ollama-local",
  "model": "hermes3:8b",
  "endpointUrl": "http://host.openshell.internal:11435/v1",
  "policyPresets": ["npm","pypi","huggingface","brew","brave","local-inference"],
  "failure": null


The failure mode (this bug) occurs *during* `nemoclaw <sandbox> rebuild`, not at quiescent state — `nemoclaw debug --quick` shows a healthy sandbox because the rebuild was interrupted before the destroy step on a separate occasion when the bug was first triggered. The verbatim verbose-mode trace under "Logs" reproduces the actual failure.

Logs

Verbose mode (`NEMOCLAW_REBUILD_VERBOSE=1`) trace excerpt — see Description.

Checklist

  • I confirmed this bug is reproducible
  • I searched existing issues and this is not a duplicate

Metadata

Metadata

Assignees

Labels

area: cliCommand line interface, flags, terminal UX, or outputplatform: dgx-sparkAffects DGX Spark hardware or workflows

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions