Skip to content

fix(sandbox): restore mutable OpenClaw config perms after raw in-sandbox doctor --fix (#4538)#5017

Merged
cv merged 1 commit into
NVIDIA:mainfrom
yimoj:fix/4538-raw-openclaw-doctor-perms
Jun 9, 2026
Merged

fix(sandbox): restore mutable OpenClaw config perms after raw in-sandbox doctor --fix (#4538)#5017
cv merged 1 commit into
NVIDIA:mainfrom
yimoj:fix/4538-raw-openclaw-doctor-perms

Conversation

@yimoj

@yimoj yimoj commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

A raw connect-shell openclaw doctor --fix (run directly inside a mutable sandbox, outside any NemoClaw wrapper) enforces OpenClaw's single-user 700/600 layout, tightening /sandbox/.openclaw and openclaw.json back from the NemoClaw mutable contract (2770 setgid + group-writable dir, 660 config). That blocks the gateway UID — a member of the sandbox group — from persisting config. This fixes the missed path at the always-on in-sandbox openclaw() guard so the contract is restored after the raw command, even when doctor exits nonzero.

Related Issue

Fixes #4538

Changes

  • Always-on in-sandbox restore. The emitted openclaw() guard function (written into /tmp/nemoclaw-proxy-env.sh, sourced by every interactive/login sandbox shell) now calls a new _nemoclaw_restore_mutable_config_perms helper after every routed openclaw command, re-asserting 2770 dir / 660 config (+ .config-hash). PR fix(sandbox): preserve mutable OpenClaw config perms after doctor --fix (#4538) #4610 only repaired this via host-side wrapper commands (nemoclaw doctor --fix, rebuild, startup), none of which run after the raw in-sandbox path the reporter used.
  • Safe by construction. The helper skips when shields are up (config dir root-owned, so the lock is never weakened), is a no-op fast path when the contract already holds, and strips group-write from the read-only recovery baseline that the recursive chmod could otherwise loosen in rootless mode.
  • Survives nonzero exit. Errexit is dropped around the guarded call (mirroring the existing devices approve branch) so a nonzero doctor --fix exit — the exact reported failure (EACCES on a root-locked shell init file) — cannot abort the guard before the restore runs.
  • Tests. Regression tests for the helper (restore, shields-up no-op, errexit survival, baseline protection) and a real-image reporter-workflow E2E (test/repro-4538-raw-doctor-perms.test.ts) that sources the actual emitted guard, runs raw openclaw doctor --fix in a NemoClaw sandbox image, and asserts the contract is restored. The E2E is gated behind NEMOCLAW_RUN_DOCTOR_PERMS_DOCKER_E2E=1 so it runs where docker + the image are available.

Type of Change

  • Code change (feature, bug fix, or refactor)

Verification

  • npm test passes locally (the only failures were 14 pre-existing, load-induced 5s-timeout flakes in unrelated CLI --help oclif-dispatch tests, e.g. test/cli/snapshot-shields.test.ts; they do not touch this change). All CI checks are green: ShellCheck, static-checks, 5 cli-test-shards, plugin-tests, build-typecheck, CodeQL, macos/wsl-e2e, dco-check, codebase-growth-guardrails.
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed

Reporter-workflow E2E evidence

The exact reported workflow — connect to a mutable sandbox, source the always-on guard (as /etc/bash.bashrc does on connect), run the raw in-sandbox openclaw doctor --fix — was reproduced end-to-end against a real NemoClaw sandbox image (nemoclaw-production:latest), as the sandbox user.

  • Test: test/repro-4538-raw-doctor-perms.test.ts
  • Env / command: NEMOCLAW_RUN_DOCTOR_PERMS_DOCKER_E2E=1 npx vitest run test/repro-4538-raw-doctor-perms.test.ts
=== BEFORE (baseline mutable contract) ===
/sandbox/.openclaw           2770 sandbox:sandbox
/sandbox/.openclaw/openclaw.json  660 sandbox:sandbox
=== reporter cmd: raw `openclaw doctor --fix` ===
(without the fix this leaves: dir 700 sandbox:sandbox, openclaw.json 600 sandbox:sandbox)
=== AFTER (with the always-on openclaw() guard) ===
/sandbox/.openclaw           2770 sandbox:sandbox
/sandbox/.openclaw/openclaw.json  660 sandbox:sandbox
RESULT dir=2770 file=660
PASS: mutable contract (2770/660) restored after raw openclaw doctor --fix

The before-fix regression (dir→700, file→600) was also reproduced directly in the same image without the guard sourced, confirming the gap. Additional non-docker regression tests in the same file cover the restore helper, shields-up no-op, errexit survival on a nonzero doctor --fix, and recovery-baseline protection.


Signed-off-by: Yimo Jiang yimoj@nvidia.com

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds _nemoclaw_restore_mutable_config_perms to the runtime guard and updates the guard's openclaw() wrapper to invoke the restore helper after every in-sandbox openclaw run, preserving exit codes and prior errexit state; includes unit tests and a gated Docker E2E validating permission restoration and baseline invariants.

Changes

Mutable config permission restoration after openclaw doctor --fix

Layer / File(s) Summary
Permission restoration helper and guard wrapper
scripts/nemoclaw-start.sh
Adds _nemoclaw_restore_mutable_config_perms (skips when config dir is root-owned; otherwise normalizes directory setgid/group rwx and file modes and strips group-write from openclaw.json.nemoclaw-baseline) and updates openclaw() to capture exit status, call the helper unconditionally, restore prior errexit state, and return the original exit code.
Unit helpers and tests
test/repro-4538-raw-doctor-perms.test.ts
Adds utilities to extract the guard block and compute POSIX modes, creates a simulated tightened 700/600 tree, and unit-tests: restoring permissions after tightening, noop when shields up (root-owned), preserving nonzero openclaw exit codes while restoring permissions, preventing an errexit gap, and ensuring the baseline remains non-group-writable.
Gated Docker-based E2E validation
test/repro-4538-raw-doctor-perms.test.ts
Optional Docker E2E (enabled via NEMOCLAW_RUN_DOCTOR_PERMS_DOCKER_E2E=1) that injects the guard into a sandbox image, runs openclaw doctor --fix, and asserts /sandbox/.openclaw and /sandbox/.openclaw/openclaw.json end at 2770 and 660.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4715: Related changes to scripts/nemoclaw-start.sh handling of .config-hash and mutable config startup behavior.

Suggested labels

fix, area: sandbox, integration: openclaw, bug-fix

Suggested reviewers

  • cv
  • cjagwani

🐰 A tiny rabbit hopped into the shell,
found doctor fixes that tightened the well.
It planted a guard and adjusted the perms,
so sandbox and gateway can both take turns.
Now configs are safe — hop, patch, and quell.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly summarizes the main change: fixing mutable OpenClaw config permission restoration after raw in-sandbox doctor --fix, directly addressing issue #4538.
Linked Issues check ✅ Passed The PR fully addresses the core coding requirements from issue #4538: it implements permission restoration after openclaw doctor --fix, preserves group-writable contract in mutable mode, and includes comprehensive unit/E2E tests.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the stated objective: shell script modifications for permission restoration and corresponding test coverage; no unrelated alterations present.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

…box doctor --fix (NVIDIA#4538)

A raw connect-shell `openclaw doctor --fix` enforces OpenClaw's
single-user 700/600 layout, tightening /sandbox/.openclaw and
openclaw.json back from the NemoClaw mutable contract (2770 setgid +
group-writable dir, 660 config). That blocks the gateway UID — a member
of the sandbox group — from persisting config. PR NVIDIA#4610 only repaired
this via host-side NemoClaw wrapper commands (`nemoclaw doctor --fix`,
rebuild, startup); none run after the raw in-sandbox path the QA
reporter used, so the gateway stayed broken until the next restart.

Fix the missed path at the always-on in-sandbox guard: the emitted
openclaw() shell function (sourced by every interactive/login sandbox
shell) now re-asserts the contract after every routed openclaw command,
regardless of exit code.

- add _nemoclaw_restore_mutable_config_perms to the guard env; skips
  when shields are up (config dir root-owned), no-op via fast path when
  the contract already holds, and strips group-write from the read-only
  recovery baseline the recursive chmod could otherwise loosen
- drop errexit around the guarded call (mirroring the devices-approve
  branch) so a nonzero `doctor --fix` exit cannot abort the guard before
  the restore runs — the nonzero-exit case is the exact reported failure
- regression tests plus a gated real-image reporter-workflow E2E that
  drives raw `openclaw doctor --fix` and asserts 2770/660 is restored

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@yimoj yimoj force-pushed the fix/4538-raw-openclaw-doctor-perms branch from bbe4f63 to baf8fe7 Compare June 9, 2026 04:28
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 9, 2026
@yimoj yimoj added the v0.0.62 Release target label Jun 9, 2026
@cv cv merged commit bc4f726 into NVIDIA:main Jun 9, 2026
36 checks passed
jyaunches pushed a commit that referenced this pull request Jun 10, 2026
## Summary
- Add v0.0.62 release notes from Discussion #5100 and link release
highlights to the relevant docs pages.
- Document the release's GPU sandbox recreation, sandbox-side local
inference verification, and Hermes dashboard port guard in the command
and inference references.
- Refresh generated NemoClaw user skills for the release-prep docs set.

## Source Summary
- #4956 -> `docs/reference/commands.mdx`: Document CDI-first Docker GPU
recreation behavior for Linux Docker-driver sandboxes.
- #5024 -> `docs/inference/use-local-inference.mdx`: Document
sandbox-runtime verification of the `inference.local` local inference
route.
- #5018 -> `docs/reference/commands.mdx`: Document Jetson/Tegra
device-node group propagation for sandbox CUDA initialization.
- #5012, #4763, #4706, #5030, #5015 -> `docs/about/release-notes.mdx`:
Summarize onboarding and recovery reliability fixes, including the
reserved Hermes API port guard.
- #5017 and #5043 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Summarize mutable OpenClaw config
recovery and host-side `agents list` coverage.
- #5010 and #5016 -> `docs/about/release-notes.mdx`: Summarize Hermes
upstream metadata visibility and WhatsApp QR rendering reliability.
- #5045 and prior source docs in the v0.0.62 range -> `.agents/skills/`:
Refresh generated user-skill references from the current docs source.

## Skipped
- #5019 -> skipped for new prose because it touched
`openclaw-sandbox-permissive.yaml`, which matches `docs/.docs-skip`.
Existing source docs remain the source for generated skill
synchronization.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `npm run docs` (passes; Fern reports 0 errors and 1 hidden warning)
- Pre-commit hooks passed during commit, including docs-to-skills
verification, markdown lint, gitleaks, and skills YAML tests.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
  * Added `nemoclaw <name> agents list` command.
* v0.0.62 release notes added summarizing onboarding and recovery
improvements.

* **Bug Fixes**
* Improved GPU sandbox onboarding reliability (NVIDIA CDI path,
Jetson/Tegra device handling).
* Better local inference verification and recovery for Linux
Docker-driver GPU sandboxes.
  * Quieter/earlier handling of onboarding drift and port collisions.

* **Documentation**
* Expanded GPU passthrough, inference verification, writable paths
(`/dev/pts`), port 8642 restriction, and command examples.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Prekshi Vyas <34834085+prekshivyas@users.noreply.github.com>
yimoj added a commit to yimoj/NemoClaw that referenced this pull request Jun 11, 2026
…VIDIA#4538, NVIDIA#4859)

OpenClaw's `doctor` state-integrity check flags the NemoClaw mutable-config
contract — /sandbox/.openclaw (2770 setgid + group-writable) and openclaw.json
(660) — as "permissions are too open" and, with --fix, tightens them to
single-user 700/600. That durably breaks the group-shared contract the gateway
UID (a member of the sandbox group) needs to persist control-UI config writes,
and prints scary "too open"/"Tightened permissions" output that misrepresents
the intended contract.

PR NVIDIA#5017's always-on openclaw() shell guard only restores 2770/660 *after* a
raw in-sandbox `openclaw doctor --fix` exits — so the warning/tighten still
runs, and any path that bypasses the connect-shell guard (stale image, raw
`command openclaw`, the macOS reporter flow in NVIDIA#4859) leaves the tree durably
locked at 700/600.

Fix the root cause at the OpenClaw source via the established Dockerfile-patch
mechanism: a new patch-openclaw-doctor-mutable-perms.mts narrows doctor's
over-open test from group+other bits (`stat.mode & 63`) to OTHER (world) bits
only (`stat.mode & 7`) when running inside an OpenShell sandbox. Group bits —
the gateway's intended shared access — are then tolerated, so the 2770/660
contract is neither warned about nor tightened, while genuinely world-accessible
modes (e.g. 2777/664) are still caught and repaired. Out-of-sandbox behavior is
byte-identical to upstream. The gate keys on OPENSHELL_SANDBOX being set (not
`=== "1"`): OpenShell injects the sandbox name there, verified on 0.0.44.

Because the fix lives in the baked image, it applies regardless of host OS and
whether the connect-shell guard is loaded, resolving both the All-Platforms
NVIDIA#4538 and the macOS NVIDIA#4859 reports. The NVIDIA#5017 guard is kept as defense-in-depth
for sandboxes built before this patch.

E2E (real onboarded sandbox, patched image, OpenClaw 2026.5.27):
- connect-shell `openclaw doctor --fix`: exit 0, 0 scary messages, 0 bashrc
  EACCES, perms stay 2770/660.
- raw `command openclaw doctor --fix` (guard bypassed, NVIDIA#4859 path): exit 0,
  0 scary messages, perms stay 2770/660.
- unpatched baseline: 4 scary messages, perms regress to 700/600.

The patch classifies the compiled doctor state-integrity module by content
signature, requires exactly the two reviewed permission gates, fails closed on
shape drift, and is idempotent and a no-op when absent.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression v0.0.62 Release target

Projects

None yet

3 participants