fix(sandbox): restore mutable OpenClaw config perms after raw in-sandbox doctor --fix (#4538)#5017
Conversation
📝 WalkthroughWalkthroughAdds ChangesMutable config permission restoration after openclaw doctor --fix
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…box doctor --fix (NVIDIA#4538) A raw connect-shell `openclaw doctor --fix` enforces OpenClaw's single-user 700/600 layout, tightening /sandbox/.openclaw and openclaw.json back from the NemoClaw mutable contract (2770 setgid + group-writable dir, 660 config). That blocks the gateway UID — a member of the sandbox group — from persisting config. PR NVIDIA#4610 only repaired this via host-side NemoClaw wrapper commands (`nemoclaw doctor --fix`, rebuild, startup); none run after the raw in-sandbox path the QA reporter used, so the gateway stayed broken until the next restart. Fix the missed path at the always-on in-sandbox guard: the emitted openclaw() shell function (sourced by every interactive/login sandbox shell) now re-asserts the contract after every routed openclaw command, regardless of exit code. - add _nemoclaw_restore_mutable_config_perms to the guard env; skips when shields are up (config dir root-owned), no-op via fast path when the contract already holds, and strips group-write from the read-only recovery baseline the recursive chmod could otherwise loosen - drop errexit around the guarded call (mirroring the devices-approve branch) so a nonzero `doctor --fix` exit cannot abort the guard before the restore runs — the nonzero-exit case is the exact reported failure - regression tests plus a gated real-image reporter-workflow E2E that drives raw `openclaw doctor --fix` and asserts 2770/660 is restored Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
bbe4f63 to
baf8fe7
Compare
## Summary - Add v0.0.62 release notes from Discussion #5100 and link release highlights to the relevant docs pages. - Document the release's GPU sandbox recreation, sandbox-side local inference verification, and Hermes dashboard port guard in the command and inference references. - Refresh generated NemoClaw user skills for the release-prep docs set. ## Source Summary - #4956 -> `docs/reference/commands.mdx`: Document CDI-first Docker GPU recreation behavior for Linux Docker-driver sandboxes. - #5024 -> `docs/inference/use-local-inference.mdx`: Document sandbox-runtime verification of the `inference.local` local inference route. - #5018 -> `docs/reference/commands.mdx`: Document Jetson/Tegra device-node group propagation for sandbox CUDA initialization. - #5012, #4763, #4706, #5030, #5015 -> `docs/about/release-notes.mdx`: Summarize onboarding and recovery reliability fixes, including the reserved Hermes API port guard. - #5017 and #5043 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Summarize mutable OpenClaw config recovery and host-side `agents list` coverage. - #5010 and #5016 -> `docs/about/release-notes.mdx`: Summarize Hermes upstream metadata visibility and WhatsApp QR rendering reliability. - #5045 and prior source docs in the v0.0.62 range -> `.agents/skills/`: Refresh generated user-skill references from the current docs source. ## Skipped - #5019 -> skipped for new prose because it touched `openclaw-sandbox-permissive.yaml`, which matches `docs/.docs-skip`. Existing source docs remain the source for generated skill synchronization. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` (passes; Fern reports 0 errors and 1 hidden warning) - Pre-commit hooks passed during commit, including docs-to-skills verification, markdown lint, gitleaks, and skills YAML tests. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added `nemoclaw <name> agents list` command. * v0.0.62 release notes added summarizing onboarding and recovery improvements. * **Bug Fixes** * Improved GPU sandbox onboarding reliability (NVIDIA CDI path, Jetson/Tegra device handling). * Better local inference verification and recovery for Linux Docker-driver GPU sandboxes. * Quieter/earlier handling of onboarding drift and port collisions. * **Documentation** * Expanded GPU passthrough, inference verification, writable paths (`/dev/pts`), port 8642 restriction, and command examples. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Prekshi Vyas <34834085+prekshivyas@users.noreply.github.com>
…VIDIA#4538, NVIDIA#4859) OpenClaw's `doctor` state-integrity check flags the NemoClaw mutable-config contract — /sandbox/.openclaw (2770 setgid + group-writable) and openclaw.json (660) — as "permissions are too open" and, with --fix, tightens them to single-user 700/600. That durably breaks the group-shared contract the gateway UID (a member of the sandbox group) needs to persist control-UI config writes, and prints scary "too open"/"Tightened permissions" output that misrepresents the intended contract. PR NVIDIA#5017's always-on openclaw() shell guard only restores 2770/660 *after* a raw in-sandbox `openclaw doctor --fix` exits — so the warning/tighten still runs, and any path that bypasses the connect-shell guard (stale image, raw `command openclaw`, the macOS reporter flow in NVIDIA#4859) leaves the tree durably locked at 700/600. Fix the root cause at the OpenClaw source via the established Dockerfile-patch mechanism: a new patch-openclaw-doctor-mutable-perms.mts narrows doctor's over-open test from group+other bits (`stat.mode & 63`) to OTHER (world) bits only (`stat.mode & 7`) when running inside an OpenShell sandbox. Group bits — the gateway's intended shared access — are then tolerated, so the 2770/660 contract is neither warned about nor tightened, while genuinely world-accessible modes (e.g. 2777/664) are still caught and repaired. Out-of-sandbox behavior is byte-identical to upstream. The gate keys on OPENSHELL_SANDBOX being set (not `=== "1"`): OpenShell injects the sandbox name there, verified on 0.0.44. Because the fix lives in the baked image, it applies regardless of host OS and whether the connect-shell guard is loaded, resolving both the All-Platforms NVIDIA#4538 and the macOS NVIDIA#4859 reports. The NVIDIA#5017 guard is kept as defense-in-depth for sandboxes built before this patch. E2E (real onboarded sandbox, patched image, OpenClaw 2026.5.27): - connect-shell `openclaw doctor --fix`: exit 0, 0 scary messages, 0 bashrc EACCES, perms stay 2770/660. - raw `command openclaw doctor --fix` (guard bypassed, NVIDIA#4859 path): exit 0, 0 scary messages, perms stay 2770/660. - unpatched baseline: 4 scary messages, perms regress to 700/600. The patch classifies the compiled doctor state-integrity module by content signature, requires exactly the two reviewed permission gates, fails closed on shape drift, and is idempotent and a no-op when absent. Signed-off-by: Yimo Jiang <yimoj@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
A raw connect-shell
openclaw doctor --fix(run directly inside a mutable sandbox, outside any NemoClaw wrapper) enforces OpenClaw's single-user 700/600 layout, tightening/sandbox/.openclawandopenclaw.jsonback from the NemoClaw mutable contract (2770 setgid + group-writable dir, 660 config). That blocks the gateway UID — a member of thesandboxgroup — from persisting config. This fixes the missed path at the always-on in-sandboxopenclaw()guard so the contract is restored after the raw command, even whendoctorexits nonzero.Related Issue
Fixes #4538
Changes
openclaw()guard function (written into/tmp/nemoclaw-proxy-env.sh, sourced by every interactive/login sandbox shell) now calls a new_nemoclaw_restore_mutable_config_permshelper after every routed openclaw command, re-asserting2770dir /660config (+.config-hash). PR fix(sandbox): preserve mutable OpenClaw config perms after doctor --fix (#4538) #4610 only repaired this via host-side wrapper commands (nemoclaw doctor --fix, rebuild, startup), none of which run after the raw in-sandbox path the reporter used.chmodcould otherwise loosen in rootless mode.devices approvebranch) so a nonzerodoctor --fixexit — the exact reported failure (EACCES on a root-locked shell init file) — cannot abort the guard before the restore runs.test/repro-4538-raw-doctor-perms.test.ts) that sources the actual emitted guard, runs rawopenclaw doctor --fixin a NemoClaw sandbox image, and asserts the contract is restored. The E2E is gated behindNEMOCLAW_RUN_DOCTOR_PERMS_DOCKER_E2E=1so it runs where docker + the image are available.Type of Change
Verification
npm testpasses locally (the only failures were 14 pre-existing, load-induced 5s-timeout flakes in unrelated CLI--helpoclif-dispatch tests, e.g.test/cli/snapshot-shields.test.ts; they do not touch this change). All CI checks are green: ShellCheck, static-checks, 5 cli-test-shards, plugin-tests, build-typecheck, CodeQL, macos/wsl-e2e, dco-check, codebase-growth-guardrails.Reporter-workflow E2E evidence
The exact reported workflow — connect to a mutable sandbox, source the always-on guard (as
/etc/bash.bashrcdoes on connect), run the raw in-sandboxopenclaw doctor --fix— was reproduced end-to-end against a real NemoClaw sandbox image (nemoclaw-production:latest), as thesandboxuser.test/repro-4538-raw-doctor-perms.test.tsNEMOCLAW_RUN_DOCTOR_PERMS_DOCKER_E2E=1 npx vitest run test/repro-4538-raw-doctor-perms.test.tsThe before-fix regression (dir→700, file→600) was also reproduced directly in the same image without the guard sourced, confirming the gap. Additional non-docker regression tests in the same file cover the restore helper, shields-up no-op, errexit survival on a nonzero
doctor --fix, and recovery-baseline protection.Signed-off-by: Yimo Jiang yimoj@nvidia.com