fix(sandbox): make .openclaw group-writable so OpenClaw config mutations land cleanly (#2681)#2851
Conversation
…2681) Replaces the EACCES-swallow approach proposed in #2693 with proper Unix group permissions. Control-UI toggles (Enable Dreaming, account toggles) now persist in default mode instead of throwing GatewayRequestError or becoming silent no-ops. Implementation of the design for #2681: 1. Keep `gateway` as a separate UID from `sandbox` (security separation preserved) but make it a member of the `sandbox` group via `usermod -aG sandbox gateway` in Dockerfile.base. Dockerfile carries an idempotent stale-base fallback so derived images work even on un-rebuilt sandbox-base:latest tags. 2. /sandbox/.openclaw is now group-writable + setgid in default mode. chmod -R g+w plus find -type d -exec chmod g+s on every dir, applied in both the base image and the derived production image. setgid means files created by gateway-UID inherit group=sandbox so the agent keeps read access. 3. nemoclaw-start.sh: normalize_mutable_config_perms() runs before gateway launch in root mode. Skips when shields are UP (config dir owned by root) so the lock contract is not weakened. Idempotent. 4. shields.ts: shields-down restore is 660/2770 instead of 600/700. applyStateDirLockMode parameterizes the chmod strip — `go-w` when locking, `o-w` when unlocking — so shields-down preserves group-write. 5. Patch 4 (replaceConfigFile EACCES) is intentionally retained as a defensive fallback for older base images and edge cases where the group-write contract is incomplete. No new mutateConfigFile EACCES patch is introduced; the Patch 4b approach in #2693 is rejected. 10 regression guards in test/repro-2681-group-writable.test.ts lock the structural invariants of the new contract. Closes #2681 Supersedes #2693
📝 WalkthroughWalkthroughBuild and runtime changes make Changes
Sequence DiagramsequenceDiagram
participant BuildSystem as Build System
participant DockerImage as Docker Image
participant Startup as Startup Script
participant Filesystem as Filesystem
participant ConfigLogic as Config Logic
BuildSystem->>DockerImage: add `gateway` to `sandbox` (idempotent)
BuildSystem->>Filesystem: chmod -R g+w /sandbox/.openclaw
BuildSystem->>Filesystem: find /sandbox/.openclaw -type d -exec chmod g+s
BuildSystem->>Filesystem: create `.config-hash` with mode 664
Note over Startup,Filesystem: Container starts
Startup->>Startup: check effective UID == 0?
Startup->>Filesystem: stat owner of /sandbox/.openclaw
alt not root-owned
Startup->>Filesystem: normalize_mutable_config_perms() -> chmod -R g+w + setgid on dirs
else root-owned
Startup->>Filesystem: skip normalization (shields-up)
end
ConfigLogic->>Filesystem: applyStateDirLockMode -> chmod dirMode (755 or 2775) and strip writes (go-w / o-w)
ConfigLogic->>Filesystem: unlockAgentConfig -> chmod files 660 and dirs 2770 (setgid)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/shields.ts (1)
259-307:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUnlocking here still leaves existing state trees non-group-writable.
After a shields-up/down round-trip, descendants that were stripped to
755/644by the lock path stay that way: the unlock path only sets the top-level directory to2775and then runschmod -R o-w, which never addsg+wor setgid back on nested dirs/files. That leaves thegatewayuser unable to mutate existing entries underagents/,workspace-*, etc. until a full container restart re-normalizes them.Suggested direction
- kubectlExec(sandboxName, ["chmod", dirMode, dirPath]); - kubectlExec(sandboxName, ["chmod", "-R", writeStrip, dirPath]); + if (isLocking) { + kubectlExec(sandboxName, ["chmod", dirMode, dirPath]); + kubectlExec(sandboxName, ["chmod", "-R", writeStrip, dirPath]); + } else { + kubectlExec(sandboxName, [ + "sh", + "-c", + 'find "$1" -type d -exec chmod 2775 {} + && find "$1" -type f -exec chmod g+w {} + && chmod -R o-w "$1"', + "sh", + dirPath, + ]); + }Mirror the same recursive restore in the
workspace-*shell block too.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/shields.ts` around lines 259 - 307, The workspace-* shell block in applyStateDirLockMode currently only sets the top-level dir mode and strips world-write recursively, leaving nested files/dirs still at 755/644 after an unlock; update that kubectlExec shell script (the workspace-* block in applyStateDirLockMode) to mirror the recursive restore logic used for HIGH_RISK_STATE_DIRS: after setting the top-level dir mode, when performing an unlock (dir_mode "2775"/write_strip "o-w") also run recursive restores to re-add group write and setgid on nested items (e.g., chmod -R g+rwX "$dir" and ensure directories get g+s, e.g. via find "$dir" -type d -exec chmod g+s {} \;) so the gateway (sandbox group) can write to existing workspace/* descendants.
🧹 Nitpick comments (1)
Dockerfile (1)
483-509: Please run the container E2E matrix for this permission contract.The stale-base fallback plus the
g+w/setgid behavior depends on actual image layering and filesystem semantics, so the regex tests here won't tell us whether it survives a real build/rebuild cycle. I'd at least runcloud-e2e,sandbox-survival-e2e,hermes-e2e, andrebuild-openclaw-e2ebefore merge.As per coding guidelines,
Dockerfile: “Layer ordering, permissions, and baked config changes are only testable with a real container build.”🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` around lines 483 - 509, Run the full container E2E matrix to validate the permission contract introduced by the RUN block that adds gateway to sandbox (the id/usermod id -nG check) and the subsequent chown/chmod/find operations on /sandbox/.openclaw; specifically execute cloud-e2e, sandbox-survival-e2e, hermes-e2e, and rebuild-openclaw-e2e against an actual build/rebuild cycle to ensure the g+w + setgid behavior and stale-base fallback survive real image layering and filesystem semantics before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/lib/shields.ts`:
- Around line 259-307: The workspace-* shell block in applyStateDirLockMode
currently only sets the top-level dir mode and strips world-write recursively,
leaving nested files/dirs still at 755/644 after an unlock; update that
kubectlExec shell script (the workspace-* block in applyStateDirLockMode) to
mirror the recursive restore logic used for HIGH_RISK_STATE_DIRS: after setting
the top-level dir mode, when performing an unlock (dir_mode "2775"/write_strip
"o-w") also run recursive restores to re-add group write and setgid on nested
items (e.g., chmod -R g+rwX "$dir" and ensure directories get g+s, e.g. via find
"$dir" -type d -exec chmod g+s {} \;) so the gateway (sandbox group) can write
to existing workspace/* descendants.
---
Nitpick comments:
In `@Dockerfile`:
- Around line 483-509: Run the full container E2E matrix to validate the
permission contract introduced by the RUN block that adds gateway to sandbox
(the id/usermod id -nG check) and the subsequent chown/chmod/find operations on
/sandbox/.openclaw; specifically execute cloud-e2e, sandbox-survival-e2e,
hermes-e2e, and rebuild-openclaw-e2e against an actual build/rebuild cycle to
ensure the g+w + setgid behavior and stale-base fallback survive real image
layering and filesystem semantics before merging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b4ef1efe-3da1-4c54-bfdc-b03ee8f36a35
📒 Files selected for processing (5)
DockerfileDockerfile.basescripts/nemoclaw-start.shsrc/lib/shields.tstest/repro-2681-group-writable.test.ts
Two scope-tightening fixes on top of the group-writable redesign: - chmod 664 .config-hash (was 644). The recursive `chmod -R g+w /sandbox/.openclaw` runs before the .config-hash sha256 is generated, so the file gets its own chmod 644 right after creation — overwriting the group-write that the recursive pass would have applied. Aaron's spec item 3 explicitly calls out "group-writable config/hash files" (plural); 644 violates that for the hash file specifically. - Drop the redundant `-G sandbox` flag from the sandbox useradd. `-g sandbox` already sets sandbox as primary group; the supplementary `-G sandbox` is a no-op that crept in during the initial edit. Plus a new test in test/repro-2681-group-writable.test.ts that asserts .config-hash is chmod 664 (not 644), so the ordering bug can't silently regress. 11/11 pass.
Aaron's spec for #2681 only addressed OpenClaw — gateway UID writing to /sandbox/.openclaw config files. Unifying hermes (640/750) into the same 660/2770 contract was scope I added beyond the spec. Restore the agentName ternary so hermes keeps its historical 640/750 mode at shields-down (it has no separate gateway UID, so the shared- group contract doesn't apply). OpenClaw still uses 660/2770 as the new mutable-default contract. Test updated to assert the ternary shape rather than the unified constants.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/shields.ts (1)
276-299:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUnlocking never restores
g+won previously locked descendants.After a
shields upcycle, the earlierchmod -R go-whas already stripped group write from nested files/directories. The new unlock path only switches tochmod -R o-w, so those descendants stay non-group-writable and thegatewayUID still can't mutate nested workspace/state entries aftershields down. Theworkspace-*shell loop below has the same problem.Possible fix
- kubectlExec(sandboxName, ["chmod", dirMode, dirPath]); - kubectlExec(sandboxName, ["chmod", "-R", writeStrip, dirPath]); + kubectlExec(sandboxName, ["chmod", dirMode, dirPath]); + if (isLocking) { + kubectlExec(sandboxName, ["chmod", "-R", "go-w", dirPath]); + } else { + kubectlExec(sandboxName, ["chmod", "-R", "g+w", dirPath]); + kubectlExec(sandboxName, ["chmod", "-R", "o-w", dirPath]); + }You'd want the same lock/unlock split in the
workspace-*shell block as well.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/shields.ts` around lines 276 - 299, The workspace-* shell loop invoked via kubectlExec inside the for dir in "$config_dir"/workspace-* block only runs chmod -R "$write_strip" (which on unlock uses o-w) and thus never restores group write that was removed earlier by chmod -R go-w; change the shell snippet in that loop so the lock/unlock split mirrors the earlier logic: when locking apply chmod -R go-w, and when unlocking run two ops (chmod -R o-w and then chmod -R g+w) or otherwise explicitly re-add group write (chmod -R g+w) after removing others, so that descendants regain group-write permission and the gateway UID can mutate nested workspace/state entries. Ensure edits are made inside the same kubectlExec call and the for dir in "$config_dir"/workspace-* loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/lib/shields.ts`:
- Around line 276-299: The workspace-* shell loop invoked via kubectlExec inside
the for dir in "$config_dir"/workspace-* block only runs chmod -R "$write_strip"
(which on unlock uses o-w) and thus never restores group write that was removed
earlier by chmod -R go-w; change the shell snippet in that loop so the
lock/unlock split mirrors the earlier logic: when locking apply chmod -R go-w,
and when unlocking run two ops (chmod -R o-w and then chmod -R g+w) or otherwise
explicitly re-add group write (chmod -R g+w) after removing others, so that
descendants regain group-write permission and the gateway UID can mutate nested
workspace/state entries. Ensure edits are made inside the same kubectlExec call
and the for dir in "$config_dir"/workspace-* loop.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 47934bc4-85d6-48ae-af9b-9506b0176d6f
📒 Files selected for processing (2)
src/lib/shields.tstest/repro-2681-group-writable.test.ts
…te everywhere (#2681) CodeRabbit caught: shields-up's `chmod -R go-w` strips group-write from every descendant. The unlock path was just `chmod -R o-w`, which strips world-write but doesn't add group-write back — so after a shields up/down cycle, nested files stay at 644 and the gateway UID can't write them. Contract broken. Fix: unlock uses `g+w,o-w` (re-add group-write while stripping world- write). Lock unchanged at `go-w`. Test updated to assert the new pattern.
ericksoa
left a comment
There was a problem hiding this comment.
Reviewed the permission contract and focused validation for the #2681 fix. The approach looks sound: gateway is added to the sandbox group, the mutable OpenClaw config tree is group-writable/setgid in the image and normalized at root startup, and shields-down restores group-write after a lock/unlock cycle.
Validated locally with:
- npm ci --ignore-scripts
- npm run build:cli
- npm run typecheck:cli
- npx vitest run test/repro-2681-group-writable.test.ts test/shields.test.ts test/nemoclaw-start.test.ts test/sandbox-init.test.ts
- git diff --check f9d21af...HEAD
One live CI job is still failing (test-e2e-gateway-isolation on the non-root CAP_SETPCAP path), but the permission-specific checks in that job passed and I do not see that failure as introduced by this patch.
The runPreGatewaySetup helper stubs every helper the script invokes; #2851 added a normalize_mutable_config_perms call but no stub, so the test bash run hits an undefined function under set -euo pipefail and exits 127.
## Summary Patch 4 in the sandbox `Dockerfile` is a regex-based monkey-patch over OpenClaw's compiled JS that wraps `replaceConfigFile` in an EACCES try/catch suppression. It is source-shape-coupled and has been rewritten three times in eight days chasing OpenClaw refactors: - `fefd69fa2` (#2377) — original literal-string match against [openclaw/openclaw@v2026.4.9](https://github.com/openclaw/openclaw/releases/tag/v2026.4.9). - `5dcb0a9b9` (#2484) — updated the literal string for the restructured write block in [openclaw/openclaw@v2026.4.24](https://github.com/openclaw/openclaw/releases/tag/v2026.4.24). - `e0290e153` (#2876) — hardened to a tolerant whitespace/property-order-aware regex against [openclaw/openclaw@v2026.4.24](https://github.com/openclaw/openclaw/releases/tag/v2026.4.24). #2686 and #3497 are the latest break: in current OpenClaw, the regex no longer finds the function shape and the image build aborts at Step 17/56. Patch 4 is also unnecessary by design. The EACCES it was suppressing does not happen in the supported flows: - **Mutable-default mode** (fresh sandbox, before `nemoclaw shields up`): `openclaw.json` is `chmod 660 sandbox:sandbox` and the gateway UID is in the sandbox group, courtesy of #2851 (closing #2681; superseding the EACCES-swallow attempt in #2693). `openclaw plugins install` writes through normally; no EACCES. - **Shields-up mode** (locked): the entire config tree — file, parent directory, and the `extensions`/`plugins` state dirs from [HIGH_RISK_STATE_DIRS](src/lib/shields/index.ts#L292-L306) — is locked to `root:root` by design. Shields-up exists *to refuse* runtime config and plugin mutations. Suppressing the EACCES masked that refusal and made `openclaw plugins install` appear to succeed silently while only the auto-discovery half landed. The expected lifecycle is **configure-in-mutable-mode → `shields up` → run**. Plugin install attempted while shielded should fail cleanly; that is exactly what happens without Patch 4. This PR therefore deletes Patch 4 entirely. ## Related Issue Resolves #2686 Resolves #3497 Related context: - #2681 — original "make `.openclaw` group-writable" issue, closed by #2851. - #2851 — PR that made mutable-mode plugin install work without an EACCES swallow. - #2693 — closed earlier EACCES-swallow attempt, superseded once #2851 landed. - #2544 — NemoClaw issue tracking the broader "plugin config requires multi-minute rebuild" problem. - openclaw/openclaw#72950 — upstream defect (no env-var or writable-overlay path for `plugins.entries.<id>.config`); the real fix has to land there. ## Changes - `Dockerfile`: drop the Patch 4 block (the `COPY scripts/rcf_patch.py` line and the inline Python invocation + grep guard). - `scripts/rcf_patch.py`: deleted. - `src/lib/sandbox/build-context.ts`: stop staging `scripts/rcf_patch.py` into the sandbox build context. - `test/rcf-patch.test.ts`: deleted. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Signed-off-by: Tinson Lai <tinsonl@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Removed OpenClaw patching logic from Docker build process and related artifact copies * Updated build context script staging behavior * **Tests** * Enhanced sandbox configuration test suite with environment variable passthrough support * Added version-based conditional patching validation and warning behavior tests <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Summary
Replaces the EACCES-swallow approach proposed in #2693 with proper Unix group permissions. Control-UI toggles in the OpenClaw dashboard (Enable Dreaming, account toggles, etc.) now persist in default mode instead of throwing
GatewayRequestError: EACCESor becoming silent no-ops.Background
OpenClaw 2026.4.24 (landed via #2484) introduced
mutateConfigFileas the new control-UI write path. Patch 4 in the Dockerfile only wraps the legacyreplaceConfigFile(plugin install path), so every config-toggle click in the sandbox dashboard now EACCES'd.#2693 proposed adding "Patch 4b" — a parallel try/catch that swallows the EACCES. That makes toggles non-functional in the sandbox: the user clicks "Enable Dreaming," gets no error, but nothing actually persists. UX improves over the crash; underlying limitation stays.
This PR implements the alternative design Aaron sketched for #2681: rather than wrapping each new write path in EACCES handlers, fix the actual permissions so the writes succeed.
Closes / Supersedes
Implementation (the 6-item spec)
gatewayas a separate UID fromsandbox; add it to thesandboxgroupDockerfile.basesandbox-base:latesttags get the group membership at derived-image build timeDockerfile/sandbox/.openclawgroup-writable + setgid on dirs;.config-hashfile mode 664Dockerfile.base,Dockerfilenormalize_mutable_config_perms()at startup, gated on Shields statescripts/nemoclaw-start.shshields downrestores 660/2770 (group-writable + setgid) for OpenClaw; Hermes left at historical 640/750 (no separate gateway UID, contract doesn't apply)src/lib/shields.tstest/repro-2681-group-writable.test.tsWhy setgid
chmod g+son directories means new files inheritgroup=sandboxregardless of which UID created them. Sogatewaywrites a file → file isgroup=sandbox→ thesandboxuser (also in the group) can still read it. Without setgid, gateway's writes would land withgroup=gatewayand the agent might lose read access on rotation.Patch 4 retention
The existing
Patch 4(replaceConfigFile EACCES swallow) is intentionally retained as a defensive fallback for:No new EACCES swallow patch is added — the
Patch 4bapproach from #2693 is explicitly rejected per spec item #6.Verification
npm run build:clicompiles the changedshields.tstest/repro-2681-group-writable.test.ts— assert structural invariants of the group-writable contract@oclif/coremodule-not-found from in-flight migration; not caused by this PR)Test plan
repro-2681-group-writable.test.tsbuild-sandbox-images(validates the group-membership + setgid Dockerfile changes)test-e2e-sandbox(validates shields lifecycle + onboard flow)test-e2e-gateway-isolation(validates the gateway-as-different-UID still runs cleanly)nemoclaw statusType of Change
AI Disclosure