chore: upgrade OpenClaw from 2026.4.9 to 2026.4.24#2484
Conversation
Bump the pinned OpenClaw version across all version-tracking files (Dockerfile.base, blueprint.yaml, manifest.yaml, and version tests) to the latest stable release.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughUpdates OpenClaw from version 2026.4.9 to 2026.4.24 across build configuration, manifests, and tests. Introduces plugin runtime dependencies cache directory with proper permissions and group configuration. Implements new config writing API with sandbox error handling for read-only environments. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
OpenClaw 2026.4.24 restructured replaceConfigFile to first attempt a single-key include-file mutation (tryWriteSingleTopLevelIncludeMutation) before falling back to writeConfigFile. Both paths can EACCES in the read-only sandbox. Update the pattern match to wrap the entire write block in the OPENSHELL_SANDBOX-gated try/catch.
Capture the SSH-shell environment (HTTP_PROXY, HTTPS_PROXY, NO_PROXY, OPENCLAW_GATEWAY_URL/TOKEN, OPENSHELL_SANDBOX, NVIDIA_API_KEY) before the agent invocation, and bump the failure-message capture from head -3 to head -20 so the full reply (including any gateway/embedded fallback errors) shows in CI logs. Diagnostic-only — no behavior change.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Line 282: The diag_env diagnostic line leaks secrets by expanding the token
values; replace the unsafe expansions
`${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}` and the
analogous `NVIDIA_API_KEY` expansion in the sandbox_exec invocation so they
never emit the variable contents, and instead emit only the literal "set" or
"unset"; implement this by checking each variable's presence (e.g., an explicit
conditional or test for non-empty) and printing "set" when present or "unset"
when not, updating the diag_env/sandbox_exec call accordingly to reference
OPENCLAW_GATEWAY_TOKEN and NVIDIA_API_KEY securely.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5161bcbc-13b7-4cd0-8a9d-5d0f0d383403
📒 Files selected for processing (1)
test/e2e/test-sandbox-operations.sh
OpenClaw 2026.4.24 lazy-installs bundled plugin runtime dependencies into ~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/ on first CLI invocation (Jiti-based loader, "lazy provider dependencies" in 2026.4.20+ release notes). NemoClaw locks /sandbox/.openclaw to 444 root:root, so every bundled plugin (nvidia, openai, anthropic, ollama, ...) failed to load with EACCES, leaving `openclaw agent` with zero providers — the exact symptom in TC-SBX-02 (no agent reply, only proxy warnings). Mirror the existing .openclaw-data symlink pattern: create the dir in the writable data tree and symlink it from the immutable config tree. Add to both Dockerfile.base (canonical setup) and Dockerfile (idempotent fixup for stale GHCR bases).
…load OpenClaw 2026.4.24+ lazy-installs and Jiti-compiles ~50 bundled plugin runtime deps on the first agent invocation in a fresh sandbox. Even with deps pre-cached at build time, the plugin registry bootstrap + provider warmup + LLM round-trip on the first call can exceed the existing 60s SSH timeout (was completing in ~20s on 2026.4.9). Make sandbox_exec_for accept an optional timeout argument (default 60, preserves all other call sites) and have TC-SBX-02 pass 240s. The openclaw agent CLI's own --timeout default is 600s so 240s leaves plenty of headroom for the inference call itself.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
test/e2e/test-sandbox-operations.sh (1)
286-286:⚠️ Potential issue | 🔴 CriticalSensitive values can still be exposed in diagnostics.
Line 286 uses
${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}(and the same forNVIDIA_API_KEY), which includes the secret value when set. This can leak credentials into CI logs.🔧 Proposed fix
- diag_env=$(sandbox_exec 'echo HTTP_PROXY=${HTTP_PROXY:-unset}; echo HTTPS_PROXY=${HTTPS_PROXY:-unset}; echo NO_PROXY=${NO_PROXY:-unset}; echo OPENCLAW_GATEWAY_URL=${OPENCLAW_GATEWAY_URL:-unset}; echo OPENCLAW_GATEWAY_TOKEN=${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}; echo OPENSHELL_SANDBOX=${OPENSHELL_SANDBOX:-unset}; echo NVIDIA_API_KEY=${NVIDIA_API_KEY:+set}${NVIDIA_API_KEY:-unset}' 2>&1) || true + diag_env=$(sandbox_exec 'echo HTTP_PROXY=${HTTP_PROXY:-unset}; echo HTTPS_PROXY=${HTTPS_PROXY:-unset}; echo NO_PROXY=${NO_PROXY:-unset}; echo OPENCLAW_GATEWAY_URL=${OPENCLAW_GATEWAY_URL:-unset}; echo OPENCLAW_GATEWAY_TOKEN=$([ -n "${OPENCLAW_GATEWAY_TOKEN:-}" ] && echo set || echo unset); echo OPENSHELL_SANDBOX=${OPENSHELL_SANDBOX:-unset}; echo NVIDIA_API_KEY=$([ -n "${NVIDIA_API_KEY:-}" ] && echo set || echo unset)' 2>&1) || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-sandbox-operations.sh` at line 286, The diagnostic command leaks secret values because `${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}` (and the NVIDIA_API_KEY variant) concatenates "set" with the actual secret; change the diagnostic to print only "set" or "unset" without expanding the value by replacing those expansions with a conditional-only check (e.g., use a single parameter expansion or an explicit test) inside the sandbox_exec invocation so OPENCLAW_GATEWAY_TOKEN and NVIDIA_API_KEY are never interpolated into the logged string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Line 286: The diagnostic command leaks secret values because
`${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}` (and the
NVIDIA_API_KEY variant) concatenates "set" with the actual secret; change the
diagnostic to print only "set" or "unset" without expanding the value by
replacing those expansions with a conditional-only check (e.g., use a single
parameter expansion or an explicit test) inside the sandbox_exec invocation so
OPENCLAW_GATEWAY_TOKEN and NVIDIA_API_KEY are never interpolated into the logged
string.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: acfac00c-0120-4ef6-ac19-94ac3a5d1d09
📒 Files selected for processing (1)
test/e2e/test-sandbox-operations.sh
Add gateway to the sandbox supplementary group and set 2775 (setgid + group-write) on /sandbox/.openclaw-data/plugin-runtime-deps. OpenClaw 2026.4.24+ runs its plugin loader on both the sandbox-side CLI and the gateway side; both paths call withBundledRuntimeDepsInstallRootLock, which mkdirSyncs a lock dir under the install root. The original NemoClaw user-isolation design has gateway and sandbox in different primary groups so the sandbox user cannot tamper with the gateway. Before 2026.4.24 the plugin loader did not need write access from the gateway side; now it does, and EACCES on the lock dir caused the gateway to fail mid-request, leaving the agent CLI hanging silently on the unanswered WebSocket call. Adding gateway to sandbox as a supplementary group preserves the original boundary (sandbox still cannot affect gateway-owned resources) and only opens gateway → sandbox-owned shared cache. Setgid bit ensures new files created by either user inherit the sandbox group. Mirrored in both Dockerfile.base (canonical) and Dockerfile (idempotent fixup for stale GHCR base images).
There was a problem hiding this comment.
🧹 Nitpick comments (1)
Dockerfile (1)
186-187: Pattern matching in minified JS is fragile.The Python patch uses exact string matching including literal tabs (
\t) and newlines (\n). Minified JavaScript bundles often vary in whitespace formatting between versions or build environments. The assertionassert old in srcwill fail-close (which is good), but consider:
- The pattern assumes specific formatting that may not survive re-minification
- Upstream OpenClaw version bumps could silently change whitespace
The fail-close behavior is correct — the build aborts if the pattern isn't found. However, when this inevitably breaks on a future OpenClaw bump, debugging the exact whitespace mismatch will be tedious.
💡 Alternative: Consider regex-based patching for resilience
A more robust approach would use regex matching that's whitespace-tolerant:
import re pattern = re.compile( r'if\s*\(\s*!\s*await\s+tryWriteSingleTopLevelIncludeMutation\s*\(\s*\{[^}]+\}\s*\)\s*\)\s*await\s+writeConfigFile\s*\([^;]+\);', re.DOTALL )This would survive minor formatting changes. However, the current exact-match approach is acceptable given the fail-close assertion — just be prepared for patch maintenance on version bumps.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` around lines 186 - 187, The current Python one-liner patches the minified JS by exact string match of the tryWriteSingleTopLevelIncludeMutation/writeConfigFile block (the variables old/new and the assert old in src), which is fragile against whitespace/minification changes; change the script to use a regex-based, whitespace-tolerant search (e.g., compile a pattern that matches the if(!await tryWriteSingleTopLevelIncludeMutation(...)) await writeConfigFile(...) block with \s* and re.DOTALL) and perform a re.sub to inject the new try { ... } catch(...) wrapper, then update the assertion to check the regex matched (or that the file changed) instead of relying on the literal old string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@Dockerfile`:
- Around line 186-187: The current Python one-liner patches the minified JS by
exact string match of the tryWriteSingleTopLevelIncludeMutation/writeConfigFile
block (the variables old/new and the assert old in src), which is fragile
against whitespace/minification changes; change the script to use a regex-based,
whitespace-tolerant search (e.g., compile a pattern that matches the if(!await
tryWriteSingleTopLevelIncludeMutation(...)) await writeConfigFile(...) block
with \s* and re.DOTALL) and perform a re.sub to inject the new try { ... }
catch(...) wrapper, then update the assertion to check the regex matched (or
that the file changed) instead of relying on the literal old string.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 26c92d4a-9980-47a5-8dc7-a8dc2fab2065
📒 Files selected for processing (2)
DockerfileDockerfile.base
🚧 Files skipped from review as they are similar to previous changes (1)
- Dockerfile.base
…che" This reverts commit 521c599.
The _SANDBOX_SAFETY_NET preload was loaded via NODE_OPTIONS=--require into EVERY Node process in the sandbox, including short-lived CLI commands like `openclaw agent`. It installed an unconditional `unhandledRejection` handler that swallows the rejection — designed to keep the long-running gateway alive across non-fatal library bugs. In OpenClaw 2026.4.9 the agent CLI's code paths didn't trip an unhandled rejection, so the swallow was harmless there. In 2026.4.24 the new plugin loader / gateway client path produces an unhandled rejection from `openclaw agent`. Instead of surfacing as an error, the safety net ate it and the awaited Promise never resolved — leaving the CLI hanging silently on a request that should have failed fast. This is the exact symptom in TC-SBX-02: two UNDICI warnings (process startup) followed by minutes of silence with no error output. Gate the swallow to argv[2] === "gateway" so the protection is scoped to its actual purpose (`openclaw gateway run …`). All other CLI commands (agent, doctor, plugins, tui) get default Node behavior — errors surface and short-lived processes exit cleanly with a meaningful exit code.
…lure TC-SBX-02 hangs without surfacing any error: with the safety-net gate fix, errors should now propagate on the agent CLI side, but we see only Node UNDICI warnings then 60s of silence. The remaining hypothesis is that the gateway-side `agent` method handler hits an error that's swallowed by the gateway's still-active safety net (intentional — keeps gateway alive), leaving the client awaiting a response that never comes. To prove or refute this, the gateway log content during the hang must be visible in the failed test artifact. The test framework captures only the test runner's own log (and the agent CLI's SSH output, which is silent). /tmp/gateway.log inside the sandbox container has the data we need. Two-part diagnostic, not a behavior change: 1. nemoclaw-start.sh: background-tail /tmp/gateway.log with a [gateway-log:] prefix to PID 1's stderr after gateway launch. Each gateway-log line now appears in the container's stderr stream (and is filterable by prefix). Cleanup: tail PID added to SANDBOX_CHILD_PIDS so cleanup_on_signal reaps it on shutdown. Both root and non-root launch paths covered. 2. nightly-e2e.yaml sandbox-operations-e2e: on failure, run `docker logs` on every test-sbx-* container and upload as a separate artifact (sandbox-operations-docker-logs). The artifact will contain the gateway log content (now mirrored to container stderr) at the time of failure. This is a NemoClaw-side and workflow-level change (no test changes — the test contract for TC-SBX-02 is unchanged). The runtime diagnostic is permanent but additive; it can be removed once the upstream root cause is identified and fixed. Ref: #2484
The previous post-failure docker logs capture step ran AFTER the test script's teardown destroyed test sandbox containers — so `docker ps -a` returned no matches and the artifact was empty. Replace with a background `docker logs -f` streamer started before the test runs. As soon as a container appears, its logs stream to a per-container file in docker-logs/. When the container is removed, the stream ends but the file persists on the host. The post-failure artifact upload now captures logs from every container that existed at any point during the test. Combined with the [gateway-log:] mirror in nemoclaw-start.sh, this finally surfaces gateway-side activity (including any sandbox-safety-net swallowed errors) at the time TC-SBX-02 hangs. Ref: #2484
The previous docker-logs streamer hit "configured logging driver does not support reading" for sandbox containers. NemoClaw sandboxes are k3s pods INSIDE the openshell-cluster container, not sibling docker containers — `docker logs` cannot read pod stdio. Switch to `docker exec openshell-cluster-* kubectl logs -f -n openshell <pod> --all-containers` to stream pod logs (which include PID 1's stderr mirror of /tmp/gateway.log via the [gateway-log:] prefix added in nemoclaw-start.sh). Output goes to per-pod files on the host that persist past pod deletion. Ref: #2484
The kubectl-logs streamer also returned empty files because the container log driver in openshell's k3s setup doesn't capture container stdio (same root cause as the docker logs failure). The only working way to read /tmp/gateway.log content from outside the pod is via SSH — which `nemoclaw <sandbox> logs --follow` does internally. Switch the streamer to `nemoclaw <name> logs --follow > docker-logs/sandbox-<name>.log`. The streamer waits for nemoclaw to be installed (test does that in its first phase), polls `nemoclaw list`, and spawns a follower per sandbox. Ref: #2484
The previous `nemoclaw logs --follow` per-sandbox streamer accumulated unbounded output and the artifact upload step never finished within the 60-min job timeout (run 24968594521 was cancelled at 1h+ stuck on Upload sandbox gateway logs). Switch to snapshot mode: every 10s, run `timeout 8 nemoclaw <name> logs` and overwrite docker-logs/sandbox-<name>.log with the result, capped at 256KB. The default `nemoclaw logs` invocation returns ~62 lines (already bounded by /tmp/gateway.log size at snapshot time). When a sandbox is destroyed by the test, the file holds the final pre-destroy snapshot. Ref: #2484
The previous streamer parsed `nemoclaw list` pretty-printed output and
picked up the "Sandboxes:" header line whose first token literally is
"Sandboxes:" (with colon). Tried to create docker-logs/sandbox-Sandboxes:.log
which GitHub artifact upload rejects ("not a valid path: contains colon").
Read the registry json directly (~/.nemoclaw/sandboxes.json) via jq and
only accept names matching strict filename-safe pattern
[a-z0-9_-]+ — defense against future parsing issues too.
Ref: #2484
The previous snapshot-based streamer (overwriting per-sandbox file every 10s with `nemoclaw logs` output) lost the agent-request events because `nemoclaw logs` returns only the tail of /tmp/gateway.log and the ciao mDNS error spam (~10 errors/sec) buries earlier real events. Switch to a per-sandbox SSH+tail follower that streams /tmp/gateway.log directly (full stream from start), filters the uv_interface_addresses noise inline, and caps each file at 512KB. Spawned once per sandbox via openshell ssh-config. Stop step kills the SSH followers along with the streamer. Ref: #2484
Previous streamer wrote ssh config via mktemp and rm'd it before the backgrounded ssh child connected — ssh hit "Can't open user config file" race. Use a per-sandbox stable path /tmp/sshcfg-<name>.tmp and don't remove it; runner /tmp gets cleaned up at job end anyway. Ref: #2484
The bash -c '...' single-quoted block had apostrophes inside its comments (Can't, `rm`) which prematurely terminated the outer single quote, leaving the rest of the script with unbalanced quotes — bash exited with "unexpected EOF while looking for matching `\"'" within 6 seconds of job start. Reword comments to avoid apostrophes. Ref: #2484
`head -c 524288` blocked waiting for 512KB to arrive through the tail | grep pipe. Most lines are mDNS noise that grep -v drops, so useful content arrives slowly. When the streamer was killed at job end, head had captured zero bytes — final file was just the SSH disconnect message (43b). Drop the head -c cap so output streams freely while the job runs. As safety against runaway file size, trim each log file to its last 5MB at stop time. Real gateway events are interleaved with whatever filtered content remains, so tail-trim keeps the most recent content (which includes the TC-SBX-02 hang window). Ref: #2484
The gateway log line "log file: /tmp/openclaw-998/openclaw-2026-04-27.log" revealed that openclaw writes detailed event tracing to a SEPARATE file than /tmp/gateway.log (which only captures the launch redirect of stdout/stderr from nemoclaw-start.sh). The structured log carries the agent-flow events we need; gateway.log silenced after startup because most subsequent events go to the structured log instead. Tail BOTH files in the same SSH session so we capture all gateway-side activity during TC-SBX-02. Glob /tmp/openclaw-*/openclaw-*.log to handle the per-uid stem (e.g. openclaw-998). Ref: #2484
Root cause of TC-SBX-02 hang, now fully traced via the gateway-log
streamer artifact:
The bonjour plugin (mDNS service advertiser) attempts to probe network
interfaces via ciao every few seconds. Inside the sandbox netns,
os.networkInterfaces() throws (no usable interfaces). The ciao guard in
nemoclaw-start.sh monkey-patches os.networkInterfaces to return empty,
but that does not stop ciao from cancelling its outstanding probe with
"CIAO PROBING CANCELLED" — an UNHANDLED Promise rejection (the ciao
guard only catches synchronous uncaughtException, not async).
The sandbox-safety-net swallows the rejection (gateway-only after the
recent gate fix), but the swallow happens during the same event loop
tick as in-flight WebSocket handshakes from the openclaw agent CLI.
Pending WS connections get torn down with code 1006 (abnormal closure):
03:17:39.367 Unhandled promise rejection: CIAO PROBING CANCELLED
03:17:39.387 [gateway/ws] closed before connect ... code=1006
(handshake pending,
durationMs=7)
The agent CLI sees the abrupt close, retries, hits the same race,
eventually times out at the 10s connect-challenge timeout. Test only
sees UNDICI warnings because the CLI's `console.error` failure message
goes to /tmp/openclaw-<uid>/openclaw-<date>.log (the structured event
log), not stdout/stderr — the test framework never sees it.
Why TC-SBX-02 worked on 2026.4.9 but not 2026.4.24: bonjour plugin
loading and probe timing changed in the 2026.4.10–24 range
(Jiti-based plugin loader, lazy provider deps), making the rejection
window overlap WS handshakes more aggressively. On 2026.4.9 the timing
was lucky enough that the rejection never overlapped a real connect.
Fix: set plugins.entries.bonjour.enabled=false in the generated
openclaw.json. mDNS service advertisement is useless inside a sandboxed
netns (no peers to advertise to, no clients to discover the service)
and the only thing it accomplishes here is destabilizing other
gateway connections.
Ref: #2484
Root cause: bonjour mDNS plugin destabilizes WS connections in sandbox netnsAfter significant diagnostic plumbing (the openclaw structured event log lives at 19 ms between the unhandled rejection from the Causal chain
Why this surfaces in 2026.4.24 but not 2026.4.9Plugin load timing changed in the 2026.4.10–24 range (Jiti-based plugin loader, "lazy provider dependencies" in the release notes). The bonjour rejection window now overlaps WS handshakes more aggressively. On 2026.4.9 the timing was a lucky race; on 2026.4.24 it reliably hits. Why disable bonjour is the right fixmDNS service advertisement is structurally useless inside a NemoClaw sandbox:
This is the kind of plugin that exists for the user-laptop deployment story (where mDNS finds your assistant on a home network), not for the headless sandbox case NemoClaw runs. Fix in this PR
Validation re-run in progress: https://github.com/NVIDIA/NemoClaw/actions/runs/24975221024 Diagnostic infrastructure to remove on greenOnce TC-SBX-02 passes, these diagnostic-only commits should be reverted:
These were necessary to find the root cause but add ambient runtime/CI overhead. Cleanup commit will be marked with |
Selective E2E Results — ❌ Some jobs failedRun: 25077451446
|
Selective E2E Results — ❌ Some jobs failedRun: 25078188617
|
Selective E2E Results — ❌ Some jobs failedRun: 25079888582
|
Selective E2E Results — ❌ Some jobs failedRun: 25080430723
|
Selective E2E Results — ❌ Some jobs failedRun: 25080990090
|
Selective E2E Results — ❌ Some jobs failedRun: 25081625450
|
Selective E2E Results — ✅ All requested jobs passedRun: 25082270514
|
Selective E2E Results — ✅ All requested jobs passedRun: 25085223759
|
## Summary
Upgrades OpenClaw from **2026.4.9** to **2026.4.24** (latest stable,
CalVer).
### Fixes in this PR
1. **Version bumps** — `Dockerfile.base`,
`nemoclaw-blueprint/blueprint.yaml`, `agents/openclaw/manifest.yaml`,
`src/lib/sandbox-version.test.ts`.
2. **Patch 4 updated** — OpenClaw 2026.4.24 restructured
`replaceConfigFile` to first attempt
`tryWriteSingleTopLevelIncludeMutation` (writes to a `$include` file
like `plugins.json5`) before falling back to `writeConfigFile`. The old
patch matched an exact tab-indented `writeConfigFile(params.nextConfig,
{...})` string that no longer exists. Updated to match the new `if
(!await tryWriteSingleTopLevelIncludeMutation(...)) await
writeConfigFile(...)` block and wrap the entire write path in the
OPENSHELL_SANDBOX-gated EACCES try/catch.
3. **`plugin-runtime-deps` symlink** — OpenClaw 2026.4.24 introduced
lazy plugin runtime-dep installation (Jiti loader). The CLI writes to
`~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/` on first
invocation. NemoClaw locks `/sandbox/.openclaw` to `444 root:root`, so
every bundled provider failed to load with `EACCES`. Fix: created the
dir in the writable `.openclaw-data` tree and symlinked it from the
immutable config tree, mirroring the existing pattern used for `logs`,
`credentials`, `extensions`, etc. Added in both `Dockerfile.base`
(canonical) and `Dockerfile` (idempotent fixup for stale GHCR base).
4. **Selective sandbox safety-net** — `_SANDBOX_SAFETY_NET` (a Node
`--require` preload from `nemoclaw-start.sh`) used to be a catch-all
swallow + `process.exit` interceptor. Rewritten to: (a) gate to gateway
processes only (`OPENSHELL_SANDBOX=1` + `argv[2]==='gateway'`) so CLI
commands keep default Node crash behaviour; (b) match documented
known-benign patterns (currently `ciao`/mDNS — produced when bonjour's
probe state machine cancels itself, since the sandbox netns has no
multicast); (c) for unknown errors, log full stack but keep gateway
alive (gateway is shared infrastructure, user-initiated actions must not
take it down); (d) drop `process.exit` interception entirely. The CIAO
guard's `uncaughtException` listener was similarly gated to gateway
processes — registering one in CLI processes turns Node's default
crash-on-uncaught into silent absorb, which would silently hang
`openclaw agent`.
5. **Disable bonjour and qqbot bundled plugins** — both ship
enabled-by-default in 2026.4.24 and break in the sandbox netns:
- **bonjour**: introduced in 2026.4.15, uses `@homebridge/ciao` for mDNS
announcement. Sandbox netns has no multicast — ciao's probe state
machine fails at startup.
- **qqbot**: has `stageRuntimeDependencies=true`, so its npm deps
(`@tencent-connect/qqbot-connector`, `silk-wasm`, etc.) install on first
load. The sandbox L7 proxy denies the registry URL with `403
policy_denied`, the install retries for ~6 minutes, and while channel
loading is stuck the gateway can't service `openclaw agent` requests.
Both disabled via `plugins.entries.<id>.enabled = false` in
`scripts/generate-openclaw-config.py`.
6. **Build-context fix for `generate-openclaw-config.py`** — main's PR
NVIDIA#2449 (commit `f5ee8a4d`) extracted the inline Python config-generator
from Dockerfile into `scripts/generate-openclaw-config.py` and added
`COPY scripts/generate-openclaw-config.py …` to Dockerfile, but did not
update `src/lib/sandbox-build-context.ts` which curates the optimized
build context for sandbox image builds. Without this, every nightly E2E
job (and any sandbox onboard) fails with `COPY failed: file not found in
build context`. Added the file to `stageOptimizedSandboxBuildContext()`
next to `nemoclaw-start.sh` and added a test assertion so the staging
stays in sync.
### Status
Most recent un-rate-limited run (25015126555 with build-context fix):
**13 of 18 jobs pass**. `sandbox-operations-e2e` still fails — only
TC-SBX-02 (Connect & Chat) within it. All other TC-SBX cases (03, 04,
05, 06, 07, 08, 10, 11, 12) pass on `test-sbx-a`, confirming the gateway
is functional. After the `sandbox-build-context.ts` fix and the qqbot
disable, the failure mode of TC-SBX-02 changed from `SSH command timed
out after 60s` to `Expected '42' in agent reply; reply=''` — same 60-90
second hang but now hitting the test's outer `run_with_timeout` rather
than producing a stack trace. The test drops stderr (`2>/dev/null`), and
the gateway-log streamer/snapshot infrastructure has been unable to
capture `test-sbx-a`'s `/tmp/openclaw-998/openclaw-*.log` reliably (the
post-test openshell state has no active gateway after TC-SBX-06's docker
kill, and the streamer's connection to test-sbx-a races and gets
`Connection refused`). Still root-causing.
### Notable upstream changes (2026.4.9 → 2026.4.24)
- Google Meet bundled plugin, DeepSeek V4 Flash/Pro, realtime voice
loops (Talk/Voice Call/Google Meet), Gemini Live, browser automation
improvements.
- Lighter startup: static model catalogs, manifest-backed model rows,
**lazy provider dependencies** (the new plugin-runtime-deps mechanism —
root cause of fix NVIDIA#3).
- **Breaking:** Plugin SDK tool-result transforms migrated from
`registerEmbeddedExtensionFactory()` to
`registerAgentToolResultMiddleware()` — verified NemoClaw uses neither.
- **Breaking:** Plugin registry migrated from `plugins.installs` config
key to managed `plugins/installs.json` ledger — `openclaw doctor --fix`
migrates automatically.
- Config writes restructured to use single-file `$include` mutations
before falling back to full config write (root cause of fix NVIDIA#2).
- CVE-2026-41349, CVE-2026-22181 fixes; exec-approvals chat enablement
(2026.4.22); cron `jobs-state.json` separation (2026.4.20).
- bonjour mDNS plugin added in 2026.4.15 (root cause of fix #5a).
### User sandbox state migration on rebuild
Existing user sandboxes upgrade via `nemoclaw <name> rebuild`. State
(memory/, workspace/, agents/, extensions/, etc.) is backed up via tar,
sandbox is destroyed and recreated with the new image, state is
restored, `openclaw doctor --fix` runs post-restore.
**Handled automatically:** memory, cron job definitions, plugin
auto-discovery, plugin registry migration. **Existing reset behavior
(not new):** exec-approvals, credentials, device pairing. **New minor
behavior change:** cron runtime state (`jobs-state.json`) absent in
pre-2026.4.20 backups — job execution history resets, jobs may re-fire
once after upgrade.
## Test plan
- [x] CI lint, typecheck, unit tests pass
- [x] Docker base image and sandbox image build with all dist patches
applied
- [x] 13/18 nightly E2E jobs pass cleanly with all six fixes
- [ ] **TC-SBX-02** — root cause for the residual `reply=''` hang under
investigation; the gateway-log capture infrastructure needs to work
reliably post-test before we can read what's happening server-side
- [ ] Manual smoke test via `nemoclaw <sandbox> connect` interactive
flow
- [ ] Rebuild test: existing 2026.4.9 sandbox → rebuild → verify state
preserved (rebuild-openclaw-e2e covers this)
…ons land cleanly (#2681) (#2851) ## Summary Replaces the EACCES-swallow approach proposed in #2693 with proper Unix group permissions. Control-UI toggles in the OpenClaw dashboard (Enable Dreaming, account toggles, etc.) now **persist in default mode** instead of throwing `GatewayRequestError: EACCES` or becoming silent no-ops. ## Background OpenClaw 2026.4.24 (landed via #2484) introduced `mutateConfigFile` as the new control-UI write path. Patch 4 in the Dockerfile only wraps the legacy `replaceConfigFile` (plugin install path), so every config-toggle click in the sandbox dashboard now EACCES'd. #2693 proposed adding "Patch 4b" — a parallel try/catch that swallows the EACCES. That makes toggles non-functional in the sandbox: the user clicks "Enable Dreaming," gets no error, but nothing actually persists. UX improves over the crash; underlying limitation stays. This PR implements the alternative design Aaron sketched for #2681: rather than wrapping each new write path in EACCES handlers, fix the actual permissions so the writes succeed. ## Closes / Supersedes - Closes #2681 - Supersedes #2693 — thanks @Sanjays2402 for raising the issue and the initial swallow attempt that surfaced the deeper design question ## Implementation (the 6-item spec) | # | Item | File | |---|------|------| | 1 | Keep `gateway` as a separate UID from `sandbox`; add it to the `sandbox` group | `Dockerfile.base` | | 2 | Stale-base fallback so older `sandbox-base:latest` tags get the group membership at derived-image build time | `Dockerfile` | | 3 | `/sandbox/.openclaw` group-writable + setgid on dirs; `.config-hash` file mode 664 | `Dockerfile.base`, `Dockerfile` | | 4 | `normalize_mutable_config_perms()` at startup, gated on Shields state | `scripts/nemoclaw-start.sh` | | 5 | `shields down` restores 660/2770 (group-writable + setgid) for OpenClaw; Hermes left at historical 640/750 (no separate gateway UID, contract doesn't apply) | `src/lib/shields.ts` | | 6 | Tests assert the new invariant: writes succeed in default mode, no new EACCES swallow | `test/repro-2681-group-writable.test.ts` | ## Why setgid `chmod g+s` on directories means new files inherit `group=sandbox` regardless of which UID created them. So `gateway` writes a file → file is `group=sandbox` → the `sandbox` user (also in the group) can still read it. Without setgid, gateway's writes would land with `group=gateway` and the agent might lose read access on rotation. ## Patch 4 retention The existing `Patch 4` (replaceConfigFile EACCES swallow) is **intentionally retained** as a defensive fallback for: - Older base images during the rollout window - Host filesystems that don't honor setgid (rare, but possible on some Windows/WSL2 configurations) - Other write paths in OpenClaw that might surface in future versions No new EACCES swallow patch is added — the `Patch 4b` approach from #2693 is explicitly rejected per spec item #6. ## Verification - [x] `npm run build:cli` compiles the changed `shields.ts` - [x] 11/11 new tests pass in `test/repro-2681-group-writable.test.ts` — assert structural invariants of the group-writable contract - [x] 443/443 plugin tests pass - [x] Pre-existing CLI tests that fail on this branch ALSO fail on pristine main (`@oclif/core` module-not-found from in-flight migration; not caused by this PR) - [ ] **Brev E2E required** — touches Dockerfile + Dockerfile.base + shields lifecycle. Adaptive matrix: M×DANGER → full Brev sweep before merge ## Test plan - [x] Unit: 11 structural assertions in `repro-2681-group-writable.test.ts` - [ ] CI: `build-sandbox-images` (validates the group-membership + setgid Dockerfile changes) - [ ] CI: `test-e2e-sandbox` (validates shields lifecycle + onboard flow) - [ ] CI: `test-e2e-gateway-isolation` (validates the gateway-as-different-UID still runs cleanly) - [ ] Manual repro: onboard, click "Enable Dreaming" in dashboard, verify mutation persists across `nemoclaw status` ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## AI Disclosure - [x] AI-assisted — tool: Claude Code
Patch 4 is a regex-based monkey-patch over OpenClaw's compiled JS that suppresses EACCES inside replaceConfigFile. Its source-shape coupling has broken three times in eight days (#2377, #2484, #2876) chasing upstream refactors; #2686 and #3497 report the latest casualty, where the regex no longer finds the function in 2026.4.24 and the image build fails. Patch 4 is also unnecessary by design: * In mutable-default mode, openclaw.json is chmod 660 sandbox:sandbox and the gateway UID is in the sandbox group (#2681), so plugin install writes through without ever hitting EACCES. * In shields-up mode, the entire config tree (file + parent dir + the plugin/extensions state dirs in HIGH_RISK_STATE_DIRS) is locked to root:root by design — refusing runtime mutations is the whole point of shields-up. Suppressing the EACCES masked that refusal and made the install appear to succeed silently while only the auto-discovery half landed. The expected flow is configure-in-mutable-mode → shields up → run. Plugin install attempted while shielded should fail cleanly, which is what happens without Patch 4. Reverts the rcf-shim replacement attempt; the require-hook approach does not catch OpenClaw's ESM named imports anyway (capture-at-import- time semantics). Resolves #2686 Resolves #3497 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
## Summary Patch 4 in the sandbox `Dockerfile` is a regex-based monkey-patch over OpenClaw's compiled JS that wraps `replaceConfigFile` in an EACCES try/catch suppression. It is source-shape-coupled and has been rewritten three times in eight days chasing OpenClaw refactors: - `fefd69fa2` (#2377) — original literal-string match against [openclaw/openclaw@v2026.4.9](https://github.com/openclaw/openclaw/releases/tag/v2026.4.9). - `5dcb0a9b9` (#2484) — updated the literal string for the restructured write block in [openclaw/openclaw@v2026.4.24](https://github.com/openclaw/openclaw/releases/tag/v2026.4.24). - `e0290e153` (#2876) — hardened to a tolerant whitespace/property-order-aware regex against [openclaw/openclaw@v2026.4.24](https://github.com/openclaw/openclaw/releases/tag/v2026.4.24). #2686 and #3497 are the latest break: in current OpenClaw, the regex no longer finds the function shape and the image build aborts at Step 17/56. Patch 4 is also unnecessary by design. The EACCES it was suppressing does not happen in the supported flows: - **Mutable-default mode** (fresh sandbox, before `nemoclaw shields up`): `openclaw.json` is `chmod 660 sandbox:sandbox` and the gateway UID is in the sandbox group, courtesy of #2851 (closing #2681; superseding the EACCES-swallow attempt in #2693). `openclaw plugins install` writes through normally; no EACCES. - **Shields-up mode** (locked): the entire config tree — file, parent directory, and the `extensions`/`plugins` state dirs from [HIGH_RISK_STATE_DIRS](src/lib/shields/index.ts#L292-L306) — is locked to `root:root` by design. Shields-up exists *to refuse* runtime config and plugin mutations. Suppressing the EACCES masked that refusal and made `openclaw plugins install` appear to succeed silently while only the auto-discovery half landed. The expected lifecycle is **configure-in-mutable-mode → `shields up` → run**. Plugin install attempted while shielded should fail cleanly; that is exactly what happens without Patch 4. This PR therefore deletes Patch 4 entirely. ## Related Issue Resolves #2686 Resolves #3497 Related context: - #2681 — original "make `.openclaw` group-writable" issue, closed by #2851. - #2851 — PR that made mutable-mode plugin install work without an EACCES swallow. - #2693 — closed earlier EACCES-swallow attempt, superseded once #2851 landed. - #2544 — NemoClaw issue tracking the broader "plugin config requires multi-minute rebuild" problem. - openclaw/openclaw#72950 — upstream defect (no env-var or writable-overlay path for `plugins.entries.<id>.config`); the real fix has to land there. ## Changes - `Dockerfile`: drop the Patch 4 block (the `COPY scripts/rcf_patch.py` line and the inline Python invocation + grep guard). - `scripts/rcf_patch.py`: deleted. - `src/lib/sandbox/build-context.ts`: stop staging `scripts/rcf_patch.py` into the sandbox build context. - `test/rcf-patch.test.ts`: deleted. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Signed-off-by: Tinson Lai <tinsonl@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Removed OpenClaw patching logic from Docker build process and related artifact copies * Updated build context script staging behavior * **Tests** * Enhanced sandbox configuration test suite with environment variable passthrough support * Added version-based conditional patching validation and warning behavior tests <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Summary
Upgrades OpenClaw from 2026.4.9 to 2026.4.24 (latest stable, CalVer).
Fixes in this PR
Dockerfile.base,nemoclaw-blueprint/blueprint.yaml,agents/openclaw/manifest.yaml,src/lib/sandbox-version.test.ts.replaceConfigFileto first attempttryWriteSingleTopLevelIncludeMutation(writes to a$includefile likeplugins.json5) before falling back towriteConfigFile. The old patch matched an exact tab-indentedwriteConfigFile(params.nextConfig, {...})string that no longer exists. Updated to match the newif (!await tryWriteSingleTopLevelIncludeMutation(...)) await writeConfigFile(...)block and wrap the entire write path in the OPENSHELL_SANDBOX-gated EACCES try/catch.plugin-runtime-depssymlink — OpenClaw 2026.4.24 introduced lazy plugin runtime-dep installation (Jiti loader). The CLI writes to~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/on first invocation. NemoClaw locks/sandbox/.openclawto444 root:root, so every bundled provider failed to load withEACCES. Fix: created the dir in the writable.openclaw-datatree and symlinked it from the immutable config tree, mirroring the existing pattern used forlogs,credentials,extensions, etc. Added in bothDockerfile.base(canonical) andDockerfile(idempotent fixup for stale GHCR base)._SANDBOX_SAFETY_NET(a Node--requirepreload fromnemoclaw-start.sh) used to be a catch-all swallow +process.exitinterceptor. Rewritten to: (a) gate to gateway processes only (OPENSHELL_SANDBOX=1+argv[2]==='gateway') so CLI commands keep default Node crash behaviour; (b) match documented known-benign patterns (currentlyciao/mDNS — produced when bonjour's probe state machine cancels itself, since the sandbox netns has no multicast); (c) for unknown errors, log full stack but keep gateway alive (gateway is shared infrastructure, user-initiated actions must not take it down); (d) dropprocess.exitinterception entirely. The CIAO guard'suncaughtExceptionlistener was similarly gated to gateway processes — registering one in CLI processes turns Node's default crash-on-uncaught into silent absorb, which would silently hangopenclaw agent.@homebridge/ciaofor mDNS announcement. Sandbox netns has no multicast — ciao's probe state machine fails at startup.stageRuntimeDependencies=true, so its npm deps (@tencent-connect/qqbot-connector,silk-wasm, etc.) install on first load. The sandbox L7 proxy denies the registry URL with403 policy_denied, the install retries for ~6 minutes, and while channel loading is stuck the gateway can't serviceopenclaw agentrequests. Both disabled viaplugins.entries.<id>.enabled = falseinscripts/generate-openclaw-config.py.generate-openclaw-config.py— main's PR fix: auto-disable device auth for non-loopback URLs (#2341) #2449 (commitf5ee8a4d) extracted the inline Python config-generator from Dockerfile intoscripts/generate-openclaw-config.pyand addedCOPY scripts/generate-openclaw-config.py …to Dockerfile, but did not updatesrc/lib/sandbox-build-context.tswhich curates the optimized build context for sandbox image builds. Without this, every nightly E2E job (and any sandbox onboard) fails withCOPY failed: file not found in build context. Added the file tostageOptimizedSandboxBuildContext()next tonemoclaw-start.shand added a test assertion so the staging stays in sync.Status
Most recent un-rate-limited run (25015126555 with build-context fix): 13 of 18 jobs pass.
sandbox-operations-e2estill fails — only TC-SBX-02 (Connect & Chat) within it. All other TC-SBX cases (03, 04, 05, 06, 07, 08, 10, 11, 12) pass ontest-sbx-a, confirming the gateway is functional. After thesandbox-build-context.tsfix and the qqbot disable, the failure mode of TC-SBX-02 changed fromSSH command timed out after 60stoExpected '42' in agent reply; reply=''— same 60-90 second hang but now hitting the test's outerrun_with_timeoutrather than producing a stack trace. The test drops stderr (2>/dev/null), and the gateway-log streamer/snapshot infrastructure has been unable to capturetest-sbx-a's/tmp/openclaw-998/openclaw-*.logreliably (the post-test openshell state has no active gateway after TC-SBX-06's docker kill, and the streamer's connection to test-sbx-a races and getsConnection refused). Still root-causing.Notable upstream changes (2026.4.9 → 2026.4.24)
registerEmbeddedExtensionFactory()toregisterAgentToolResultMiddleware()— verified NemoClaw uses neither.plugins.installsconfig key to managedplugins/installs.jsonledger —openclaw doctor --fixmigrates automatically.$includemutations before falling back to full config write (root cause of fix feature: custom settings for using build endpoints #2).jobs-state.jsonseparation (2026.4.20).User sandbox state migration on rebuild
Existing user sandboxes upgrade via
nemoclaw <name> rebuild. State (memory/, workspace/, agents/, extensions/, etc.) is backed up via tar, sandbox is destroyed and recreated with the new image, state is restored,openclaw doctor --fixruns post-restore.Handled automatically: memory, cron job definitions, plugin auto-discovery, plugin registry migration. Existing reset behavior (not new): exec-approvals, credentials, device pairing. New minor behavior change: cron runtime state (
jobs-state.json) absent in pre-2026.4.20 backups — job execution history resets, jobs may re-fire once after upgrade.Test plan
reply=''hang under investigation; the gateway-log capture infrastructure needs to work reliably post-test before we can read what's happening server-sidenemoclaw <sandbox> connectinteractive flow