chore: upgrade OpenClaw from 2026.4.9 to 2026.4.24 by ericksoa · Pull Request #2484 · NVIDIA/NemoClaw

ericksoa · 2026-04-25T22:32:22Z

Summary

Upgrades OpenClaw from 2026.4.9 to 2026.4.24 (latest stable, CalVer).

Fixes in this PR

Version bumps — Dockerfile.base, nemoclaw-blueprint/blueprint.yaml, agents/openclaw/manifest.yaml, src/lib/sandbox-version.test.ts.
Patch 4 updated — OpenClaw 2026.4.24 restructured replaceConfigFile to first attempt tryWriteSingleTopLevelIncludeMutation (writes to a $include file like plugins.json5) before falling back to writeConfigFile. The old patch matched an exact tab-indented writeConfigFile(params.nextConfig, {...}) string that no longer exists. Updated to match the new if (!await tryWriteSingleTopLevelIncludeMutation(...)) await writeConfigFile(...) block and wrap the entire write path in the OPENSHELL_SANDBOX-gated EACCES try/catch.
plugin-runtime-deps symlink — OpenClaw 2026.4.24 introduced lazy plugin runtime-dep installation (Jiti loader). The CLI writes to ~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/ on first invocation. NemoClaw locks /sandbox/.openclaw to 444 root:root, so every bundled provider failed to load with EACCES. Fix: created the dir in the writable .openclaw-data tree and symlinked it from the immutable config tree, mirroring the existing pattern used for logs, credentials, extensions, etc. Added in both Dockerfile.base (canonical) and Dockerfile (idempotent fixup for stale GHCR base).
Selective sandbox safety-net — _SANDBOX_SAFETY_NET (a Node --require preload from nemoclaw-start.sh) used to be a catch-all swallow + process.exit interceptor. Rewritten to: (a) gate to gateway processes only (OPENSHELL_SANDBOX=1 + argv[2]==='gateway') so CLI commands keep default Node crash behaviour; (b) match documented known-benign patterns (currently ciao/mDNS — produced when bonjour's probe state machine cancels itself, since the sandbox netns has no multicast); (c) for unknown errors, log full stack but keep gateway alive (gateway is shared infrastructure, user-initiated actions must not take it down); (d) drop process.exit interception entirely. The CIAO guard's uncaughtException listener was similarly gated to gateway processes — registering one in CLI processes turns Node's default crash-on-uncaught into silent absorb, which would silently hang openclaw agent.
Disable bonjour and qqbot bundled plugins — both ship enabled-by-default in 2026.4.24 and break in the sandbox netns:
- bonjour: introduced in 2026.4.15, uses @homebridge/ciao for mDNS announcement. Sandbox netns has no multicast — ciao's probe state machine fails at startup.
- qqbot: has stageRuntimeDependencies=true, so its npm deps (@tencent-connect/qqbot-connector, silk-wasm, etc.) install on first load. The sandbox L7 proxy denies the registry URL with 403 policy_denied, the install retries for ~6 minutes, and while channel loading is stuck the gateway can't service openclaw agent requests. Both disabled via plugins.entries.<id>.enabled = false in scripts/generate-openclaw-config.py.
Build-context fix for generate-openclaw-config.py — main's PR fix: auto-disable device auth for non-loopback URLs (#2341) #2449 (commit f5ee8a4d) extracted the inline Python config-generator from Dockerfile into scripts/generate-openclaw-config.py and added COPY scripts/generate-openclaw-config.py … to Dockerfile, but did not update src/lib/sandbox-build-context.ts which curates the optimized build context for sandbox image builds. Without this, every nightly E2E job (and any sandbox onboard) fails with COPY failed: file not found in build context. Added the file to stageOptimizedSandboxBuildContext() next to nemoclaw-start.sh and added a test assertion so the staging stays in sync.

Status

Most recent un-rate-limited run (25015126555 with build-context fix): 13 of 18 jobs pass. sandbox-operations-e2e still fails — only TC-SBX-02 (Connect & Chat) within it. All other TC-SBX cases (03, 04, 05, 06, 07, 08, 10, 11, 12) pass on test-sbx-a, confirming the gateway is functional. After the sandbox-build-context.ts fix and the qqbot disable, the failure mode of TC-SBX-02 changed from SSH command timed out after 60s to Expected '42' in agent reply; reply='' — same 60-90 second hang but now hitting the test's outer run_with_timeout rather than producing a stack trace. The test drops stderr (2>/dev/null), and the gateway-log streamer/snapshot infrastructure has been unable to capture test-sbx-a's /tmp/openclaw-998/openclaw-*.log reliably (the post-test openshell state has no active gateway after TC-SBX-06's docker kill, and the streamer's connection to test-sbx-a races and gets Connection refused). Still root-causing.

Notable upstream changes (2026.4.9 → 2026.4.24)

Google Meet bundled plugin, DeepSeek V4 Flash/Pro, realtime voice loops (Talk/Voice Call/Google Meet), Gemini Live, browser automation improvements.
Lighter startup: static model catalogs, manifest-backed model rows, lazy provider dependencies (the new plugin-runtime-deps mechanism — root cause of fix Change small local model to qwen3.5:9b #3).
Breaking: Plugin SDK tool-result transforms migrated from registerEmbeddedExtensionFactory() to registerAgentToolResultMiddleware() — verified NemoClaw uses neither.
Breaking: Plugin registry migrated from plugins.installs config key to managed plugins/installs.json ledger — openclaw doctor --fix migrates automatically.
Config writes restructured to use single-file $include mutations before falling back to full config write (root cause of fix feature: custom settings for using build endpoints #2).
CVE-2026-41349, CVE-2026-22181 fixes; exec-approvals chat enablement (2026.4.22); cron jobs-state.json separation (2026.4.20).
bonjour mDNS plugin added in 2026.4.15 (root cause of fix #5a).

User sandbox state migration on rebuild

Existing user sandboxes upgrade via nemoclaw <name> rebuild. State (memory/, workspace/, agents/, extensions/, etc.) is backed up via tar, sandbox is destroyed and recreated with the new image, state is restored, openclaw doctor --fix runs post-restore.

Handled automatically: memory, cron job definitions, plugin auto-discovery, plugin registry migration. Existing reset behavior (not new): exec-approvals, credentials, device pairing. New minor behavior change: cron runtime state (jobs-state.json) absent in pre-2026.4.20 backups — job execution history resets, jobs may re-fire once after upgrade.

Test plan

CI lint, typecheck, unit tests pass
Docker base image and sandbox image build with all dist patches applied
13/18 nightly E2E jobs pass cleanly with all six fixes
TC-SBX-02 — root cause for the residual reply='' hang under investigation; the gateway-log capture infrastructure needs to work reliably post-test before we can read what's happening server-side
Manual smoke test via nemoclaw <sandbox> connect interactive flow
Rebuild test: existing 2026.4.9 sandbox → rebuild → verify state preserved (rebuild-openclaw-e2e covers this)

Bump the pinned OpenClaw version across all version-tracking files (Dockerfile.base, blueprint.yaml, manifest.yaml, and version tests) to the latest stable release.

copy-pr-bot · 2026-04-25T22:32:25Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-04-25T22:32:28Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Updates OpenClaw from version 2026.4.9 to 2026.4.24 across build configuration, manifests, and tests. Introduces plugin runtime dependencies cache directory with proper permissions and group configuration. Implements new config writing API with sandbox error handling for read-only environments.

Changes

Cohort / File(s)	Summary
Version Upgrades `Dockerfile.base`, `agents/openclaw/manifest.yaml`, `nemoclaw-blueprint/blueprint.yaml`	Bump OpenClaw version from 2026.4.9 to 2026.4.24 across build configuration and manifest declarations.
Dockerfile Configuration `Dockerfile`	Implements new OpenClaw 2026.4.24+ config writing via `tryWriteSingleTopLevelIncludeMutation` with `writeConfigFile` fallback. Adds error handling for `EACCES` in sandboxes. Creates `/sandbox/.openclaw-data/plugin-runtime-deps` directory with group-write permissions (setgid/2775) to allow gateway user write access.
Test Updates `src/lib/sandbox-version.test.ts`	Update test fixtures and assertions to expect OpenClaw version 2026.4.24 across mocked agent definitions, version comparisons, and staleness warnings.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hopping through configs, the version's bumped high,
From point-nine to point-twenty-four in the sky!
Plugin deps find a cache with a gateway's new right,
Sandboxes protected from permission-denied plight.
A safer, stronger OpenClaw, shiny and bright! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'chore: upgrade OpenClaw from 2026.4.9 to 2026.4.24' accurately reflects the primary change across the changeset—upgrading the OpenClaw version and updating all related version references.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch upgrade/openclaw-2026.4.24

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

OpenClaw 2026.4.24 restructured replaceConfigFile to first attempt a single-key include-file mutation (tryWriteSingleTopLevelIncludeMutation) before falling back to writeConfigFile. Both paths can EACCES in the read-only sandbox. Update the pattern match to wrap the entire write block in the OPENSHELL_SANDBOX-gated try/catch.

olegshilov

lgtm

Capture the SSH-shell environment (HTTP_PROXY, HTTPS_PROXY, NO_PROXY, OPENCLAW_GATEWAY_URL/TOKEN, OPENSHELL_SANDBOX, NVIDIA_API_KEY) before the agent invocation, and bump the failure-message capture from head -3 to head -20 so the full reply (including any gateway/embedded fallback errors) shows in CI logs. Diagnostic-only — no behavior change.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Line 282: The diag_env diagnostic line leaks secrets by expanding the token
values; replace the unsafe expansions
`${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}` and the
analogous `NVIDIA_API_KEY` expansion in the sandbox_exec invocation so they
never emit the variable contents, and instead emit only the literal "set" or
"unset"; implement this by checking each variable's presence (e.g., an explicit
conditional or test for non-empty) and printing "set" when present or "unset"
when not, updating the diag_env/sandbox_exec call accordingly to reference
OPENCLAW_GATEWAY_TOKEN and NVIDIA_API_KEY securely.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5161bcbc-13b7-4cd0-8a9d-5d0f0d383403

📥 Commits

Reviewing files that changed from the base of the PR and between 5dcb0a9 and 2aacc51.

📒 Files selected for processing (1)

test/e2e/test-sandbox-operations.sh

OpenClaw 2026.4.24 lazy-installs bundled plugin runtime dependencies into ~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/ on first CLI invocation (Jiti-based loader, "lazy provider dependencies" in 2026.4.20+ release notes). NemoClaw locks /sandbox/.openclaw to 444 root:root, so every bundled plugin (nvidia, openai, anthropic, ollama, ...) failed to load with EACCES, leaving `openclaw agent` with zero providers — the exact symptom in TC-SBX-02 (no agent reply, only proxy warnings). Mirror the existing .openclaw-data symlink pattern: create the dir in the writable data tree and symlink it from the immutable config tree. Add to both Dockerfile.base (canonical setup) and Dockerfile (idempotent fixup for stale GHCR bases).

…load OpenClaw 2026.4.24+ lazy-installs and Jiti-compiles ~50 bundled plugin runtime deps on the first agent invocation in a fresh sandbox. Even with deps pre-cached at build time, the plugin registry bootstrap + provider warmup + LLM round-trip on the first call can exceed the existing 60s SSH timeout (was completing in ~20s on 2026.4.9). Make sandbox_exec_for accept an optional timeout argument (default 60, preserves all other call sites) and have TC-SBX-02 pass 240s. The openclaw agent CLI's own --timeout default is 600s so 240s leaves plenty of headroom for the inference call itself.

coderabbitai

♻️ Duplicate comments (1)

test/e2e/test-sandbox-operations.sh (1)

286-286: ⚠️ Potential issue | 🔴 Critical

Sensitive values can still be exposed in diagnostics.

Line 286 uses ${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset} (and the same for NVIDIA_API_KEY), which includes the secret value when set. This can leak credentials into CI logs.

🔧 Proposed fix

-  diag_env=$(sandbox_exec 'echo HTTP_PROXY=${HTTP_PROXY:-unset}; echo HTTPS_PROXY=${HTTPS_PROXY:-unset}; echo NO_PROXY=${NO_PROXY:-unset}; echo OPENCLAW_GATEWAY_URL=${OPENCLAW_GATEWAY_URL:-unset}; echo OPENCLAW_GATEWAY_TOKEN=${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}; echo OPENSHELL_SANDBOX=${OPENSHELL_SANDBOX:-unset}; echo NVIDIA_API_KEY=${NVIDIA_API_KEY:+set}${NVIDIA_API_KEY:-unset}' 2>&1) || true
+  diag_env=$(sandbox_exec 'echo HTTP_PROXY=${HTTP_PROXY:-unset}; echo HTTPS_PROXY=${HTTPS_PROXY:-unset}; echo NO_PROXY=${NO_PROXY:-unset}; echo OPENCLAW_GATEWAY_URL=${OPENCLAW_GATEWAY_URL:-unset}; echo OPENCLAW_GATEWAY_TOKEN=$([ -n "${OPENCLAW_GATEWAY_TOKEN:-}" ] && echo set || echo unset); echo OPENSHELL_SANDBOX=${OPENSHELL_SANDBOX:-unset}; echo NVIDIA_API_KEY=$([ -n "${NVIDIA_API_KEY:-}" ] && echo set || echo unset)' 2>&1) || true

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-sandbox-operations.sh` at line 286, The diagnostic command
leaks secret values because
`${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}` (and the
NVIDIA_API_KEY variant) concatenates "set" with the actual secret; change the
diagnostic to print only "set" or "unset" without expanding the value by
replacing those expansions with a conditional-only check (e.g., use a single
parameter expansion or an explicit test) inside the sandbox_exec invocation so
OPENCLAW_GATEWAY_TOKEN and NVIDIA_API_KEY are never interpolated into the logged
string.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Line 286: The diagnostic command leaks secret values because
`${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}` (and the
NVIDIA_API_KEY variant) concatenates "set" with the actual secret; change the
diagnostic to print only "set" or "unset" without expanding the value by
replacing those expansions with a conditional-only check (e.g., use a single
parameter expansion or an explicit test) inside the sandbox_exec invocation so
OPENCLAW_GATEWAY_TOKEN and NVIDIA_API_KEY are never interpolated into the logged
string.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: acfac00c-0120-4ef6-ac19-94ac3a5d1d09

📥 Commits

Reviewing files that changed from the base of the PR and between e1f1be8 and 1e512b1.

📒 Files selected for processing (1)

test/e2e/test-sandbox-operations.sh

Reverts 2aacc51 and 1e512b1. The test contract (run openclaw agent via SSH and assert the reply contains the expected token) stays as-is. Real fix belongs in NemoClaw, not the test harness.

Add gateway to the sandbox supplementary group and set 2775 (setgid + group-write) on /sandbox/.openclaw-data/plugin-runtime-deps. OpenClaw 2026.4.24+ runs its plugin loader on both the sandbox-side CLI and the gateway side; both paths call withBundledRuntimeDepsInstallRootLock, which mkdirSyncs a lock dir under the install root. The original NemoClaw user-isolation design has gateway and sandbox in different primary groups so the sandbox user cannot tamper with the gateway. Before 2026.4.24 the plugin loader did not need write access from the gateway side; now it does, and EACCES on the lock dir caused the gateway to fail mid-request, leaving the agent CLI hanging silently on the unanswered WebSocket call. Adding gateway to sandbox as a supplementary group preserves the original boundary (sandbox still cannot affect gateway-owned resources) and only opens gateway → sandbox-owned shared cache. Setgid bit ensures new files created by either user inherit the sandbox group. Mirrored in both Dockerfile.base (canonical) and Dockerfile (idempotent fixup for stale GHCR base images).

coderabbitai

🧹 Nitpick comments (1)

Dockerfile (1)
186-187: Pattern matching in minified JS is fragile.

The Python patch uses exact string matching including literal tabs (\t) and newlines (\n). Minified JavaScript bundles often vary in whitespace formatting between versions or build environments. The assertion assert old in src will fail-close (which is good), but consider:

The pattern assumes specific formatting that may not survive re-minification

Upstream OpenClaw version bumps could silently change whitespace

The fail-close behavior is correct — the build aborts if the pattern isn't found. However, when this inevitably breaks on a future OpenClaw bump, debugging the exact whitespace mismatch will be tedious.
💡 Alternative: Consider regex-based patching for resilience

A more robust approach would use regex matching that's whitespace-tolerant:
import re
pattern = re.compile(
    r'if\s*$\s*!\s*await\s+tryWriteSingleTopLevelIncludeMutation\s*\(\s*\{[^}]+\}\s*$\s*\)\s*await\s+writeConfigFile\s*$[^;]+$;',
    re.DOTALL
)
This would survive minor formatting changes. However, the current exact-match approach is acceptable given the fail-close assertion — just be prepared for patch maintenance on version bumps.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 186 - 187, The current Python one-liner patches the
minified JS by exact string match of the
tryWriteSingleTopLevelIncludeMutation/writeConfigFile block (the variables
old/new and the assert old in src), which is fragile against
whitespace/minification changes; change the script to use a regex-based,
whitespace-tolerant search (e.g., compile a pattern that matches the if(!await
tryWriteSingleTopLevelIncludeMutation(...)) await writeConfigFile(...) block
with \s* and re.DOTALL) and perform a re.sub to inject the new try { ... }
catch(...) wrapper, then update the assertion to check the regex matched (or
that the file changed) instead of relying on the literal old string.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@Dockerfile`:
- Around line 186-187: The current Python one-liner patches the minified JS by
exact string match of the tryWriteSingleTopLevelIncludeMutation/writeConfigFile
block (the variables old/new and the assert old in src), which is fragile
against whitespace/minification changes; change the script to use a regex-based,
whitespace-tolerant search (e.g., compile a pattern that matches the if(!await
tryWriteSingleTopLevelIncludeMutation(...)) await writeConfigFile(...) block
with \s* and re.DOTALL) and perform a re.sub to inject the new try { ... }
catch(...) wrapper, then update the assertion to check the regex matched (or
that the file changed) instead of relying on the literal old string.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 26c92d4a-9980-47a5-8dc7-a8dc2fab2065

📥 Commits

Reviewing files that changed from the base of the PR and between 1e512b1 and 521c599.

📒 Files selected for processing (2)

Dockerfile
Dockerfile.base

🚧 Files skipped from review as they are similar to previous changes (1)

Dockerfile.base

…che" This reverts commit 521c599.

The _SANDBOX_SAFETY_NET preload was loaded via NODE_OPTIONS=--require into EVERY Node process in the sandbox, including short-lived CLI commands like `openclaw agent`. It installed an unconditional `unhandledRejection` handler that swallows the rejection — designed to keep the long-running gateway alive across non-fatal library bugs. In OpenClaw 2026.4.9 the agent CLI's code paths didn't trip an unhandled rejection, so the swallow was harmless there. In 2026.4.24 the new plugin loader / gateway client path produces an unhandled rejection from `openclaw agent`. Instead of surfacing as an error, the safety net ate it and the awaited Promise never resolved — leaving the CLI hanging silently on a request that should have failed fast. This is the exact symptom in TC-SBX-02: two UNDICI warnings (process startup) followed by minutes of silence with no error output. Gate the swallow to argv[2] === "gateway" so the protection is scoped to its actual purpose (`openclaw gateway run …`). All other CLI commands (agent, doctor, plugins, tui) get default Node behavior — errors surface and short-lived processes exit cleanly with a meaningful exit code.

…lure TC-SBX-02 hangs without surfacing any error: with the safety-net gate fix, errors should now propagate on the agent CLI side, but we see only Node UNDICI warnings then 60s of silence. The remaining hypothesis is that the gateway-side `agent` method handler hits an error that's swallowed by the gateway's still-active safety net (intentional — keeps gateway alive), leaving the client awaiting a response that never comes. To prove or refute this, the gateway log content during the hang must be visible in the failed test artifact. The test framework captures only the test runner's own log (and the agent CLI's SSH output, which is silent). /tmp/gateway.log inside the sandbox container has the data we need. Two-part diagnostic, not a behavior change: 1. nemoclaw-start.sh: background-tail /tmp/gateway.log with a [gateway-log:] prefix to PID 1's stderr after gateway launch. Each gateway-log line now appears in the container's stderr stream (and is filterable by prefix). Cleanup: tail PID added to SANDBOX_CHILD_PIDS so cleanup_on_signal reaps it on shutdown. Both root and non-root launch paths covered. 2. nightly-e2e.yaml sandbox-operations-e2e: on failure, run `docker logs` on every test-sbx-* container and upload as a separate artifact (sandbox-operations-docker-logs). The artifact will contain the gateway log content (now mirrored to container stderr) at the time of failure. This is a NemoClaw-side and workflow-level change (no test changes — the test contract for TC-SBX-02 is unchanged). The runtime diagnostic is permanent but additive; it can be removed once the upstream root cause is identified and fixed. Ref: #2484

The previous post-failure docker logs capture step ran AFTER the test script's teardown destroyed test sandbox containers — so `docker ps -a` returned no matches and the artifact was empty. Replace with a background `docker logs -f` streamer started before the test runs. As soon as a container appears, its logs stream to a per-container file in docker-logs/. When the container is removed, the stream ends but the file persists on the host. The post-failure artifact upload now captures logs from every container that existed at any point during the test. Combined with the [gateway-log:] mirror in nemoclaw-start.sh, this finally surfaces gateway-side activity (including any sandbox-safety-net swallowed errors) at the time TC-SBX-02 hangs. Ref: #2484

The previous docker-logs streamer hit "configured logging driver does not support reading" for sandbox containers. NemoClaw sandboxes are k3s pods INSIDE the openshell-cluster container, not sibling docker containers — `docker logs` cannot read pod stdio. Switch to `docker exec openshell-cluster-* kubectl logs -f -n openshell <pod> --all-containers` to stream pod logs (which include PID 1's stderr mirror of /tmp/gateway.log via the [gateway-log:] prefix added in nemoclaw-start.sh). Output goes to per-pod files on the host that persist past pod deletion. Ref: #2484

The kubectl-logs streamer also returned empty files because the container log driver in openshell's k3s setup doesn't capture container stdio (same root cause as the docker logs failure). The only working way to read /tmp/gateway.log content from outside the pod is via SSH — which `nemoclaw <sandbox> logs --follow` does internally. Switch the streamer to `nemoclaw <name> logs --follow > docker-logs/sandbox-<name>.log`. The streamer waits for nemoclaw to be installed (test does that in its first phase), polls `nemoclaw list`, and spawns a follower per sandbox. Ref: #2484

The previous `nemoclaw logs --follow` per-sandbox streamer accumulated unbounded output and the artifact upload step never finished within the 60-min job timeout (run 24968594521 was cancelled at 1h+ stuck on Upload sandbox gateway logs). Switch to snapshot mode: every 10s, run `timeout 8 nemoclaw <name> logs` and overwrite docker-logs/sandbox-<name>.log with the result, capped at 256KB. The default `nemoclaw logs` invocation returns ~62 lines (already bounded by /tmp/gateway.log size at snapshot time). When a sandbox is destroyed by the test, the file holds the final pre-destroy snapshot. Ref: #2484

The previous streamer parsed `nemoclaw list` pretty-printed output and picked up the "Sandboxes:" header line whose first token literally is "Sandboxes:" (with colon). Tried to create docker-logs/sandbox-Sandboxes:.log which GitHub artifact upload rejects ("not a valid path: contains colon"). Read the registry json directly (~/.nemoclaw/sandboxes.json) via jq and only accept names matching strict filename-safe pattern [a-z0-9_-]+ — defense against future parsing issues too. Ref: #2484

The previous snapshot-based streamer (overwriting per-sandbox file every 10s with `nemoclaw logs` output) lost the agent-request events because `nemoclaw logs` returns only the tail of /tmp/gateway.log and the ciao mDNS error spam (~10 errors/sec) buries earlier real events. Switch to a per-sandbox SSH+tail follower that streams /tmp/gateway.log directly (full stream from start), filters the uv_interface_addresses noise inline, and caps each file at 512KB. Spawned once per sandbox via openshell ssh-config. Stop step kills the SSH followers along with the streamer. Ref: #2484

Previous streamer wrote ssh config via mktemp and rm'd it before the backgrounded ssh child connected — ssh hit "Can't open user config file" race. Use a per-sandbox stable path /tmp/sshcfg-<name>.tmp and don't remove it; runner /tmp gets cleaned up at job end anyway. Ref: #2484

The bash -c '...' single-quoted block had apostrophes inside its comments (Can't, `rm`) which prematurely terminated the outer single quote, leaving the rest of the script with unbalanced quotes — bash exited with "unexpected EOF while looking for matching `\"'" within 6 seconds of job start. Reword comments to avoid apostrophes. Ref: #2484

`head -c 524288` blocked waiting for 512KB to arrive through the tail | grep pipe. Most lines are mDNS noise that grep -v drops, so useful content arrives slowly. When the streamer was killed at job end, head had captured zero bytes — final file was just the SSH disconnect message (43b). Drop the head -c cap so output streams freely while the job runs. As safety against runaway file size, trim each log file to its last 5MB at stop time. Real gateway events are interleaved with whatever filtered content remains, so tail-trim keeps the most recent content (which includes the TC-SBX-02 hang window). Ref: #2484

The gateway log line "log file: /tmp/openclaw-998/openclaw-2026-04-27.log" revealed that openclaw writes detailed event tracing to a SEPARATE file than /tmp/gateway.log (which only captures the launch redirect of stdout/stderr from nemoclaw-start.sh). The structured log carries the agent-flow events we need; gateway.log silenced after startup because most subsequent events go to the structured log instead. Tail BOTH files in the same SSH session so we capture all gateway-side activity during TC-SBX-02. Glob /tmp/openclaw-*/openclaw-*.log to handle the per-uid stem (e.g. openclaw-998). Ref: #2484

Root cause of TC-SBX-02 hang, now fully traced via the gateway-log streamer artifact: The bonjour plugin (mDNS service advertiser) attempts to probe network interfaces via ciao every few seconds. Inside the sandbox netns, os.networkInterfaces() throws (no usable interfaces). The ciao guard in nemoclaw-start.sh monkey-patches os.networkInterfaces to return empty, but that does not stop ciao from cancelling its outstanding probe with "CIAO PROBING CANCELLED" — an UNHANDLED Promise rejection (the ciao guard only catches synchronous uncaughtException, not async). The sandbox-safety-net swallows the rejection (gateway-only after the recent gate fix), but the swallow happens during the same event loop tick as in-flight WebSocket handshakes from the openclaw agent CLI. Pending WS connections get torn down with code 1006 (abnormal closure): 03:17:39.367 Unhandled promise rejection: CIAO PROBING CANCELLED 03:17:39.387 [gateway/ws] closed before connect ... code=1006 (handshake pending, durationMs=7) The agent CLI sees the abrupt close, retries, hits the same race, eventually times out at the 10s connect-challenge timeout. Test only sees UNDICI warnings because the CLI's `console.error` failure message goes to /tmp/openclaw-<uid>/openclaw-<date>.log (the structured event log), not stdout/stderr — the test framework never sees it. Why TC-SBX-02 worked on 2026.4.9 but not 2026.4.24: bonjour plugin loading and probe timing changed in the 2026.4.10–24 range (Jiti-based plugin loader, lazy provider deps), making the rejection window overlap WS handshakes more aggressively. On 2026.4.9 the timing was lucky enough that the rejection never overlapped a real connect. Fix: set plugins.entries.bonjour.enabled=false in the generated openclaw.json. mDNS service advertisement is useless inside a sandboxed netns (no peers to advertise to, no clients to discover the service) and the only thing it accomplishes here is destabilizing other gateway connections. Ref: #2484

ericksoa · 2026-04-27T03:34:32Z

Root cause: bonjour mDNS plugin destabilizes WS connections in sandbox netns

After significant diagnostic plumbing (the openclaw structured event log lives at /tmp/openclaw-<uid>/openclaw-<date>.log, not /tmp/gateway.log), the gateway-log streamer artifact (workflow sandbox-operations-docker-logs) finally captured the failure window for TC-SBX-02. Smoking-gun timeline from the gateway log:

03:17:39.354  [plugins] bonjour: restarting advertiser (service stuck in probing)
03:17:39.367  [openclaw] Unhandled promise rejection: CIAO PROBING CANCELLED
03:17:39.370  wrote stability bundle (rejection logged)
03:17:39.387  [gateway/ws] closed before connect conn=... code=1006 reason=n/a
              (handshake pending, durationMs=7)

19 ms between the unhandled rejection from the bonjour plugin and the abrupt WebSocket close.

Causal chain

The bonjour plugin (mDNS service advertiser) attempts to probe network interfaces every few seconds
The sandbox netns has no usable interfaces → os.networkInterfaces() throws
NemoClaw's ciao guard (in nemoclaw-start.sh) monkey-patches os.networkInterfaces to return empty on failure — BUT that doesn't stop ciao from cancelling its in-flight probe with "CIAO PROBING CANCELLED", which surfaces as an unhandled Promise rejection
The ciao guard only catches synchronous uncaughtException, not async unhandledRejection
The sandbox-safety-net catches the rejection (gateway-only after the earlier gate fix in this PR), but the swallow happens during the same event loop tick as in-flight WebSocket handshakes
Pending WS connections from the openclaw agent CLI get torn down with code 1006 (abnormal closure)
The agent CLI retries, hits the same race, eventually times out at the 10s connect-challenge timeout
The CLI's console.error failure message goes to the openclaw structured log, NOT stdout/stderr — that's why the test only ever saw the two UNDICI warnings followed by 60s of silence

Why this surfaces in 2026.4.24 but not 2026.4.9

Plugin load timing changed in the 2026.4.10–24 range (Jiti-based plugin loader, "lazy provider dependencies" in the release notes). The bonjour rejection window now overlaps WS handshakes more aggressively. On 2026.4.9 the timing was a lucky race; on 2026.4.24 it reliably hits.

Why disable bonjour is the right fix

mDNS service advertisement is structurally useless inside a NemoClaw sandbox:

The sandbox netns is isolated — there are no peers on the network to advertise the gateway to
The only way the gateway is reached from outside the sandbox is via the openshell SSH tunnel (nemoclaw <sandbox> connect), which doesn't use mDNS discovery
Internal-to-sandbox callers (the agent CLI, the configure-guard) connect to 127.0.0.1:18789 directly via the openclaw config, not via mDNS lookup
Continuing to load bonjour produces nothing useful and actively destabilizes the gateway every few seconds

This is the kind of plugin that exists for the user-laptop deployment story (where mDNS finds your assistant on a home network), not for the headless sandbox case NemoClaw runs.

Fix in this PR

plugins.entries.bonjour.enabled = false in the generated openclaw.json. Single line in the Dockerfile's Python config generator. Doesn't affect the user-laptop NemoClaw flow (different config path).

Validation re-run in progress: https://github.com/NVIDIA/NemoClaw/actions/runs/24975221024

Diagnostic infrastructure to remove on green

Once TC-SBX-02 passes, these diagnostic-only commits should be reverted:

[gateway-log:] mirror in nemoclaw-start.sh (PID 1 stderr tail of /tmp/gateway.log)
Start gateway log streamer (background) and related steps in .github/workflows/nightly-e2e.yaml

These were necessary to find the root cause but add ambient runtime/CI overhead. Cleanup commit will be marked with revert(diag): ….

github-actions · 2026-04-28T21:12:35Z

Selective E2E Results — ❌ Some jobs failed

Run: 25077451446
Branch: upgrade/openclaw-2026.4.24
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 17 skipped

Job	Result
cloud-e2e	⏭️ skipped
deployment-services-e2e	⏭️ skipped
diagnostics-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped
hermes-e2e	⏭️ skipped
inference-routing-e2e	⏭️ skipped
messaging-providers-e2e	⏭️ skipped
network-policy-e2e	⏭️ skipped
overlayfs-autofix-e2e	⏭️ skipped
rebuild-hermes-e2e	⏭️ skipped
rebuild-openclaw-e2e	⏭️ skipped
sandbox-operations-e2e	❌ failure
sandbox-survival-e2e	⏭️ skipped
shields-config-e2e	⏭️ skipped
skip-permissions-e2e	⏭️ skipped
snapshot-commands-e2e	⏭️ skipped
token-rotation-e2e	⏭️ skipped
upgrade-stale-sandbox-e2e	⏭️ skipped

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

github-actions · 2026-04-28T21:51:49Z

Selective E2E Results — ❌ Some jobs failed

Run: 25078188617
Branch: upgrade/openclaw-2026.4.24
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 17 skipped

Job	Result
cloud-e2e	⏭️ skipped
deployment-services-e2e	⏭️ skipped
diagnostics-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped
hermes-e2e	⏭️ skipped
inference-routing-e2e	⏭️ skipped
messaging-providers-e2e	⏭️ skipped
network-policy-e2e	⏭️ skipped
overlayfs-autofix-e2e	⏭️ skipped
rebuild-hermes-e2e	⏭️ skipped
rebuild-openclaw-e2e	⏭️ skipped
sandbox-operations-e2e	❌ failure
sandbox-survival-e2e	⏭️ skipped
shields-config-e2e	⏭️ skipped
skip-permissions-e2e	⏭️ skipped
snapshot-commands-e2e	⏭️ skipped
token-rotation-e2e	⏭️ skipped
upgrade-stale-sandbox-e2e	⏭️ skipped

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

github-actions · 2026-04-28T22:13:16Z

Selective E2E Results — ❌ Some jobs failed

Run: 25079888582
Branch: upgrade/openclaw-2026.4.24
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 17 skipped

Job	Result
cloud-e2e	⏭️ skipped
deployment-services-e2e	⏭️ skipped
diagnostics-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped
hermes-e2e	⏭️ skipped
inference-routing-e2e	⏭️ skipped
messaging-providers-e2e	⏭️ skipped
network-policy-e2e	⏭️ skipped
overlayfs-autofix-e2e	⏭️ skipped
rebuild-hermes-e2e	⏭️ skipped
rebuild-openclaw-e2e	⏭️ skipped
sandbox-operations-e2e	❌ failure
sandbox-survival-e2e	⏭️ skipped
shields-config-e2e	⏭️ skipped
skip-permissions-e2e	⏭️ skipped
snapshot-commands-e2e	⏭️ skipped
token-rotation-e2e	⏭️ skipped
upgrade-stale-sandbox-e2e	⏭️ skipped

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

github-actions · 2026-04-28T22:27:55Z

Selective E2E Results — ❌ Some jobs failed

Run: 25080430723
Branch: upgrade/openclaw-2026.4.24
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 17 skipped

Job	Result
cloud-e2e	⏭️ skipped
deployment-services-e2e	⏭️ skipped
diagnostics-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped
hermes-e2e	⏭️ skipped
inference-routing-e2e	⏭️ skipped
messaging-providers-e2e	⏭️ skipped
network-policy-e2e	⏭️ skipped
overlayfs-autofix-e2e	⏭️ skipped
rebuild-hermes-e2e	⏭️ skipped
rebuild-openclaw-e2e	⏭️ skipped
sandbox-operations-e2e	❌ failure
sandbox-survival-e2e	⏭️ skipped
shields-config-e2e	⏭️ skipped
skip-permissions-e2e	⏭️ skipped
snapshot-commands-e2e	⏭️ skipped
token-rotation-e2e	⏭️ skipped
upgrade-stale-sandbox-e2e	⏭️ skipped

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

github-actions · 2026-04-28T22:32:08Z

Selective E2E Results — ❌ Some jobs failed

Run: 25080990090
Branch: upgrade/openclaw-2026.4.24
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 17 skipped

Job	Result
cloud-e2e	⏭️ skipped
deployment-services-e2e	⏭️ skipped
diagnostics-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped
hermes-e2e	⏭️ skipped
inference-routing-e2e	⏭️ skipped
messaging-providers-e2e	⏭️ skipped
network-policy-e2e	⏭️ skipped
overlayfs-autofix-e2e	⏭️ skipped
rebuild-hermes-e2e	⏭️ skipped
rebuild-openclaw-e2e	⏭️ skipped
sandbox-operations-e2e	❌ failure
sandbox-survival-e2e	⏭️ skipped
shields-config-e2e	⏭️ skipped
skip-permissions-e2e	⏭️ skipped
snapshot-commands-e2e	⏭️ skipped
token-rotation-e2e	⏭️ skipped
upgrade-stale-sandbox-e2e	⏭️ skipped

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

github-actions · 2026-04-28T23:00:06Z

Selective E2E Results — ❌ Some jobs failed

Run: 25081625450
Branch: upgrade/openclaw-2026.4.24
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 17 skipped

Job	Result
cloud-e2e	⏭️ skipped
deployment-services-e2e	⏭️ skipped
diagnostics-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped
hermes-e2e	⏭️ skipped
inference-routing-e2e	⏭️ skipped
messaging-providers-e2e	⏭️ skipped
network-policy-e2e	⏭️ skipped
overlayfs-autofix-e2e	⏭️ skipped
rebuild-hermes-e2e	⏭️ skipped
rebuild-openclaw-e2e	⏭️ skipped
sandbox-operations-e2e	❌ failure
sandbox-survival-e2e	⏭️ skipped
shields-config-e2e	⏭️ skipped
skip-permissions-e2e	⏭️ skipped
snapshot-commands-e2e	⏭️ skipped
token-rotation-e2e	⏭️ skipped
upgrade-stale-sandbox-e2e	⏭️ skipped

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

github-actions · 2026-04-28T23:39:57Z

Selective E2E Results — ✅ All requested jobs passed

Run: 25082270514
Branch: upgrade/openclaw-2026.4.24
Requested jobs: sandbox-operations-e2e
Summary: 1 passed, 0 failed, 17 skipped

Job	Result
cloud-e2e	⏭️ skipped
deployment-services-e2e	⏭️ skipped
diagnostics-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped
hermes-e2e	⏭️ skipped
inference-routing-e2e	⏭️ skipped
messaging-providers-e2e	⏭️ skipped
network-policy-e2e	⏭️ skipped
overlayfs-autofix-e2e	⏭️ skipped
rebuild-hermes-e2e	⏭️ skipped
rebuild-openclaw-e2e	⏭️ skipped
sandbox-operations-e2e	✅ success
sandbox-survival-e2e	⏭️ skipped
shields-config-e2e	⏭️ skipped
skip-permissions-e2e	⏭️ skipped
snapshot-commands-e2e	⏭️ skipped
token-rotation-e2e	⏭️ skipped
upgrade-stale-sandbox-e2e	⏭️ skipped

….4.24

github-actions · 2026-04-29T01:17:49Z

Selective E2E Results — ✅ All requested jobs passed

Run: 25085223759
Branch: upgrade/openclaw-2026.4.24
Requested jobs: sandbox-operations-e2e
Summary: 1 passed, 0 failed, 18 skipped

Job	Result
cloud-e2e	⏭️ skipped
cloud-experimental-e2e	⏭️ skipped
deployment-services-e2e	⏭️ skipped
diagnostics-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped
hermes-e2e	⏭️ skipped
inference-routing-e2e	⏭️ skipped
messaging-providers-e2e	⏭️ skipped
network-policy-e2e	⏭️ skipped
overlayfs-autofix-e2e	⏭️ skipped
rebuild-hermes-e2e	⏭️ skipped
rebuild-openclaw-e2e	⏭️ skipped
sandbox-operations-e2e	✅ success
sandbox-survival-e2e	⏭️ skipped
shields-config-e2e	⏭️ skipped
skip-permissions-e2e	⏭️ skipped
snapshot-commands-e2e	⏭️ skipped
token-rotation-e2e	⏭️ skipped
upgrade-stale-sandbox-e2e	⏭️ skipped

## Summary Upgrades OpenClaw from **2026.4.9** to **2026.4.24** (latest stable, CalVer). ### Fixes in this PR 1. **Version bumps** — `Dockerfile.base`, `nemoclaw-blueprint/blueprint.yaml`, `agents/openclaw/manifest.yaml`, `src/lib/sandbox-version.test.ts`. 2. **Patch 4 updated** — OpenClaw 2026.4.24 restructured `replaceConfigFile` to first attempt `tryWriteSingleTopLevelIncludeMutation` (writes to a `$include` file like `plugins.json5`) before falling back to `writeConfigFile`. The old patch matched an exact tab-indented `writeConfigFile(params.nextConfig, {...})` string that no longer exists. Updated to match the new `if (!await tryWriteSingleTopLevelIncludeMutation(...)) await writeConfigFile(...)` block and wrap the entire write path in the OPENSHELL_SANDBOX-gated EACCES try/catch. 3. **`plugin-runtime-deps` symlink** — OpenClaw 2026.4.24 introduced lazy plugin runtime-dep installation (Jiti loader). The CLI writes to `~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/` on first invocation. NemoClaw locks `/sandbox/.openclaw` to `444 root:root`, so every bundled provider failed to load with `EACCES`. Fix: created the dir in the writable `.openclaw-data` tree and symlinked it from the immutable config tree, mirroring the existing pattern used for `logs`, `credentials`, `extensions`, etc. Added in both `Dockerfile.base` (canonical) and `Dockerfile` (idempotent fixup for stale GHCR base). 4. **Selective sandbox safety-net** — `_SANDBOX_SAFETY_NET` (a Node `--require` preload from `nemoclaw-start.sh`) used to be a catch-all swallow + `process.exit` interceptor. Rewritten to: (a) gate to gateway processes only (`OPENSHELL_SANDBOX=1` + `argv[2]==='gateway'`) so CLI commands keep default Node crash behaviour; (b) match documented known-benign patterns (currently `ciao`/mDNS — produced when bonjour's probe state machine cancels itself, since the sandbox netns has no multicast); (c) for unknown errors, log full stack but keep gateway alive (gateway is shared infrastructure, user-initiated actions must not take it down); (d) drop `process.exit` interception entirely. The CIAO guard's `uncaughtException` listener was similarly gated to gateway processes — registering one in CLI processes turns Node's default crash-on-uncaught into silent absorb, which would silently hang `openclaw agent`. 5. **Disable bonjour and qqbot bundled plugins** — both ship enabled-by-default in 2026.4.24 and break in the sandbox netns: - **bonjour**: introduced in 2026.4.15, uses `@homebridge/ciao` for mDNS announcement. Sandbox netns has no multicast — ciao's probe state machine fails at startup. - **qqbot**: has `stageRuntimeDependencies=true`, so its npm deps (`@tencent-connect/qqbot-connector`, `silk-wasm`, etc.) install on first load. The sandbox L7 proxy denies the registry URL with `403 policy_denied`, the install retries for ~6 minutes, and while channel loading is stuck the gateway can't service `openclaw agent` requests. Both disabled via `plugins.entries.<id>.enabled = false` in `scripts/generate-openclaw-config.py`. 6. **Build-context fix for `generate-openclaw-config.py`** — main's PR NVIDIA#2449 (commit `f5ee8a4d`) extracted the inline Python config-generator from Dockerfile into `scripts/generate-openclaw-config.py` and added `COPY scripts/generate-openclaw-config.py …` to Dockerfile, but did not update `src/lib/sandbox-build-context.ts` which curates the optimized build context for sandbox image builds. Without this, every nightly E2E job (and any sandbox onboard) fails with `COPY failed: file not found in build context`. Added the file to `stageOptimizedSandboxBuildContext()` next to `nemoclaw-start.sh` and added a test assertion so the staging stays in sync. ### Status Most recent un-rate-limited run (25015126555 with build-context fix): **13 of 18 jobs pass**. `sandbox-operations-e2e` still fails — only TC-SBX-02 (Connect & Chat) within it. All other TC-SBX cases (03, 04, 05, 06, 07, 08, 10, 11, 12) pass on `test-sbx-a`, confirming the gateway is functional. After the `sandbox-build-context.ts` fix and the qqbot disable, the failure mode of TC-SBX-02 changed from `SSH command timed out after 60s` to `Expected '42' in agent reply; reply=''` — same 60-90 second hang but now hitting the test's outer `run_with_timeout` rather than producing a stack trace. The test drops stderr (`2>/dev/null`), and the gateway-log streamer/snapshot infrastructure has been unable to capture `test-sbx-a`'s `/tmp/openclaw-998/openclaw-*.log` reliably (the post-test openshell state has no active gateway after TC-SBX-06's docker kill, and the streamer's connection to test-sbx-a races and gets `Connection refused`). Still root-causing. ### Notable upstream changes (2026.4.9 → 2026.4.24) - Google Meet bundled plugin, DeepSeek V4 Flash/Pro, realtime voice loops (Talk/Voice Call/Google Meet), Gemini Live, browser automation improvements. - Lighter startup: static model catalogs, manifest-backed model rows, **lazy provider dependencies** (the new plugin-runtime-deps mechanism — root cause of fix NVIDIA#3). - **Breaking:** Plugin SDK tool-result transforms migrated from `registerEmbeddedExtensionFactory()` to `registerAgentToolResultMiddleware()` — verified NemoClaw uses neither. - **Breaking:** Plugin registry migrated from `plugins.installs` config key to managed `plugins/installs.json` ledger — `openclaw doctor --fix` migrates automatically. - Config writes restructured to use single-file `$include` mutations before falling back to full config write (root cause of fix NVIDIA#2). - CVE-2026-41349, CVE-2026-22181 fixes; exec-approvals chat enablement (2026.4.22); cron `jobs-state.json` separation (2026.4.20). - bonjour mDNS plugin added in 2026.4.15 (root cause of fix #5a). ### User sandbox state migration on rebuild Existing user sandboxes upgrade via `nemoclaw <name> rebuild`. State (memory/, workspace/, agents/, extensions/, etc.) is backed up via tar, sandbox is destroyed and recreated with the new image, state is restored, `openclaw doctor --fix` runs post-restore. **Handled automatically:** memory, cron job definitions, plugin auto-discovery, plugin registry migration. **Existing reset behavior (not new):** exec-approvals, credentials, device pairing. **New minor behavior change:** cron runtime state (`jobs-state.json`) absent in pre-2026.4.20 backups — job execution history resets, jobs may re-fire once after upgrade. ## Test plan - [x] CI lint, typecheck, unit tests pass - [x] Docker base image and sandbox image build with all dist patches applied - [x] 13/18 nightly E2E jobs pass cleanly with all six fixes - [ ] **TC-SBX-02** — root cause for the residual `reply=''` hang under investigation; the gateway-log capture infrastructure needs to work reliably post-test before we can read what's happening server-side - [ ] Manual smoke test via `nemoclaw <sandbox> connect` interactive flow - [ ] Rebuild test: existing 2026.4.9 sandbox → rebuild → verify state preserved (rebuild-openclaw-e2e covers this)

@Sanjays2402

…ons land cleanly (#2681) (#2851) ## Summary Replaces the EACCES-swallow approach proposed in #2693 with proper Unix group permissions. Control-UI toggles in the OpenClaw dashboard (Enable Dreaming, account toggles, etc.) now **persist in default mode** instead of throwing `GatewayRequestError: EACCES` or becoming silent no-ops. ## Background OpenClaw 2026.4.24 (landed via #2484) introduced `mutateConfigFile` as the new control-UI write path. Patch 4 in the Dockerfile only wraps the legacy `replaceConfigFile` (plugin install path), so every config-toggle click in the sandbox dashboard now EACCES'd. #2693 proposed adding "Patch 4b" — a parallel try/catch that swallows the EACCES. That makes toggles non-functional in the sandbox: the user clicks "Enable Dreaming," gets no error, but nothing actually persists. UX improves over the crash; underlying limitation stays. This PR implements the alternative design Aaron sketched for #2681: rather than wrapping each new write path in EACCES handlers, fix the actual permissions so the writes succeed. ## Closes / Supersedes - Closes #2681 - Supersedes #2693 — thanks @Sanjays2402 for raising the issue and the initial swallow attempt that surfaced the deeper design question ## Implementation (the 6-item spec) | # | Item | File | |---|------|------| | 1 | Keep `gateway` as a separate UID from `sandbox`; add it to the `sandbox` group | `Dockerfile.base` | | 2 | Stale-base fallback so older `sandbox-base:latest` tags get the group membership at derived-image build time | `Dockerfile` | | 3 | `/sandbox/.openclaw` group-writable + setgid on dirs; `.config-hash` file mode 664 | `Dockerfile.base`, `Dockerfile` | | 4 | `normalize_mutable_config_perms()` at startup, gated on Shields state | `scripts/nemoclaw-start.sh` | | 5 | `shields down` restores 660/2770 (group-writable + setgid) for OpenClaw; Hermes left at historical 640/750 (no separate gateway UID, contract doesn't apply) | `src/lib/shields.ts` | | 6 | Tests assert the new invariant: writes succeed in default mode, no new EACCES swallow | `test/repro-2681-group-writable.test.ts` | ## Why setgid `chmod g+s` on directories means new files inherit `group=sandbox` regardless of which UID created them. So `gateway` writes a file → file is `group=sandbox` → the `sandbox` user (also in the group) can still read it. Without setgid, gateway's writes would land with `group=gateway` and the agent might lose read access on rotation. ## Patch 4 retention The existing `Patch 4` (replaceConfigFile EACCES swallow) is **intentionally retained** as a defensive fallback for: - Older base images during the rollout window - Host filesystems that don't honor setgid (rare, but possible on some Windows/WSL2 configurations) - Other write paths in OpenClaw that might surface in future versions No new EACCES swallow patch is added — the `Patch 4b` approach from #2693 is explicitly rejected per spec item #6. ## Verification - [x] `npm run build:cli` compiles the changed `shields.ts` - [x] 11/11 new tests pass in `test/repro-2681-group-writable.test.ts` — assert structural invariants of the group-writable contract - [x] 443/443 plugin tests pass - [x] Pre-existing CLI tests that fail on this branch ALSO fail on pristine main (`@oclif/core` module-not-found from in-flight migration; not caused by this PR) - [ ] **Brev E2E required** — touches Dockerfile + Dockerfile.base + shields lifecycle. Adaptive matrix: M×DANGER → full Brev sweep before merge ## Test plan - [x] Unit: 11 structural assertions in `repro-2681-group-writable.test.ts` - [ ] CI: `build-sandbox-images` (validates the group-membership + setgid Dockerfile changes) - [ ] CI: `test-e2e-sandbox` (validates shields lifecycle + onboard flow) - [ ] CI: `test-e2e-gateway-isolation` (validates the gateway-as-different-UID still runs cleanly) - [ ] Manual repro: onboard, click "Enable Dreaming" in dashboard, verify mutation persists across `nemoclaw status` ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## AI Disclosure - [x] AI-assisted — tool: Claude Code

Patch 4 is a regex-based monkey-patch over OpenClaw's compiled JS that suppresses EACCES inside replaceConfigFile. Its source-shape coupling has broken three times in eight days (#2377, #2484, #2876) chasing upstream refactors; #2686 and #3497 report the latest casualty, where the regex no longer finds the function in 2026.4.24 and the image build fails. Patch 4 is also unnecessary by design: * In mutable-default mode, openclaw.json is chmod 660 sandbox:sandbox and the gateway UID is in the sandbox group (#2681), so plugin install writes through without ever hitting EACCES. * In shields-up mode, the entire config tree (file + parent dir + the plugin/extensions state dirs in HIGH_RISK_STATE_DIRS) is locked to root:root by design — refusing runtime mutations is the whole point of shields-up. Suppressing the EACCES masked that refusal and made the install appear to succeed silently while only the auto-discovery half landed. The expected flow is configure-in-mutable-mode → shields up → run. Plugin install attempted while shielded should fail cleanly, which is what happens without Patch 4. Reverts the rcf-shim replacement attempt; the require-hook approach does not catch OpenClaw's ESM named imports anyway (capture-at-import- time semantics). Resolves #2686 Resolves #3497 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

## Summary Patch 4 in the sandbox `Dockerfile` is a regex-based monkey-patch over OpenClaw's compiled JS that wraps `replaceConfigFile` in an EACCES try/catch suppression. It is source-shape-coupled and has been rewritten three times in eight days chasing OpenClaw refactors: - `fefd69fa2` (#2377) — original literal-string match against [openclaw/openclaw@v2026.4.9](https://github.com/openclaw/openclaw/releases/tag/v2026.4.9). - `5dcb0a9b9` (#2484) — updated the literal string for the restructured write block in [openclaw/openclaw@v2026.4.24](https://github.com/openclaw/openclaw/releases/tag/v2026.4.24). - `e0290e153` (#2876) — hardened to a tolerant whitespace/property-order-aware regex against [openclaw/openclaw@v2026.4.24](https://github.com/openclaw/openclaw/releases/tag/v2026.4.24). #2686 and #3497 are the latest break: in current OpenClaw, the regex no longer finds the function shape and the image build aborts at Step 17/56. Patch 4 is also unnecessary by design. The EACCES it was suppressing does not happen in the supported flows: - **Mutable-default mode** (fresh sandbox, before `nemoclaw shields up`): `openclaw.json` is `chmod 660 sandbox:sandbox` and the gateway UID is in the sandbox group, courtesy of #2851 (closing #2681; superseding the EACCES-swallow attempt in #2693). `openclaw plugins install` writes through normally; no EACCES. - **Shields-up mode** (locked): the entire config tree — file, parent directory, and the `extensions`/`plugins` state dirs from [HIGH_RISK_STATE_DIRS](src/lib/shields/index.ts#L292-L306) — is locked to `root:root` by design. Shields-up exists *to refuse* runtime config and plugin mutations. Suppressing the EACCES masked that refusal and made `openclaw plugins install` appear to succeed silently while only the auto-discovery half landed. The expected lifecycle is **configure-in-mutable-mode → `shields up` → run**. Plugin install attempted while shielded should fail cleanly; that is exactly what happens without Patch 4. This PR therefore deletes Patch 4 entirely. ## Related Issue Resolves #2686 Resolves #3497 Related context: - #2681 — original "make `.openclaw` group-writable" issue, closed by #2851. - #2851 — PR that made mutable-mode plugin install work without an EACCES swallow. - #2693 — closed earlier EACCES-swallow attempt, superseded once #2851 landed. - #2544 — NemoClaw issue tracking the broader "plugin config requires multi-minute rebuild" problem. - openclaw/openclaw#72950 — upstream defect (no env-var or writable-overlay path for `plugins.entries.<id>.config`); the real fix has to land there. ## Changes - `Dockerfile`: drop the Patch 4 block (the `COPY scripts/rcf_patch.py` line and the inline Python invocation + grep guard). - `scripts/rcf_patch.py`: deleted. - `src/lib/sandbox/build-context.ts`: stop staging `scripts/rcf_patch.py` into the sandbox build context. - `test/rcf-patch.test.ts`: deleted. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Signed-off-by: Tinson Lai <tinsonl@nvidia.com>  ## Summary by CodeRabbit * **Chores** * Removed OpenClaw patching logic from Docker build process and related artifact copies * Updated build context script staging behavior * **Tests** * Enhanced sandbox configuration test suite with environment variable passthrough support * Added version-based conditional patching validation and warning behavior tests  --------- Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

chore: upgrade OpenClaw from 2026.4.9 to 2026.4.24

f3b0dbe

Bump the pinned OpenClaw version across all version-tracking files (Dockerfile.base, blueprint.yaml, manifest.yaml, and version tests) to the latest stable release.

ericksoa added 2 commits April 25, 2026 15:32

Merge branch 'main' into upgrade/openclaw-2026.4.24

c1fe5f4

ericksoa marked this pull request as ready for review April 25, 2026 23:23

olegshilov approved these changes Apr 25, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

Comment thread test/e2e/test-sandbox-operations.sh Outdated

ericksoa added 2 commits April 26, 2026 09:11

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

ericksoa added 2 commits April 26, 2026 10:49

revert: undo test/e2e/test-sandbox-operations.sh timeout + diagnostics

935a9b4

Reverts 2aacc51 and 1e512b1. The test contract (run openclaw agent via SSH and assert the reply contains the expected token) stays as-is. Real fix belongs in NemoClaw, not the test harness.

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

ericksoa added 14 commits April 26, 2026 11:29

Revert "fix: give gateway user write access to plugin-runtime-deps ca…

e685857

…che" This reverts commit 521c599.

test(e2e): resume sandbox onboard after import reset

4de3c67

fix(sandbox): disable inferred thinking for smoke agent

aa6fecc

test(e2e): retry sandbox onboard resume resets

0df3c91

style(e2e): apply sandbox operations shell format

2843fde

fix(sandbox): trim staged OpenClaw runtime deps

2cb8064

ericksoa added 2 commits April 28, 2026 17:44

Merge remote-tracking branch 'origin/main' into upgrade/openclaw-2026…

14278b8

….4.24

Merge branch 'main' into upgrade/openclaw-2026.4.24

3363b22

cv approved these changes Apr 29, 2026

View reviewed changes

ericksoa merged commit 65d2fae into main Apr 29, 2026
16 checks passed

camerono mentioned this pull request Apr 29, 2026

v0.0.29 Dockerfile Step 17 patches fail against bundled OpenClaw 2026.4.24 #2689

Closed

2 tasks

cjagwani mentioned this pull request May 1, 2026

fix(sandbox): make .openclaw group-writable so OpenClaw config mutations land cleanly (#2681) #2851

Merged

12 tasks

ericksoa mentioned this pull request May 3, 2026

feat(compat): bump OpenClaw from 2026.4.2 to 2026.4.22 #2384

Closed

3 tasks

laitingsheng mentioned this pull request May 14, 2026

fix(dockerfile): drop Patch 4 entirely #3500

Merged

12 tasks

wscurran added area: packaging Packages, images, registries, installers, or distribution chore Build, CI, dependency, or tooling maintenance platform: container Affects Docker, containerd, Podman, or images and removed Docker labels Jun 3, 2026

github-actions Bot mentioned this pull request Jun 10, 2026

fix(sandbox): cron preflight inference.local uses trusted env-proxy mode #5129

Merged

12 tasks

Conversation

ericksoa commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fixes in this PR

Status

Notable upstream changes (2026.4.9 → 2026.4.24)

User sandbox state migration on rebuild

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 25, 2026

Uh oh!

coderabbitai Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

olegshilov left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ericksoa commented Apr 27, 2026

Root cause: bonjour mDNS plugin destabilizes WS connections in sandbox netns

Causal chain

Why this surfaces in 2026.4.24 but not 2026.4.9

Why disable bonjour is the right fix

Fix in this PR

Diagnostic infrastructure to remove on green

Uh oh!

github-actions Bot commented Apr 28, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

github-actions Bot commented Apr 28, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

github-actions Bot commented Apr 28, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

github-actions Bot commented Apr 28, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

github-actions Bot commented Apr 28, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

github-actions Bot commented Apr 28, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

github-actions Bot commented Apr 28, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

github-actions Bot commented Apr 29, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ericksoa commented Apr 25, 2026 •

edited

Loading

coderabbitai Bot commented Apr 25, 2026 •

edited

Loading