Skip to content

chore: upgrade OpenClaw from 2026.4.9 to 2026.4.24#2484

Merged
ericksoa merged 62 commits into
mainfrom
upgrade/openclaw-2026.4.24
Apr 29, 2026
Merged

chore: upgrade OpenClaw from 2026.4.9 to 2026.4.24#2484
ericksoa merged 62 commits into
mainfrom
upgrade/openclaw-2026.4.24

Conversation

@ericksoa

@ericksoa ericksoa commented Apr 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Upgrades OpenClaw from 2026.4.9 to 2026.4.24 (latest stable, CalVer).

Fixes in this PR

  1. Version bumpsDockerfile.base, nemoclaw-blueprint/blueprint.yaml, agents/openclaw/manifest.yaml, src/lib/sandbox-version.test.ts.
  2. Patch 4 updated — OpenClaw 2026.4.24 restructured replaceConfigFile to first attempt tryWriteSingleTopLevelIncludeMutation (writes to a $include file like plugins.json5) before falling back to writeConfigFile. The old patch matched an exact tab-indented writeConfigFile(params.nextConfig, {...}) string that no longer exists. Updated to match the new if (!await tryWriteSingleTopLevelIncludeMutation(...)) await writeConfigFile(...) block and wrap the entire write path in the OPENSHELL_SANDBOX-gated EACCES try/catch.
  3. plugin-runtime-deps symlink — OpenClaw 2026.4.24 introduced lazy plugin runtime-dep installation (Jiti loader). The CLI writes to ~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/ on first invocation. NemoClaw locks /sandbox/.openclaw to 444 root:root, so every bundled provider failed to load with EACCES. Fix: created the dir in the writable .openclaw-data tree and symlinked it from the immutable config tree, mirroring the existing pattern used for logs, credentials, extensions, etc. Added in both Dockerfile.base (canonical) and Dockerfile (idempotent fixup for stale GHCR base).
  4. Selective sandbox safety-net_SANDBOX_SAFETY_NET (a Node --require preload from nemoclaw-start.sh) used to be a catch-all swallow + process.exit interceptor. Rewritten to: (a) gate to gateway processes only (OPENSHELL_SANDBOX=1 + argv[2]==='gateway') so CLI commands keep default Node crash behaviour; (b) match documented known-benign patterns (currently ciao/mDNS — produced when bonjour's probe state machine cancels itself, since the sandbox netns has no multicast); (c) for unknown errors, log full stack but keep gateway alive (gateway is shared infrastructure, user-initiated actions must not take it down); (d) drop process.exit interception entirely. The CIAO guard's uncaughtException listener was similarly gated to gateway processes — registering one in CLI processes turns Node's default crash-on-uncaught into silent absorb, which would silently hang openclaw agent.
  5. Disable bonjour and qqbot bundled plugins — both ship enabled-by-default in 2026.4.24 and break in the sandbox netns:
    • bonjour: introduced in 2026.4.15, uses @homebridge/ciao for mDNS announcement. Sandbox netns has no multicast — ciao's probe state machine fails at startup.
    • qqbot: has stageRuntimeDependencies=true, so its npm deps (@tencent-connect/qqbot-connector, silk-wasm, etc.) install on first load. The sandbox L7 proxy denies the registry URL with 403 policy_denied, the install retries for ~6 minutes, and while channel loading is stuck the gateway can't service openclaw agent requests. Both disabled via plugins.entries.<id>.enabled = false in scripts/generate-openclaw-config.py.
  6. Build-context fix for generate-openclaw-config.py — main's PR fix: auto-disable device auth for non-loopback URLs (#2341) #2449 (commit f5ee8a4d) extracted the inline Python config-generator from Dockerfile into scripts/generate-openclaw-config.py and added COPY scripts/generate-openclaw-config.py … to Dockerfile, but did not update src/lib/sandbox-build-context.ts which curates the optimized build context for sandbox image builds. Without this, every nightly E2E job (and any sandbox onboard) fails with COPY failed: file not found in build context. Added the file to stageOptimizedSandboxBuildContext() next to nemoclaw-start.sh and added a test assertion so the staging stays in sync.

Status

Most recent un-rate-limited run (25015126555 with build-context fix): 13 of 18 jobs pass. sandbox-operations-e2e still fails — only TC-SBX-02 (Connect & Chat) within it. All other TC-SBX cases (03, 04, 05, 06, 07, 08, 10, 11, 12) pass on test-sbx-a, confirming the gateway is functional. After the sandbox-build-context.ts fix and the qqbot disable, the failure mode of TC-SBX-02 changed from SSH command timed out after 60s to Expected '42' in agent reply; reply='' — same 60-90 second hang but now hitting the test's outer run_with_timeout rather than producing a stack trace. The test drops stderr (2>/dev/null), and the gateway-log streamer/snapshot infrastructure has been unable to capture test-sbx-a's /tmp/openclaw-998/openclaw-*.log reliably (the post-test openshell state has no active gateway after TC-SBX-06's docker kill, and the streamer's connection to test-sbx-a races and gets Connection refused). Still root-causing.

Notable upstream changes (2026.4.9 → 2026.4.24)

  • Google Meet bundled plugin, DeepSeek V4 Flash/Pro, realtime voice loops (Talk/Voice Call/Google Meet), Gemini Live, browser automation improvements.
  • Lighter startup: static model catalogs, manifest-backed model rows, lazy provider dependencies (the new plugin-runtime-deps mechanism — root cause of fix Change small local model to qwen3.5:9b #3).
  • Breaking: Plugin SDK tool-result transforms migrated from registerEmbeddedExtensionFactory() to registerAgentToolResultMiddleware() — verified NemoClaw uses neither.
  • Breaking: Plugin registry migrated from plugins.installs config key to managed plugins/installs.json ledger — openclaw doctor --fix migrates automatically.
  • Config writes restructured to use single-file $include mutations before falling back to full config write (root cause of fix feature: custom settings for using build endpoints #2).
  • CVE-2026-41349, CVE-2026-22181 fixes; exec-approvals chat enablement (2026.4.22); cron jobs-state.json separation (2026.4.20).
  • bonjour mDNS plugin added in 2026.4.15 (root cause of fix #5a).

User sandbox state migration on rebuild

Existing user sandboxes upgrade via nemoclaw <name> rebuild. State (memory/, workspace/, agents/, extensions/, etc.) is backed up via tar, sandbox is destroyed and recreated with the new image, state is restored, openclaw doctor --fix runs post-restore.

Handled automatically: memory, cron job definitions, plugin auto-discovery, plugin registry migration. Existing reset behavior (not new): exec-approvals, credentials, device pairing. New minor behavior change: cron runtime state (jobs-state.json) absent in pre-2026.4.20 backups — job execution history resets, jobs may re-fire once after upgrade.

Test plan

  • CI lint, typecheck, unit tests pass
  • Docker base image and sandbox image build with all dist patches applied
  • 13/18 nightly E2E jobs pass cleanly with all six fixes
  • TC-SBX-02 — root cause for the residual reply='' hang under investigation; the gateway-log capture infrastructure needs to work reliably post-test before we can read what's happening server-side
  • Manual smoke test via nemoclaw <sandbox> connect interactive flow
  • Rebuild test: existing 2026.4.9 sandbox → rebuild → verify state preserved (rebuild-openclaw-e2e covers this)

Bump the pinned OpenClaw version across all version-tracking files
(Dockerfile.base, blueprint.yaml, manifest.yaml, and version tests)
to the latest stable release.
@copy-pr-bot

copy-pr-bot Bot commented Apr 25, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Apr 25, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Updates OpenClaw from version 2026.4.9 to 2026.4.24 across build configuration, manifests, and tests. Introduces plugin runtime dependencies cache directory with proper permissions and group configuration. Implements new config writing API with sandbox error handling for read-only environments.

Changes

Cohort / File(s) Summary
Version Upgrades
Dockerfile.base, agents/openclaw/manifest.yaml, nemoclaw-blueprint/blueprint.yaml
Bump OpenClaw version from 2026.4.9 to 2026.4.24 across build configuration and manifest declarations.
Dockerfile Configuration
Dockerfile
Implements new OpenClaw 2026.4.24+ config writing via tryWriteSingleTopLevelIncludeMutation with writeConfigFile fallback. Adds error handling for EACCES in sandboxes. Creates /sandbox/.openclaw-data/plugin-runtime-deps directory with group-write permissions (setgid/2775) to allow gateway user write access.
Test Updates
src/lib/sandbox-version.test.ts
Update test fixtures and assertions to expect OpenClaw version 2026.4.24 across mocked agent definitions, version comparisons, and staleness warnings.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hopping through configs, the version's bumped high,
From point-nine to point-twenty-four in the sky!
Plugin deps find a cache with a gateway's new right,
Sandboxes protected from permission-denied plight.
A safer, stronger OpenClaw, shiny and bright! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'chore: upgrade OpenClaw from 2026.4.9 to 2026.4.24' accurately reflects the primary change across the changeset—upgrading the OpenClaw version and updating all related version references.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch upgrade/openclaw-2026.4.24

Comment @coderabbitai help to get the list of available commands and usage tips.

OpenClaw 2026.4.24 restructured replaceConfigFile to first attempt a
single-key include-file mutation (tryWriteSingleTopLevelIncludeMutation)
before falling back to writeConfigFile. Both paths can EACCES in the
read-only sandbox. Update the pattern match to wrap the entire write
block in the OPENSHELL_SANDBOX-gated try/catch.
@ericksoa ericksoa marked this pull request as ready for review April 25, 2026 23:23

@olegshilov olegshilov left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Capture the SSH-shell environment (HTTP_PROXY, HTTPS_PROXY, NO_PROXY,
OPENCLAW_GATEWAY_URL/TOKEN, OPENSHELL_SANDBOX, NVIDIA_API_KEY) before
the agent invocation, and bump the failure-message capture from head -3
to head -20 so the full reply (including any gateway/embedded fallback
errors) shows in CI logs. Diagnostic-only — no behavior change.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Line 282: The diag_env diagnostic line leaks secrets by expanding the token
values; replace the unsafe expansions
`${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}` and the
analogous `NVIDIA_API_KEY` expansion in the sandbox_exec invocation so they
never emit the variable contents, and instead emit only the literal "set" or
"unset"; implement this by checking each variable's presence (e.g., an explicit
conditional or test for non-empty) and printing "set" when present or "unset"
when not, updating the diag_env/sandbox_exec call accordingly to reference
OPENCLAW_GATEWAY_TOKEN and NVIDIA_API_KEY securely.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5161bcbc-13b7-4cd0-8a9d-5d0f0d383403

📥 Commits

Reviewing files that changed from the base of the PR and between 5dcb0a9 and 2aacc51.

📒 Files selected for processing (1)
  • test/e2e/test-sandbox-operations.sh

Comment thread test/e2e/test-sandbox-operations.sh Outdated
OpenClaw 2026.4.24 lazy-installs bundled plugin runtime dependencies into
~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/ on first CLI
invocation (Jiti-based loader, "lazy provider dependencies" in 2026.4.20+
release notes). NemoClaw locks /sandbox/.openclaw to 444 root:root, so
every bundled plugin (nvidia, openai, anthropic, ollama, ...) failed to
load with EACCES, leaving `openclaw agent` with zero providers — the
exact symptom in TC-SBX-02 (no agent reply, only proxy warnings).

Mirror the existing .openclaw-data symlink pattern: create the dir in
the writable data tree and symlink it from the immutable config tree.
Add to both Dockerfile.base (canonical setup) and Dockerfile (idempotent
fixup for stale GHCR bases).
…load

OpenClaw 2026.4.24+ lazy-installs and Jiti-compiles ~50 bundled plugin
runtime deps on the first agent invocation in a fresh sandbox. Even with
deps pre-cached at build time, the plugin registry bootstrap + provider
warmup + LLM round-trip on the first call can exceed the existing 60s
SSH timeout (was completing in ~20s on 2026.4.9).

Make sandbox_exec_for accept an optional timeout argument (default 60,
preserves all other call sites) and have TC-SBX-02 pass 240s. The
openclaw agent CLI's own --timeout default is 600s so 240s leaves
plenty of headroom for the inference call itself.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
test/e2e/test-sandbox-operations.sh (1)

286-286: ⚠️ Potential issue | 🔴 Critical

Sensitive values can still be exposed in diagnostics.

Line 286 uses ${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset} (and the same for NVIDIA_API_KEY), which includes the secret value when set. This can leak credentials into CI logs.

🔧 Proposed fix
-  diag_env=$(sandbox_exec 'echo HTTP_PROXY=${HTTP_PROXY:-unset}; echo HTTPS_PROXY=${HTTPS_PROXY:-unset}; echo NO_PROXY=${NO_PROXY:-unset}; echo OPENCLAW_GATEWAY_URL=${OPENCLAW_GATEWAY_URL:-unset}; echo OPENCLAW_GATEWAY_TOKEN=${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}; echo OPENSHELL_SANDBOX=${OPENSHELL_SANDBOX:-unset}; echo NVIDIA_API_KEY=${NVIDIA_API_KEY:+set}${NVIDIA_API_KEY:-unset}' 2>&1) || true
+  diag_env=$(sandbox_exec 'echo HTTP_PROXY=${HTTP_PROXY:-unset}; echo HTTPS_PROXY=${HTTPS_PROXY:-unset}; echo NO_PROXY=${NO_PROXY:-unset}; echo OPENCLAW_GATEWAY_URL=${OPENCLAW_GATEWAY_URL:-unset}; echo OPENCLAW_GATEWAY_TOKEN=$([ -n "${OPENCLAW_GATEWAY_TOKEN:-}" ] && echo set || echo unset); echo OPENSHELL_SANDBOX=${OPENSHELL_SANDBOX:-unset}; echo NVIDIA_API_KEY=$([ -n "${NVIDIA_API_KEY:-}" ] && echo set || echo unset)' 2>&1) || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-sandbox-operations.sh` at line 286, The diagnostic command
leaks secret values because
`${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}` (and the
NVIDIA_API_KEY variant) concatenates "set" with the actual secret; change the
diagnostic to print only "set" or "unset" without expanding the value by
replacing those expansions with a conditional-only check (e.g., use a single
parameter expansion or an explicit test) inside the sandbox_exec invocation so
OPENCLAW_GATEWAY_TOKEN and NVIDIA_API_KEY are never interpolated into the logged
string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/e2e/test-sandbox-operations.sh`:
- Line 286: The diagnostic command leaks secret values because
`${OPENCLAW_GATEWAY_TOKEN:+set}${OPENCLAW_GATEWAY_TOKEN:-unset}` (and the
NVIDIA_API_KEY variant) concatenates "set" with the actual secret; change the
diagnostic to print only "set" or "unset" without expanding the value by
replacing those expansions with a conditional-only check (e.g., use a single
parameter expansion or an explicit test) inside the sandbox_exec invocation so
OPENCLAW_GATEWAY_TOKEN and NVIDIA_API_KEY are never interpolated into the logged
string.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: acfac00c-0120-4ef6-ac19-94ac3a5d1d09

📥 Commits

Reviewing files that changed from the base of the PR and between e1f1be8 and 1e512b1.

📒 Files selected for processing (1)
  • test/e2e/test-sandbox-operations.sh

Reverts 2aacc51 and 1e512b1. The test contract (run openclaw agent via
SSH and assert the reply contains the expected token) stays as-is. Real
fix belongs in NemoClaw, not the test harness.
Add gateway to the sandbox supplementary group and set 2775 (setgid +
group-write) on /sandbox/.openclaw-data/plugin-runtime-deps. OpenClaw
2026.4.24+ runs its plugin loader on both the sandbox-side CLI and the
gateway side; both paths call withBundledRuntimeDepsInstallRootLock,
which mkdirSyncs a lock dir under the install root.

The original NemoClaw user-isolation design has gateway and sandbox in
different primary groups so the sandbox user cannot tamper with the
gateway. Before 2026.4.24 the plugin loader did not need write access
from the gateway side; now it does, and EACCES on the lock dir caused
the gateway to fail mid-request, leaving the agent CLI hanging silently
on the unanswered WebSocket call.

Adding gateway to sandbox as a supplementary group preserves the
original boundary (sandbox still cannot affect gateway-owned resources)
and only opens gateway → sandbox-owned shared cache. Setgid bit ensures
new files created by either user inherit the sandbox group. Mirrored in
both Dockerfile.base (canonical) and Dockerfile (idempotent fixup for
stale GHCR base images).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
Dockerfile (1)

186-187: Pattern matching in minified JS is fragile.

The Python patch uses exact string matching including literal tabs (\t) and newlines (\n). Minified JavaScript bundles often vary in whitespace formatting between versions or build environments. The assertion assert old in src will fail-close (which is good), but consider:

  1. The pattern assumes specific formatting that may not survive re-minification
  2. Upstream OpenClaw version bumps could silently change whitespace

The fail-close behavior is correct — the build aborts if the pattern isn't found. However, when this inevitably breaks on a future OpenClaw bump, debugging the exact whitespace mismatch will be tedious.

💡 Alternative: Consider regex-based patching for resilience

A more robust approach would use regex matching that's whitespace-tolerant:

import re
pattern = re.compile(
    r'if\s*\(\s*!\s*await\s+tryWriteSingleTopLevelIncludeMutation\s*\(\s*\{[^}]+\}\s*\)\s*\)\s*await\s+writeConfigFile\s*\([^;]+\);',
    re.DOTALL
)

This would survive minor formatting changes. However, the current exact-match approach is acceptable given the fail-close assertion — just be prepared for patch maintenance on version bumps.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 186 - 187, The current Python one-liner patches the
minified JS by exact string match of the
tryWriteSingleTopLevelIncludeMutation/writeConfigFile block (the variables
old/new and the assert old in src), which is fragile against
whitespace/minification changes; change the script to use a regex-based,
whitespace-tolerant search (e.g., compile a pattern that matches the if(!await
tryWriteSingleTopLevelIncludeMutation(...)) await writeConfigFile(...) block
with \s* and re.DOTALL) and perform a re.sub to inject the new try { ... }
catch(...) wrapper, then update the assertion to check the regex matched (or
that the file changed) instead of relying on the literal old string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@Dockerfile`:
- Around line 186-187: The current Python one-liner patches the minified JS by
exact string match of the tryWriteSingleTopLevelIncludeMutation/writeConfigFile
block (the variables old/new and the assert old in src), which is fragile
against whitespace/minification changes; change the script to use a regex-based,
whitespace-tolerant search (e.g., compile a pattern that matches the if(!await
tryWriteSingleTopLevelIncludeMutation(...)) await writeConfigFile(...) block
with \s* and re.DOTALL) and perform a re.sub to inject the new try { ... }
catch(...) wrapper, then update the assertion to check the regex matched (or
that the file changed) instead of relying on the literal old string.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 26c92d4a-9980-47a5-8dc7-a8dc2fab2065

📥 Commits

Reviewing files that changed from the base of the PR and between 1e512b1 and 521c599.

📒 Files selected for processing (2)
  • Dockerfile
  • Dockerfile.base
🚧 Files skipped from review as they are similar to previous changes (1)
  • Dockerfile.base

ericksoa added 14 commits April 26, 2026 11:29
The _SANDBOX_SAFETY_NET preload was loaded via NODE_OPTIONS=--require into
EVERY Node process in the sandbox, including short-lived CLI commands like
`openclaw agent`. It installed an unconditional `unhandledRejection`
handler that swallows the rejection — designed to keep the long-running
gateway alive across non-fatal library bugs.

In OpenClaw 2026.4.9 the agent CLI's code paths didn't trip an unhandled
rejection, so the swallow was harmless there. In 2026.4.24 the new plugin
loader / gateway client path produces an unhandled rejection from
`openclaw agent`. Instead of surfacing as an error, the safety net ate it
and the awaited Promise never resolved — leaving the CLI hanging silently
on a request that should have failed fast. This is the exact symptom in
TC-SBX-02: two UNDICI warnings (process startup) followed by minutes of
silence with no error output.

Gate the swallow to argv[2] === "gateway" so the protection is scoped to
its actual purpose (`openclaw gateway run …`). All other CLI commands
(agent, doctor, plugins, tui) get default Node behavior — errors surface
and short-lived processes exit cleanly with a meaningful exit code.
…lure

TC-SBX-02 hangs without surfacing any error: with the safety-net gate fix,
errors should now propagate on the agent CLI side, but we see only Node
UNDICI warnings then 60s of silence. The remaining hypothesis is that the
gateway-side `agent` method handler hits an error that's swallowed by the
gateway's still-active safety net (intentional — keeps gateway alive),
leaving the client awaiting a response that never comes.

To prove or refute this, the gateway log content during the hang must be
visible in the failed test artifact. The test framework captures only the
test runner's own log (and the agent CLI's SSH output, which is silent).
/tmp/gateway.log inside the sandbox container has the data we need.

Two-part diagnostic, not a behavior change:

1. nemoclaw-start.sh: background-tail /tmp/gateway.log with a [gateway-log:]
   prefix to PID 1's stderr after gateway launch. Each gateway-log line now
   appears in the container's stderr stream (and is filterable by prefix).
   Cleanup: tail PID added to SANDBOX_CHILD_PIDS so cleanup_on_signal
   reaps it on shutdown. Both root and non-root launch paths covered.

2. nightly-e2e.yaml sandbox-operations-e2e: on failure, run `docker logs`
   on every test-sbx-* container and upload as a separate artifact
   (sandbox-operations-docker-logs). The artifact will contain the gateway
   log content (now mirrored to container stderr) at the time of failure.

This is a NemoClaw-side and workflow-level change (no test changes — the
test contract for TC-SBX-02 is unchanged). The runtime diagnostic is
permanent but additive; it can be removed once the upstream root cause is
identified and fixed.

Ref: #2484
The previous post-failure docker logs capture step ran AFTER the test
script's teardown destroyed test sandbox containers — so `docker ps -a`
returned no matches and the artifact was empty.

Replace with a background `docker logs -f` streamer started before the
test runs. As soon as a container appears, its logs stream to a
per-container file in docker-logs/. When the container is removed, the
stream ends but the file persists on the host. The post-failure artifact
upload now captures logs from every container that existed at any point
during the test.

Combined with the [gateway-log:] mirror in nemoclaw-start.sh, this
finally surfaces gateway-side activity (including any sandbox-safety-net
swallowed errors) at the time TC-SBX-02 hangs.

Ref: #2484
The previous docker-logs streamer hit "configured logging driver does
not support reading" for sandbox containers. NemoClaw sandboxes are k3s
pods INSIDE the openshell-cluster container, not sibling docker
containers — `docker logs` cannot read pod stdio.

Switch to `docker exec openshell-cluster-* kubectl logs -f -n openshell
<pod> --all-containers` to stream pod logs (which include PID 1's stderr
mirror of /tmp/gateway.log via the [gateway-log:] prefix added in
nemoclaw-start.sh). Output goes to per-pod files on the host that
persist past pod deletion.

Ref: #2484
The kubectl-logs streamer also returned empty files because the container
log driver in openshell's k3s setup doesn't capture container stdio
(same root cause as the docker logs failure). The only working way to
read /tmp/gateway.log content from outside the pod is via SSH — which
`nemoclaw <sandbox> logs --follow` does internally.

Switch the streamer to `nemoclaw <name> logs --follow > docker-logs/sandbox-<name>.log`.
The streamer waits for nemoclaw to be installed (test does that in its
first phase), polls `nemoclaw list`, and spawns a follower per sandbox.

Ref: #2484
The previous `nemoclaw logs --follow` per-sandbox streamer accumulated
unbounded output and the artifact upload step never finished within the
60-min job timeout (run 24968594521 was cancelled at 1h+ stuck on
Upload sandbox gateway logs).

Switch to snapshot mode: every 10s, run `timeout 8 nemoclaw <name> logs`
and overwrite docker-logs/sandbox-<name>.log with the result, capped at
256KB. The default `nemoclaw logs` invocation returns ~62 lines (already
bounded by /tmp/gateway.log size at snapshot time). When a sandbox is
destroyed by the test, the file holds the final pre-destroy snapshot.

Ref: #2484
The previous streamer parsed `nemoclaw list` pretty-printed output and
picked up the "Sandboxes:" header line whose first token literally is
"Sandboxes:" (with colon). Tried to create docker-logs/sandbox-Sandboxes:.log
which GitHub artifact upload rejects ("not a valid path: contains colon").

Read the registry json directly (~/.nemoclaw/sandboxes.json) via jq and
only accept names matching strict filename-safe pattern
[a-z0-9_-]+ — defense against future parsing issues too.

Ref: #2484
The previous snapshot-based streamer (overwriting per-sandbox file every
10s with `nemoclaw logs` output) lost the agent-request events because
`nemoclaw logs` returns only the tail of /tmp/gateway.log and the ciao
mDNS error spam (~10 errors/sec) buries earlier real events.

Switch to a per-sandbox SSH+tail follower that streams /tmp/gateway.log
directly (full stream from start), filters the uv_interface_addresses
noise inline, and caps each file at 512KB. Spawned once per sandbox via
openshell ssh-config.

Stop step kills the SSH followers along with the streamer.

Ref: #2484
Previous streamer wrote ssh config via mktemp and rm'd it before the
backgrounded ssh child connected — ssh hit "Can't open user config file"
race. Use a per-sandbox stable path /tmp/sshcfg-<name>.tmp and don't
remove it; runner /tmp gets cleaned up at job end anyway.

Ref: #2484
The bash -c '...' single-quoted block had apostrophes inside its
comments (Can't, `rm`) which prematurely terminated the outer single
quote, leaving the rest of the script with unbalanced quotes — bash
exited with "unexpected EOF while looking for matching `\"'" within 6
seconds of job start.

Reword comments to avoid apostrophes.

Ref: #2484
`head -c 524288` blocked waiting for 512KB to arrive through the
tail | grep pipe. Most lines are mDNS noise that grep -v drops, so
useful content arrives slowly. When the streamer was killed at job end,
head had captured zero bytes — final file was just the SSH disconnect
message (43b).

Drop the head -c cap so output streams freely while the job runs. As
safety against runaway file size, trim each log file to its last 5MB at
stop time. Real gateway events are interleaved with whatever filtered
content remains, so tail-trim keeps the most recent content (which
includes the TC-SBX-02 hang window).

Ref: #2484
The gateway log line "log file: /tmp/openclaw-998/openclaw-2026-04-27.log"
revealed that openclaw writes detailed event tracing to a SEPARATE file
than /tmp/gateway.log (which only captures the launch redirect of
stdout/stderr from nemoclaw-start.sh). The structured log carries the
agent-flow events we need; gateway.log silenced after startup because
most subsequent events go to the structured log instead.

Tail BOTH files in the same SSH session so we capture all gateway-side
activity during TC-SBX-02. Glob /tmp/openclaw-*/openclaw-*.log to handle
the per-uid stem (e.g. openclaw-998).

Ref: #2484
Root cause of TC-SBX-02 hang, now fully traced via the gateway-log
streamer artifact:

The bonjour plugin (mDNS service advertiser) attempts to probe network
interfaces via ciao every few seconds. Inside the sandbox netns,
os.networkInterfaces() throws (no usable interfaces). The ciao guard in
nemoclaw-start.sh monkey-patches os.networkInterfaces to return empty,
but that does not stop ciao from cancelling its outstanding probe with
"CIAO PROBING CANCELLED" — an UNHANDLED Promise rejection (the ciao
guard only catches synchronous uncaughtException, not async).

The sandbox-safety-net swallows the rejection (gateway-only after the
recent gate fix), but the swallow happens during the same event loop
tick as in-flight WebSocket handshakes from the openclaw agent CLI.
Pending WS connections get torn down with code 1006 (abnormal closure):

  03:17:39.367  Unhandled promise rejection: CIAO PROBING CANCELLED
  03:17:39.387  [gateway/ws] closed before connect ... code=1006
                                                       (handshake pending,
                                                        durationMs=7)

The agent CLI sees the abrupt close, retries, hits the same race,
eventually times out at the 10s connect-challenge timeout. Test only
sees UNDICI warnings because the CLI's `console.error` failure message
goes to /tmp/openclaw-<uid>/openclaw-<date>.log (the structured event
log), not stdout/stderr — the test framework never sees it.

Why TC-SBX-02 worked on 2026.4.9 but not 2026.4.24: bonjour plugin
loading and probe timing changed in the 2026.4.10–24 range
(Jiti-based plugin loader, lazy provider deps), making the rejection
window overlap WS handshakes more aggressively. On 2026.4.9 the timing
was lucky enough that the rejection never overlapped a real connect.

Fix: set plugins.entries.bonjour.enabled=false in the generated
openclaw.json. mDNS service advertisement is useless inside a sandboxed
netns (no peers to advertise to, no clients to discover the service)
and the only thing it accomplishes here is destabilizing other
gateway connections.

Ref: #2484
@ericksoa

Copy link
Copy Markdown
Contributor Author

Root cause: bonjour mDNS plugin destabilizes WS connections in sandbox netns

After significant diagnostic plumbing (the openclaw structured event log lives at /tmp/openclaw-<uid>/openclaw-<date>.log, not /tmp/gateway.log), the gateway-log streamer artifact (workflow sandbox-operations-docker-logs) finally captured the failure window for TC-SBX-02. Smoking-gun timeline from the gateway log:

03:17:39.354  [plugins] bonjour: restarting advertiser (service stuck in probing)
03:17:39.367  [openclaw] Unhandled promise rejection: CIAO PROBING CANCELLED
03:17:39.370  wrote stability bundle (rejection logged)
03:17:39.387  [gateway/ws] closed before connect conn=... code=1006 reason=n/a
              (handshake pending, durationMs=7)

19 ms between the unhandled rejection from the bonjour plugin and the abrupt WebSocket close.

Causal chain

  1. The bonjour plugin (mDNS service advertiser) attempts to probe network interfaces every few seconds
  2. The sandbox netns has no usable interfaces → os.networkInterfaces() throws
  3. NemoClaw's ciao guard (in nemoclaw-start.sh) monkey-patches os.networkInterfaces to return empty on failure — BUT that doesn't stop ciao from cancelling its in-flight probe with "CIAO PROBING CANCELLED", which surfaces as an unhandled Promise rejection
  4. The ciao guard only catches synchronous uncaughtException, not async unhandledRejection
  5. The sandbox-safety-net catches the rejection (gateway-only after the earlier gate fix in this PR), but the swallow happens during the same event loop tick as in-flight WebSocket handshakes
  6. Pending WS connections from the openclaw agent CLI get torn down with code 1006 (abnormal closure)
  7. The agent CLI retries, hits the same race, eventually times out at the 10s connect-challenge timeout
  8. The CLI's console.error failure message goes to the openclaw structured log, NOT stdout/stderr — that's why the test only ever saw the two UNDICI warnings followed by 60s of silence

Why this surfaces in 2026.4.24 but not 2026.4.9

Plugin load timing changed in the 2026.4.10–24 range (Jiti-based plugin loader, "lazy provider dependencies" in the release notes). The bonjour rejection window now overlaps WS handshakes more aggressively. On 2026.4.9 the timing was a lucky race; on 2026.4.24 it reliably hits.

Why disable bonjour is the right fix

mDNS service advertisement is structurally useless inside a NemoClaw sandbox:

  • The sandbox netns is isolated — there are no peers on the network to advertise the gateway to
  • The only way the gateway is reached from outside the sandbox is via the openshell SSH tunnel (nemoclaw <sandbox> connect), which doesn't use mDNS discovery
  • Internal-to-sandbox callers (the agent CLI, the configure-guard) connect to 127.0.0.1:18789 directly via the openclaw config, not via mDNS lookup
  • Continuing to load bonjour produces nothing useful and actively destabilizes the gateway every few seconds

This is the kind of plugin that exists for the user-laptop deployment story (where mDNS finds your assistant on a home network), not for the headless sandbox case NemoClaw runs.

Fix in this PR

plugins.entries.bonjour.enabled = false in the generated openclaw.json. Single line in the Dockerfile's Python config generator. Doesn't affect the user-laptop NemoClaw flow (different config path).

Validation re-run in progress: https://github.com/NVIDIA/NemoClaw/actions/runs/24975221024

Diagnostic infrastructure to remove on green

Once TC-SBX-02 passes, these diagnostic-only commits should be reverted:

  • [gateway-log:] mirror in nemoclaw-start.sh (PID 1 stderr tail of /tmp/gateway.log)
  • Start gateway log streamer (background) and related steps in .github/workflows/nightly-e2e.yaml

These were necessary to find the root cause but add ambient runtime/CI overhead. Cleanup commit will be marked with revert(diag): ….

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25077451446
Branch: upgrade/openclaw-2026.4.24
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 17 skipped

Job Result
cloud-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ⏭️ skipped
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ❌ failure
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skip-permissions-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25078188617
Branch: upgrade/openclaw-2026.4.24
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 17 skipped

Job Result
cloud-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ⏭️ skipped
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ❌ failure
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skip-permissions-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25079888582
Branch: upgrade/openclaw-2026.4.24
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 17 skipped

Job Result
cloud-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ⏭️ skipped
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ❌ failure
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skip-permissions-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25080430723
Branch: upgrade/openclaw-2026.4.24
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 17 skipped

Job Result
cloud-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ⏭️ skipped
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ❌ failure
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skip-permissions-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25080990090
Branch: upgrade/openclaw-2026.4.24
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 17 skipped

Job Result
cloud-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ⏭️ skipped
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ❌ failure
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skip-permissions-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25081625450
Branch: upgrade/openclaw-2026.4.24
Requested jobs: sandbox-operations-e2e
Summary: 0 passed, 1 failed, 17 skipped

Job Result
cloud-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ⏭️ skipped
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ❌ failure
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skip-permissions-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

Failed jobs: sandbox-operations-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25082270514
Branch: upgrade/openclaw-2026.4.24
Requested jobs: sandbox-operations-e2e
Summary: 1 passed, 0 failed, 17 skipped

Job Result
cloud-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ⏭️ skipped
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skip-permissions-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25085223759
Branch: upgrade/openclaw-2026.4.24
Requested jobs: sandbox-operations-e2e
Summary: 1 passed, 0 failed, 18 skipped

Job Result
cloud-e2e ⏭️ skipped
cloud-experimental-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ⏭️ skipped
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skip-permissions-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

@ericksoa ericksoa merged commit 65d2fae into main Apr 29, 2026
16 checks passed
DemianHeyGen pushed a commit to DemianHeyGen/NemoClaw that referenced this pull request Apr 30, 2026
## Summary

Upgrades OpenClaw from **2026.4.9** to **2026.4.24** (latest stable,
CalVer).

### Fixes in this PR

1. **Version bumps** — `Dockerfile.base`,
`nemoclaw-blueprint/blueprint.yaml`, `agents/openclaw/manifest.yaml`,
`src/lib/sandbox-version.test.ts`.
2. **Patch 4 updated** — OpenClaw 2026.4.24 restructured
`replaceConfigFile` to first attempt
`tryWriteSingleTopLevelIncludeMutation` (writes to a `$include` file
like `plugins.json5`) before falling back to `writeConfigFile`. The old
patch matched an exact tab-indented `writeConfigFile(params.nextConfig,
{...})` string that no longer exists. Updated to match the new `if
(!await tryWriteSingleTopLevelIncludeMutation(...)) await
writeConfigFile(...)` block and wrap the entire write path in the
OPENSHELL_SANDBOX-gated EACCES try/catch.
3. **`plugin-runtime-deps` symlink** — OpenClaw 2026.4.24 introduced
lazy plugin runtime-dep installation (Jiti loader). The CLI writes to
`~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/` on first
invocation. NemoClaw locks `/sandbox/.openclaw` to `444 root:root`, so
every bundled provider failed to load with `EACCES`. Fix: created the
dir in the writable `.openclaw-data` tree and symlinked it from the
immutable config tree, mirroring the existing pattern used for `logs`,
`credentials`, `extensions`, etc. Added in both `Dockerfile.base`
(canonical) and `Dockerfile` (idempotent fixup for stale GHCR base).
4. **Selective sandbox safety-net** — `_SANDBOX_SAFETY_NET` (a Node
`--require` preload from `nemoclaw-start.sh`) used to be a catch-all
swallow + `process.exit` interceptor. Rewritten to: (a) gate to gateway
processes only (`OPENSHELL_SANDBOX=1` + `argv[2]==='gateway'`) so CLI
commands keep default Node crash behaviour; (b) match documented
known-benign patterns (currently `ciao`/mDNS — produced when bonjour's
probe state machine cancels itself, since the sandbox netns has no
multicast); (c) for unknown errors, log full stack but keep gateway
alive (gateway is shared infrastructure, user-initiated actions must not
take it down); (d) drop `process.exit` interception entirely. The CIAO
guard's `uncaughtException` listener was similarly gated to gateway
processes — registering one in CLI processes turns Node's default
crash-on-uncaught into silent absorb, which would silently hang
`openclaw agent`.
5. **Disable bonjour and qqbot bundled plugins** — both ship
enabled-by-default in 2026.4.24 and break in the sandbox netns:
- **bonjour**: introduced in 2026.4.15, uses `@homebridge/ciao` for mDNS
announcement. Sandbox netns has no multicast — ciao's probe state
machine fails at startup.
- **qqbot**: has `stageRuntimeDependencies=true`, so its npm deps
(`@tencent-connect/qqbot-connector`, `silk-wasm`, etc.) install on first
load. The sandbox L7 proxy denies the registry URL with `403
policy_denied`, the install retries for ~6 minutes, and while channel
loading is stuck the gateway can't service `openclaw agent` requests.
Both disabled via `plugins.entries.<id>.enabled = false` in
`scripts/generate-openclaw-config.py`.
6. **Build-context fix for `generate-openclaw-config.py`** — main's PR
NVIDIA#2449 (commit `f5ee8a4d`) extracted the inline Python config-generator
from Dockerfile into `scripts/generate-openclaw-config.py` and added
`COPY scripts/generate-openclaw-config.py …` to Dockerfile, but did not
update `src/lib/sandbox-build-context.ts` which curates the optimized
build context for sandbox image builds. Without this, every nightly E2E
job (and any sandbox onboard) fails with `COPY failed: file not found in
build context`. Added the file to `stageOptimizedSandboxBuildContext()`
next to `nemoclaw-start.sh` and added a test assertion so the staging
stays in sync.

### Status

Most recent un-rate-limited run (25015126555 with build-context fix):
**13 of 18 jobs pass**. `sandbox-operations-e2e` still fails — only
TC-SBX-02 (Connect & Chat) within it. All other TC-SBX cases (03, 04,
05, 06, 07, 08, 10, 11, 12) pass on `test-sbx-a`, confirming the gateway
is functional. After the `sandbox-build-context.ts` fix and the qqbot
disable, the failure mode of TC-SBX-02 changed from `SSH command timed
out after 60s` to `Expected '42' in agent reply; reply=''` — same 60-90
second hang but now hitting the test's outer `run_with_timeout` rather
than producing a stack trace. The test drops stderr (`2>/dev/null`), and
the gateway-log streamer/snapshot infrastructure has been unable to
capture `test-sbx-a`'s `/tmp/openclaw-998/openclaw-*.log` reliably (the
post-test openshell state has no active gateway after TC-SBX-06's docker
kill, and the streamer's connection to test-sbx-a races and gets
`Connection refused`). Still root-causing.

### Notable upstream changes (2026.4.9 → 2026.4.24)

- Google Meet bundled plugin, DeepSeek V4 Flash/Pro, realtime voice
loops (Talk/Voice Call/Google Meet), Gemini Live, browser automation
improvements.
- Lighter startup: static model catalogs, manifest-backed model rows,
**lazy provider dependencies** (the new plugin-runtime-deps mechanism —
root cause of fix NVIDIA#3).
- **Breaking:** Plugin SDK tool-result transforms migrated from
`registerEmbeddedExtensionFactory()` to
`registerAgentToolResultMiddleware()` — verified NemoClaw uses neither.
- **Breaking:** Plugin registry migrated from `plugins.installs` config
key to managed `plugins/installs.json` ledger — `openclaw doctor --fix`
migrates automatically.
- Config writes restructured to use single-file `$include` mutations
before falling back to full config write (root cause of fix NVIDIA#2).
- CVE-2026-41349, CVE-2026-22181 fixes; exec-approvals chat enablement
(2026.4.22); cron `jobs-state.json` separation (2026.4.20).
- bonjour mDNS plugin added in 2026.4.15 (root cause of fix #5a).

### User sandbox state migration on rebuild

Existing user sandboxes upgrade via `nemoclaw <name> rebuild`. State
(memory/, workspace/, agents/, extensions/, etc.) is backed up via tar,
sandbox is destroyed and recreated with the new image, state is
restored, `openclaw doctor --fix` runs post-restore.

**Handled automatically:** memory, cron job definitions, plugin
auto-discovery, plugin registry migration. **Existing reset behavior
(not new):** exec-approvals, credentials, device pairing. **New minor
behavior change:** cron runtime state (`jobs-state.json`) absent in
pre-2026.4.20 backups — job execution history resets, jobs may re-fire
once after upgrade.

## Test plan

- [x] CI lint, typecheck, unit tests pass
- [x] Docker base image and sandbox image build with all dist patches
applied
- [x] 13/18 nightly E2E jobs pass cleanly with all six fixes
- [ ] **TC-SBX-02** — root cause for the residual `reply=''` hang under
investigation; the gateway-log capture infrastructure needs to work
reliably post-test before we can read what's happening server-side
- [ ] Manual smoke test via `nemoclaw <sandbox> connect` interactive
flow
- [ ] Rebuild test: existing 2026.4.9 sandbox → rebuild → verify state
preserved (rebuild-openclaw-e2e covers this)
ericksoa pushed a commit that referenced this pull request May 1, 2026
…ons land cleanly (#2681) (#2851)

## Summary

Replaces the EACCES-swallow approach proposed in #2693 with proper Unix
group permissions. Control-UI toggles in the OpenClaw dashboard (Enable
Dreaming, account toggles, etc.) now **persist in default mode** instead
of throwing `GatewayRequestError: EACCES` or becoming silent no-ops.

## Background

OpenClaw 2026.4.24 (landed via #2484) introduced `mutateConfigFile` as
the new control-UI write path. Patch 4 in the Dockerfile only wraps the
legacy `replaceConfigFile` (plugin install path), so every config-toggle
click in the sandbox dashboard now EACCES'd.

#2693 proposed adding "Patch 4b" — a parallel try/catch that swallows
the EACCES. That makes toggles non-functional in the sandbox: the user
clicks "Enable Dreaming," gets no error, but nothing actually persists.
UX improves over the crash; underlying limitation stays.

This PR implements the alternative design Aaron sketched for #2681:
rather than wrapping each new write path in EACCES handlers, fix the
actual permissions so the writes succeed.

## Closes / Supersedes

- Closes #2681
- Supersedes #2693 — thanks @Sanjays2402 for raising the issue and the
initial swallow attempt that surfaced the deeper design question

## Implementation (the 6-item spec)

| # | Item | File |
|---|------|------|
| 1 | Keep `gateway` as a separate UID from `sandbox`; add it to the
`sandbox` group | `Dockerfile.base` |
| 2 | Stale-base fallback so older `sandbox-base:latest` tags get the
group membership at derived-image build time | `Dockerfile` |
| 3 | `/sandbox/.openclaw` group-writable + setgid on dirs;
`.config-hash` file mode 664 | `Dockerfile.base`, `Dockerfile` |
| 4 | `normalize_mutable_config_perms()` at startup, gated on Shields
state | `scripts/nemoclaw-start.sh` |
| 5 | `shields down` restores 660/2770 (group-writable + setgid) for
OpenClaw; Hermes left at historical 640/750 (no separate gateway UID,
contract doesn't apply) | `src/lib/shields.ts` |
| 6 | Tests assert the new invariant: writes succeed in default mode, no
new EACCES swallow | `test/repro-2681-group-writable.test.ts` |

## Why setgid

`chmod g+s` on directories means new files inherit `group=sandbox`
regardless of which UID created them. So `gateway` writes a file → file
is `group=sandbox` → the `sandbox` user (also in the group) can still
read it. Without setgid, gateway's writes would land with
`group=gateway` and the agent might lose read access on rotation.

## Patch 4 retention

The existing `Patch 4` (replaceConfigFile EACCES swallow) is
**intentionally retained** as a defensive fallback for:

- Older base images during the rollout window
- Host filesystems that don't honor setgid (rare, but possible on some
Windows/WSL2 configurations)
- Other write paths in OpenClaw that might surface in future versions

No new EACCES swallow patch is added — the `Patch 4b` approach from
#2693 is explicitly rejected per spec item #6.

## Verification

- [x] `npm run build:cli` compiles the changed `shields.ts`
- [x] 11/11 new tests pass in `test/repro-2681-group-writable.test.ts` —
assert structural invariants of the group-writable contract
- [x] 443/443 plugin tests pass
- [x] Pre-existing CLI tests that fail on this branch ALSO fail on
pristine main (`@oclif/core` module-not-found from in-flight migration;
not caused by this PR)
- [ ] **Brev E2E required** — touches Dockerfile + Dockerfile.base +
shields lifecycle. Adaptive matrix: M×DANGER → full Brev sweep before
merge

## Test plan

- [x] Unit: 11 structural assertions in
`repro-2681-group-writable.test.ts`
- [ ] CI: `build-sandbox-images` (validates the group-membership +
setgid Dockerfile changes)
- [ ] CI: `test-e2e-sandbox` (validates shields lifecycle + onboard
flow)
- [ ] CI: `test-e2e-gateway-isolation` (validates the
gateway-as-different-UID still runs cleanly)
- [ ] Manual repro: onboard, click "Enable Dreaming" in dashboard,
verify mutation persists across `nemoclaw status`

## Type of Change

- [x] Code change (feature, bug fix, or refactor)

## AI Disclosure

- [x] AI-assisted — tool: Claude Code
laitingsheng added a commit that referenced this pull request May 14, 2026
Patch 4 is a regex-based monkey-patch over OpenClaw's compiled JS that
suppresses EACCES inside replaceConfigFile. Its source-shape coupling
has broken three times in eight days (#2377, #2484, #2876) chasing
upstream refactors; #2686 and #3497 report the latest casualty, where
the regex no longer finds the function in 2026.4.24 and the image build
fails.

Patch 4 is also unnecessary by design:

* In mutable-default mode, openclaw.json is chmod 660 sandbox:sandbox
  and the gateway UID is in the sandbox group (#2681), so plugin
  install writes through without ever hitting EACCES.
* In shields-up mode, the entire config tree (file + parent dir + the
  plugin/extensions state dirs in HIGH_RISK_STATE_DIRS) is locked to
  root:root by design — refusing runtime mutations is the whole point
  of shields-up. Suppressing the EACCES masked that refusal and made
  the install appear to succeed silently while only the auto-discovery
  half landed.

The expected flow is configure-in-mutable-mode → shields up → run.
Plugin install attempted while shielded should fail cleanly, which is
what happens without Patch 4.

Reverts the rcf-shim replacement attempt; the require-hook approach
does not catch OpenClaw's ESM named imports anyway (capture-at-import-
time semantics).

Resolves #2686
Resolves #3497

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
cv pushed a commit that referenced this pull request May 15, 2026
## Summary
Patch 4 in the sandbox `Dockerfile` is a regex-based monkey-patch over
OpenClaw's compiled JS that wraps `replaceConfigFile` in an EACCES
try/catch suppression. It is source-shape-coupled and has been rewritten
three times in eight days chasing OpenClaw refactors:

- `fefd69fa2` (#2377) — original literal-string match against
[openclaw/openclaw@v2026.4.9](https://github.com/openclaw/openclaw/releases/tag/v2026.4.9).
- `5dcb0a9b9` (#2484) — updated the literal string for the restructured
write block in
[openclaw/openclaw@v2026.4.24](https://github.com/openclaw/openclaw/releases/tag/v2026.4.24).
- `e0290e153` (#2876) — hardened to a tolerant
whitespace/property-order-aware regex against
[openclaw/openclaw@v2026.4.24](https://github.com/openclaw/openclaw/releases/tag/v2026.4.24).

#2686 and #3497 are the latest break: in current OpenClaw, the regex no
longer finds the function shape and the image build aborts at Step
17/56.

Patch 4 is also unnecessary by design. The EACCES it was suppressing
does not happen in the supported flows:

- **Mutable-default mode** (fresh sandbox, before `nemoclaw shields
up`): `openclaw.json` is `chmod 660 sandbox:sandbox` and the gateway UID
is in the sandbox group, courtesy of #2851 (closing #2681; superseding
the EACCES-swallow attempt in #2693). `openclaw plugins install` writes
through normally; no EACCES.
- **Shields-up mode** (locked): the entire config tree — file, parent
directory, and the `extensions`/`plugins` state dirs from
[HIGH_RISK_STATE_DIRS](src/lib/shields/index.ts#L292-L306) — is locked
to `root:root` by design. Shields-up exists *to refuse* runtime config
and plugin mutations. Suppressing the EACCES masked that refusal and
made `openclaw plugins install` appear to succeed silently while only
the auto-discovery half landed.

The expected lifecycle is **configure-in-mutable-mode → `shields up` →
run**. Plugin install attempted while shielded should fail cleanly; that
is exactly what happens without Patch 4.

This PR therefore deletes Patch 4 entirely.

## Related Issue
Resolves #2686
Resolves #3497

Related context:

- #2681 — original "make `.openclaw` group-writable" issue, closed by
#2851.
- #2851 — PR that made mutable-mode plugin install work without an
EACCES swallow.
- #2693 — closed earlier EACCES-swallow attempt, superseded once #2851
landed.
- #2544 — NemoClaw issue tracking the broader "plugin config requires
multi-minute rebuild" problem.
- openclaw/openclaw#72950 — upstream defect (no env-var or
writable-overlay path for `plugins.entries.<id>.config`); the real fix
has to land there.

## Changes
- `Dockerfile`: drop the Patch 4 block (the `COPY scripts/rcf_patch.py`
line and the inline Python invocation + grep guard).
- `scripts/rcf_patch.py`: deleted.
- `src/lib/sandbox/build-context.ts`: stop staging
`scripts/rcf_patch.py` into the sandbox build context.
- `test/rcf-patch.test.ts`: deleted.

## Type of Change

- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification

- [x] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Removed OpenClaw patching logic from Docker build process and related
artifact copies
  * Updated build context script staging behavior

* **Tests**
* Enhanced sandbox configuration test suite with environment variable
passthrough support
* Added version-based conditional patching validation and warning
behavior tests

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@wscurran wscurran added area: packaging Packages, images, registries, installers, or distribution chore Build, CI, dependency, or tooling maintenance platform: container Affects Docker, containerd, Podman, or images and removed Docker labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: packaging Packages, images, registries, installers, or distribution chore Build, CI, dependency, or tooling maintenance dependencies Pull requests that update a dependency file platform: container Affects Docker, containerd, Podman, or images

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants