fix(sandbox): fix non-root gateway startup and add crash safety net#2472
Conversation
…og readable Two fixes for the messaging-providers-e2e Phase 7 Slack guard test that has never passed since #2355: 1. Add the Slack channel guard to the proxy-env.sh sourced file so interactive sessions (openshell sandbox connect/exec) see the guard in NODE_OPTIONS. The guard file is installed after proxy-env.sh is written, so use a runtime conditional ([ -f ... ]) in the sourced script. This fixes the misleading diagnostic that showed NODE_OPTIONS without the guard. 2. Change gateway.log permissions from 600 to 644 so E2E diagnostics (openshell sandbox exec -- cat /tmp/gateway.log) can read the log without being the gateway user. The log doesn't contain secrets.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a global Node.js sandbox safety-net preload and a ciao/mDNS guard preload; reorders NODE_OPTIONS (safety-net first, ciao always, Slack guard conditional); makes rc-file locking best-effort; permits Changes
Sequence Diagram(s)sequenceDiagram
participant Shell as User Shell
participant Start as scripts/nemoclaw-start.sh
participant Node as Node.js Process
participant Safety as Safety-net Preload
participant Ciao as Ciao/mDNS Guard
participant App as Application
Note over Start,Node: Build NODE_OPTIONS (safety-net first,\nciao guard always, slack guard if present)
Shell->>Start: launch sandbox (OPENSHELL_SANDBOX=1)
Start->>Node: exec node with NODE_OPTIONS (preloads)
Node->>Safety: require safety-net preload
Safety->>Node: install uncaughtException / unhandledRejection handlers
Node->>Ciao: require ciao guard preload
Ciao->>Node: monkey-patch os.networkInterfaces() to safe-return {}
Node->>App: load and run application
App-->>Safety: runtime error / unhandled rejection
Safety-->>App: swallow/log error and optionally prevent exit
App->>Node: normal or recovered termination
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 1399-1404: Change the permission of /tmp/gateway.log to match the
validator expectations: replace the chmod 644 call with chmod 600 so the file
created in nemoclaw-start.sh remains owned by gateway:gateway and is only
readable by owner; ensure this aligns with validate_tmp_permissions (the sandbox
tmp-permissions validator) and keep the existing touch and chown/gateway
ownership calls intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 22e388c4-ba43-4f37-8cb1-e2b31e8434dd
📒 Files selected for processing (1)
scripts/nemoclaw-start.sh
The guard file doesn't exist in the sandbox even though openclaw.json should contain "slack". Add logging to install_slack_channel_guard when the grep fails (reports file existence/readability) and add E2E diagnostics to check the grep result and container logs for guard skip/install messages.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
scripts/nemoclaw-start.sh (1)
1403-1408:⚠️ Potential issue | 🔴 Critical
chmod 644still conflicts with the tmp-permissions validator.The root path still runs
validate_tmp_permissionsbefore launch, so leaving Line 1408 at644will fail startup if that validator still requires/tmp/gateway.logto stay owner-only. Either keep this file at600, or update the validator and every dependent expectation together.You can verify the mismatch with:
#!/usr/bin/env bash set -euo pipefail echo "== gateway.log permissions in startup scripts ==" rg -n -C2 '/tmp/gateway\.log|chmod 644|chmod 600' \ scripts/nemoclaw-start.sh \ scripts/lib/sandbox-init.sh \ agents/hermes/start.sh echo echo "== validate_tmp_permissions implementation ==" rg -n -A60 -B5 'validate_tmp_permissions\s*\(\)' scripts/lib/sandbox-init.shExpected result:
scripts/nemoclaw-start.shshowschmod 644, whilescripts/lib/sandbox-init.shstill documents or enforces600for/tmp/gateway.log.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/nemoclaw-start.sh` around lines 1403 - 1408, The startup script creates /tmp/gateway.log with chmod 644 which conflicts with the existing validate_tmp_permissions logic (validate_tmp_permissions in scripts/lib/sandbox-init.sh) that expects owner-only perms; change the chmod in the block where /tmp/gateway.log is touched/chowned in scripts/nemoclaw-start.sh from 644 to 600 to match the validator (or alternatively, if you intend to relax the validator, update validate_tmp_permissions and all dependent expectations (including agents/hermes/start.sh and any tests) together), ensuring the file remains owned by gateway:gateway and permission checks in validate_tmp_permissions still pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/test-messaging-providers.sh`:
- Around line 675-676: Replace the grep+head pipeline with grep's match-limit
option: in the command that sets the container_log variable (the line using
nemoclaw "$SANDBOX_NAME" logs ... | grep -i "channel
guard\|slack.*guard\|guard.*skip\|guard.*install" | head -5 || echo "no guard
messages"), remove the pipe to head and add grep -m 5 to limit matches; preserve
the case-insensitive -i and the trailing || echo fallback so container_log still
falls back to "no guard messages" when there are no matches.
---
Duplicate comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 1403-1408: The startup script creates /tmp/gateway.log with chmod
644 which conflicts with the existing validate_tmp_permissions logic
(validate_tmp_permissions in scripts/lib/sandbox-init.sh) that expects
owner-only perms; change the chmod in the block where /tmp/gateway.log is
touched/chowned in scripts/nemoclaw-start.sh from 644 to 600 to match the
validator (or alternatively, if you intend to relax the validator, update
validate_tmp_permissions and all dependent expectations (including
agents/hermes/start.sh and any tests) together), ensuring the file remains owned
by gateway:gateway and permission checks in validate_tmp_permissions still pass.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 93f07b20-4343-495f-8a3d-7cbbac4c1d43
📒 Files selected for processing (2)
scripts/nemoclaw-start.shtest/e2e/test-messaging-providers.sh
nemoclaw logs reads /tmp/gateway.log, not container stderr. The entrypoint guard messages go to stderr (Docker logs). Try openshell sandbox logs and docker logs directly to find guard installation messages.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
test/e2e/test-messaging-providers.sh (1)
675-680:⚠️ Potential issue | 🟡 MinorAvoid false “no guard messages” fallbacks under
pipefail.With
set -o pipefail(Line 58),grep ... | head -10can return non-zero (SIGPIPE ongrep) after matching lines, which incorrectly triggers the|| echo ...fallback. This can hide real guard diagnostics.Suggested fix
- container_log=$(openshell sandbox logs --name "$SANDBOX_NAME" 2>&1 | grep -i "channel guard\|slack.*guard\|guard.*skip\|guard.*install\|\[channels\].*slack\|\[channels\].*guard" | head -10 || echo "no guard messages in openshell logs") + container_log=$(openshell sandbox logs --name "$SANDBOX_NAME" 2>&1 | grep -im 10 "channel guard\|slack.*guard\|guard.*skip\|guard.*install\|\[channels\].*slack\|\[channels\].*guard" || echo "no guard messages in openshell logs") @@ - docker_log=$(docker logs "$container_id" 2>&1 | grep -i "channel guard\|slack.*guard\|\[channels\]" | head -10 || echo "no guard messages in docker logs") + docker_log=$(docker logs "$container_id" 2>&1 | grep -im 10 "channel guard\|slack.*guard\|\[channels\]" || echo "no guard messages in docker logs")#!/usr/bin/env bash set -uo pipefail echo "Repro: grep|head can trigger fallback even when matches exist" out=$( printf 'match\n%.0s' {1..50} \ | grep -i 'match' \ | head -10 \ || echo 'FALLBACK_TRIGGERED' ) printf '%s\n' "$out" echo echo "Control: grep -m avoids SIGPIPE fallback" out2=$( printf 'match\n%.0s' {1..50} \ | grep -im 10 'match' \ || echo 'FALLBACK_TRIGGERED' ) printf '%s\n' "$out2"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-messaging-providers.sh` around lines 675 - 680, The fallback is being incorrectly triggered due to SIGPIPE when using "grep ... | head -10" under pipefail; update the two places that set container_log and docker_log (the assignments referencing openshell logs and docker logs) to avoid a piped head: replace the grep|head pipeline with grep -m 10 (use the -m/--max-count option) so grep stops after 10 matches and won't emit SIGPIPE, ensuring the "|| echo 'no guard messages...'" fallback only runs when there truly are no matches.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@test/e2e/test-messaging-providers.sh`:
- Around line 675-680: The fallback is being incorrectly triggered due to
SIGPIPE when using "grep ... | head -10" under pipefail; update the two places
that set container_log and docker_log (the assignments referencing openshell
logs and docker logs) to avoid a piped head: replace the grep|head pipeline with
grep -m 10 (use the -m/--max-count option) so grep stops after 10 matches and
won't emit SIGPIPE, ensuring the "|| echo 'no guard messages...'" fallback only
runs when there truly are no matches.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2dfb655a-0981-42e1-a90b-d1826e713a26
📒 Files selected for processing (1)
test/e2e/test-messaging-providers.sh
List all nemoclaw-* and gateway.log files in /tmp to see exactly what the entrypoint created vs what's missing.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
test/e2e/test-messaging-providers.sh (1)
678-684:⚠️ Potential issue | 🟡 MinorAvoid
grep | headwithpipefailin these log captures.Line 678 and Line 683 can emit fallback text even when matches exist (SIGPIPE from
headmakes the pipeline fail underpipefail), which pollutes diagnostics. This was already raised previously in the PR discussion.Suggested fix
- container_log=$(openshell sandbox logs --name "$SANDBOX_NAME" 2>&1 | grep -i "channel guard\|slack.*guard\|guard.*skip\|guard.*install\|\[channels\].*slack\|\[channels\].*guard" | head -10 || echo "no guard messages in openshell logs") + container_log=$(openshell sandbox logs --name "$SANDBOX_NAME" 2>&1 | grep -im 10 "channel guard\|slack.*guard\|guard.*skip\|guard.*install\|\[channels\].*slack\|\[channels\].*guard" || echo "no guard messages in openshell logs") @@ - container_id=$(openshell sandbox exec --name "$SANDBOX_NAME" -- cat /proc/1/cgroup 2>/dev/null | grep -oP '[a-f0-9]{64}' | head -1 || echo "") + container_id=$(openshell sandbox exec --name "$SANDBOX_NAME" -- cat /proc/1/cgroup 2>/dev/null | grep -oPm 1 '[a-f0-9]{64}' || echo "") @@ - docker_log=$(docker logs "$container_id" 2>&1 | grep -i "channel guard\|slack.*guard\|\[channels\]" | head -10 || echo "no guard messages in docker logs") + docker_log=$(docker logs "$container_id" 2>&1 | grep -im 10 "channel guard\|slack.*guard\|\[channels\]" || echo "no guard messages in docker logs")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-messaging-providers.sh` around lines 678 - 684, The pipelines that capture logs (the commands assigning container_log and docker_log and the container_id retrieval) use "grep ... | head -10" which can produce SIGPIPE under pipefail and return the fallback text; change these to avoid piping into head (e.g., use grep's -m 10 to limit matches or use a single-tool solution like awk/sed to select the first 10 matches) so the pipeline won't fail with SIGPIPE. Update the two places that create container_log and docker_log (and any similar openshell/docker log captures) to use grep -m 10 or an equivalent single-process selector to reliably return up to 10 matches without causing a pipe failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/test-messaging-providers.sh`:
- Around line 665-667: The current tmp_files assignment calls openshell sandbox
exec with an unquoted glob (/tmp/nemoclaw-*) so the caller shell may expand it;
change the invocation of openshell sandbox exec (the command that sets
tmp_files) to run a shell inside the sandbox (e.g., use sh -c with a
single-quoted command) so ls -la /tmp/nemoclaw-* /tmp/gateway.log is executed
and expanded inside the sandbox; update the tmp_files variable assignment and
keep the surrounding error fallback and the info " /tmp/nemoclaw-* files:
$tmp_files" unchanged.
---
Duplicate comments:
In `@test/e2e/test-messaging-providers.sh`:
- Around line 678-684: The pipelines that capture logs (the commands assigning
container_log and docker_log and the container_id retrieval) use "grep ... |
head -10" which can produce SIGPIPE under pipefail and return the fallback text;
change these to avoid piping into head (e.g., use grep's -m 10 to limit matches
or use a single-tool solution like awk/sed to select the first 10 matches) so
the pipeline won't fail with SIGPIPE. Update the two places that create
container_log and docker_log (and any similar openshell/docker log captures) to
use grep -m 10 or an equivalent single-process selector to reliably return up to
10 matches without causing a pipe failure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 35e4ab05-ccaa-495b-af0f-d820d3edfff1
📒 Files selected for processing (1)
test/e2e/test-messaging-providers.sh
The /tmp/nemoclaw-* glob was expanding on the host shell before being passed to openshell sandbox exec, showing host files instead of sandbox files. Wrap in bash -c to expand inside the container.
Write breadcrumb timestamps to /tmp/nemoclaw-entrypoint-trace.log at key points: after proxy-env, before root/non-root branch, before each guard install call. Read the trace in the E2E diagnostic. This will show exactly where the entrypoint stops executing.
Root cause: install_configure_guard() tries to write to /sandbox/.bashrc which is Landlock read-only at runtime (#804). With set -e active, the write failure kills the entrypoint before install_slack_channel_guard and the gateway startup ever run. The proxy fix and nemotron fix work because they're installed at top level (before the root/non-root branch). The Slack guard and gateway startup are inside the branch and never execute. Fix: check file writability before attempting the .bashrc update. If the file is read-only (Landlock), skip it gracefully. Also add 2>/dev/null || true to the cat redirect as defense-in-depth.
lock_rc_files() calls chmod 444 on .bashrc/.profile which fails under Landlock. With set -e this kills the entrypoint — same root cause as the install_configure_guard fix. Add || true so it degrades gracefully.
gateway.log was changed from 600 to 644 for diagnostic readability. Update the validate_tmp_permissions check to expect 644 for gateway.log so it doesn't fail and kill the entrypoint under set -e.
The trace still dies before install_configure_guard. Add per-line traces to identify which of verify_config_integrity, apply_model_override, apply_cors_override, apply_slack_token_override, or token generation is the actual failure point.
The [ -w file ] test checks DAC permissions but cannot detect Landlock enforcement. The sandbox user owns .bashrc (DAC says writable) but Landlock blocks the write at kernel level. Under set -e, the failed write kills the entrypoint before the gateway ever starts. Remove the -w guard entirely and wrap every write operation in || true / continue so Landlock failures are silently skipped.
- Change non-root gateway.log to 644 (matching root path) - Add post-launch diagnostic: check if gateway PID is alive after 3s, dump gateway.log contents to trace file if non-empty
The @homebridge/ciao mDNS library calls os.networkInterfaces() which
throws SystemError (uv_interface_addresses) inside sandboxes with
restricted network namespaces. This crashes the gateway even though
mDNS is not needed for NemoClaw operation.
Add a NODE_OPTIONS preload that:
1. Monkey-patches os.networkInterfaces to return {} on failure
2. Catches the uncaughtException as a fallback for any call sites
that bypass the monkey-patch
Installed unconditionally at top level (same pattern as proxy fix
and nemotron fix) since any sandbox can hit this.
The OpenClaw gateway health monitor kills the entire gateway process when a messaging channel fails to connect within 120s (the channel-connect-grace). With fake/placeholder Slack tokens, the Slack channel auth always fails, and the health monitor kills the gateway after the grace period — even though the Slack guard successfully caught the initial auth error. Set gateway.channelHealthCheckMinutes to 0 in the baked openclaw.json config, which disables the health monitor entirely. In a NemoClaw sandbox, channel health is not critical — inference, chat, and TUI should continue even if a messaging channel is misconfigured.
Replace the global channelHealthCheckMinutes=0 with per-account healthMonitor.enabled=false on each messaging channel. This prevents the health monitor from killing the gateway when a channel has placeholder tokens, while keeping the global health monitor active for inference and other subsystems. OpenClaw supports per-account overrides via accounts.default.healthMonitor.enabled in the channel config.
Any uncaught exception or unhandled rejection from any npm dependency crashes the gateway, killing inference, chat, and TUI. We've been adding per-library guards (proxy fix, Slack guard, ciao guard) but this is whack-a-mole — the next library that does something unexpected in a restricted sandbox will crash the gateway again. Add a global safety net preload (sandbox-safety-net.js) that catches ALL uncaught exceptions and unhandled rejections, logs them, and continues. Only active when OPENSHELL_SANDBOX=1 (set by OpenShell at runtime) — outside a sandbox, normal Node.js crash behavior is preserved. Loaded as the FIRST --require preload so its handlers register before any library code runs. Per-library guards (Slack, ciao) still provide targeted handling with better log messages; the safety net is the last resort for everything else.
OpenClaw installs its own unhandledRejection handler that calls process.exit(1) for non-transient errors. Our safety net catches the rejection first and swallows it, but Node.js delivers the event to ALL listeners — OpenClaw's handler also fires and exits. Monkey-patch process.exit to block exits during the rejection delivery window. A flag (_swallowing) is set during our handler and cleared on the next microtask, so OpenClaw's handler (same tick) hits the intercepted process.exit and the gateway survives.
This test has been failing since March 31 and wastes API tokens on every nightly run without providing actionable signal.
Remove all entrypoint execution traces, /tmp dumps, gateway crash diagnostics, and verbose guard-skip logging added during debugging. Simplify E2E diagnostics to just guard file and NODE_OPTIONS checks.
shfmt reformatted case statement indentation. The nemotron test regex for validate_tmp_permissions was too strict — matched only when _NEMOTRON_FIX_SCRIPT was the second argument, but it's now further in the argument list due to _SANDBOX_SAFETY_NET and _CIAO_GUARD_SCRIPT.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
scripts/nemoclaw-start.sh (2)
1634-1634: Same suggestion: add_SLACK_GUARD_SCRIPTto root-path validation.For consistency with the non-root path recommendation.
♻️ Suggested fix
-validate_tmp_permissions "$_SANDBOX_SAFETY_NET" "$_PROXY_FIX_SCRIPT" "$_NEMOTRON_FIX_SCRIPT" "$_CIAO_GUARD_SCRIPT" +validate_tmp_permissions "$_SANDBOX_SAFETY_NET" "$_PROXY_FIX_SCRIPT" "$_NEMOTRON_FIX_SCRIPT" "$_CIAO_GUARD_SCRIPT" "$_SLACK_GUARD_SCRIPT"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/nemoclaw-start.sh` at line 1634, The call to validate_tmp_permissions is missing the _SLACK_GUARD_SCRIPT argument, so update the invocation of validate_tmp_permissions to include _SLACK_GUARD_SCRIPT alongside the existing arguments (_SANDBOX_SAFETY_NET, _PROXY_FIX_SCRIPT, _NEMOTRON_FIX_SCRIPT, _CIAO_GUARD_SCRIPT) so the root-path validation covers the Slack guard script as well.
1493-1493: Consider adding_SLACK_GUARD_SCRIPTto validation.The Slack guard (
/tmp/nemoclaw-slack-channel-guard.js) is a trust-boundary file loaded viaNODE_OPTIONS, but it's not passed tovalidate_tmp_permissions. Since the validator skips non-existent files, adding it is safe even when Slack isn't configured.♻️ Suggested fix
- validate_tmp_permissions "$_SANDBOX_SAFETY_NET" "$_PROXY_FIX_SCRIPT" "$_NEMOTRON_FIX_SCRIPT" "$_CIAO_GUARD_SCRIPT" + validate_tmp_permissions "$_SANDBOX_SAFETY_NET" "$_PROXY_FIX_SCRIPT" "$_NEMOTRON_FIX_SCRIPT" "$_CIAO_GUARD_SCRIPT" "$_SLACK_GUARD_SCRIPT"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/nemoclaw-start.sh` at line 1493, Call validate_tmp_permissions with the Slack guard variable as well: add _SLACK_GUARD_SCRIPT to the existing invocation that currently passes _SANDBOX_SAFETY_NET, _PROXY_FIX_SCRIPT, _NEMOTRON_FIX_SCRIPT, and _CIAO_GUARD_SCRIPT; this ensures the trust-bound Slack guard file (referenced via NODE_OPTIONS, _SLACK_GUARD_SCRIPT) is validated too — it's safe to include because validate_tmp_permissions already skips non-existent files.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/lib/sandbox-init.sh`:
- Around line 127-132: The permission validator complains because
validate_tmp_permissions expects /tmp/gateway.log mode 644 but several startup
scripts set it to 600; update the startup chmod calls (references:
agent-runtime.ts around the chmod call, src/nemoclaw.ts at the chmod line, and
agents/hermes/start.sh occurrences) to set 644 instead of 600 so the file modes
match the validator, or alternatively modify the validate_tmp_permissions logic
to accept both "600" and "644" for gateway.log; pick one approach and change all
referenced locations consistently (agent-runtime.ts, src/nemoclaw.ts,
agents/hermes/start.sh, or validate_tmp_permissions) so the validator no longer
flags a mismatch.
---
Nitpick comments:
In `@scripts/nemoclaw-start.sh`:
- Line 1634: The call to validate_tmp_permissions is missing the
_SLACK_GUARD_SCRIPT argument, so update the invocation of
validate_tmp_permissions to include _SLACK_GUARD_SCRIPT alongside the existing
arguments (_SANDBOX_SAFETY_NET, _PROXY_FIX_SCRIPT, _NEMOTRON_FIX_SCRIPT,
_CIAO_GUARD_SCRIPT) so the root-path validation covers the Slack guard script as
well.
- Line 1493: Call validate_tmp_permissions with the Slack guard variable as
well: add _SLACK_GUARD_SCRIPT to the existing invocation that currently passes
_SANDBOX_SAFETY_NET, _PROXY_FIX_SCRIPT, _NEMOTRON_FIX_SCRIPT, and
_CIAO_GUARD_SCRIPT; this ensures the trust-bound Slack guard file (referenced
via NODE_OPTIONS, _SLACK_GUARD_SCRIPT) is validated too — it's safe to include
because validate_tmp_permissions already skips non-existent files.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 19af5706-991f-463d-9648-fbac9d3dbad7
📒 Files selected for processing (5)
.github/workflows/nightly-e2e.yamlDockerfilescripts/lib/sandbox-init.shscripts/nemoclaw-start.shtest/e2e/test-messaging-providers.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e/test-messaging-providers.sh
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
scripts/lib/sandbox-init.sh (1)
214-214: Keep best-effort behavior, but log lock failures.Line 214 swallows all
chmoderrors silently. That keeps startup resilient, but it also hides hardening regressions. Emit a warning when the lock fails.🛠️ Suggested change
- chmod 444 "$rc_file" 2>/dev/null || true + if ! chmod 444 "$rc_file" 2>/dev/null; then + echo "[SECURITY] Failed to lock ${rc_file} to 444; continuing (best-effort)." >&2 + fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/lib/sandbox-init.sh` at line 214, The chmod call currently swallows all errors for "$rc_file" (chmod 444 "$rc_file" 2>/dev/null || true); change it to keep the best-effort behavior but log a warning when the lock fails: run chmod 444 on "$rc_file", capture its exit status, and if it fails emit a warning to stderr (including the rc_file name and the failure status or errno) while still allowing startup to continue; update the block referencing "$rc_file" so failures are visible but non-fatal.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/lib/sandbox-init.sh`:
- Around line 121-132: The trust-boundary mapping in the script is out of sync:
the runtime check in the for-loop sets /tmp/gateway.log expected_perms="644"
(via the case handling for */gateway.log) but the earlier documentation/table
still lists gateway.log as 600; update that trust-boundary map or comment to
state /tmp/gateway.log is mode 644 so the docs match the implemented check
(search for the trust-boundary table/header and the entries referencing
gateway.log and change its mode to 644).
---
Nitpick comments:
In `@scripts/lib/sandbox-init.sh`:
- Line 214: The chmod call currently swallows all errors for "$rc_file" (chmod
444 "$rc_file" 2>/dev/null || true); change it to keep the best-effort behavior
but log a warning when the lock fails: run chmod 444 on "$rc_file", capture its
exit status, and if it fails emit a warning to stderr (including the rc_file
name and the failure status or errno) while still allowing startup to continue;
update the block referencing "$rc_file" so failures are visible but non-fatal.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 17e6f851-bc96-4b7e-b7df-1570d88a71f5
📒 Files selected for processing (3)
scripts/lib/sandbox-init.shscripts/nemoclaw-start.shtest/nemotron-inference-fix.test.ts
✅ Files skipped from review due to trivial changes (1)
- test/nemotron-inference-fix.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- scripts/nemoclaw-start.sh
…Slack guard to validation Same issue as the nemotron test — regex was too strict for the new argument ordering. Also add _SLACK_GUARD_SCRIPT to validate_tmp_permissions calls per CodeRabbit review.
## Summary Refreshes user-facing docs for the last 24 hours of merged NemoClaw history and bumps the docs metadata to 0.0.29, the next version after v0.0.28. The updates are limited to behavior supported by merged PR descriptions and diffs. ## Changes - `docs/reference/commands.md`: documented `nemoclaw <name> policy-add --from-file` and `--from-dir`, including custom preset review guidance, from #2077 / commit `7720b175`. - `docs/deployment/deploy-to-remote-gpu.md`: clarified that non-loopback `CHAT_UI_URL` disables OpenClaw device pairing for remote browser-only deployments, from #2449 / commit `f5ee8a4d`. - `docs/inference/inference-options.md`: documented provider-aware credential retry validation and the NVIDIA-only `nvapi-` prefix check, from #2389 / commit `6f7f0c6d`. - `docs/inference/switch-inference-providers.md`: documented `NEMOCLAW_INFERENCE_INPUTS` for text/image-capable model metadata baked into `openclaw.json`, from #2441 / commit `f4391892`. - `docs/reference/troubleshooting.md`: added the Git certificate verification entry for proxy CA propagation through `GIT_SSL_CAINFO`, `GIT_SSL_CAPATH`, `CURL_CA_BUNDLE`, and `REQUESTS_CA_BUNDLE`, from #2345 / commit `fa0dc1ab`. - `docs/versions1.json` and `docs/project.json`: promoted docs version `0.0.29`; `docs/versions1.json` omits unpublished `0.0.26`, `0.0.27`, and `0.0.28` entries. - `.agents/skills/nemoclaw-user-*`: regenerated derived user skill references from the updated docs. - Reviewed with no extra doc changes: #2575 / `d392ec07`, #2565 / `a3231049`, #1965 / `db1ef3ca`, #1990 / `db665834`, #2495 / `7da86fa3`, #2496 / `3192f4f4`, #2490 / `8c209058`, #2487 / `1f615e2f`, #2483 / `5653d33a`, #2482 / `31c782c0`, #2464 / `23bb5703`, #2472 / `a54f9a34`, and #2437 / `6bc860d7`. - Skipped per docs policy: #2420 / `7b76df6b` touched the experimental sandbox config path listed in `docs/.docs-skip`; #2466 / `cc15689c` touched a skipped term and CI-only sandbox image files. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification <!-- Check each item you ran and confirmed. Leave unchecked items you skipped. --> - [x] `npx prek run --all-files` passes - [ ] `npm test` passes — failed locally in installer-integration tests and one onboard helper timeout; the doc-scoped hook test projects passed under `prek`. - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) — build succeeded, but local Sphinx emitted the existing version-switcher file read message. - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure <!-- If an AI agent authored or co-authored this PR, check the box and name the tool. Remove this section for fully human-authored PRs. --> - [x] AI-assisted — tool: Codex --- <!-- DCO sign-off required by CI. Run: git config user.name && git config user.email --> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Support for custom YAML presets in policy configuration via --from-file and --from-dir. * New build-time inference input option to declare accepted modalities (text or text,image). * **Improvements** * Credential validation now offers interactive recovery: re-enter key, retry, choose another provider, or exit. * Clarified provider-specific API key prefix handling (nvapi- only applies to NVIDIA keys). * **Documentation** * TLS certificate troubleshooting for inspected networks. * Clarified remote dashboard security/device-pairing behavior; command docs updated; docs version bumped. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
PR NVIDIA#2472 accidentally deleted the entire cloud-experimental-e2e job from nightly-e2e.yaml. Restores Landlock enforcement, API key leak detection, openclaw tui smoke, live chat, and skill injection tests. Fixes NVIDIA#2570 Signed-off-by: Truong Nguyen <tgnguyen@nvidia.com> Made-with: Cursor
PR NVIDIA#2472 accidentally deleted the entire cloud-experimental-e2e job from nightly-e2e.yaml. Restores Landlock enforcement, API key leak detection, openclaw tui smoke, live chat, and skill injection tests. Verified on fork: PASS in 14m 5s. Fixes NVIDIA#2570 Signed-off-by: Truong Nguyen <tgnguyen@nvidia.com> Made-with: Cursor
…#2615) ## Summary Add automated E2E test recommendations to PR reviews and selective job dispatch to the nightly E2E workflow. Closes #2564 (Phases 1–3). ## What changed ### 1. CodeRabbit `path_instructions` for E2E recommendations (`.coderabbit.yaml`) 15 new `path_instructions` entries map sensitive file paths to the nightly E2E jobs that exercise them. When a PR touches a mapped path, CodeRabbit posts a review comment recommending specific jobs and a copy-pasteable `gh workflow run` command. | Path Pattern | Recommended Jobs | |-------------|-----------------| | `scripts/nemoclaw-start.sh`, `scripts/lib/sandbox-init.sh` | `sandbox-survival-e2e`, `sandbox-operations-e2e`, `cloud-e2e` | | `Dockerfile`, `Dockerfile.base` | `cloud-e2e`, `sandbox-survival-e2e`, `hermes-e2e`, `rebuild-openclaw-e2e` | | `nemoclaw-blueprint/scripts/http-proxy-fix.js` | `cloud-e2e`, `inference-routing-e2e` | | `src/lib/onboard.ts` | `cloud-e2e`, `sandbox-operations-e2e`, `rebuild-openclaw-e2e` | | `src/nemoclaw.ts` | `sandbox-survival-e2e`, `sandbox-operations-e2e`, `skip-permissions-e2e` | | `src/lib/cluster-image-patch.ts`, `src/lib/preflight.ts` | `overlayfs-autofix-e2e` | | `src/lib/deploy.ts` | `deployment-services-e2e` | | `src/lib/sandbox-state.ts` | `snapshot-commands-e2e`, `rebuild-openclaw-e2e` | | `src/lib/shields*.ts` | `shields-config-e2e` | | `agents/hermes/**` | `hermes-e2e`, `rebuild-hermes-e2e` | | `nemoclaw-blueprint/policies/**` | `network-policy-e2e`, `skip-permissions-e2e` | | `.github/workflows/nightly-e2e.yaml` | Reminds to add CodeRabbit coverage for new jobs | ### 2. Selective job dispatch (`nightly-e2e.yaml`) Added a `jobs` input to `workflow_dispatch` so maintainers can run a subset of nightly jobs on any branch: ``` gh workflow run nightly-e2e.yaml --ref <branch> -f jobs=sandbox-survival-e2e,sandbox-operations-e2e ``` - All 18 E2E jobs get a conditional guard: unselected jobs are skipped - Empty `jobs` input (or scheduled runs) still runs everything - `notify-on-failure` is unaffected: skipped jobs produce `result: 'skipped'`, not `'failure'` ### 3. Cross-validation test (`test/validate-e2e-coverage.test.ts`) Keeps the mapping up to date as files and jobs evolve: | Assertion | What it catches | |-----------|----------------| | Job names in CodeRabbit match `nightly-e2e.yaml` | Renamed/removed jobs | | Path globs match at least one file on disk | Renamed/deleted source files | | Every nightly job has selective dispatch guard | New jobs added without the `if:` pattern | | Advisory: nightly jobs with no CodeRabbit coverage | New jobs added without `path_instructions` | ## Validation - [x] All 4 cross-validation tests pass locally - [x] Existing `validate-config-schemas` tests still pass - [x] Selective dispatch validated: [run 25052625486](https://github.com/NVIDIA/NemoClaw/actions/runs/25052625486) — triggered with `-f jobs=diagnostics-e2e`, 17/18 jobs correctly skipped - [x] `notify-on-failure` does not false-alarm on selective run — [run 25052625486](https://github.com/NVIDIA/NemoClaw/actions/runs/25052625486) confirmed: `notify-on-failure` was skipped (not triggered) - [ ] CodeRabbit posts recommendations on a PR touching a mapped file (post-merge validation) ## Context - Issue: #2564 - Weekend incident: #2471, #2472, #2482, #2490 - E2E strategy: `cloud-experimental-e2e` removal in #2472 left a coverage gap that would have been flagged by these recommendations <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Expanded review automation to map sensitive paths to targeted nightly E2E jobs and inject instructions for running relevant subsets. * Added manual workflow dispatch allowing selective E2E job execution via a jobs input. * **New Features** * Added a reporting step that, on manual runs, posts a PR comment summarizing passed/failed/skipped E2E jobs. * **Tests** * Added a validation suite that cross-checks review-to-workflow mappings and dispatch guards, warning on uncovered jobs. <!-- end of auto-generated comment: release notes by coderabbit.ai --> ### 4. Substring match fix (`nightly-e2e.yaml`) CodeRabbit review correctly identified that `contains(inputs.jobs, 'cloud-e2e')` performs substring matching — e.g., passing `jobs=e2e` would match every job. All 18 job guards now use delimiter-wrapping: ```yaml contains(format(',{0},', inputs.jobs), ',<job-name>,') ``` This ensures exact token matching within the comma-separated input. The cross-validation test was updated to enforce the new pattern.
PR NVIDIA#2472 accidentally deleted the entire cloud-experimental-e2e job from nightly-e2e.yaml. Restores Landlock enforcement, API key leak detection, openclaw tui smoke, live chat, and skill injection tests. Verified on fork: PASS in 14m 5s. Fixes NVIDIA#2570 Signed-off-by: Truong Nguyen <tgnguyen@nvidia.com> Made-with: Cursor
PR NVIDIA#2472 accidentally deleted the entire cloud-experimental-e2e job from nightly-e2e.yaml. Restores Landlock enforcement, API key leak detection, openclaw tui smoke, live chat, and skill injection tests. Verified on fork: PASS in 14m 5s. Fixes NVIDIA#2570 Signed-off-by: Truong Nguyen <tgnguyen@nvidia.com> Made-with: Cursor
PR NVIDIA#2472 accidentally deleted the entire cloud-experimental-e2e job from nightly-e2e.yaml. Restores Landlock enforcement, API key leak detection, openclaw tui smoke, live chat, and skill injection tests. Verified on fork: PASS in 14m 5s. Fixes NVIDIA#2570 Signed-off-by: Truong Nguyen <tgnguyen@nvidia.com> Made-with: Cursor
## Summary Restore the `cloud-experimental-e2e` job that was accidentally deleted from `nightly-e2e.yaml` in PR #2472. ## Related Issue Fixes #2570 ## Changes Restores the `cloud-experimental-e2e` job that tests: - Landlock read-only enforcement (8 assertions on .bashrc, .profile, .openclaw, .openclaw-data, /tmp) - API key leak detection in process list - `openclaw tui` smoke test inside sandbox - Live chat via `openclaw agent` - Skill injection + agent verification - `inference.local` HTTPS probe The job runs unconditionally (no feature-flag gate). Added to `notify-on-failure` needs list. Removed the old `skip/05-network-policy.sh` step (now covered by the dedicated `network-policy-e2e` job). ## Type of Change - Code change (feature, bug fix, or refactor) ## Verification - YAML validated on fork: all jobs parse correctly - Verified on fork CI: cloud-experimental-e2e PASS in 14m 5s ## AI Disclosure - AI-assisted — tool: Cursor --- Signed-off-by: Truong Nguyen <tgnguyen@nvidia.com> Made with [Cursor](https://cursor.com) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added nightly cloud experimental end-to-end tests to broaden coverage. * Made the experimental job selectable from the manual job list for targeted runs. * Always-check documentation during these runs for improved QA. * Ensure experimental sandbox is torn down and verified after tests. * Upload an install-log artifact when the experimental job fails to aid troubleshooting. * Include the experimental job in failure notifications and PR reporting so results are tracked. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Truong Nguyen <tgnguyen@nvidia.com>
## Summary Restore the `cloud-experimental-e2e` job that was accidentally deleted from `nightly-e2e.yaml` in PR #2472. ## Related Issue Fixes #2570 ## Changes Restores the `cloud-experimental-e2e` job that tests: - Landlock read-only enforcement (8 assertions on .bashrc, .profile, .openclaw, .openclaw-data, /tmp) - API key leak detection in process list - `openclaw tui` smoke test inside sandbox - Live chat via `openclaw agent` - Skill injection + agent verification - `inference.local` HTTPS probe The job runs unconditionally (no feature-flag gate). Added to `notify-on-failure` needs list. Removed the old `skip/05-network-policy.sh` step (now covered by the dedicated `network-policy-e2e` job). ## Type of Change - Code change (feature, bug fix, or refactor) ## Verification - YAML validated on fork: all jobs parse correctly - Verified on fork CI: cloud-experimental-e2e PASS in 14m 5s ## AI Disclosure - AI-assisted — tool: Cursor --- Signed-off-by: Truong Nguyen <tgnguyen@nvidia.com> Made with [Cursor](https://cursor.com) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added nightly cloud experimental end-to-end tests to broaden coverage. * Made the experimental job selectable from the manual job list for targeted runs. * Always-check documentation during these runs for improved QA. * Ensure experimental sandbox is torn down and verified after tests. * Upload an install-log artifact when the experimental job fails to aid troubleshooting. * Include the experimental job in failure notifications and PR reporting so results are tracked. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Truong Nguyen <tgnguyen@nvidia.com>
## Summary `scripts/brev-launchable-ci-cpu.sh` is the community install path for Brev users — it bootstraps a VM with Docker, Node.js, OpenShell, and NemoClaw. **That script already exists in the repo but has zero CI coverage.** This PR adds a nightly E2E smoke test that validates the script works end-to-end. This is the long-living safety net for the community install flow. If any regression breaks the launchable script (e.g., the Apr 20–25 Brev outage from #2472/#2482, or the container reachability fallback from #2425), this test catches it before community users are affected. ## Related Issue Closes #2599 Related: #2425 (the `isProxyHealthy()` fallback in PR #2453 — if that regresses, onboard will abort on Brev and this smoke test catches it) ## Changes ### New: `test/e2e/test-launchable-smoke.sh` | Phase | What it validates | |-------|-------------------| | 0 | Pre-cleanup + pre-seed clone directory from checkout | | 1 | Prerequisites (Docker, NVIDIA_API_KEY, network, env vars) | | 2 | Run `brev-launchable-ci-cpu.sh` — the existing community bootstrap script | | 3 | Verify artifacts (nemoclaw, openshell, Node.js, Docker, sentinel file, built outputs) | | 4 | `nemoclaw onboard --non-interactive` with cloud provider | | 5 | Sandbox health (list, status, inference config, gateway) | | 6 | Live inference (direct API, routing via inference.local, openclaw agent 6×7=42) | | 7 | Destroy + cleanup | Key design decisions: - **No BREV_API_TOKEN needed** — the launchable script is a generic Ubuntu bootstrap with zero Brev dependencies, so it runs on standard GitHub-hosted `ubuntu-latest` runners - **Tests current code, not main** — pre-seeds the clone directory from the CI checkout so regressions are caught before reaching community users - **Follows existing E2E conventions** — pass/fail/section helpers, e2e-timeout.sh self-wrap, sandbox-teardown.sh EXIT trap, parse_chat_content() for reasoning models ### Modified: `.github/workflows/nightly-e2e.yaml` - Added `launchable-smoke-e2e` job: `ubuntu-latest`, 30min timeout, `NVIDIA_API_KEY` secret - Uploads install/onboard/test logs as artifacts on failure - Added to `notify-on-failure` needs list ## Validation Triggered via fork dispatch (`jyaunches/NemoClaw` → `sparky-dispatch` → `launchable-smoke`): - **Run:** https://github.com/jyaunches/NemoClaw/actions/runs/25075715342 - **Result:** ✅ 24 passed, 0 failed, 1 skipped (Node.js version — GH runner pre-installs Node 20) - **Runtime:** ~12 minutes ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Checklist - [x] Follows project coding conventions - [x] Tests pass locally or in CI - [x] No secrets/credentials committed <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added an end-to-end smoke test and CI job that validates the community launchable CPU install path (install, onboarding, runtime readiness, and a simple inference check). CI now uploads install/onboard/test logs on failures. * **Chores** * Renamed the branch-validation workflow and corresponding test-suite identifiers for clarity. * Updated E2E test documentation and project configuration names to match the new labeling. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…VIDIA#2472) ## Summary Fixes a 5-day outage where the gateway never started in non-root sandbox mode (Brev Launchable, no-new-privileges containers). Also adds a global safety net preventing any npm library crash from killing the gateway. ### Changes **Entrypoint Landlock tolerance** (`scripts/nemoclaw-start.sh`, `scripts/lib/sandbox-init.sh`) - `install_configure_guard`: wrap all `.bashrc`/`.profile` writes in `|| true` — the `[ -w file ]` test passes (DAC) but Landlock blocks the actual write, crashing the entrypoint under `set -e` - `lock_rc_files`: `|| true` on chmod calls - `validate_tmp_permissions`: expect 644 for gateway.log - Root cause: commit `20407589` (Apr 20) added `install_configure_guard` which writes to Landlock-protected files. Every non-root sandbox since then had a dead gateway. **Global sandbox safety net** (`scripts/nemoclaw-start.sh`) - New `sandbox-safety-net.js` preload — catches ALL uncaught exceptions and unhandled rejections in sandbox mode (OPENSHELL_SANDBOX=1), logs them, and continues - Intercepts `process.exit()` during swallowed rejection delivery so OpenClaw's own handler (which calls `process.exit(1)` for non-transient errors) doesn't kill the gateway - First `--require` preload so handlers register before any library code **ciao network guard** (`scripts/nemoclaw-start.sh`) - Targeted guard for `@homebridge/ciao` mDNS library crash (`os.networkInterfaces()` → `uv_interface_addresses` SystemError in restricted namespaces) - Monkey-patches `os.networkInterfaces` to return `{}` on failure **Slack guard improvements** (`scripts/nemoclaw-start.sh`) - Include Slack guard in `proxy-env.sh` for connect sessions - Gateway.log changed from 600 to 644 for diagnostic readability **Per-channel health monitor disable** (`Dockerfile`) - Set `healthMonitor.enabled: false` on each messaging channel account - Prevents OpenClaw's health monitor from killing the gateway after 120s channel-connect-grace when a channel has placeholder tokens **Remove cloud-experimental-e2e** (`.github/workflows/nightly-e2e.yaml`) - Has been failing since March 31, wastes API tokens ## Test plan - [x] shellcheck clean - [x] 91/91 nemoclaw-start.test.ts pass - [x] Nightly E2E: messaging-providers Phase 7 S1+S2 pass (gateway survives Slack auth failure) - [x] Nightly E2E: sandbox-survival, skip-permissions, cloud-e2e pass <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved sandbox crash resilience with new runtime guards, deterministic preload ordering, and selective guard activation to reduce unexpected failures. * Shell rc-file snippet installation and locking are now best-effort so setup proceeds when modifying user rc files fails. * **Chores** * Relaxed /tmp log permissions and extended tmp-permission validation to cover new guard artifacts. * Disabled health monitoring for messaging channel accounts. * **Tests** * Reduced noisy diagnostics in an e2e test, broadened matching in unit tests to avoid false negatives, and removed an experimental nightly CI job. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Refreshes user-facing docs for the last 24 hours of merged NemoClaw history and bumps the docs metadata to 0.0.29, the next version after v0.0.28. The updates are limited to behavior supported by merged PR descriptions and diffs. ## Changes - `docs/reference/commands.md`: documented `nemoclaw <name> policy-add --from-file` and `--from-dir`, including custom preset review guidance, from NVIDIA#2077 / commit `7720b175`. - `docs/deployment/deploy-to-remote-gpu.md`: clarified that non-loopback `CHAT_UI_URL` disables OpenClaw device pairing for remote browser-only deployments, from NVIDIA#2449 / commit `f5ee8a4d`. - `docs/inference/inference-options.md`: documented provider-aware credential retry validation and the NVIDIA-only `nvapi-` prefix check, from NVIDIA#2389 / commit `6f7f0c6d`. - `docs/inference/switch-inference-providers.md`: documented `NEMOCLAW_INFERENCE_INPUTS` for text/image-capable model metadata baked into `openclaw.json`, from NVIDIA#2441 / commit `f4391892`. - `docs/reference/troubleshooting.md`: added the Git certificate verification entry for proxy CA propagation through `GIT_SSL_CAINFO`, `GIT_SSL_CAPATH`, `CURL_CA_BUNDLE`, and `REQUESTS_CA_BUNDLE`, from NVIDIA#2345 / commit `fa0dc1ab`. - `docs/versions1.json` and `docs/project.json`: promoted docs version `0.0.29`; `docs/versions1.json` omits unpublished `0.0.26`, `0.0.27`, and `0.0.28` entries. - `.agents/skills/nemoclaw-user-*`: regenerated derived user skill references from the updated docs. - Reviewed with no extra doc changes: NVIDIA#2575 / `d392ec07`, NVIDIA#2565 / `a3231049`, NVIDIA#1965 / `db1ef3ca`, NVIDIA#1990 / `db665834`, NVIDIA#2495 / `7da86fa3`, NVIDIA#2496 / `3192f4f4`, NVIDIA#2490 / `8c209058`, NVIDIA#2487 / `1f615e2f`, NVIDIA#2483 / `5653d33a`, NVIDIA#2482 / `31c782c0`, NVIDIA#2464 / `23bb5703`, NVIDIA#2472 / `a54f9a34`, and NVIDIA#2437 / `6bc860d7`. - Skipped per docs policy: NVIDIA#2420 / `7b76df6b` touched the experimental sandbox config path listed in `docs/.docs-skip`; NVIDIA#2466 / `cc15689c` touched a skipped term and CI-only sandbox image files. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification <!-- Check each item you ran and confirmed. Leave unchecked items you skipped. --> - [x] `npx prek run --all-files` passes - [ ] `npm test` passes — failed locally in installer-integration tests and one onboard helper timeout; the doc-scoped hook test projects passed under `prek`. - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) — build succeeded, but local Sphinx emitted the existing version-switcher file read message. - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure <!-- If an AI agent authored or co-authored this PR, check the box and name the tool. Remove this section for fully human-authored PRs. --> - [x] AI-assisted — tool: Codex --- <!-- DCO sign-off required by CI. Run: git config user.name && git config user.email --> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Support for custom YAML presets in policy configuration via --from-file and --from-dir. * New build-time inference input option to declare accepted modalities (text or text,image). * **Improvements** * Credential validation now offers interactive recovery: re-enter key, retry, choose another provider, or exit. * Clarified provider-specific API key prefix handling (nvapi- only applies to NVIDIA keys). * **Documentation** * TLS certificate troubleshooting for inspected networks. * Clarified remote dashboard security/device-pairing behavior; command docs updated; docs version bumped. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
…NVIDIA#2615) ## Summary Add automated E2E test recommendations to PR reviews and selective job dispatch to the nightly E2E workflow. Closes NVIDIA#2564 (Phases 1–3). ## What changed ### 1. CodeRabbit `path_instructions` for E2E recommendations (`.coderabbit.yaml`) 15 new `path_instructions` entries map sensitive file paths to the nightly E2E jobs that exercise them. When a PR touches a mapped path, CodeRabbit posts a review comment recommending specific jobs and a copy-pasteable `gh workflow run` command. | Path Pattern | Recommended Jobs | |-------------|-----------------| | `scripts/nemoclaw-start.sh`, `scripts/lib/sandbox-init.sh` | `sandbox-survival-e2e`, `sandbox-operations-e2e`, `cloud-e2e` | | `Dockerfile`, `Dockerfile.base` | `cloud-e2e`, `sandbox-survival-e2e`, `hermes-e2e`, `rebuild-openclaw-e2e` | | `nemoclaw-blueprint/scripts/http-proxy-fix.js` | `cloud-e2e`, `inference-routing-e2e` | | `src/lib/onboard.ts` | `cloud-e2e`, `sandbox-operations-e2e`, `rebuild-openclaw-e2e` | | `src/nemoclaw.ts` | `sandbox-survival-e2e`, `sandbox-operations-e2e`, `skip-permissions-e2e` | | `src/lib/cluster-image-patch.ts`, `src/lib/preflight.ts` | `overlayfs-autofix-e2e` | | `src/lib/deploy.ts` | `deployment-services-e2e` | | `src/lib/sandbox-state.ts` | `snapshot-commands-e2e`, `rebuild-openclaw-e2e` | | `src/lib/shields*.ts` | `shields-config-e2e` | | `agents/hermes/**` | `hermes-e2e`, `rebuild-hermes-e2e` | | `nemoclaw-blueprint/policies/**` | `network-policy-e2e`, `skip-permissions-e2e` | | `.github/workflows/nightly-e2e.yaml` | Reminds to add CodeRabbit coverage for new jobs | ### 2. Selective job dispatch (`nightly-e2e.yaml`) Added a `jobs` input to `workflow_dispatch` so maintainers can run a subset of nightly jobs on any branch: ``` gh workflow run nightly-e2e.yaml --ref <branch> -f jobs=sandbox-survival-e2e,sandbox-operations-e2e ``` - All 18 E2E jobs get a conditional guard: unselected jobs are skipped - Empty `jobs` input (or scheduled runs) still runs everything - `notify-on-failure` is unaffected: skipped jobs produce `result: 'skipped'`, not `'failure'` ### 3. Cross-validation test (`test/validate-e2e-coverage.test.ts`) Keeps the mapping up to date as files and jobs evolve: | Assertion | What it catches | |-----------|----------------| | Job names in CodeRabbit match `nightly-e2e.yaml` | Renamed/removed jobs | | Path globs match at least one file on disk | Renamed/deleted source files | | Every nightly job has selective dispatch guard | New jobs added without the `if:` pattern | | Advisory: nightly jobs with no CodeRabbit coverage | New jobs added without `path_instructions` | ## Validation - [x] All 4 cross-validation tests pass locally - [x] Existing `validate-config-schemas` tests still pass - [x] Selective dispatch validated: [run 25052625486](https://github.com/NVIDIA/NemoClaw/actions/runs/25052625486) — triggered with `-f jobs=diagnostics-e2e`, 17/18 jobs correctly skipped - [x] `notify-on-failure` does not false-alarm on selective run — [run 25052625486](https://github.com/NVIDIA/NemoClaw/actions/runs/25052625486) confirmed: `notify-on-failure` was skipped (not triggered) - [ ] CodeRabbit posts recommendations on a PR touching a mapped file (post-merge validation) ## Context - Issue: NVIDIA#2564 - Weekend incident: NVIDIA#2471, NVIDIA#2472, NVIDIA#2482, NVIDIA#2490 - E2E strategy: `cloud-experimental-e2e` removal in NVIDIA#2472 left a coverage gap that would have been flagged by these recommendations <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Expanded review automation to map sensitive paths to targeted nightly E2E jobs and inject instructions for running relevant subsets. * Added manual workflow dispatch allowing selective E2E job execution via a jobs input. * **New Features** * Added a reporting step that, on manual runs, posts a PR comment summarizing passed/failed/skipped E2E jobs. * **Tests** * Added a validation suite that cross-checks review-to-workflow mappings and dispatch guards, warning on uncovered jobs. <!-- end of auto-generated comment: release notes by coderabbit.ai --> ### 4. Substring match fix (`nightly-e2e.yaml`) CodeRabbit review correctly identified that `contains(inputs.jobs, 'cloud-e2e')` performs substring matching — e.g., passing `jobs=e2e` would match every job. All 18 job guards now use delimiter-wrapping: ```yaml contains(format(',{0},', inputs.jobs), ',<job-name>,') ``` This ensures exact token matching within the comma-separated input. The cross-validation test was updated to enforce the new pattern.
) ## Summary Restore the `cloud-experimental-e2e` job that was accidentally deleted from `nightly-e2e.yaml` in PR NVIDIA#2472. ## Related Issue Fixes NVIDIA#2570 ## Changes Restores the `cloud-experimental-e2e` job that tests: - Landlock read-only enforcement (8 assertions on .bashrc, .profile, .openclaw, .openclaw-data, /tmp) - API key leak detection in process list - `openclaw tui` smoke test inside sandbox - Live chat via `openclaw agent` - Skill injection + agent verification - `inference.local` HTTPS probe The job runs unconditionally (no feature-flag gate). Added to `notify-on-failure` needs list. Removed the old `skip/05-network-policy.sh` step (now covered by the dedicated `network-policy-e2e` job). ## Type of Change - Code change (feature, bug fix, or refactor) ## Verification - YAML validated on fork: all jobs parse correctly - Verified on fork CI: cloud-experimental-e2e PASS in 14m 5s ## AI Disclosure - AI-assisted — tool: Cursor --- Signed-off-by: Truong Nguyen <tgnguyen@nvidia.com> Made with [Cursor](https://cursor.com) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added nightly cloud experimental end-to-end tests to broaden coverage. * Made the experimental job selectable from the manual job list for targeted runs. * Always-check documentation during these runs for improved QA. * Ensure experimental sandbox is torn down and verified after tests. * Upload an install-log artifact when the experimental job fails to aid troubleshooting. * Include the experimental job in failure notifications and PR reporting so results are tracked. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Truong Nguyen <tgnguyen@nvidia.com>
## Summary `scripts/brev-launchable-ci-cpu.sh` is the community install path for Brev users — it bootstraps a VM with Docker, Node.js, OpenShell, and NemoClaw. **That script already exists in the repo but has zero CI coverage.** This PR adds a nightly E2E smoke test that validates the script works end-to-end. This is the long-living safety net for the community install flow. If any regression breaks the launchable script (e.g., the Apr 20–25 Brev outage from NVIDIA#2472/NVIDIA#2482, or the container reachability fallback from NVIDIA#2425), this test catches it before community users are affected. ## Related Issue Closes NVIDIA#2599 Related: NVIDIA#2425 (the `isProxyHealthy()` fallback in PR NVIDIA#2453 — if that regresses, onboard will abort on Brev and this smoke test catches it) ## Changes ### New: `test/e2e/test-launchable-smoke.sh` | Phase | What it validates | |-------|-------------------| | 0 | Pre-cleanup + pre-seed clone directory from checkout | | 1 | Prerequisites (Docker, NVIDIA_API_KEY, network, env vars) | | 2 | Run `brev-launchable-ci-cpu.sh` — the existing community bootstrap script | | 3 | Verify artifacts (nemoclaw, openshell, Node.js, Docker, sentinel file, built outputs) | | 4 | `nemoclaw onboard --non-interactive` with cloud provider | | 5 | Sandbox health (list, status, inference config, gateway) | | 6 | Live inference (direct API, routing via inference.local, openclaw agent 6×7=42) | | 7 | Destroy + cleanup | Key design decisions: - **No BREV_API_TOKEN needed** — the launchable script is a generic Ubuntu bootstrap with zero Brev dependencies, so it runs on standard GitHub-hosted `ubuntu-latest` runners - **Tests current code, not main** — pre-seeds the clone directory from the CI checkout so regressions are caught before reaching community users - **Follows existing E2E conventions** — pass/fail/section helpers, e2e-timeout.sh self-wrap, sandbox-teardown.sh EXIT trap, parse_chat_content() for reasoning models ### Modified: `.github/workflows/nightly-e2e.yaml` - Added `launchable-smoke-e2e` job: `ubuntu-latest`, 30min timeout, `NVIDIA_API_KEY` secret - Uploads install/onboard/test logs as artifacts on failure - Added to `notify-on-failure` needs list ## Validation Triggered via fork dispatch (`jyaunches/NemoClaw` → `sparky-dispatch` → `launchable-smoke`): - **Run:** https://github.com/jyaunches/NemoClaw/actions/runs/25075715342 - **Result:** ✅ 24 passed, 0 failed, 1 skipped (Node.js version — GH runner pre-installs Node 20) - **Runtime:** ~12 minutes ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Checklist - [x] Follows project coding conventions - [x] Tests pass locally or in CI - [x] No secrets/credentials committed <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added an end-to-end smoke test and CI job that validates the community launchable CPU install path (install, onboarding, runtime readiness, and a simple inference check). CI now uploads install/onboard/test logs on failures. * **Chores** * Renamed the branch-validation workflow and corresponding test-suite identifiers for clarity. * Updated E2E test documentation and project configuration names to match the new labeling. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- markdownlint-disable MD041 --> ## Summary Adds the `test-non-root-sandbox-smoke` test from #2571 — a PR-gate job that runs the production image under `-security-opt no-new-privileges` to catch #2472 and #2482 regressions, without OpenShell, NVIDIA_API_KEY, or live inference. ## Related Issue Part of #2571 ## Changes - New `test/e2e-non-root-smoke.sh` (host-side bash, no `openshell`/`nemoclaw` CLI required): - **Test 1** — entrypoint setup chain completes cleanly under `--security-opt no-new-privileges` (regression guard for # 2472; passes a `true` command via the entrypoint's `NEMOCLAW_CMD` exec path so the gateway-launch branch is bypassed and we don't need the OpenShell-managed runtime). - **Test 2** — kernel confirms `NoNewPrivs=1` inside the container (defends the test itself against silent typos in the docker flag). - New job `test-non-root-sandbox-smoke` in `.github/workflows/pr-self-hosted.yaml` — `linux-amd64-cpu4`, `timeout-minutes: 5`, `needs: build-sandbox-images`, reuses the existing `isolation-image` artifact. - Expected results: ``` my-machine@ab1-cdf40-30:~/NemoClaw$ # Run script bash test/e2e-non-root-smoke.sh TEST: 1. Entrypoint setup chain completes under --security-opt no-new-privileges PASS: entrypoint exited 0 under no-new-privileges (#2472 setup chain healthy) TEST: 2. Kernel confirms NoNewPrivs=1 inside container (defends against silent flag typos) PASS: kernel confirms NoNewPrivs=1 ======================================== Results: 2 passed, 0 failed ======================================== ``` - Upcoming plans: - **Test 3** — `openclaw tui` does not error with "Missing gateway auth token" inside a login shell under the same constraint (regression guard for # 2482) after PR #2485 is merged ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification <!-- Check each item you ran and confirmed. Leave unchecked items you skipped. Doc-only changes do not require npm test unless you ran it. --> - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [ ] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- <!-- DCO sign-off required by CI. Run: git config user.name && git config user.email --> Signed-off-by: Hung Le <hple@nvidia.com>
Summary
Fixes a 5-day outage where the gateway never started in non-root sandbox mode (Brev Launchable, no-new-privileges containers). Also adds a global safety net preventing any npm library crash from killing the gateway.
Changes
Entrypoint Landlock tolerance (
scripts/nemoclaw-start.sh,scripts/lib/sandbox-init.sh)install_configure_guard: wrap all.bashrc/.profilewrites in|| true— the[ -w file ]test passes (DAC) but Landlock blocks the actual write, crashing the entrypoint underset -elock_rc_files:|| trueon chmod callsvalidate_tmp_permissions: expect 644 for gateway.log20407589(Apr 20) addedinstall_configure_guardwhich writes to Landlock-protected files. Every non-root sandbox since then had a dead gateway.Global sandbox safety net (
scripts/nemoclaw-start.sh)sandbox-safety-net.jspreload — catches ALL uncaught exceptions and unhandled rejections in sandbox mode (OPENSHELL_SANDBOX=1), logs them, and continuesprocess.exit()during swallowed rejection delivery so OpenClaw's own handler (which callsprocess.exit(1)for non-transient errors) doesn't kill the gateway--requirepreload so handlers register before any library codeciao network guard (
scripts/nemoclaw-start.sh)@homebridge/ciaomDNS library crash (os.networkInterfaces()→uv_interface_addressesSystemError in restricted namespaces)os.networkInterfacesto return{}on failureSlack guard improvements (
scripts/nemoclaw-start.sh)proxy-env.shfor connect sessionsPer-channel health monitor disable (
Dockerfile)healthMonitor.enabled: falseon each messaging channel accountRemove cloud-experimental-e2e (
.github/workflows/nightly-e2e.yaml)Test plan
Summary by CodeRabbit
Bug Fixes
Chores
Tests