fix(recovery): harden gateway recovery preload chain (#2478) by ericksoa · Pull Request #2558 · NVIDIA/NemoClaw

ericksoa · 2026-04-27T19:27:08Z

Summary

Hardens the gateway recovery script so the NODE_OPTIONS preload chain (sandbox safety-net, ciao networkInterfaces guard, slack guard, http-proxy fix, ws-proxy fix, nemotron fix) actually survives gateway respawn. The pre-fix path silently swallowed .bashrc sourcing failures with 2>/dev/null and never asserted that NODE_OPTIONS contained the safety-net preload — so when /tmp/nemoclaw-proxy-env.sh was missing or env did not propagate through the gosu gateway boundary, the respawned gateway came up naked and any library that threw during init (ciao mDNS being the documented trigger in #2478) crash-looped the gateway forever under health-monitor restart cadence.

Related Issue

Closes #2478

Changes

src/lib/agent-runtime.ts:buildRecoveryScript — explicitly sources /tmp/nemoclaw-proxy-env.sh (single source of truth for NODE_OPTIONS guards), drops the silencing 2>/dev/null on .bashrc sourcing, asserts NODE_OPTIONS contains the safety-net preload, surfaces a [gateway-recovery] WARNING line to gateway.log when the env file is missing or guards are absent.
src/nemoclaw.ts:recoverSandboxProcesses — mirrors the same hardening in the inline OpenClaw recovery path.
src/lib/agent-runtime.test.ts — 5 new unit tests under #2478 hardened library-guard preload chain lock in the contract: explicit source, warn-on-missing, no silencing, NODE_OPTIONS check, sourcing-before-launch ordering.
test/e2e/test-issue-2478-crash-loop-recovery.sh — long-running regression e2e: 5 crash-recover cycles asserting the preload chain survives each respawn (verified via /proc/<pid>/environ), a negative case where proxy-env.sh is removed and the recovery surfaces a [gateway-recovery] WARNING instead of silently launching, and a 5-minute idle soak that catches crash-loop signatures by sampling the gateway PID every 15s. Marked STAYS_IN_PR_UNTIL_SHIP — delete this file before merge once a clean soak run lands on a real Spark/Brev instance.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes

Signed-off-by: Aaron Erickson aerickson@nvidia.com

Summary by CodeRabbit

Bug Fixes
- Recovery now explicitly sources the proxy env, warns when the env file or safety‑net preload are missing, appends those warnings to the gateway log (preserving earlier messages), and rate‑limits repeated network‑interface error logs while reporting suppressed counts.
Tests
- Added a long-running E2E regression that exercises repeated crash/restart cycles, negative recovery scenarios, soak testing, and verification that guard behavior persists.
Chores
- Added a nightly E2E job for the recovery test, updated staging comments, and extended a test runner timeout and review guidance.

The gateway recovery script (recoverSandboxProcesses + buildRecoveryScript) sourced ~/.bashrc with `2>/dev/null` and never asserted that the resulting NODE_OPTIONS contained the sandbox safety-net preload before launching openclaw gateway run. When proxy-env.sh was missing, .bashrc had been tampered with, or shell env failed to propagate through the gosu boundary, the respawned gateway came up naked. Any library that threw during init (ciao mDNS networkInterfaces being the documented trigger) crashed the gateway forever in a health-monitor restart loop. Recovery now explicitly sources /tmp/nemoclaw-proxy-env.sh — the single source of truth for NODE_OPTIONS preload guards — surfaces a [gateway-recovery] WARNING line when the file is missing or when the safety-net preload is absent from NODE_OPTIONS, and drops the silencing 2>/dev/null on .bashrc sourcing so real failures stay observable in gateway.log. Mirrors the change in src/lib/agent-runtime.ts (non-OpenClaw agents) and the inline OpenClaw recovery in src/nemoclaw.ts. Adds a regression test suite locking in the contract: explicit source, warn-on-missing, no silencing, NODE_OPTIONS check, ordering before launch. Adds a long-running e2e (test/e2e/test-issue-2478-crash-loop-recovery.sh) that crash-recovers the gateway 5x while asserting the preload chain survives each respawn, exercises the negative case where proxy-env.sh is removed, and soaks for 5 minutes to catch crash-loop signatures by sampling the gateway PID. The e2e file is marked STAYS_IN_PR_UNTIL_SHIP and should be removed once a clean soak run lands on a real Spark/Brev instance. Closes #2478 Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

coderabbitai · 2026-04-27T19:27:21Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Recovery/runtime scripts now deterministically source /tmp/nemoclaw-proxy-env.sh (warn if missing), stop suppressing ~/.bashrc errors, verify NODE_OPTIONS contains the nemoclaw-sandbox-safety-net preload (warn if absent), append warnings into /tmp/gateway.log, change gateway launch logging to append, and add unit, E2E tests plus a nightly CI job for issue #2478.

Changes

Cohort / File(s)	Summary
Recovery Script / Runtime `src/lib/agent-runtime.ts`, `src/nemoclaw.ts`	Deterministic sourcing of `/tmp/nemoclaw-proxy-env.sh` (track `_PE_MISSING`), stop silencing `~/.bashrc` errors, inspect `NODE_OPTIONS` for `nemoclaw-sandbox-safety-net` (track `_GUARDS_MISSING`), emit `[gateway-recovery] WARNING` to stderr and append into `/tmp/gateway.log`, and launch gateway with append redirection (`>>`).
Unit Tests `src/lib/agent-runtime.test.ts`	Expanded regression tests for `#2478` asserting explicit proxy-env sourcing, missing-file warnings referencing `#2478`, absence of `2>/dev/null` suppression, safety-net preload checks/messages, ordering (sourcing before gateway launch), and persistence of warnings in `/tmp/gateway.log`.
End-to-End Test `test/e2e/test-issue-2478-crash-loop-recovery.sh`	New long-running E2E script that provisions a sandbox, validates `NODE_OPTIONS` via `/proc/<pid>/environ`, runs repeated gateway crash/recovery cycles (including negative scenario removing `/tmp/nemoclaw-proxy-env.sh`), asserts recovery warnings in `gateway.log`, restores and byte-verifies the env file, performs soak testing, and summarizes results.
CI Workflow `.github/workflows/nightly-e2e.yaml`	Added nightly job `issue-2478-crash-loop-recovery-e2e` to run the new E2E script on `ubuntu-latest` with a dedicated sandbox (`e2e-2478`) and failure-only install-log artifact upload.
Sandbox Build Staging (comment) `src/lib/sandbox-build-context.ts`	Minor inline comment update describing `generate-openclaw-config.py`; no behavior changes.
Start Script Logging Rate-limit `scripts/nemoclaw-start.sh`	Rate-limits repeated `os.networkInterfaces()` failure logs: emit first failure immediately, then at most one message per 5 minutes while counting suppressed occurrences for later reporting.
PR Review Mapping `.coderabbit.yaml`	Added `reviews.path_instructions` entry linking the new E2E script to the `issue-2478-crash-loop-recovery-e2e` nightly job and providing review guidance.
Test Timeout `test/no-direct-credential-env.test.ts`	Increased ESLint spawn timeout from 30s to 60s for `src/lib/onboard.ts` invocation.

Sequence Diagram(s)

sequenceDiagram
    participant Trigger as Recovery Trigger
    participant Script as recovery script
    participant ProxyEnv as /tmp/nemoclaw-proxy-env.sh
    participant Bashrc as ~/.bashrc
    participant Inspector as NODE_OPTIONS inspector
    participant Logger as /tmp/gateway.log
    participant Gateway as gateway process

    Trigger->>Script: invoke recovery
    Script->>ProxyEnv: attempt to source (set _PE_MISSING if unreadable)
    alt proxy env missing
        Script->>Logger: touch & append "[gateway-recovery] WARNING: proxy-env missing (`#2478`)"
        Script->>Script: persist warning via writing $_W
        Script->>Gateway: write warning to stderr
    end
    Script->>Bashrc: source ~/.bashrc (no stderr suppression)
    Script->>Inspector: inspect NODE_OPTIONS for "nemoclaw-sandbox-safety-net"
    alt safety-net missing
        Script->>Logger: append "[gateway-recovery] WARNING: NODE_OPTIONS missing safety-net preload"
        Script->>Gateway: write warning to stderr
    end
    Script->>Gateway: exec gateway run >> /tmp/gateway.log

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nudged the env, I checked the net,

If safety's gone I leave a warning set.
I append my notes where gateway logs sleep,
So recoveries wake and no secrets keep.
Hop, patch, persist — then back to my peep.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(recovery): harden gateway recovery preload chain (`#2478`)' directly and clearly summarizes the main change: hardening the gateway recovery process to ensure the NODE_OPTIONS preload chain survives respawn, addressing issue `#2478`.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/2478-gateway-recovery-preload-chain

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

test/e2e/test-issue-2478-crash-loop-recovery.sh (1)

316-326: File restore may fail silently due to permission issues.

The restore logic writes to /tmp/nemoclaw-proxy-env.sh.restore as the sandbox user, but the original file was root-owned with mode 444. The mv on line 324 will fail if the sandbox user lacks write permission to replace the original.

The comment acknowledges this ("best-effort"), but consider adding explicit feedback:

 sandbox_exec sh -c "cat > /tmp/nemoclaw-proxy-env.sh.restore <<'REPL'
 $SNAPSHOT
 REPL
 chmod 444 /tmp/nemoclaw-proxy-env.sh.restore 2>/dev/null || true
 mv /tmp/nemoclaw-proxy-env.sh.restore /tmp/nemoclaw-proxy-env.sh 2>/dev/null || true
 " >/dev/null
-info "proxy-env.sh restored (best-effort; soak phase will tolerate degraded state)"
+# Verify restore succeeded
+if sandbox_exec sh -c "[ -r /tmp/nemoclaw-proxy-env.sh ] && grep -q 'NODE_OPTIONS' /tmp/nemoclaw-proxy-env.sh" >/dev/null 2>&1; then
+  info "proxy-env.sh restored successfully"
+else
+  info "proxy-env.sh restore failed (best-effort); soak phase will tolerate degraded state"
+fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-issue-2478-crash-loop-recovery.sh` around lines 316 - 326, The
restore can fail silently because sandbox_exec redirects output and ignores
errors from chmod/mv; update the block that calls sandbox_exec so failures are
captured and reported: stop swallowing stdout/stderr (remove the final
>/dev/null) or capture the sandbox_exec exit code and log a clear error via
info/error if it is non-zero, and inside the sandbox_exec command ensure the
mv/chmod failures are propagated (e.g. check their exits and write an
explanatory message to stderr) and, as a fallback, attempt a privileged
move/chown when mv fails; reference the sandbox_exec invocation and the
/tmp/nemoclaw-proxy-env.sh.restore -> /tmp/nemoclaw-proxy-env.sh mv/chmod steps
so the script logs explicit failure details instead of silently continuing.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-issue-2478-crash-loop-recovery.sh`:
- Line 329: The command using sandbox_exec with $(pgrep -f 'openclaw gateway
run' >/dev/null 2>&1 && echo true || echo true) is a no-op; replace it with a
real existence check or a blocking wait. Update the invocation around
sandbox_exec to directly test pgrep's exit status (use pgrep -f 'openclaw
gateway run' >/dev/null 2>&1 && proceed || fail) or implement a retry/wait loop
that repeatedly calls pgrep and only continues when the gateway process is
found, referencing the existing sandbox_exec wrapper and the pgrep -f 'openclaw
gateway run' check so the script fails or waits appropriately instead of always
executing a no-op.
- Around line 247-249: The sandboxConnect handler in src/nemoclaw.ts currently
ignores unknown flags and only recognizes --dangerously-skip-permissions; add
explicit support for a --probe-only flag in the parsing logic used by the
sandboxConnect function so that when --probe-only is passed the command performs
a non-interactive probe (no prompts, no provisioning or long-lived side effects)
and returns an exit status indicating reachability; update the branch that
handles connect (and any helper like sandboxConnect or parseConnectArgs) to
detect --probe-only, short-circuit the interactive flow (skip permission prompts
and provisioning), run only the lightweight connectivity checks the tests
expect, and ensure the command returns non-zero on failure and zero on success
so the e2e test lines using nemoclaw ... connect --probe-only behave correctly.

---

Nitpick comments:
In `@test/e2e/test-issue-2478-crash-loop-recovery.sh`:
- Around line 316-326: The restore can fail silently because sandbox_exec
redirects output and ignores errors from chmod/mv; update the block that calls
sandbox_exec so failures are captured and reported: stop swallowing
stdout/stderr (remove the final >/dev/null) or capture the sandbox_exec exit
code and log a clear error via info/error if it is non-zero, and inside the
sandbox_exec command ensure the mv/chmod failures are propagated (e.g. check
their exits and write an explanatory message to stderr) and, as a fallback,
attempt a privileged move/chown when mv fails; reference the sandbox_exec
invocation and the /tmp/nemoclaw-proxy-env.sh.restore ->
/tmp/nemoclaw-proxy-env.sh mv/chmod steps so the script logs explicit failure
details instead of silently continuing.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ac9aeebf-94f8-4fa7-abc6-a392f2fe9c59

📥 Commits

Reviewing files that changed from the base of the PR and between db1ef3c and 48e2d9b.

📒 Files selected for processing (4)

src/lib/agent-runtime.test.ts
src/lib/agent-runtime.ts
src/nemoclaw.ts
test/e2e/test-issue-2478-crash-loop-recovery.sh

…UNTIL_SHIP) Wires the new long-running e2e from this PR into the nightly e2e workflow so the recovery hardening for #2478 gets a real soak run on a runner with NVIDIA endpoints reachable. Removed in the same commit that deletes the test file before merge. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 172-195: The new job issue-2478-crash-loop-recovery-e2e must be
added to the notify-on-failure job's needs list so the notifier waits for it and
creates an auto-issue when this job fails; update the notify-on-failure job (the
job that defines needs between lines referencing notify-on-failure) to include
"issue-2478-crash-loop-recovery-e2e" in its needs array alongside the other job
names, ensuring the notifier will depend on and run after this new E2E job and
that artifacts/logs are available when it runs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 662b99f7-a4fe-4723-bcf9-97f9e9b8ce8c

📥 Commits

Reviewing files that changed from the base of the PR and between 48e2d9b and 054ed59.

📒 Files selected for processing (1)

.github/workflows/nightly-e2e.yaml

#2449 added scripts/generate-openclaw-config.py and a COPY for it in the Dockerfile, but did not update stageOptimizedSandboxBuildContext to include the new file. The optimized build context is missing the script, so every sandbox image build that uses the optimized path fails at Dockerfile step 21/56 with `COPY failed: file not found in build context`. This breaks all dispatched nightly-e2e jobs since 2026-04-26 22:21 EDT. Pulling this fix into the #2478 PR rather than a standalone PR so the recovery hardening test on this branch can actually run. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

…very-preload-chain

The runner hit "FAIL: Initial gateway missing safety-net preload" with "sh: 1: cannot open /proc/831/environ: No such file" — the test grabbed a PID that was either wrong (greedy pgrep -f match) or gone by the time we tried to read its environ. Add a diagnostic snapshot (all matching processes, ps -ef openclaw filter, /proc/<pid>/{cmdline,status,listing}) that fires before exit so the next failure tells us which of those hypotheses is right before we change the matching logic. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

test/e2e/test-issue-2478-crash-loop-recovery.sh (1)

347-363: Verify the env-file restore before starting the soak.

This block suppresses restore failures, then immediately enters Phase 5 without re-checking the guard chain. If proxy-env.sh did not come back cleanly, the soak is no longer exercising the healthy recovery path this test is supposed to validate.

✅ Suggested hardening

-sandbox_exec sh -c "cat > /tmp/nemoclaw-proxy-env.sh.restore <<'REPL'
+if ! sandbox_exec sh -c "cat > /tmp/nemoclaw-proxy-env.sh.restore <<'REPL'
 $SNAPSHOT
 REPL
-chmod 444 /tmp/nemoclaw-proxy-env.sh.restore 2>/dev/null || true
-mv /tmp/nemoclaw-proxy-env.sh.restore /tmp/nemoclaw-proxy-env.sh 2>/dev/null || true
-" >/dev/null
-info "proxy-env.sh restored (best-effort; soak phase will tolerate degraded state)"
+chmod 444 /tmp/nemoclaw-proxy-env.sh.restore &&
+mv /tmp/nemoclaw-proxy-env.sh.restore /tmp/nemoclaw-proxy-env.sh
+" >/dev/null; then
+  fail "Failed to restore /tmp/nemoclaw-proxy-env.sh before soak"
+  exit 1
+fi
+info "proxy-env.sh restored"
 
 ...
 SOAK_START_PID="$(wait_for_gateway_up 30)"
 if [ -z "$SOAK_START_PID" ]; then
   fail "Gateway not up entering soak phase"
   exit 1
 fi
+SOAK_NODE_OPTIONS="$(gateway_node_options "$SOAK_START_PID")"
+if ! echo "$SOAK_NODE_OPTIONS" | grep -q 'nemoclaw-sandbox-safety-net'; then
+  fail "Gateway entered soak without safety-net preload after restore"
+  exit 1
+fi
 pass "Gateway healthy entering soak (pid=$SOAK_START_PID)"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-issue-2478-crash-loop-recovery.sh` around lines 347 - 363, The
restore of the proxy env file is being done best-effort and failures are
suppressed; change it to verify the restore succeeded before starting the soak:
after calling sandbox_exec to write the snapshot, check the command exit status
and then validate that /tmp/nemoclaw-proxy-env.sh exists and is readable (and
optionally that its content matches $SNAPSHOT or a checksum) before proceeding
to call wait_for_gateway_up; if the verification fails, call fail (or exit
non‑zero) instead of continuing into the soak phase so that the test exercises
the healthy-recovery path; reference sandbox_exec, $SNAPSHOT,
/tmp/nemoclaw-proxy-env.sh, wait_for_gateway_up and fail when updating the
logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-issue-2478-crash-loop-recovery.sh`:
- Around line 100-107: The helpers are accidentally treating stderr as
structured PID/data because sandbox_exec currently redirects stderr into stdout;
stop folding transport/errors into the data path by removing the global "2>&1"
redirection from sandbox_exec (so callers can observe failures) and harden
callers like gateway_pid and gateway_node_options to validate output (e.g., only
accept numeric PIDs or parse expected option formats) before returning; also
ensure wait_for_gateway_up checks for empty/invalid values (not just truthy
strings) from gateway_pid/gateway_node_options so an error message won't be
misinterpreted as a PID.
- Line 409: The cleanup command currently at the end of the script only runs on
the happy path and leaks sandboxes on early exits; instead, register an EXIT
trap immediately after creating the sandbox that runs a cleanup function which
checks NEMOCLAW_E2E_KEEP_SANDBOX and calls "nemoclaw $SANDBOX_NAME destroy
--yes" (or a no-op if the variable is set), ensuring the trap uses the same
suppression redirection (>/dev/null 2>&1 || true) and is idempotent; add a named
cleanup function (e.g., cleanup_sandbox) and "trap cleanup_sandbox EXIT" near
where SANDBOX_NAME is created, and you can keep or remove the existing Phase 6
line to avoid duplicate destroys.

---

Nitpick comments:
In `@test/e2e/test-issue-2478-crash-loop-recovery.sh`:
- Around line 347-363: The restore of the proxy env file is being done
best-effort and failures are suppressed; change it to verify the restore
succeeded before starting the soak: after calling sandbox_exec to write the
snapshot, check the command exit status and then validate that
/tmp/nemoclaw-proxy-env.sh exists and is readable (and optionally that its
content matches $SNAPSHOT or a checksum) before proceeding to call
wait_for_gateway_up; if the verification fails, call fail (or exit non‑zero)
instead of continuing into the soak phase so that the test exercises the
healthy-recovery path; reference sandbox_exec, $SNAPSHOT,
/tmp/nemoclaw-proxy-env.sh, wait_for_gateway_up and fail when updating the
logic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d07fb5f7-617a-4c99-933c-b41762a0b824

📥 Commits

Reviewing files that changed from the base of the PR and between 9c4418e and b94bd40.

📒 Files selected for processing (1)

test/e2e/test-issue-2478-crash-loop-recovery.sh

Last run revealed two distinct issues: 1. `pgrep -f 'openclaw gateway run'` matched its own `sh -c \"pgrep ...\"` wrapper argv, returning the wrapper's PID. The wrapper exited immediately after writing stdout, so `/proc/<pid>/environ` was already gone. Switch to `pgrep -fo '[o]penclaw gateway run'` — bracket trick prevents self-match, `-o` picks the oldest matching process (the long-lived gateway, not transient launchers). 2. ps/pgrep showed zero openclaw processes in the sandbox after onboard reported success. Could be the gateway crashed silently, or `openshell sandbox exec` is landing in a different container than the gateway. Extend diagnostics to dump: - exec context (whoami, hostname, pid namespace) - full process tree (ps auxf) - /tmp/gateway.log content + listing - nemoclaw status output - openshell sandbox info (containers in pod) Also call gateway_diagnostics on the "Gateway never came up after onboard" path so we see the data when there's no PID at all. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

Diagnostic from the previous run showed the gateway is alive and the fix is working — the ciao guard fires 19+ times in 90s and the gateway stays up — but the test never picked up its PID because the real argv is `openclaw-gateway` (single token, after re-exec), not the launcher's `openclaw gateway run`. Match either form via `[o]penclaw[ -]gateway`. Bracket prevents pgrep self-match (the wrapper argv contains the literal pattern, but `[o]` doesn't match `[`); `[ -]` accepts both space (early launcher) and dash (post-rename gateway). Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

Last run got past pgrep matching the right gateway PID (288), but reading /proc/288/environ failed with EACCES — Linux's kernel.yama.ptrace_scope=1 (default) refuses /proc/<pid>/environ reads from non-ancestor processes even at the same UID, and `openshell sandbox exec` spawns a new process tree, not a child of the gateway. Switch to a stronger three-signal verification that doesn't need to read environ: 1. /tmp/nemoclaw-proxy-env.sh contains the safety-net + ciao guard NODE_OPTIONS exports (the source of truth recovery sources). 2. gateway.log contains "[guard] os.networkInterfaces() failed:" — that line is *only* emitted by our preload code, so its presence proves the preload actually executed inside the gateway's Node process. This is stronger evidence than a NODE_OPTIONS snapshot. 3. The gateway PID is alive after the guard activations, proving the guard prevented a crash. The preload activation log line appears within seconds because bonjour mDNS retries advertising eagerly; helper waits up to 30s for the signature to accrue. Drops the redundant per-cycle ciao-guard assertion (the new gateway_guards_active() helper already covers it). Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

Phase 3 Cycle 1 hung for 13+ minutes until GitHub Actions killed the job at the 30-min budget (exit 124). Cause: `nemoclaw <name> connect --probe-only` invokes `connect` with a flag that does not exist; with stdin/stdout swallowed by the test redirects, the connect TUI blocks waiting for input that never arrives. Switch to `nemoclaw <name> status`. That command is non-interactive and is the actual entry point that calls checkAndRecoverSandboxProcesses() → recoverSandboxProcesses() — the hardened path #2478 changes. Wrap with `timeout 60` so any future hang fails fast instead of eating the whole job budget. Also update the post-kill pgrep verification to use the same `[o]penclaw[ -]gateway` pattern as gateway_pid(). Phases 0-2 PASS on the previous run, including the new "Initial gateway has guard chain active" assertion — confirming the recovery hardening works end-to-end. Phase 3 was just the test harness lying. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

…2478) The recovery script's missing-proxy-env.sh and missing-safety-net warnings went to stderr only, but executeSandboxCommand captures stderr without surfacing it — so a sysadmin tailing /tmp/gateway.log to debug a crash-loop saw the crash output but not the explanation. Defer warning emission until AFTER the script touches+chmods a fresh gateway.log, then write each warning to both stderr and gateway.log so it lands where it's actually discoverable. Mirrored across src/lib/agent-runtime.ts (non-OpenClaw agents) and the inline OpenClaw recovery in src/nemoclaw.ts. Adds a unit-test assertion that warnings end up in /tmp/gateway.log and are deferred past the touch. Also fix the e2e Phase 4 proxy-env.sh restore. The previous heredoc-via- shell approach hit a 444-mode mv that silently failed, leaving the gateway without guards through the entire soak phase (Phase 5 saw 19/20 empty PID samples). Restore via stdin pipe → temp file → mv -f → chmod 444, with size verification, and assert the restored gateway has guards active before entering the soak. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

Phase 4 was failing for two distinct reasons even after the previous fix: 1. `kill -9 $prev_pid` only killed the openclaw-gateway child; the parent `openclaw` launcher's watchdog likely respawned the gateway before `nemoclaw status` ran the recovery script. Result: recovery code path never executed, so the [gateway-recovery] WARNING was never emitted. Switch to `pkill -9 -f '[o]penclaw'` to nuke the whole tree. 2. `printf '%s' "$SNAPSHOT" | sandbox_exec sh -c 'cat > file'` left an empty file because `openshell sandbox exec` doesn't pipe caller stdin through to the subshell. Encode the snapshot as base64 and decode inside the sandbox so the data lives in the command argv, sidestepping the stdin gap entirely. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

The recovery script writes the [gateway-recovery] WARNING to gateway.log moments before launching the gateway. The launch then redirected with `>` (truncate), which wiped the warning before anyone — sysadmin or e2e test — could read it. Switch to `>>` (append). The earlier `rm -f /tmp/gateway.log; touch /tmp/gateway.log` reset already ensures we start clean per recovery cycle, so append is safe and preserves the warning that explains why the gateway is about to crash. Also: e2e Phase 4 proxy-env.sh restore was leaving an empty file even with base64-via-argv. Add `wc -c` inline so we can see what the restore actually produced if it fails again, and collapse the multi-line sh -c to a single chained command which is more reliable across openshell sandbox exec versions. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

The ciao networkInterfaces guard preload (#2340) prevents the gateway crash, but the bonjour watchdog inside ciao retries advertising every few seconds, so the guard's failure message ends up flooding sandbox logs with hundreds of identical lines per hour: [guard] os.networkInterfaces() failed: ... — returning empty (mDNS disabled) [guard] os.networkInterfaces() failed: ... — returning empty (mDNS disabled) ... Operators see the same "actionable" message a thousand times. Log on first failure (operator gets the explanation), then suppress repeats within a 5-minute window and replay one summary line afterward with a "[N suppressed in last ~5min, M total]" suffix so volume is still observable. Same prefix, so existing log scrapers and the e2e regression test continue to match. Closes GitHub issue #2611. Also fix the e2e snapshot/restore byte-faithfulness. The previous `SNAPSHOT="$(cat ...)"` capture went through bash command substitution which strips trailing newlines, so the restored file was 2 bytes longer than the captured size and Phase 4's verification reported a spurious mismatch even though the restore was correct. Capture the snapshot as base64 from inside the sandbox (lossless round-trip) and verify size against the original sandbox-side wc -c. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

coderabbitai

♻️ Duplicate comments (1)

.github/workflows/nightly-e2e.yaml (1)
678-701: ⚠️ Potential issue | 🟡 Minor

Add issue-2478-crash-loop-recovery-e2e to notify-on-failure.

This job is still missing from the notifier’s needs list, so a failure here will not be included in the auto-issue path or waited on before the notifier runs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml around lines 678 - 701, The
notify-on-failure job's needs list is missing the
issue-2478-crash-loop-recovery-e2e job; update the notify-on-failure job
(notify-on-failure) to include "issue-2478-crash-loop-recovery-e2e" in its needs
array so the notifier waits on and includes failures from that job in the
auto-issue path.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 678-701: The notify-on-failure job's needs list is missing the
issue-2478-crash-loop-recovery-e2e job; update the notify-on-failure job
(notify-on-failure) to include "issue-2478-crash-loop-recovery-e2e" in its needs
array so the notifier waits on and includes failures from that job in the
auto-issue path.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ae379a80-1cdd-4f11-a3d5-1e10ef63b728

📥 Commits

Reviewing files that changed from the base of the PR and between 9c4418e and 2658673.

📒 Files selected for processing (6)

.github/workflows/nightly-e2e.yaml
scripts/nemoclaw-start.sh
src/lib/agent-runtime.test.ts
src/lib/agent-runtime.ts
src/nemoclaw.ts
test/e2e/test-issue-2478-crash-loop-recovery.sh

✅ Files skipped from review due to trivial changes (1)

src/lib/agent-runtime.ts

🚧 Files skipped from review as they are similar to previous changes (2)

src/nemoclaw.ts
test/e2e/test-issue-2478-crash-loop-recovery.sh

A NemoClaw user reported on issue #2478 that the pre-fix ciao crash made `https://inference.local/v1/models` go silent — the user's deployed model "disappeared" even though the gateway process state was variously alive/dead. The recovery hardening keeps the gateway process up, but nothing in the e2e was asserting that the user-facing inference API actually stays reachable across kill/respawn. Add `gateway_serves_inference()` (curl /v1/models from inside the sandbox, accept any OpenAI-compatible response shape) and call it in: - Phase 2: initial gateway must serve before we trust later phases. - Phase 3: after every kill/respawn cycle (5x), the new gateway must serve, not just exist. - Phase 5 entry + every 60s during the 300s soak, with a final pass/fail on the failure rate. Strongest-possible reply to the user's report: end-to-end on a runner, gateway + inference endpoint both stay healthy across 5 kill/recover cycles and a 5-minute idle soak. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

src/nemoclaw.ts (1)

289-314: Consider collapsing this fallback onto the shared recovery-script builder.

This block now mirrors src/lib/agent-runtime.ts almost verbatim. Keeping the warning strings, ordering, and NODE_OPTIONS checks in two places makes the next preload-chain change easy to miss in one path.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 289 - 314, This duplicates the recovery-launch
shell block found in src/lib/agent-runtime.ts (see buildRecoveryScript); instead
of maintaining two nearly identical arrays, refactor src/nemoclaw.ts to
import/use the shared builder (buildRecoveryScript) or a common helper that
returns the recovery script/command list, and replace the inline commands (the
array containing checks for /tmp/nemoclaw-proxy-env.sh, ~/.bashrc, NODE_OPTIONS
guard check, warnings, OPENCLAW lookup and nohup launch) with a call to that
shared builder so the warning strings, ordering, and NODE_OPTIONS logic are
maintained in one place (also update any callers that expect the current array
shape to use the unified output, and ensure compatibility with
executeSandboxCommand if used).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/nemoclaw.ts`:
- Around line 289-314: This duplicates the recovery-launch shell block found in
src/lib/agent-runtime.ts (see buildRecoveryScript); instead of maintaining two
nearly identical arrays, refactor src/nemoclaw.ts to import/use the shared
builder (buildRecoveryScript) or a common helper that returns the recovery
script/command list, and replace the inline commands (the array containing
checks for /tmp/nemoclaw-proxy-env.sh, ~/.bashrc, NODE_OPTIONS guard check,
warnings, OPENCLAW lookup and nohup launch) with a call to that shared builder
so the warning strings, ordering, and NODE_OPTIONS logic are maintained in one
place (also update any callers that expect the current array shape to use the
unified output, and ensure compatibility with executeSandboxCommand if used).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 85febb49-5051-4324-9f7d-123a7a5e9579

📥 Commits

Reviewing files that changed from the base of the PR and between 9c4418e and 4f64a74.

📒 Files selected for processing (6)

.github/workflows/nightly-e2e.yaml
scripts/nemoclaw-start.sh
src/lib/agent-runtime.test.ts
src/lib/agent-runtime.ts
src/nemoclaw.ts
test/e2e/test-issue-2478-crash-loop-recovery.sh

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/nightly-e2e.yaml

Brings in the selective nightly-e2e dispatch infrastructure from #2615 (adds workflow_dispatch.inputs.jobs filter + per-job guard) and the test/validate-e2e-coverage.test.ts cross-check that asserts every nightly job carries the canonical guard. Updates the issue-2478-crash-loop-recovery-e2e job to use the same guard pattern as every other job, and adds it to the inputs.jobs description "Valid:" list. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

.github/workflows/nightly-e2e.yaml (1)

197-228: ⚠️ Potential issue | 🟠 Major

Add the new #2478 job to notify-on-failure.needs.

Line 201 introduces issue-2478-crash-loop-recovery-e2e, but notify-on-failure (Line 777-Line 797) still excludes it. Failures from this job can be missed by notifier sequencing/reporting.

🔧 Proposed patch

   notify-on-failure:
     runs-on: ubuntu-latest
     needs:
       [
         cloud-e2e,
         messaging-providers-e2e,
         token-rotation-e2e,
         sandbox-survival-e2e,
+        issue-2478-crash-loop-recovery-e2e,
         hermes-e2e,
         skip-permissions-e2e,
         sandbox-operations-e2e,
         inference-routing-e2e,
         network-policy-e2e,
         deployment-services-e2e,
         diagnostics-e2e,
         snapshot-commands-e2e,
         shields-config-e2e,
         rebuild-openclaw-e2e,
         upgrade-stale-sandbox-e2e,
         rebuild-hermes-e2e,
         overlayfs-autofix-e2e,
         gpu-e2e,
         gpu-double-onboard-e2e,
       ]

As per coding guidelines: "Keeping notify-on-failure updated: include the new job in the needs list so failures trigger the failure issue/reporting."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml around lines 197 - 228, The new workflow
job issue-2478-crash-loop-recovery-e2e is not included in the needs list for the
notify-on-failure job; update the notify-on-failure job's needs array to include
"issue-2478-crash-loop-recovery-e2e" so failures from that job trigger the
notifier sequencing/reporting (locate notify-on-failure and add the exact job
name to its needs list).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 197-228: The workflow adds a new E2E job
issue-2478-crash-loop-recovery-e2e that runs
test/e2e/test-issue-2478-crash-loop-recovery.sh but .coderabbit.yaml lacks a
corresponding path_instructions entry; open .coderabbit.yaml and add a
path_instructions mapping for the job name issue-2478-crash-loop-recovery-e2e
that points to the test file path
(test/e2e/test-issue-2478-crash-loop-recovery.sh) following the existing format
of other E2E jobs so the cross-validation test
(test/validate-e2e-coverage.test.ts) recognizes the new job.

---

Duplicate comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 197-228: The new workflow job issue-2478-crash-loop-recovery-e2e
is not included in the needs list for the notify-on-failure job; update the
notify-on-failure job's needs array to include
"issue-2478-crash-loop-recovery-e2e" so failures from that job trigger the
notifier sequencing/reporting (locate notify-on-failure and add the exact job
name to its needs list).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 08d47945-9f7b-4274-9018-2fecb3d84d05

📥 Commits

Reviewing files that changed from the base of the PR and between 4f64a74 and 2f248bb.

📒 Files selected for processing (1)

.github/workflows/nightly-e2e.yaml

Per project coding standard (enforced by test/validate-e2e-coverage.test.ts), every nightly E2E job needs a corresponding .coderabbit.yaml path_instructions entry. Add one keyed on the test file path so the mapping disappears in the same removal commit when the STAYS_IN_PR_UNTIL_SHIP test is deleted before merge. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

src/nemoclaw.ts (1)

292-319: Recommend centralizing this fallback script to avoid drift with agent-runtime.ts.

This block now mirrors src/lib/agent-runtime.ts:52-123 very closely. Consolidating into one shared builder would reduce future regression risk when guard-chain behavior changes again.

♻️ Refactor direction

-  const script =
-    agentScript ||
-    [
-      // inline OpenClaw recovery script...
-    ].join(" ");
+  const script =
+    agentScript ||
+    agentRuntime.buildOpenclawRecoveryScript({
+      dashboardPort: DASHBOARD_PORT,
+      gatewayLogPath: "/tmp/gateway.log",
+    });

// src/lib/agent-runtime.ts
export function buildOpenclawRecoveryScript(opts: {
  dashboardPort: number;
  gatewayLogPath: string;
}): string {
  // single source of truth for OpenClaw fallback recovery chain
}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 292 - 319, The recovery-script block in
src/nemoclaw.ts duplicates logic from src/lib/agent-runtime.ts; extract it into
a single exported builder function e.g.
buildOpenclawRecoveryScript({dashboardPort, gatewayLogPath}) in
src/lib/agent-runtime.ts that returns the full joined script string (preserving
all checks, touch/chmod, warnings, nohup/GPID logic), then import and call that
function from src/nemoclaw.ts (pass DASHBOARD_PORT and "/tmp/gateway.log" or
configurable gatewayLogPath) replacing the inline array.join(" ") block; ensure
the new function is exported and tests/uses still behave the same (keep variable
names like OPENCLAW, GPID, and the exact warning texts).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/nemoclaw.ts`:
- Around line 292-319: The recovery-script block in src/nemoclaw.ts duplicates
logic from src/lib/agent-runtime.ts; extract it into a single exported builder
function e.g. buildOpenclawRecoveryScript({dashboardPort, gatewayLogPath}) in
src/lib/agent-runtime.ts that returns the full joined script string (preserving
all checks, touch/chmod, warnings, nohup/GPID logic), then import and call that
function from src/nemoclaw.ts (pass DASHBOARD_PORT and "/tmp/gateway.log" or
configurable gatewayLogPath) replacing the inline array.join(" ") block; ensure
the new function is exported and tests/uses still behave the same (keep variable
names like OPENCLAW, GPID, and the exact warning texts).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 03f4cfe7-5e6f-407f-8fb9-c9bf66f54aa9

📥 Commits

Reviewing files that changed from the base of the PR and between 28c6626 and e73710a.

📒 Files selected for processing (1)

src/nemoclaw.ts

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

…very-preload-chain Signed-off-by: Aaron Erickson <aerickson@nvidia.com> # Conflicts: # .github/workflows/nightly-e2e.yaml

…VIDIA#2558) ## Summary Hardens the gateway recovery script so the NODE_OPTIONS preload chain (sandbox safety-net, ciao networkInterfaces guard, slack guard, http-proxy fix, ws-proxy fix, nemotron fix) actually survives gateway respawn. The pre-fix path silently swallowed `.bashrc` sourcing failures with `2>/dev/null` and never asserted that NODE_OPTIONS contained the safety-net preload — so when `/tmp/nemoclaw-proxy-env.sh` was missing or env did not propagate through the `gosu gateway` boundary, the respawned gateway came up naked and any library that threw during init (ciao mDNS being the documented trigger in NVIDIA#2478) crash-looped the gateway forever under health-monitor restart cadence. ## Related Issue Closes NVIDIA#2478 ## Changes - `src/lib/agent-runtime.ts:buildRecoveryScript` — explicitly sources `/tmp/nemoclaw-proxy-env.sh` (single source of truth for NODE_OPTIONS guards), drops the silencing `2>/dev/null` on `.bashrc` sourcing, asserts NODE_OPTIONS contains the safety-net preload, surfaces a `[gateway-recovery] WARNING` line to gateway.log when the env file is missing or guards are absent. - `src/nemoclaw.ts:recoverSandboxProcesses` — mirrors the same hardening in the inline OpenClaw recovery path. - `src/lib/agent-runtime.test.ts` — 5 new unit tests under `NVIDIA#2478 hardened library-guard preload chain` lock in the contract: explicit source, warn-on-missing, no silencing, NODE_OPTIONS check, sourcing-before-launch ordering. - `test/e2e/test-issue-2478-crash-loop-recovery.sh` — long-running regression e2e: 5 crash-recover cycles asserting the preload chain survives each respawn (verified via `/proc/<pid>/environ`), a negative case where `proxy-env.sh` is removed and the recovery surfaces a `[gateway-recovery] WARNING` instead of silently launching, and a 5-minute idle soak that catches crash-loop signatures by sampling the gateway PID every 15s. Marked `STAYS_IN_PR_UNTIL_SHIP` — delete this file before merge once a clean soak run lands on a real Spark/Brev instance. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] ``npx prek run --all-files`` passes - [ ] ``npm test`` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes --- Signed-off-by: Aaron Erickson <aerickson@nvidia.com>  ## Summary by CodeRabbit * **Bug Fixes** * Recovery now explicitly sources the proxy env, warns when the env file or safety‑net preload are missing, appends those warnings to the gateway log (preserving earlier messages), and rate‑limits repeated network‑interface error logs while reporting suppressed counts. * **Tests** * Added a long-running E2E regression that exercises repeated crash/restart cycles, negative recovery scenarios, soak testing, and verification that guard behavior persists. * **Chores** * Added a nightly E2E job for the recovery test, updated staging comments, and extended a test runner timeout and review guidance.  --------- Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

ericksoa self-assigned this Apr 27, 2026

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread test/e2e/test-issue-2478-crash-loop-recovery.sh Outdated

Comment thread test/e2e/test-issue-2478-crash-loop-recovery.sh Outdated

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread .github/workflows/nightly-e2e.yaml

ericksoa added 3 commits April 27, 2026 13:02

Merge remote-tracking branch 'origin/main' into fix/2478-gateway-reco…

9c4418e

…very-preload-chain

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread test/e2e/test-issue-2478-crash-loop-recovery.sh

Comment thread test/e2e/test-issue-2478-crash-loop-recovery.sh

ericksoa added 9 commits April 27, 2026 18:20

Merge branch 'main' into fix/2478-gateway-recovery-preload-chain

13a16a1

wscurran added bug Something fails against expected or documented behavior Platform: DGX Spark provider: nvidia NVIDIA inference endpoint, NIM, or NVIDIA provider behavior labels Apr 28, 2026

Merge branch 'main' into fix/2478-gateway-recovery-preload-chain

2658673

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

ericksoa added the v0.0.29 label Apr 28, 2026

style(e2e/2478): apply shfmt spacing in case pattern

4f64a74

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread .github/workflows/nightly-e2e.yaml

ericksoa added 2 commits April 28, 2026 10:12

Merge branch 'main' into fix/2478-gateway-recovery-preload-chain

e73710a

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

ericksoa added 3 commits April 28, 2026 13:29

test(wsl): extend credential lint subprocess timeout

315e1cf

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

Merge remote-tracking branch 'origin/main' into fix/2478-gateway-reco…

9cc1d25

…very-preload-chain Signed-off-by: Aaron Erickson <aerickson@nvidia.com> # Conflicts: # .github/workflows/nightly-e2e.yaml

Merge branch 'main' into fix/2478-gateway-recovery-preload-chain

6e29b8b

cv approved these changes Apr 28, 2026

View reviewed changes

ericksoa merged commit cf10693 into main Apr 28, 2026
13 checks passed

coderabbitai Bot mentioned this pull request May 2, 2026

fix(recovery): add connect probe recovery path #2646

Merged

prekshivyas mentioned this pull request May 8, 2026

[Linux][Policy] sandbox logs repeatedly emit os.networkInterfaces guard errors after NVIDIA endpoint onboard #2611

Closed

wscurran added area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression platform: dgx-spark Affects DGX Spark hardware or workflows and removed Platform: DGX Spark bug Something fails against expected or documented behavior labels Jun 3, 2026

Conversation

ericksoa commented Apr 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ericksoa commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading