Skip to content

fix(recovery): harden gateway recovery preload chain (#2478)#2558

Merged
ericksoa merged 23 commits into
mainfrom
fix/2478-gateway-recovery-preload-chain
Apr 28, 2026
Merged

fix(recovery): harden gateway recovery preload chain (#2478)#2558
ericksoa merged 23 commits into
mainfrom
fix/2478-gateway-recovery-preload-chain

Conversation

@ericksoa

@ericksoa ericksoa commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Hardens the gateway recovery script so the NODE_OPTIONS preload chain (sandbox safety-net, ciao networkInterfaces guard, slack guard, http-proxy fix, ws-proxy fix, nemotron fix) actually survives gateway respawn. The pre-fix path silently swallowed .bashrc sourcing failures with 2>/dev/null and never asserted that NODE_OPTIONS contained the safety-net preload — so when /tmp/nemoclaw-proxy-env.sh was missing or env did not propagate through the gosu gateway boundary, the respawned gateway came up naked and any library that threw during init (ciao mDNS being the documented trigger in #2478) crash-looped the gateway forever under health-monitor restart cadence.

Related Issue

Closes #2478

Changes

  • src/lib/agent-runtime.ts:buildRecoveryScript — explicitly sources /tmp/nemoclaw-proxy-env.sh (single source of truth for NODE_OPTIONS guards), drops the silencing 2>/dev/null on .bashrc sourcing, asserts NODE_OPTIONS contains the safety-net preload, surfaces a [gateway-recovery] WARNING line to gateway.log when the env file is missing or guards are absent.
  • src/nemoclaw.ts:recoverSandboxProcesses — mirrors the same hardening in the inline OpenClaw recovery path.
  • src/lib/agent-runtime.test.ts — 5 new unit tests under #2478 hardened library-guard preload chain lock in the contract: explicit source, warn-on-missing, no silencing, NODE_OPTIONS check, sourcing-before-launch ordering.
  • test/e2e/test-issue-2478-crash-loop-recovery.sh — long-running regression e2e: 5 crash-recover cycles asserting the preload chain survives each respawn (verified via /proc/<pid>/environ), a negative case where proxy-env.sh is removed and the recovery surfaces a [gateway-recovery] WARNING instead of silently launching, and a 5-minute idle soak that catches crash-loop signatures by sampling the gateway PID every 15s. Marked STAYS_IN_PR_UNTIL_SHIP — delete this file before merge once a clean soak run lands on a real Spark/Brev instance.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes

Signed-off-by: Aaron Erickson aerickson@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • Recovery now explicitly sources the proxy env, warns when the env file or safety‑net preload are missing, appends those warnings to the gateway log (preserving earlier messages), and rate‑limits repeated network‑interface error logs while reporting suppressed counts.
  • Tests

    • Added a long-running E2E regression that exercises repeated crash/restart cycles, negative recovery scenarios, soak testing, and verification that guard behavior persists.
  • Chores

    • Added a nightly E2E job for the recovery test, updated staging comments, and extended a test runner timeout and review guidance.

The gateway recovery script (recoverSandboxProcesses + buildRecoveryScript)
sourced ~/.bashrc with `2>/dev/null` and never asserted that the resulting
NODE_OPTIONS contained the sandbox safety-net preload before launching
openclaw gateway run. When proxy-env.sh was missing, .bashrc had been
tampered with, or shell env failed to propagate through the gosu boundary,
the respawned gateway came up naked. Any library that threw during init
(ciao mDNS networkInterfaces being the documented trigger) crashed the
gateway forever in a health-monitor restart loop.

Recovery now explicitly sources /tmp/nemoclaw-proxy-env.sh — the single
source of truth for NODE_OPTIONS preload guards — surfaces a
[gateway-recovery] WARNING line when the file is missing or when the
safety-net preload is absent from NODE_OPTIONS, and drops the silencing
2>/dev/null on .bashrc sourcing so real failures stay observable in
gateway.log.

Mirrors the change in src/lib/agent-runtime.ts (non-OpenClaw agents) and
the inline OpenClaw recovery in src/nemoclaw.ts. Adds a regression test
suite locking in the contract: explicit source, warn-on-missing, no
silencing, NODE_OPTIONS check, ordering before launch.

Adds a long-running e2e (test/e2e/test-issue-2478-crash-loop-recovery.sh)
that crash-recovers the gateway 5x while asserting the preload chain
survives each respawn, exercises the negative case where proxy-env.sh
is removed, and soaks for 5 minutes to catch crash-loop signatures by
sampling the gateway PID. The e2e file is marked STAYS_IN_PR_UNTIL_SHIP
and should be removed once a clean soak run lands on a real Spark/Brev
instance.

Closes #2478

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa ericksoa self-assigned this Apr 27, 2026
@coderabbitai

coderabbitai Bot commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Recovery/runtime scripts now deterministically source /tmp/nemoclaw-proxy-env.sh (warn if missing), stop suppressing ~/.bashrc errors, verify NODE_OPTIONS contains the nemoclaw-sandbox-safety-net preload (warn if absent), append warnings into /tmp/gateway.log, change gateway launch logging to append, and add unit, E2E tests plus a nightly CI job for issue #2478.

Changes

Cohort / File(s) Summary
Recovery Script / Runtime
src/lib/agent-runtime.ts, src/nemoclaw.ts
Deterministic sourcing of /tmp/nemoclaw-proxy-env.sh (track _PE_MISSING), stop silencing ~/.bashrc errors, inspect NODE_OPTIONS for nemoclaw-sandbox-safety-net (track _GUARDS_MISSING), emit [gateway-recovery] WARNING to stderr and append into /tmp/gateway.log, and launch gateway with append redirection (>>).
Unit Tests
src/lib/agent-runtime.test.ts
Expanded regression tests for #2478 asserting explicit proxy-env sourcing, missing-file warnings referencing #2478, absence of 2>/dev/null suppression, safety-net preload checks/messages, ordering (sourcing before gateway launch), and persistence of warnings in /tmp/gateway.log.
End-to-End Test
test/e2e/test-issue-2478-crash-loop-recovery.sh
New long-running E2E script that provisions a sandbox, validates NODE_OPTIONS via /proc/<pid>/environ, runs repeated gateway crash/recovery cycles (including negative scenario removing /tmp/nemoclaw-proxy-env.sh), asserts recovery warnings in gateway.log, restores and byte-verifies the env file, performs soak testing, and summarizes results.
CI Workflow
.github/workflows/nightly-e2e.yaml
Added nightly job issue-2478-crash-loop-recovery-e2e to run the new E2E script on ubuntu-latest with a dedicated sandbox (e2e-2478) and failure-only install-log artifact upload.
Sandbox Build Staging (comment)
src/lib/sandbox-build-context.ts
Minor inline comment update describing generate-openclaw-config.py; no behavior changes.
Start Script Logging Rate-limit
scripts/nemoclaw-start.sh
Rate-limits repeated os.networkInterfaces() failure logs: emit first failure immediately, then at most one message per 5 minutes while counting suppressed occurrences for later reporting.
PR Review Mapping
.coderabbit.yaml
Added reviews.path_instructions entry linking the new E2E script to the issue-2478-crash-loop-recovery-e2e nightly job and providing review guidance.
Test Timeout
test/no-direct-credential-env.test.ts
Increased ESLint spawn timeout from 30s to 60s for src/lib/onboard.ts invocation.

Sequence Diagram(s)

sequenceDiagram
    participant Trigger as Recovery Trigger
    participant Script as recovery script
    participant ProxyEnv as /tmp/nemoclaw-proxy-env.sh
    participant Bashrc as ~/.bashrc
    participant Inspector as NODE_OPTIONS inspector
    participant Logger as /tmp/gateway.log
    participant Gateway as gateway process

    Trigger->>Script: invoke recovery
    Script->>ProxyEnv: attempt to source (set _PE_MISSING if unreadable)
    alt proxy env missing
        Script->>Logger: touch & append "[gateway-recovery] WARNING: proxy-env missing (`#2478`)"
        Script->>Script: persist warning via writing $_W
        Script->>Gateway: write warning to stderr
    end
    Script->>Bashrc: source ~/.bashrc (no stderr suppression)
    Script->>Inspector: inspect NODE_OPTIONS for "nemoclaw-sandbox-safety-net"
    alt safety-net missing
        Script->>Logger: append "[gateway-recovery] WARNING: NODE_OPTIONS missing safety-net preload"
        Script->>Gateway: write warning to stderr
    end
    Script->>Gateway: exec gateway run >> /tmp/gateway.log
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nudged the env, I checked the net,

If safety's gone I leave a warning set.
I append my notes where gateway logs sleep,
So recoveries wake and no secrets keep.
Hop, patch, persist — then back to my peep.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(recovery): harden gateway recovery preload chain (#2478)' directly and clearly summarizes the main change: hardening the gateway recovery process to ensure the NODE_OPTIONS preload chain survives respawn, addressing issue #2478.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/2478-gateway-recovery-preload-chain

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/e2e/test-issue-2478-crash-loop-recovery.sh (1)

316-326: File restore may fail silently due to permission issues.

The restore logic writes to /tmp/nemoclaw-proxy-env.sh.restore as the sandbox user, but the original file was root-owned with mode 444. The mv on line 324 will fail if the sandbox user lacks write permission to replace the original.

The comment acknowledges this ("best-effort"), but consider adding explicit feedback:

 sandbox_exec sh -c "cat > /tmp/nemoclaw-proxy-env.sh.restore <<'REPL'
 $SNAPSHOT
 REPL
 chmod 444 /tmp/nemoclaw-proxy-env.sh.restore 2>/dev/null || true
 mv /tmp/nemoclaw-proxy-env.sh.restore /tmp/nemoclaw-proxy-env.sh 2>/dev/null || true
 " >/dev/null
-info "proxy-env.sh restored (best-effort; soak phase will tolerate degraded state)"
+# Verify restore succeeded
+if sandbox_exec sh -c "[ -r /tmp/nemoclaw-proxy-env.sh ] && grep -q 'NODE_OPTIONS' /tmp/nemoclaw-proxy-env.sh" >/dev/null 2>&1; then
+  info "proxy-env.sh restored successfully"
+else
+  info "proxy-env.sh restore failed (best-effort); soak phase will tolerate degraded state"
+fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-issue-2478-crash-loop-recovery.sh` around lines 316 - 326, The
restore can fail silently because sandbox_exec redirects output and ignores
errors from chmod/mv; update the block that calls sandbox_exec so failures are
captured and reported: stop swallowing stdout/stderr (remove the final
>/dev/null) or capture the sandbox_exec exit code and log a clear error via
info/error if it is non-zero, and inside the sandbox_exec command ensure the
mv/chmod failures are propagated (e.g. check their exits and write an
explanatory message to stderr) and, as a fallback, attempt a privileged
move/chown when mv fails; reference the sandbox_exec invocation and the
/tmp/nemoclaw-proxy-env.sh.restore -> /tmp/nemoclaw-proxy-env.sh mv/chmod steps
so the script logs explicit failure details instead of silently continuing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-issue-2478-crash-loop-recovery.sh`:
- Line 329: The command using sandbox_exec with $(pgrep -f 'openclaw gateway
run' >/dev/null 2>&1 && echo true || echo true) is a no-op; replace it with a
real existence check or a blocking wait. Update the invocation around
sandbox_exec to directly test pgrep's exit status (use pgrep -f 'openclaw
gateway run' >/dev/null 2>&1 && proceed || fail) or implement a retry/wait loop
that repeatedly calls pgrep and only continues when the gateway process is
found, referencing the existing sandbox_exec wrapper and the pgrep -f 'openclaw
gateway run' check so the script fails or waits appropriately instead of always
executing a no-op.
- Around line 247-249: The sandboxConnect handler in src/nemoclaw.ts currently
ignores unknown flags and only recognizes --dangerously-skip-permissions; add
explicit support for a --probe-only flag in the parsing logic used by the
sandboxConnect function so that when --probe-only is passed the command performs
a non-interactive probe (no prompts, no provisioning or long-lived side effects)
and returns an exit status indicating reachability; update the branch that
handles connect (and any helper like sandboxConnect or parseConnectArgs) to
detect --probe-only, short-circuit the interactive flow (skip permission prompts
and provisioning), run only the lightweight connectivity checks the tests
expect, and ensure the command returns non-zero on failure and zero on success
so the e2e test lines using nemoclaw ... connect --probe-only behave correctly.

---

Nitpick comments:
In `@test/e2e/test-issue-2478-crash-loop-recovery.sh`:
- Around line 316-326: The restore can fail silently because sandbox_exec
redirects output and ignores errors from chmod/mv; update the block that calls
sandbox_exec so failures are captured and reported: stop swallowing
stdout/stderr (remove the final >/dev/null) or capture the sandbox_exec exit
code and log a clear error via info/error if it is non-zero, and inside the
sandbox_exec command ensure the mv/chmod failures are propagated (e.g. check
their exits and write an explanatory message to stderr) and, as a fallback,
attempt a privileged move/chown when mv fails; reference the sandbox_exec
invocation and the /tmp/nemoclaw-proxy-env.sh.restore ->
/tmp/nemoclaw-proxy-env.sh mv/chmod steps so the script logs explicit failure
details instead of silently continuing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ac9aeebf-94f8-4fa7-abc6-a392f2fe9c59

📥 Commits

Reviewing files that changed from the base of the PR and between db1ef3c and 48e2d9b.

📒 Files selected for processing (4)
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts
  • src/nemoclaw.ts
  • test/e2e/test-issue-2478-crash-loop-recovery.sh

Comment thread test/e2e/test-issue-2478-crash-loop-recovery.sh Outdated
Comment thread test/e2e/test-issue-2478-crash-loop-recovery.sh Outdated
…UNTIL_SHIP)

Wires the new long-running e2e from this PR into the nightly e2e workflow
so the recovery hardening for #2478 gets a real soak run on a runner with
NVIDIA endpoints reachable. Removed in the same commit that deletes the
test file before merge.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 172-195: The new job issue-2478-crash-loop-recovery-e2e must be
added to the notify-on-failure job's needs list so the notifier waits for it and
creates an auto-issue when this job fails; update the notify-on-failure job (the
job that defines needs between lines referencing notify-on-failure) to include
"issue-2478-crash-loop-recovery-e2e" in its needs array alongside the other job
names, ensuring the notifier will depend on and run after this new E2E job and
that artifacts/logs are available when it runs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 662b99f7-a4fe-4723-bcf9-97f9e9b8ce8c

📥 Commits

Reviewing files that changed from the base of the PR and between 48e2d9b and 054ed59.

📒 Files selected for processing (1)
  • .github/workflows/nightly-e2e.yaml

Comment thread .github/workflows/nightly-e2e.yaml
#2449 added scripts/generate-openclaw-config.py and a COPY for it in the
Dockerfile, but did not update stageOptimizedSandboxBuildContext to include
the new file. The optimized build context is missing the script, so every
sandbox image build that uses the optimized path fails at Dockerfile step
21/56 with `COPY failed: file not found in build context`. This breaks
all dispatched nightly-e2e jobs since 2026-04-26 22:21 EDT.

Pulling this fix into the #2478 PR rather than a standalone PR so the
recovery hardening test on this branch can actually run.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The runner hit "FAIL: Initial gateway missing safety-net preload" with
"sh: 1: cannot open /proc/831/environ: No such file" — the test grabbed
a PID that was either wrong (greedy pgrep -f match) or gone by the time
we tried to read its environ. Add a diagnostic snapshot (all matching
processes, ps -ef openclaw filter, /proc/<pid>/{cmdline,status,listing})
that fires before exit so the next failure tells us which of those
hypotheses is right before we change the matching logic.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/e2e/test-issue-2478-crash-loop-recovery.sh (1)

347-363: Verify the env-file restore before starting the soak.

This block suppresses restore failures, then immediately enters Phase 5 without re-checking the guard chain. If proxy-env.sh did not come back cleanly, the soak is no longer exercising the healthy recovery path this test is supposed to validate.

✅ Suggested hardening
-sandbox_exec sh -c "cat > /tmp/nemoclaw-proxy-env.sh.restore <<'REPL'
+if ! sandbox_exec sh -c "cat > /tmp/nemoclaw-proxy-env.sh.restore <<'REPL'
 $SNAPSHOT
 REPL
-chmod 444 /tmp/nemoclaw-proxy-env.sh.restore 2>/dev/null || true
-mv /tmp/nemoclaw-proxy-env.sh.restore /tmp/nemoclaw-proxy-env.sh 2>/dev/null || true
-" >/dev/null
-info "proxy-env.sh restored (best-effort; soak phase will tolerate degraded state)"
+chmod 444 /tmp/nemoclaw-proxy-env.sh.restore &&
+mv /tmp/nemoclaw-proxy-env.sh.restore /tmp/nemoclaw-proxy-env.sh
+" >/dev/null; then
+  fail "Failed to restore /tmp/nemoclaw-proxy-env.sh before soak"
+  exit 1
+fi
+info "proxy-env.sh restored"
 
 ...
 SOAK_START_PID="$(wait_for_gateway_up 30)"
 if [ -z "$SOAK_START_PID" ]; then
   fail "Gateway not up entering soak phase"
   exit 1
 fi
+SOAK_NODE_OPTIONS="$(gateway_node_options "$SOAK_START_PID")"
+if ! echo "$SOAK_NODE_OPTIONS" | grep -q 'nemoclaw-sandbox-safety-net'; then
+  fail "Gateway entered soak without safety-net preload after restore"
+  exit 1
+fi
 pass "Gateway healthy entering soak (pid=$SOAK_START_PID)"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-issue-2478-crash-loop-recovery.sh` around lines 347 - 363, The
restore of the proxy env file is being done best-effort and failures are
suppressed; change it to verify the restore succeeded before starting the soak:
after calling sandbox_exec to write the snapshot, check the command exit status
and then validate that /tmp/nemoclaw-proxy-env.sh exists and is readable (and
optionally that its content matches $SNAPSHOT or a checksum) before proceeding
to call wait_for_gateway_up; if the verification fails, call fail (or exit
non‑zero) instead of continuing into the soak phase so that the test exercises
the healthy-recovery path; reference sandbox_exec, $SNAPSHOT,
/tmp/nemoclaw-proxy-env.sh, wait_for_gateway_up and fail when updating the
logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-issue-2478-crash-loop-recovery.sh`:
- Around line 100-107: The helpers are accidentally treating stderr as
structured PID/data because sandbox_exec currently redirects stderr into stdout;
stop folding transport/errors into the data path by removing the global "2>&1"
redirection from sandbox_exec (so callers can observe failures) and harden
callers like gateway_pid and gateway_node_options to validate output (e.g., only
accept numeric PIDs or parse expected option formats) before returning; also
ensure wait_for_gateway_up checks for empty/invalid values (not just truthy
strings) from gateway_pid/gateway_node_options so an error message won't be
misinterpreted as a PID.
- Line 409: The cleanup command currently at the end of the script only runs on
the happy path and leaks sandboxes on early exits; instead, register an EXIT
trap immediately after creating the sandbox that runs a cleanup function which
checks NEMOCLAW_E2E_KEEP_SANDBOX and calls "nemoclaw $SANDBOX_NAME destroy
--yes" (or a no-op if the variable is set), ensuring the trap uses the same
suppression redirection (>/dev/null 2>&1 || true) and is idempotent; add a named
cleanup function (e.g., cleanup_sandbox) and "trap cleanup_sandbox EXIT" near
where SANDBOX_NAME is created, and you can keep or remove the existing Phase 6
line to avoid duplicate destroys.

---

Nitpick comments:
In `@test/e2e/test-issue-2478-crash-loop-recovery.sh`:
- Around line 347-363: The restore of the proxy env file is being done
best-effort and failures are suppressed; change it to verify the restore
succeeded before starting the soak: after calling sandbox_exec to write the
snapshot, check the command exit status and then validate that
/tmp/nemoclaw-proxy-env.sh exists and is readable (and optionally that its
content matches $SNAPSHOT or a checksum) before proceeding to call
wait_for_gateway_up; if the verification fails, call fail (or exit non‑zero)
instead of continuing into the soak phase so that the test exercises the
healthy-recovery path; reference sandbox_exec, $SNAPSHOT,
/tmp/nemoclaw-proxy-env.sh, wait_for_gateway_up and fail when updating the
logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d07fb5f7-617a-4c99-933c-b41762a0b824

📥 Commits

Reviewing files that changed from the base of the PR and between 9c4418e and b94bd40.

📒 Files selected for processing (1)
  • test/e2e/test-issue-2478-crash-loop-recovery.sh

Comment thread test/e2e/test-issue-2478-crash-loop-recovery.sh
Comment thread test/e2e/test-issue-2478-crash-loop-recovery.sh
Last run revealed two distinct issues:

1. `pgrep -f 'openclaw gateway run'` matched its own `sh -c \"pgrep ...\"`
   wrapper argv, returning the wrapper's PID. The wrapper exited
   immediately after writing stdout, so `/proc/<pid>/environ` was already
   gone. Switch to `pgrep -fo '[o]penclaw gateway run'` — bracket trick
   prevents self-match, `-o` picks the oldest matching process (the
   long-lived gateway, not transient launchers).

2. ps/pgrep showed zero openclaw processes in the sandbox after onboard
   reported success. Could be the gateway crashed silently, or
   `openshell sandbox exec` is landing in a different container than the
   gateway. Extend diagnostics to dump:
   - exec context (whoami, hostname, pid namespace)
   - full process tree (ps auxf)
   - /tmp/gateway.log content + listing
   - nemoclaw status output
   - openshell sandbox info (containers in pod)

Also call gateway_diagnostics on the "Gateway never came up after
onboard" path so we see the data when there's no PID at all.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Diagnostic from the previous run showed the gateway is alive and the
fix is working — the ciao guard fires 19+ times in 90s and the gateway
stays up — but the test never picked up its PID because the real argv
is `openclaw-gateway` (single token, after re-exec), not the launcher's
`openclaw gateway run`. Match either form via `[o]penclaw[ -]gateway`.

Bracket prevents pgrep self-match (the wrapper argv contains the literal
pattern, but `[o]` doesn't match `[`); `[ -]` accepts both space (early
launcher) and dash (post-rename gateway).

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Last run got past pgrep matching the right gateway PID (288), but
reading /proc/288/environ failed with EACCES — Linux's
kernel.yama.ptrace_scope=1 (default) refuses /proc/<pid>/environ
reads from non-ancestor processes even at the same UID, and
`openshell sandbox exec` spawns a new process tree, not a child of
the gateway.

Switch to a stronger three-signal verification that doesn't need to
read environ:

  1. /tmp/nemoclaw-proxy-env.sh contains the safety-net + ciao guard
     NODE_OPTIONS exports (the source of truth recovery sources).
  2. gateway.log contains "[guard] os.networkInterfaces() failed:" —
     that line is *only* emitted by our preload code, so its presence
     proves the preload actually executed inside the gateway's Node
     process. This is stronger evidence than a NODE_OPTIONS snapshot.
  3. The gateway PID is alive after the guard activations, proving
     the guard prevented a crash.

The preload activation log line appears within seconds because bonjour
mDNS retries advertising eagerly; helper waits up to 30s for the
signature to accrue.

Drops the redundant per-cycle ciao-guard assertion (the new
gateway_guards_active() helper already covers it).

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Phase 3 Cycle 1 hung for 13+ minutes until GitHub Actions killed the
job at the 30-min budget (exit 124). Cause: `nemoclaw <name> connect
--probe-only` invokes `connect` with a flag that does not exist; with
stdin/stdout swallowed by the test redirects, the connect TUI blocks
waiting for input that never arrives.

Switch to `nemoclaw <name> status`. That command is non-interactive
and is the actual entry point that calls
checkAndRecoverSandboxProcesses() → recoverSandboxProcesses() — the
hardened path #2478 changes. Wrap with `timeout 60` so any future
hang fails fast instead of eating the whole job budget. Also update
the post-kill pgrep verification to use the same `[o]penclaw[ -]gateway`
pattern as gateway_pid().

Phases 0-2 PASS on the previous run, including the new
"Initial gateway has guard chain active" assertion — confirming the
recovery hardening works end-to-end. Phase 3 was just the test
harness lying.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…2478)

The recovery script's missing-proxy-env.sh and missing-safety-net warnings
went to stderr only, but executeSandboxCommand captures stderr without
surfacing it — so a sysadmin tailing /tmp/gateway.log to debug a crash-loop
saw the crash output but not the explanation. Defer warning emission until
AFTER the script touches+chmods a fresh gateway.log, then write each
warning to both stderr and gateway.log so it lands where it's actually
discoverable.

Mirrored across src/lib/agent-runtime.ts (non-OpenClaw agents) and the
inline OpenClaw recovery in src/nemoclaw.ts. Adds a unit-test assertion
that warnings end up in /tmp/gateway.log and are deferred past the touch.

Also fix the e2e Phase 4 proxy-env.sh restore. The previous heredoc-via-
shell approach hit a 444-mode mv that silently failed, leaving the gateway
without guards through the entire soak phase (Phase 5 saw 19/20 empty PID
samples). Restore via stdin pipe → temp file → mv -f → chmod 444, with
size verification, and assert the restored gateway has guards active before
entering the soak.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Phase 4 was failing for two distinct reasons even after the previous fix:

1. `kill -9 $prev_pid` only killed the openclaw-gateway child; the parent
   `openclaw` launcher's watchdog likely respawned the gateway before
   `nemoclaw status` ran the recovery script. Result: recovery code path
   never executed, so the [gateway-recovery] WARNING was never emitted.
   Switch to `pkill -9 -f '[o]penclaw'` to nuke the whole tree.

2. `printf '%s' "$SNAPSHOT" | sandbox_exec sh -c 'cat > file'` left an empty
   file because `openshell sandbox exec` doesn't pipe caller stdin through
   to the subshell. Encode the snapshot as base64 and decode inside the
   sandbox so the data lives in the command argv, sidestepping the stdin
   gap entirely.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The recovery script writes the [gateway-recovery] WARNING to gateway.log
moments before launching the gateway. The launch then redirected with
`>` (truncate), which wiped the warning before anyone — sysadmin or
e2e test — could read it. Switch to `>>` (append). The earlier
`rm -f /tmp/gateway.log; touch /tmp/gateway.log` reset already ensures
we start clean per recovery cycle, so append is safe and preserves the
warning that explains why the gateway is about to crash.

Also: e2e Phase 4 proxy-env.sh restore was leaving an empty file even
with base64-via-argv. Add `wc -c` inline so we can see what the restore
actually produced if it fails again, and collapse the multi-line sh -c
to a single chained command which is more reliable across openshell
sandbox exec versions.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
The ciao networkInterfaces guard preload (#2340) prevents the gateway
crash, but the bonjour watchdog inside ciao retries advertising every
few seconds, so the guard's failure message ends up flooding sandbox
logs with hundreds of identical lines per hour:

  [guard] os.networkInterfaces() failed: ... — returning empty (mDNS disabled)
  [guard] os.networkInterfaces() failed: ... — returning empty (mDNS disabled)
  ...

Operators see the same "actionable" message a thousand times. Log on
first failure (operator gets the explanation), then suppress repeats
within a 5-minute window and replay one summary line afterward with a
"[N suppressed in last ~5min, M total]" suffix so volume is still
observable. Same prefix, so existing log scrapers and the e2e
regression test continue to match. Closes GitHub issue #2611.

Also fix the e2e snapshot/restore byte-faithfulness. The previous
`SNAPSHOT="$(cat ...)"` capture went through bash command substitution
which strips trailing newlines, so the restored file was 2 bytes longer
than the captured size and Phase 4's verification reported a spurious
mismatch even though the restore was correct. Capture the snapshot as
base64 from inside the sandbox (lossless round-trip) and verify size
against the original sandbox-side wc -c.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@wscurran wscurran added bug Something fails against expected or documented behavior Platform: DGX Spark provider: nvidia NVIDIA inference endpoint, NIM, or NVIDIA provider behavior labels Apr 28, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
.github/workflows/nightly-e2e.yaml (1)

678-701: ⚠️ Potential issue | 🟡 Minor

Add issue-2478-crash-loop-recovery-e2e to notify-on-failure.

This job is still missing from the notifier’s needs list, so a failure here will not be included in the auto-issue path or waited on before the notifier runs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml around lines 678 - 701, The
notify-on-failure job's needs list is missing the
issue-2478-crash-loop-recovery-e2e job; update the notify-on-failure job
(notify-on-failure) to include "issue-2478-crash-loop-recovery-e2e" in its needs
array so the notifier waits on and includes failures from that job in the
auto-issue path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 678-701: The notify-on-failure job's needs list is missing the
issue-2478-crash-loop-recovery-e2e job; update the notify-on-failure job
(notify-on-failure) to include "issue-2478-crash-loop-recovery-e2e" in its needs
array so the notifier waits on and includes failures from that job in the
auto-issue path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ae379a80-1cdd-4f11-a3d5-1e10ef63b728

📥 Commits

Reviewing files that changed from the base of the PR and between 9c4418e and 2658673.

📒 Files selected for processing (6)
  • .github/workflows/nightly-e2e.yaml
  • scripts/nemoclaw-start.sh
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts
  • src/nemoclaw.ts
  • test/e2e/test-issue-2478-crash-loop-recovery.sh
✅ Files skipped from review due to trivial changes (1)
  • src/lib/agent-runtime.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/nemoclaw.ts
  • test/e2e/test-issue-2478-crash-loop-recovery.sh

A NemoClaw user reported on issue #2478 that the pre-fix ciao crash made
`https://inference.local/v1/models` go silent — the user's deployed
model "disappeared" even though the gateway process state was variously
alive/dead. The recovery hardening keeps the gateway process up, but
nothing in the e2e was asserting that the user-facing inference API
actually stays reachable across kill/respawn.

Add `gateway_serves_inference()` (curl /v1/models from inside the
sandbox, accept any OpenAI-compatible response shape) and call it in:

  - Phase 2: initial gateway must serve before we trust later phases.
  - Phase 3: after every kill/respawn cycle (5x), the new gateway must
    serve, not just exist.
  - Phase 5 entry + every 60s during the 300s soak, with a final
    pass/fail on the failure rate.

Strongest-possible reply to the user's report: end-to-end on a runner,
gateway + inference endpoint both stay healthy across 5 kill/recover
cycles and a 5-minute idle soak.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/nemoclaw.ts (1)

289-314: Consider collapsing this fallback onto the shared recovery-script builder.

This block now mirrors src/lib/agent-runtime.ts almost verbatim. Keeping the warning strings, ordering, and NODE_OPTIONS checks in two places makes the next preload-chain change easy to miss in one path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 289 - 314, This duplicates the recovery-launch
shell block found in src/lib/agent-runtime.ts (see buildRecoveryScript); instead
of maintaining two nearly identical arrays, refactor src/nemoclaw.ts to
import/use the shared builder (buildRecoveryScript) or a common helper that
returns the recovery script/command list, and replace the inline commands (the
array containing checks for /tmp/nemoclaw-proxy-env.sh, ~/.bashrc, NODE_OPTIONS
guard check, warnings, OPENCLAW lookup and nohup launch) with a call to that
shared builder so the warning strings, ordering, and NODE_OPTIONS logic are
maintained in one place (also update any callers that expect the current array
shape to use the unified output, and ensure compatibility with
executeSandboxCommand if used).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/nemoclaw.ts`:
- Around line 289-314: This duplicates the recovery-launch shell block found in
src/lib/agent-runtime.ts (see buildRecoveryScript); instead of maintaining two
nearly identical arrays, refactor src/nemoclaw.ts to import/use the shared
builder (buildRecoveryScript) or a common helper that returns the recovery
script/command list, and replace the inline commands (the array containing
checks for /tmp/nemoclaw-proxy-env.sh, ~/.bashrc, NODE_OPTIONS guard check,
warnings, OPENCLAW lookup and nohup launch) with a call to that shared builder
so the warning strings, ordering, and NODE_OPTIONS logic are maintained in one
place (also update any callers that expect the current array shape to use the
unified output, and ensure compatibility with executeSandboxCommand if used).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 85febb49-5051-4324-9f7d-123a7a5e9579

📥 Commits

Reviewing files that changed from the base of the PR and between 9c4418e and 4f64a74.

📒 Files selected for processing (6)
  • .github/workflows/nightly-e2e.yaml
  • scripts/nemoclaw-start.sh
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts
  • src/nemoclaw.ts
  • test/e2e/test-issue-2478-crash-loop-recovery.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/nightly-e2e.yaml

Brings in the selective nightly-e2e dispatch infrastructure from #2615
(adds workflow_dispatch.inputs.jobs filter + per-job guard) and the
test/validate-e2e-coverage.test.ts cross-check that asserts every
nightly job carries the canonical guard.

Updates the issue-2478-crash-loop-recovery-e2e job to use the same
guard pattern as every other job, and adds it to the inputs.jobs
description "Valid:" list.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
.github/workflows/nightly-e2e.yaml (1)

197-228: ⚠️ Potential issue | 🟠 Major

Add the new #2478 job to notify-on-failure.needs.

Line 201 introduces issue-2478-crash-loop-recovery-e2e, but notify-on-failure (Line 777-Line 797) still excludes it. Failures from this job can be missed by notifier sequencing/reporting.

🔧 Proposed patch
   notify-on-failure:
     runs-on: ubuntu-latest
     needs:
       [
         cloud-e2e,
         messaging-providers-e2e,
         token-rotation-e2e,
         sandbox-survival-e2e,
+        issue-2478-crash-loop-recovery-e2e,
         hermes-e2e,
         skip-permissions-e2e,
         sandbox-operations-e2e,
         inference-routing-e2e,
         network-policy-e2e,
         deployment-services-e2e,
         diagnostics-e2e,
         snapshot-commands-e2e,
         shields-config-e2e,
         rebuild-openclaw-e2e,
         upgrade-stale-sandbox-e2e,
         rebuild-hermes-e2e,
         overlayfs-autofix-e2e,
         gpu-e2e,
         gpu-double-onboard-e2e,
       ]

As per coding guidelines: "Keeping notify-on-failure updated: include the new job in the needs list so failures trigger the failure issue/reporting."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml around lines 197 - 228, The new workflow
job issue-2478-crash-loop-recovery-e2e is not included in the needs list for the
notify-on-failure job; update the notify-on-failure job's needs array to include
"issue-2478-crash-loop-recovery-e2e" so failures from that job trigger the
notifier sequencing/reporting (locate notify-on-failure and add the exact job
name to its needs list).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 197-228: The workflow adds a new E2E job
issue-2478-crash-loop-recovery-e2e that runs
test/e2e/test-issue-2478-crash-loop-recovery.sh but .coderabbit.yaml lacks a
corresponding path_instructions entry; open .coderabbit.yaml and add a
path_instructions mapping for the job name issue-2478-crash-loop-recovery-e2e
that points to the test file path
(test/e2e/test-issue-2478-crash-loop-recovery.sh) following the existing format
of other E2E jobs so the cross-validation test
(test/validate-e2e-coverage.test.ts) recognizes the new job.

---

Duplicate comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 197-228: The new workflow job issue-2478-crash-loop-recovery-e2e
is not included in the needs list for the notify-on-failure job; update the
notify-on-failure job's needs array to include
"issue-2478-crash-loop-recovery-e2e" so failures from that job trigger the
notifier sequencing/reporting (locate notify-on-failure and add the exact job
name to its needs list).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 08d47945-9f7b-4274-9018-2fecb3d84d05

📥 Commits

Reviewing files that changed from the base of the PR and between 4f64a74 and 2f248bb.

📒 Files selected for processing (1)
  • .github/workflows/nightly-e2e.yaml

Comment thread .github/workflows/nightly-e2e.yaml
Per project coding standard (enforced by test/validate-e2e-coverage.test.ts),
every nightly E2E job needs a corresponding .coderabbit.yaml
path_instructions entry. Add one keyed on the test file path so the
mapping disappears in the same removal commit when the
STAYS_IN_PR_UNTIL_SHIP test is deleted before merge.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/nemoclaw.ts (1)

292-319: Recommend centralizing this fallback script to avoid drift with agent-runtime.ts.

This block now mirrors src/lib/agent-runtime.ts:52-123 very closely. Consolidating into one shared builder would reduce future regression risk when guard-chain behavior changes again.

♻️ Refactor direction
-  const script =
-    agentScript ||
-    [
-      // inline OpenClaw recovery script...
-    ].join(" ");
+  const script =
+    agentScript ||
+    agentRuntime.buildOpenclawRecoveryScript({
+      dashboardPort: DASHBOARD_PORT,
+      gatewayLogPath: "/tmp/gateway.log",
+    });
// src/lib/agent-runtime.ts
export function buildOpenclawRecoveryScript(opts: {
  dashboardPort: number;
  gatewayLogPath: string;
}): string {
  // single source of truth for OpenClaw fallback recovery chain
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 292 - 319, The recovery-script block in
src/nemoclaw.ts duplicates logic from src/lib/agent-runtime.ts; extract it into
a single exported builder function e.g.
buildOpenclawRecoveryScript({dashboardPort, gatewayLogPath}) in
src/lib/agent-runtime.ts that returns the full joined script string (preserving
all checks, touch/chmod, warnings, nohup/GPID logic), then import and call that
function from src/nemoclaw.ts (pass DASHBOARD_PORT and "/tmp/gateway.log" or
configurable gatewayLogPath) replacing the inline array.join(" ") block; ensure
the new function is exported and tests/uses still behave the same (keep variable
names like OPENCLAW, GPID, and the exact warning texts).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/nemoclaw.ts`:
- Around line 292-319: The recovery-script block in src/nemoclaw.ts duplicates
logic from src/lib/agent-runtime.ts; extract it into a single exported builder
function e.g. buildOpenclawRecoveryScript({dashboardPort, gatewayLogPath}) in
src/lib/agent-runtime.ts that returns the full joined script string (preserving
all checks, touch/chmod, warnings, nohup/GPID logic), then import and call that
function from src/nemoclaw.ts (pass DASHBOARD_PORT and "/tmp/gateway.log" or
configurable gatewayLogPath) replacing the inline array.join(" ") block; ensure
the new function is exported and tests/uses still behave the same (keep variable
names like OPENCLAW, GPID, and the exact warning texts).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 03f4cfe7-5e6f-407f-8fb9-c9bf66f54aa9

📥 Commits

Reviewing files that changed from the base of the PR and between 28c6626 and e73710a.

📒 Files selected for processing (1)
  • src/nemoclaw.ts

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…very-preload-chain

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

# Conflicts:
#	.github/workflows/nightly-e2e.yaml
@ericksoa ericksoa merged commit cf10693 into main Apr 28, 2026
13 checks passed
DemianHeyGen pushed a commit to DemianHeyGen/NemoClaw that referenced this pull request Apr 30, 2026
…VIDIA#2558)

## Summary

Hardens the gateway recovery script so the NODE_OPTIONS preload chain
(sandbox safety-net, ciao networkInterfaces guard, slack guard,
http-proxy fix, ws-proxy fix, nemotron fix) actually survives gateway
respawn. The pre-fix path silently swallowed `.bashrc` sourcing failures
with `2>/dev/null` and never asserted that NODE_OPTIONS contained the
safety-net preload — so when `/tmp/nemoclaw-proxy-env.sh` was missing or
env did not propagate through the `gosu gateway` boundary, the respawned
gateway came up naked and any library that threw during init (ciao mDNS
being the documented trigger in NVIDIA#2478) crash-looped the gateway forever
under health-monitor restart cadence.

## Related Issue

Closes NVIDIA#2478

## Changes

- `src/lib/agent-runtime.ts:buildRecoveryScript` — explicitly sources
`/tmp/nemoclaw-proxy-env.sh` (single source of truth for NODE_OPTIONS
guards), drops the silencing `2>/dev/null` on `.bashrc` sourcing,
asserts NODE_OPTIONS contains the safety-net preload, surfaces a
`[gateway-recovery] WARNING` line to gateway.log when the env file is
missing or guards are absent.
- `src/nemoclaw.ts:recoverSandboxProcesses` — mirrors the same hardening
in the inline OpenClaw recovery path.
- `src/lib/agent-runtime.test.ts` — 5 new unit tests under `NVIDIA#2478
hardened library-guard preload chain` lock in the contract: explicit
source, warn-on-missing, no silencing, NODE_OPTIONS check,
sourcing-before-launch ordering.
- `test/e2e/test-issue-2478-crash-loop-recovery.sh` — long-running
regression e2e: 5 crash-recover cycles asserting the preload chain
survives each respawn (verified via `/proc/<pid>/environ`), a negative
case where `proxy-env.sh` is removed and the recovery surfaces a
`[gateway-recovery] WARNING` instead of silently launching, and a
5-minute idle soak that catches crash-loop signatures by sampling the
gateway PID every 15s. Marked `STAYS_IN_PR_UNTIL_SHIP` — delete this
file before merge once a clean soak run lands on a real Spark/Brev
instance.

## Type of Change

- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification

- [ ] ``npx prek run --all-files`` passes
- [ ] ``npm test`` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes

---
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Recovery now explicitly sources the proxy env, warns when the env file
or safety‑net preload are missing, appends those warnings to the gateway
log (preserving earlier messages), and rate‑limits repeated
network‑interface error logs while reporting suppressed counts.

* **Tests**
* Added a long-running E2E regression that exercises repeated
crash/restart cycles, negative recovery scenarios, soak testing, and
verification that guard behavior persists.

* **Chores**
* Added a nightly E2E job for the recovery test, updated staging
comments, and extended a test runner timeout and review guidance.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression platform: dgx-spark Affects DGX Spark hardware or workflows and removed Platform: DGX Spark bug Something fails against expected or documented behavior labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression platform: dgx-spark Affects DGX Spark hardware or workflows provider: nvidia NVIDIA inference endpoint, NIM, or NVIDIA provider behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DGX Spark] Gateway crash loop on startup: @homebridge/ciao networkInterfaces() returns EPERM in OpenShell sandbox

3 participants