Skip to content

fix(sandbox): auto-respawn gateway when it exits unexpectedly (#2757)#3409

Merged
cv merged 7 commits into
mainfrom
fix/2757-gateway-respawn-loop
May 15, 2026
Merged

fix(sandbox): auto-respawn gateway when it exits unexpectedly (#2757)#3409
cv merged 7 commits into
mainfrom
fix/2757-gateway-respawn-loop

Conversation

@cjagwani

@cjagwani cjagwani commented May 12, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Wraps the terminal wait "\$GATEWAY_PID" in scripts/nemoclaw-start.sh (both non-root and root/step-down branches) in a respawn loop so unexpected gateway death no longer drops PID 1 and reaps the sandbox container.
  • Adds a 60s-window respawn-count guard: after 5 respawns in <60s, logs a CRITICAL line so a crashing gateway surfaces in /tmp/gateway.log rather than being masked.
  • Preserves existing cleanup_on_signal shutdown semantics — clean exit (rc=0) still drops PID 1, SIGTERM/SIGINT still trigger the existing handler.

Closes #2757.

Root cause

The bug report blamed src/lib/agent-runtime.ts for missing supervisor logic, but that file was moved to src/lib/agent/runtime.ts (#3191) and the gateway launch is correct — nohup ... & followed by wait "\$GATEWAY_PID". The real cause sits one layer down: wait unblocks the moment the gateway dies, PID 1 exits, and Docker reaps the container by design (scripts/nemoclaw-start.sh is the entrypoint). NemoClaw also doesn't pass --restart= when OpenShell creates the sandbox, so neither layer recovers.

Verification

Reproduced locally in Ubuntu 24.04 via a synthetic entrypoint mirroring lines 2240-2268 of this file:

Test Result
Without patch, kill -9 \$GATEWAY_PID Container exited (exitCode=137, restartCount=0). Matches QA report.
With patch + same kill Loop sees rc=137, sleeps 2s, relaunches. Container stays running, gateway gets new PID. nemoclaw status → healthy.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • `npx prek run --all-files` passes (shellcheck clean on the touched file)
  • `npm test` passes (no JS/TS touched; not run)
  • Tests added or updated for new or changed behavior — see below
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes (n/a — internal entrypoint behavior)
  • `make docs` builds without warnings (doc changes only)

Test plan

Manual repro mirrors the QA acceptance criteria:

  1. `nemoclaw onboard --name my-assistant --non-interactive`
  2. `docker exec pgrep -af "openclaw gateway"` → note PID
  3. `docker exec kill -9 `
  4. Wait 5s
  5. `nemoclaw my-assistant status` → expect HEALTHY (no `connect` needed)

I did not add an automated E2E test for the kill-and-respawn flow in this PR (scope kept minimal per #2757's acceptance criteria); happy to follow up with one if reviewers want — would slot into `test/e2e/test-sandbox-survival.sh`.

Notes for reviewers

  • Both branches of the entrypoint (non-root at L2021, root/step-down at L2240) get the same loop. The root branch uses `"${STEP_DOWN_PREFIX_GATEWAY[@]}"` to preserve the gateway-user UID separation on respawn.
  • `SANDBOX_WAIT_PID` is reassigned on each respawn so `cleanup_on_signal` (in `scripts/lib/sandbox-init.sh`) waits on the live PID during shutdown.
  • `SANDBOX_CHILD_PIDS` accumulates respawn PIDs; the trap kills them all with `2>/dev/null || true` so stale entries don't break shutdown.
  • Tier-3 follow-up (have `nemoclaw status` also call `checkAndRecoverSandboxProcesses`, currently only `connect` does) is logged as a separate quick-win — not in this PR's scope.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Gateway service now auto-restarts if it exits unexpectedly, improving availability and reducing manual intervention.
    • Added safeguards and enhanced logging to detect and emit a critical alert when frequent restart attempts occur within a short window, preventing runaway restart loops.

Review Change Stack

The sandbox entrypoint (scripts/nemoclaw-start.sh) ends with
`wait "$GATEWAY_PID"`. When the gateway process dies, that wait
unblocks, PID 1 exits, and Docker reaps the entire sandbox container.
NemoClaw also doesn't pass any --restart policy when OpenShell creates
the sandbox, so no automatic recovery happens — users have to run
`nemoclaw connect` to bring the sandbox back.

Wrap the terminal wait in a respawn loop on both branches (non-root
and root/step-down). The loop:
  - exits cleanly on rc=0 (graceful gateway shutdown) so SIGTERM /
    SIGINT shutdown via cleanup_on_signal is unaffected;
  - sleeps 2s and relaunches the gateway on any non-zero exit (kill,
    OOM, crash);
  - tracks respawn count in a 60s sliding window and logs a CRITICAL
    line after 5 respawns to surface a crashing gateway rather than
    silently masking it;
  - updates SANDBOX_WAIT_PID and appends to SANDBOX_CHILD_PIDS so the
    existing signal-cleanup path sees the new pid.

Closes #2757.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
@coderabbitai

coderabbitai Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3a2ff7d7-6800-4d52-9b24-96a0a4ea2042

📥 Commits

Reviewing files that changed from the base of the PR and between d42db5b and 34b84b6.

📒 Files selected for processing (1)
  • scripts/nemoclaw-start.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/nemoclaw-start.sh

📝 Walkthrough

Walkthrough

Entrypoint now supervises the gateway and auto-respawns it on non-zero exits. Both non-root and root paths wait for GATEWAY_PID, track respawn timestamps in a 60s sliding window, emit a critical log at 5+ respawns, sleep 2s, and restart the gateway via nohup while updating PID tracking variables.

Changes

Gateway Auto-Respawn with Stability Monitoring

Layer / File(s) Summary
Gateway respawn loop and stability tracking
scripts/nemoclaw-start.sh
Replaces single-shot wait+exit in both non-root (lines ~2182–2217) and root (lines ~2410–2445) paths with a supervision loop: wait for GATEWAY_PID, exit only on rc=0, otherwise record a timestamp in a 60‑second sliding window, emit a critical log when respawns reach 5+ within the window, sleep 2s, relaunch openclaw gateway run via nohup, update GATEWAY_PID and SANDBOX_WAIT_PID, and append the new child PID to SANDBOX_CHILD_PIDS.

Sequence Diagram(s)

sequenceDiagram
  participant Entrypoint as scripts/nemoclaw-start.sh
  participant Gateway as "openclaw gateway (nohup)"
  participant RespawnLoop as "respawn loop"
  participant Tracker as "SANDBOX tracking"

  Entrypoint->>Gateway: launch `openclaw gateway run` (nohup)
  Gateway-->>Entrypoint: exit (rc)
  Entrypoint->>RespawnLoop: report rc
  RespawnLoop->>Tracker: record timestamp, update PID lists
  RespawnLoop-->>Entrypoint: decide (rc==0 ? exit : restart)
  RespawnLoop->>Gateway: restart `openclaw gateway run` (nohup) after sleep
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

Docker, v0.0.40

Poem

🐰
I mind the gateway through the night,
I count the stumbles, keep it right.
Sixty ticks to mark each fall,
Five quick knocks — I shout a call.
Then sleep, restart, and watch it run.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: adding auto-respawn functionality for the gateway when it exits unexpectedly.
Linked Issues check ✅ Passed The PR successfully implements all primary requirements from issue #2757: auto-respawn of gateway on unexpected exit, preservation of clean shutdown semantics, and 60-second respawn supervision with crash detection.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the linked issue #2757: the modifications wrap the gateway wait in a respawn loop with respawn-count guards and PID management, with no unrelated alterations.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/2757-gateway-respawn-loop

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: sandbox-operations-e2e, issue-2478-crash-loop-recovery-e2e
Optional E2E: sandbox-survival-e2e, cloud-onboard-e2e

Dispatch hint: sandbox-operations-e2e,issue-2478-crash-loop-recovery-e2e

Auto-dispatched E2E: sandbox-operations-e2e, issue-2478-crash-loop-recovery-e2e via nightly-e2e.yaml at 0a07f1007c4c55d2f221e7f9a892e47dabb477fdnightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • sandbox-operations-e2e (high): Directly exercises sandbox lifecycle and process recovery. TC-SBX-08 kills the OpenClaw gateway process inside the sandbox and verifies status/SSH recovery; TC-SBX-06 also validates gateway recovery behavior after a Docker-level gateway disruption. This is the closest existing merge-blocking coverage for the new PID 1 gateway respawn loop.
  • issue-2478-crash-loop-recovery-e2e (high): Exercises repeated OpenClaw gateway kills and recovery while checking that gateway guard/preload state and inference service availability survive respawns. The PR specifically changes crash-loop handling and repeated respawn behavior, so this regression is high-signal even though it was originally created for [DGX Spark] Gateway crash loop on startup: @homebridge/ciao networkInterfaces() returns EPERM in OpenShell sandbox #2478.

Optional E2E

  • sandbox-survival-e2e (medium): Useful adjacent confidence for sandbox persistence and live inference across gateway stop/start recovery. It is less direct than the process-kill tests because it focuses on OpenShell gateway/runtime restart survival rather than the OpenClaw gateway PID 1 wait/respawn loop.
  • cloud-onboard-e2e (medium): Smoke coverage that a freshly onboarded OpenClaw sandbox still starts cleanly, applies entrypoint setup, exposes inference.local, and does not regress installer/onboarding flows after the entrypoint change.

New E2E recommendations

  • gateway auto-respawn (high): Existing tests kill the gateway and verify broad recovery, but none appear to assert the exact [Station][Recovery] openclaw gateway killed — no auto-respawn; parent daemon exits with child; recovery requires nemoclaw connect #2757 behavior: PID 1 must not exit, the sandbox pod/container must remain running, the gateway must be automatically relaunched without a nemoclaw status/connect recovery trigger, and the respawn burst warning should appear after the configured crash threshold.
    • Suggested test: Add a focused gateway-auto-respawn E2E that onboards a sandbox, records the sandbox pod/container identity and gateway PID, kills only the OpenClaw gateway process, waits without invoking NemoClaw recovery commands, then asserts the sandbox is still alive, a new gateway PID is serving health/inference, logs contain the respawn message, and repeated kills emit the CRITICAL >=5-in-60s warning.
  • privilege-separated respawn ownership (medium): The root path must respawn through STEP_DOWN_PREFIX_GATEWAY so the replacement gateway keeps the gateway UID rather than running as root or sandbox. Current broad tests may not explicitly verify post-respawn UID/ownership.
    • Suggested test: Extend the focused gateway auto-respawn test to check the UID/user of the initial and respawned gateway process in the default root/setpriv sandbox image.

Dispatch hint

  • Workflow: E2E / Nightly
  • jobs input: sandbox-operations-e2e,issue-2478-crash-loop-recovery-e2e

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 2024-2039: The respawn alarm uses a fixed RESPAWN_WINDOW_START so
it doesn't implement a sliding 60s window; change the logic in the main loop
that waits on GATEWAY_PID to record each respawn timestamp (e.g., push $(date
+%s) into a small in-memory list/array), prune timestamps older than 60 seconds
on each respawn, compute RESPAWN_COUNT as the length of the remaining
timestamps, and trigger the critical alarm when that length >= 5; apply the same
replacement for the duplicate logic referenced around the other occurrence.
Ensure you update references to RESPAWN_COUNT and RESPAWN_WINDOW_START to use
the new timestamps list and pruning approach so the check truly reflects the
last 60s sliding window.
- Around line 2027-2028: The respawn loops call wait "$GATEWAY_PID" while set -e
is enabled, so a non-zero exit from wait will cause the script to exit instead
of continuing the respawn logic; wrap each wait call (both occurrences
referenced as wait "$GATEWAY_PID") so it is guarded from errexit (for example
use an explicit conditional that captures the exit code like: disable errexit
around the wait or append || true and then capture RC), then set RC from the
captured exit code and let the existing respawn logic run; update both
occurrences in the respawn loops to ensure the script doesn’t exit on wait
failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b756e269-0010-43e1-a11a-089be669bd2a

📥 Commits

Reviewing files that changed from the base of the PR and between edb7478 and 087e9c3.

📒 Files selected for processing (1)
  • scripts/nemoclaw-start.sh

Comment thread scripts/nemoclaw-start.sh Outdated
Comment thread scripts/nemoclaw-start.sh Outdated
@cjagwani cjagwani self-assigned this May 12, 2026
Two fixes from CodeRabbit review of the gateway respawn loop:

1. Guard `wait` from errexit. scripts/nemoclaw-start.sh runs under
   `set -euo pipefail` (line 33), so a bare `wait "$GATEWAY_PID"`
   returning non-zero (e.g. 137 from kill -9) terminates PID 1 before
   the respawn logic ever runs — defeating the fix entirely. Replaced
   with `RC=0; wait "$GATEWAY_PID" || RC=$?` in both branches so
   errexit can't fire on the wait.

2. Implement a true sliding 60s window for the crash-loop alarm. The
   previous logic anchored the window at startup/reset, so bursts of
   five crashes spanning the boundary (e.g. at 9s/22s/35s/49s/62s)
   never triggered the alarm even though all five happened within
   53s of each other. Now tracks individual timestamps in
   RESPAWN_TIMES, prunes anything older than 60s each iteration,
   counts the remainder.

Re-verified in a synthetic Docker harness with `set -euo pipefail`:
respawn fires correctly on kill -9, and CRITICAL alarm fires on the
5th kill when spaced ~13s apart over a 53s span — the exact case the
old fixed-window logic missed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
@wscurran wscurran added fix and removed v0.0.40 labels May 12, 2026
@wscurran

Copy link
Copy Markdown
Contributor

@jyaunches

Copy link
Copy Markdown
Contributor

Nit / defensive suggestion — feel free to defer to a follow-up.

I walked the signal-flow for this loop while researching #3263 (re-enable gpu-e2e nightly, where we're seeing destroy-phase hangs on a different code path). For the standard termination paths the analysis matches your PR description — SIGTERM to PID 1 fires cleanup_on_signal, exit is called, and the while :; is never re-entered. ✅

The one edge case that's not strictly covered is gateway termination that does not flow through PID 1 — e.g., a future OpenShell agent-container model with shareProcessNamespace, a sidecar that decides to kill just the gateway, or any scenario where the gateway process gets a SIGTERM the entrypoint script never sees. In that case the respawn loop would fight the termination instead of cooperating with it.

Cheap belt-and-suspenders:

# In cleanup_on_signal (scripts/lib/sandbox-init.sh), before the kill loop:
export SANDBOX_SHUTTING_DOWN=1

…and inside the respawn loop, just after wait returns non-zero:

[ -n "${SANDBOX_SHUTTING_DOWN:-}" ] && exit "$RC"

Costs ~2 lines, makes "we are intentionally tearing down" explicit, and removes any signal-propagation surprises later. Happy to defer this to a follow-up if it's out of scope here — flagging because it's adjacent to ongoing destroy-path hardening work (#3263, #2562 PR-4 audit).

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25803630630
Branch: fix/2757-gateway-respawn-loop
Requested jobs: issue-2478-crash-loop-recovery-e2e,sandbox-survival-e2e,gateway-health-honest-e2e,sandbox-operations-e2e,openshell-gateway-upgrade-e2e,onboard-repair-e2e,device-auth-health-e2e
Summary: 6 passed, 1 failed, 0 skipped

Job Result
device-auth-health-e2e ✅ success
gateway-health-honest-e2e ❌ failure
issue-2478-crash-loop-recovery-e2e ✅ success
onboard-repair-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success

Failed jobs: gateway-health-honest-e2e. Check run artifacts for logs.

@jyaunches

Copy link
Copy Markdown
Contributor

Review notes after main merge (34b84b6e)

Thanks for updating this branch with main. The current E2E Advisor run is green on the merged head: https://github.com/NVIDIA/NemoClaw/actions/runs/25807647131

The updated advisor recommendation is:

  • Required: issue-2478-crash-loop-recovery-e2e, sandbox-survival-e2e
  • Optional: gateway-health-honest-e2e, test-e2e-sandbox, test-e2e-gateway-isolation

I dispatched the required selective nightly jobs against the current head (34b84b6e): https://github.com/NVIDIA/NemoClaw/actions/runs/25808017515

🔴 Blocker

  1. E2E validation still pending for current HEAD
    • The previous selective nightly run was against stale head d42db5b3; after merging main, the PR head is now 34b84b6e.
    • Please wait for the required selective nightly run above to pass before merging.
    • Required jobs: issue-2478-crash-loop-recovery-e2e, sandbox-survival-e2e.

🟡 Warnings

  1. scripts/nemoclaw-start.sh:2182 and scripts/nemoclaw-start.sh:2410 — duplicated supervision loop

    • The root and non-root branches now carry nearly identical respawn-loop logic.
    • Not blocking, but please consider extracting a small helper in a follow-up so the sliding-window, logging, and set -e handling cannot drift between branches.
  2. scripts/nemoclaw-start.sh:2215 and scripts/nemoclaw-start.sh:2443SANDBOX_CHILD_PIDS grows with every crash

    • In a persistent crash loop, old gateway PIDs are appended forever even though they have already exited.
    • Low practical risk, but a follow-up could keep only live/static children plus the current gateway PID, or prune dead PIDs before appending.

✅ Good

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25808017515
Branch: fix/2757-gateway-respawn-loop
Requested jobs: issue-2478-crash-loop-recovery-e2e,sandbox-survival-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
issue-2478-crash-loop-recovery-e2e ✅ success
sandbox-survival-e2e ✅ success

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25899546631
Target ref: 43e73cc1daf1af4913654af02a0caf5d4a83549a
Workflow ref: main
Requested jobs: sandbox-operations-e2e,issue-2478-crash-loop-recovery-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
issue-2478-crash-loop-recovery-e2e ✅ success
sandbox-operations-e2e ✅ success

@cv cv enabled auto-merge (squash) May 15, 2026 13:52
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25921436131
Target ref: a27947c2865c5faa4a770ec91a000b39a1c8d7a3
Workflow ref: main
Requested jobs: sandbox-operations-e2e,issue-2478-crash-loop-recovery-e2e,cloud-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
cloud-e2e ⚠️ cancelled
issue-2478-crash-loop-recovery-e2e ⚠️ cancelled
sandbox-operations-e2e ⚠️ cancelled

@cv cv merged commit b41687f into main May 15, 2026
20 checks passed
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25921618323
Target ref: 0a07f1007c4c55d2f221e7f9a892e47dabb477fd
Workflow ref: main
Requested jobs: sandbox-operations-e2e,issue-2478-crash-loop-recovery-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
issue-2478-crash-loop-recovery-e2e ✅ success
sandbox-operations-e2e ✅ success

@miyoungc miyoungc mentioned this pull request May 16, 2026
12 tasks
@wscurran wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression and removed fix labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Station][Recovery] openclaw gateway killed — no auto-respawn; parent daemon exits with child; recovery requires nemoclaw connect

4 participants