fix(sandbox): auto-respawn gateway when it exits unexpectedly (#2757)#3409
Conversation
The sandbox entrypoint (scripts/nemoclaw-start.sh) ends with
`wait "$GATEWAY_PID"`. When the gateway process dies, that wait
unblocks, PID 1 exits, and Docker reaps the entire sandbox container.
NemoClaw also doesn't pass any --restart policy when OpenShell creates
the sandbox, so no automatic recovery happens — users have to run
`nemoclaw connect` to bring the sandbox back.
Wrap the terminal wait in a respawn loop on both branches (non-root
and root/step-down). The loop:
- exits cleanly on rc=0 (graceful gateway shutdown) so SIGTERM /
SIGINT shutdown via cleanup_on_signal is unaffected;
- sleeps 2s and relaunches the gateway on any non-zero exit (kill,
OOM, crash);
- tracks respawn count in a 60s sliding window and logs a CRITICAL
line after 5 respawns to surface a crashing gateway rather than
silently masking it;
- updates SANDBOX_WAIT_PID and appends to SANDBOX_CHILD_PIDS so the
existing signal-cleanup path sees the new pid.
Closes #2757.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughEntrypoint now supervises the gateway and auto-respawns it on non-zero exits. Both non-root and root paths wait for GATEWAY_PID, track respawn timestamps in a 60s sliding window, emit a critical log at 5+ respawns, sleep 2s, and restart the gateway via nohup while updating PID tracking variables. ChangesGateway Auto-Respawn with Stability Monitoring
Sequence Diagram(s)sequenceDiagram
participant Entrypoint as scripts/nemoclaw-start.sh
participant Gateway as "openclaw gateway (nohup)"
participant RespawnLoop as "respawn loop"
participant Tracker as "SANDBOX tracking"
Entrypoint->>Gateway: launch `openclaw gateway run` (nohup)
Gateway-->>Entrypoint: exit (rc)
Entrypoint->>RespawnLoop: report rc
RespawnLoop->>Tracker: record timestamp, update PID lists
RespawnLoop-->>Entrypoint: decide (rc==0 ? exit : restart)
RespawnLoop->>Gateway: restart `openclaw gateway run` (nohup) after sleep
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsStopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 2024-2039: The respawn alarm uses a fixed RESPAWN_WINDOW_START so
it doesn't implement a sliding 60s window; change the logic in the main loop
that waits on GATEWAY_PID to record each respawn timestamp (e.g., push $(date
+%s) into a small in-memory list/array), prune timestamps older than 60 seconds
on each respawn, compute RESPAWN_COUNT as the length of the remaining
timestamps, and trigger the critical alarm when that length >= 5; apply the same
replacement for the duplicate logic referenced around the other occurrence.
Ensure you update references to RESPAWN_COUNT and RESPAWN_WINDOW_START to use
the new timestamps list and pruning approach so the check truly reflects the
last 60s sliding window.
- Around line 2027-2028: The respawn loops call wait "$GATEWAY_PID" while set -e
is enabled, so a non-zero exit from wait will cause the script to exit instead
of continuing the respawn logic; wrap each wait call (both occurrences
referenced as wait "$GATEWAY_PID") so it is guarded from errexit (for example
use an explicit conditional that captures the exit code like: disable errexit
around the wait or append || true and then capture RC), then set RC from the
captured exit code and let the existing respawn logic run; update both
occurrences in the respawn loops to ensure the script doesn’t exit on wait
failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b756e269-0010-43e1-a11a-089be669bd2a
📒 Files selected for processing (1)
scripts/nemoclaw-start.sh
Two fixes from CodeRabbit review of the gateway respawn loop: 1. Guard `wait` from errexit. scripts/nemoclaw-start.sh runs under `set -euo pipefail` (line 33), so a bare `wait "$GATEWAY_PID"` returning non-zero (e.g. 137 from kill -9) terminates PID 1 before the respawn logic ever runs — defeating the fix entirely. Replaced with `RC=0; wait "$GATEWAY_PID" || RC=$?` in both branches so errexit can't fire on the wait. 2. Implement a true sliding 60s window for the crash-loop alarm. The previous logic anchored the window at startup/reset, so bursts of five crashes spanning the boundary (e.g. at 9s/22s/35s/49s/62s) never triggered the alarm even though all five happened within 53s of each other. Now tracks individual timestamps in RESPAWN_TIMES, prunes anything older than 60s each iteration, counts the remainder. Re-verified in a synthetic Docker harness with `set -euo pipefail`: respawn fires correctly on kill -9, and CRITICAL alarm fires on the 5th kill when spaced ~13s apart over a 53s span — the exact case the old fixed-window logic missed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
|
Nit / defensive suggestion — feel free to defer to a follow-up. I walked the signal-flow for this loop while researching #3263 (re-enable The one edge case that's not strictly covered is gateway termination that does not flow through PID 1 — e.g., a future OpenShell agent-container model with Cheap belt-and-suspenders: # In cleanup_on_signal (scripts/lib/sandbox-init.sh), before the kill loop:
export SANDBOX_SHUTTING_DOWN=1…and inside the respawn loop, just after [ -n "${SANDBOX_SHUTTING_DOWN:-}" ] && exit "$RC"Costs ~2 lines, makes "we are intentionally tearing down" explicit, and removes any signal-propagation surprises later. Happy to defer this to a follow-up if it's out of scope here — flagging because it's adjacent to ongoing destroy-path hardening work (#3263, #2562 PR-4 audit). |
Selective E2E Results — ❌ Some jobs failedRun: 25803630630
|
Review notes after main merge (
|
Selective E2E Results — ✅ All requested jobs passedRun: 25808017515
|
Selective E2E Results — ✅ All requested jobs passedRun: 25899546631
|
Selective E2E Results — ✅ All requested jobs passedRun: 25921436131
|
Selective E2E Results — ✅ All requested jobs passedRun: 25921618323
|
Summary
wait "\$GATEWAY_PID"inscripts/nemoclaw-start.sh(both non-root and root/step-down branches) in a respawn loop so unexpected gateway death no longer drops PID 1 and reaps the sandbox container.CRITICALline so a crashing gateway surfaces in/tmp/gateway.lograther than being masked.cleanup_on_signalshutdown semantics — clean exit (rc=0) still drops PID 1, SIGTERM/SIGINT still trigger the existing handler.Closes #2757.
Root cause
The bug report blamed
src/lib/agent-runtime.tsfor missing supervisor logic, but that file was moved tosrc/lib/agent/runtime.ts(#3191) and the gateway launch is correct —nohup ... &followed bywait "\$GATEWAY_PID". The real cause sits one layer down:waitunblocks the moment the gateway dies, PID 1 exits, and Docker reaps the container by design (scripts/nemoclaw-start.shis the entrypoint). NemoClaw also doesn't pass--restart=when OpenShell creates the sandbox, so neither layer recovers.Verification
Reproduced locally in Ubuntu 24.04 via a synthetic entrypoint mirroring lines 2240-2268 of this file:
kill -9 \$GATEWAY_PIDexited(exitCode=137, restartCount=0). Matches QA report.running, gateway gets new PID.nemoclaw status→ healthy.Type of Change
Verification
Test plan
Manual repro mirrors the QA acceptance criteria:
I did not add an automated E2E test for the kill-and-respawn flow in this PR (scope kept minimal per #2757's acceptance criteria); happy to follow up with one if reviewers want — would slot into `test/e2e/test-sandbox-survival.sh`.
Notes for reviewers
🤖 Generated with Claude Code
Summary by CodeRabbit