fix(recovery): keep gateway guard preloads active after respawn by ericksoa · Pull Request #2777 · NVIDIA/NemoClaw

ericksoa · 2026-04-30T21:24:40Z

Summary

Treat the re-execed openclaw-gateway child as a gateway process for the sandbox safety-net and ciao guard preloads.
Emit deterministic preload-loaded markers from the actual child process so the crash-loop e2e no longer relies on incidental os.networkInterfaces() failures.
Tighten recovery checks so missing safety-net or ciao preloads in NODE_OPTIONS produce an observable gateway-recovery warning.

Root cause

The failing issue-2478-crash-loop-recovery-e2e run on main did recover the gateway: cycle 1 respawned openclaw-gateway as pid 708. The failure was the guard-chain proof after respawn. The guard scripts only considered process.argv[2] === "gateway" to be a gateway invocation, but the live recovered process is the re-execed openclaw-gateway child. The e2e also waited for the ciao guard's os.networkInterfaces() failure log, which is an incidental library call and did not fire after the recovered gateway came up.

Validation

npm ci --ignore-scripts
npm run build:cli
npm test -- test/nemoclaw-start.test.ts src/lib/agent-runtime.test.ts
bash -n scripts/nemoclaw-start.sh
bash -n test/e2e/test-issue-2478-crash-loop-recovery.sh

Note: local commit/push hooks were skipped for the final commit/push because the full CLI coverage hook entered a recursive/long-running fixture path; the focused validation above passed.

Summary by CodeRabbit

Bug Fixes
- Recovery now validates two distinct gateway preload guards (safety-net and ciao) via gateway-process detection; both must be present to be considered OK.
- Guard handlers are registered only for detected gateway process flavors, and clearer stderr loader markers and updated warning text are emitted.
- Backward-compatible fallback retained for legacy guard signatures.
Tests
- Added tests verifying preload detection, loader markers, argv/title gateway classification, and gateway vs non-gateway behavior.

copy-pr-bot · 2026-04-30T21:24:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-30T21:24:51Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7e53eaae-44b5-4a62-8364-105344b0d94b

📥 Commits

Reviewing files that changed from the base of the PR and between 240537e and 6504d5b.

📒 Files selected for processing (1)

src/nemoclaw.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/nemoclaw.ts

📝 Walkthrough

Walkthrough

Sandbox and ciao preload logic now detect gateway context via a computed _gatewayProcess (executable/title/basename heuristics) instead of only process.argv[2] === 'gateway'. Each preload emits stderr loader markers when active; recovery and tests require both preloads in NODE_OPTIONS.

Changes

Cohort / File(s)	Summary
Preload Script Core Logic `scripts/nemoclaw-start.sh`	Compute `_gatewayProcess` via executable/title/basename heuristics; sandbox safety-net and ciao guard gate gateway-only behavior on that value and write stderr loader markers when active.
Recovery & Runtime Guard Checks `src/lib/agent-runtime.ts`, `src/nemoclaw.ts`	Recovery script logic now checks for both `nemoclaw-sandbox-safety-net` and `nemoclaw-ciao-network-guard` in `NODE_OPTIONS` before clearing `_GUARDS_MISSING`; warning text updated to reflect either preload missing.
Tests & E2E Scripts `src/lib/agent-runtime.test.ts`, `test/nemoclaw-start.test.ts`, `test/e2e/test-issue-2478-crash-loop-recovery.sh`	Added test helpers to extract/run heredoc preload scripts and new tests asserting loader stderr markers for gateway flavors; e2e guard detection updated to look for deterministic preload log markers with legacy fallback.

Sequence Diagram

sequenceDiagram
    participant Launcher as Launcher (openclaw gateway)
    participant Child as Child Process (openclaw-gateway)
    participant SafetyNet as Safety-Net Preload
    participant CiaoGuard as Ciao Guard Preload
    participant Runtime as Runtime

    Launcher->>Child: Re-exec (NODE_OPTIONS includes preloads)
    Child->>SafetyNet: Node loads safety-net preload
    SafetyNet->>SafetyNet: Compute _gatewayProcess (argv/title/basename)
    alt Gateway detected
        SafetyNet->>Child: Emit stderr "sandbox-safety-net loaded (...)"
        SafetyNet->>Runtime: Enable gateway-only guard behavior
    else Not a gateway
        SafetyNet->>SafetyNet: Early-return (no gateway behavior)
    end

    Child->>CiaoGuard: Node loads ciao-network-guard preload
    CiaoGuard->>CiaoGuard: Check _gatewayProcess truthiness
    alt Gateway detected
        CiaoGuard->>Child: Emit stderr "ciao-network-guard loaded (...)"
        CiaoGuard->>Runtime: Register uncaughtException handler
    else Not a gateway
        CiaoGuard->>CiaoGuard: Skip handler registration
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I sniffed the argv, checked the name,
Two little guards now know their game.
They stamp the log when they take their place,
Stay silent when not in gateway space—
A hop, a check, and steady pace.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the main objective: ensuring gateway guard preloads remain active after process respawn, which is the core fix for issue `#2478`.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/issue-2478-guard-marker

_{Review rate limit: 8/10 reviews remaining, refill in 10 minutes and 58 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-30T21:25:07Z

Selective E2E Results — ✅ All requested jobs passed

Run: 25190082955
Branch: fix/issue-2478-guard-marker
Requested jobs: issue-2478-crash-loop-recovery-e2e
Summary: 0 passed, 0 failed, 21 skipped

Job	Result
cloud-e2e	⏭️ skipped
cloud-inference-e2e	⏭️ skipped
cloud-onboard-e2e	⏭️ skipped
deployment-services-e2e	⏭️ skipped
diagnostics-e2e	⏭️ skipped
docs-validation-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped
hermes-e2e	⏭️ skipped
inference-routing-e2e	⏭️ skipped
messaging-providers-e2e	⏭️ skipped
network-policy-e2e	⏭️ skipped
overlayfs-autofix-e2e	⏭️ skipped
rebuild-hermes-e2e	⏭️ skipped
rebuild-openclaw-e2e	⏭️ skipped
sandbox-operations-e2e	⏭️ skipped
sandbox-survival-e2e	⏭️ skipped
shields-config-e2e	⏭️ skipped
skill-agent-e2e	⏭️ skipped
snapshot-commands-e2e	⏭️ skipped
token-rotation-e2e	⏭️ skipped
upgrade-stale-sandbox-e2e	⏭️ skipped

coderabbitai

🧹 Nitpick comments (1)

src/nemoclaw.ts (1)
317-324: Run the two recovery-focused E2E lanes before merge.

Since this changes restart/recovery behavior, validate with sandbox-survival-e2e and sandbox-operations-e2e in CI to catch post-respawn integration regressions.

As per coding guidelines: src/nemoclaw.ts E2E test recommendation includes sandbox-survival-e2e and sandbox-operations-e2e.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 317 - 324, This change to restart/recovery
behavior in src/nemoclaw.ts (the shell startup/recovery command array that
includes checks for NODE_OPTIONS, curl healthcheck to DASHBOARD_PORT, and temp
log/lock cleanup) requires running the two recovery-focused E2E lanes before
merging: trigger sandbox-survival-e2e and sandbox-operations-e2e for this
PR/branch in CI and attach their green runs to the review; if your CI config
requires it, add those two jobs to the PR pipeline or re-run the workflow that
includes them so their successful results are present prior to merge.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/nemoclaw.ts`:
- Around line 317-324: This change to restart/recovery behavior in
src/nemoclaw.ts (the shell startup/recovery command array that includes checks
for NODE_OPTIONS, curl healthcheck to DASHBOARD_PORT, and temp log/lock cleanup)
requires running the two recovery-focused E2E lanes before merging: trigger
sandbox-survival-e2e and sandbox-operations-e2e for this PR/branch in CI and
attach their green runs to the review; if your CI config requires it, add those
two jobs to the PR pipeline or re-run the workflow that includes them so their
successful results are present prior to merge.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 73818e87-2af2-4d85-9b52-a588458e3627

📥 Commits

Reviewing files that changed from the base of the PR and between cad2e10 and 75a6bd9.

📒 Files selected for processing (6)

scripts/nemoclaw-start.sh
src/lib/agent-runtime.test.ts
src/lib/agent-runtime.ts
src/nemoclaw.ts
test/e2e/test-issue-2478-crash-loop-recovery.sh
test/nemoclaw-start.test.ts

github-actions · 2026-04-30T21:37:21Z

Selective E2E Results — ✅ All requested jobs passed

Run: 25190568453
Branch: fix/issue-2478-guard-marker
Requested jobs: issue-2478-crash-loop-recovery-e2e
Summary: 0 passed, 0 failed, 21 skipped

Job	Result
cloud-e2e	⏭️ skipped
cloud-inference-e2e	⏭️ skipped
cloud-onboard-e2e	⏭️ skipped
deployment-services-e2e	⏭️ skipped
diagnostics-e2e	⏭️ skipped
docs-validation-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped
hermes-e2e	⏭️ skipped
inference-routing-e2e	⏭️ skipped
messaging-providers-e2e	⏭️ skipped
network-policy-e2e	⏭️ skipped
overlayfs-autofix-e2e	⏭️ skipped
rebuild-hermes-e2e	⏭️ skipped
rebuild-openclaw-e2e	⏭️ skipped
sandbox-operations-e2e	⏭️ skipped
sandbox-survival-e2e	⏭️ skipped
shields-config-e2e	⏭️ skipped
skill-agent-e2e	⏭️ skipped
snapshot-commands-e2e	⏭️ skipped
token-rotation-e2e	⏭️ skipped
upgrade-stale-sandbox-e2e	⏭️ skipped

coderabbitai

🧹 Nitpick comments (1)

test/nemoclaw-start.test.ts (1)

38-42: ⚡ Quick win

Escape the marker parameter before using it in RegExp constructor.

While the current call sites use safe string literals, the marker parameter should be escaped to prevent issues if it contains regex metacharacters. This is a defensive coding improvement.

🛡️ Proposed fix

 function startScriptHeredoc(src: string, marker: string): string {
-  const match = src.match(new RegExp(`<<'${marker}'\\n([\\s\\S]*?)\\n${marker}`));
+  const escapedMarker = marker.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
+  const match = src.match(new RegExp(`<<'${escapedMarker}'\\n([\\s\\S]*?)\\n${escapedMarker}`));
   expect(match).toBeTruthy();
   return match![1];
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/nemoclaw-start.test.ts` around lines 38 - 42, The startScriptHeredoc
function builds a RegExp from the marker parameter without escaping it; escape
marker before constructing the RegExp to avoid treating regex metacharacters as
special. Update startScriptHeredoc to create an escapedMarker (e.g. via a small
escapeRegExp helper or marker.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')) and use
that escapedMarker when building the RegExp passed to src.match so the heredoc
boundary is matched literally.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/nemoclaw-start.test.ts`:
- Around line 38-42: The startScriptHeredoc function builds a RegExp from the
marker parameter without escaping it; escape marker before constructing the
RegExp to avoid treating regex metacharacters as special. Update
startScriptHeredoc to create an escapedMarker (e.g. via a small escapeRegExp
helper or marker.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')) and use that
escapedMarker when building the RegExp passed to src.match so the heredoc
boundary is matched literally.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 33554ae1-5d30-467b-9063-4e81737a76a4

📥 Commits

Reviewing files that changed from the base of the PR and between 75a6bd9 and cf5d8d9.

📒 Files selected for processing (2)

scripts/nemoclaw-start.sh
test/nemoclaw-start.test.ts

🚧 Files skipped from review as they are similar to previous changes (1)

scripts/nemoclaw-start.sh

github-actions · 2026-04-30T21:48:27Z

Selective E2E Results — ✅ All requested jobs passed

Run: 25190997787
Branch: fix/issue-2478-guard-marker
Requested jobs: issue-2478-crash-loop-recovery-e2e
Summary: 0 passed, 0 failed, 21 skipped

Job	Result
cloud-e2e	⏭️ skipped
cloud-inference-e2e	⏭️ skipped
cloud-onboard-e2e	⏭️ skipped
deployment-services-e2e	⏭️ skipped
diagnostics-e2e	⏭️ skipped
docs-validation-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped
hermes-e2e	⏭️ skipped
inference-routing-e2e	⏭️ skipped
messaging-providers-e2e	⏭️ skipped
network-policy-e2e	⏭️ skipped
overlayfs-autofix-e2e	⏭️ skipped
rebuild-hermes-e2e	⏭️ skipped
rebuild-openclaw-e2e	⏭️ skipped
sandbox-operations-e2e	⏭️ skipped
sandbox-survival-e2e	⏭️ skipped
shields-config-e2e	⏭️ skipped
skill-agent-e2e	⏭️ skipped
snapshot-commands-e2e	⏭️ skipped
token-rotation-e2e	⏭️ skipped
upgrade-stale-sandbox-e2e	⏭️ skipped

github-actions · 2026-04-30T22:46:14Z

Selective E2E Results — ❌ Some jobs failed

Run: 25191894511
Branch: fix/issue-2478-guard-marker
Requested jobs: all (no filter)
Summary: 19 passed, 1 failed, 1 skipped

Job	Result
cloud-e2e	✅ success
cloud-inference-e2e	✅ success
cloud-onboard-e2e	✅ success
deployment-services-e2e	✅ success
diagnostics-e2e	✅ success
docs-validation-e2e	❌ failure
gpu-e2e	⏭️ skipped
hermes-e2e	✅ success
inference-routing-e2e	✅ success
messaging-providers-e2e	✅ success
network-policy-e2e	✅ success
overlayfs-autofix-e2e	✅ success
rebuild-hermes-e2e	✅ success
rebuild-openclaw-e2e	✅ success
sandbox-operations-e2e	✅ success
sandbox-survival-e2e	✅ success
shields-config-e2e	✅ success
skill-agent-e2e	✅ success
snapshot-commands-e2e	✅ success
token-rotation-e2e	✅ success
upgrade-stale-sandbox-e2e	✅ success

Failed jobs: docs-validation-e2e. Check run artifacts for logs.

fix(recovery): keep gateway guard preloads active after respawn

75a6bd9

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

fix(recovery): identify re-execed gateway process title

cf5d8d9

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

test(e2e): accept gateway guard preload markers

240537e

ericksoa added v0.0.32 bug Something fails against expected or documented behavior labels Apr 30, 2026

ericksoa self-assigned this Apr 30, 2026

ericksoa added 2 commits April 30, 2026 16:24

Merge branch 'main' into fix/issue-2478-guard-marker

6504d5b

Merge branch 'main' into fix/issue-2478-guard-marker

1a64adb

cv approved these changes May 1, 2026

View reviewed changes

ericksoa merged commit fdc405b into main May 1, 2026
9 of 10 checks passed

wscurran added bug-fix PR fixes a bug or regression and removed bug Something fails against expected or documented behavior labels Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(recovery): keep gateway guard preloads active after respawn#2777

fix(recovery): keep gateway guard preloads active after respawn#2777
ericksoa merged 5 commits into
mainfrom
fix/issue-2478-guard-marker

ericksoa commented Apr 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ericksoa commented Apr 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Validation

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Apr 30, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 30, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 30, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

github-actions Bot commented Apr 30, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ericksoa commented Apr 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading