fix(sandbox): tie gateway healthcheck marker to launch site, not env hint (#4710)#5116
Conversation
…hint (#4710) The early conditional gating mark_in_container_gateway on OPENSHELL_DRIVERS=docker (PR #4748) never fires inside the sandbox container because OpenShell 0.0.44 does not export OPENSHELL_DRIVERS into the sandbox env. hulynn's 2026-06-08 re-verification on v0.0.60 confirmed the symptom is unchanged from v0.0.57: marker file is created on every NEMOCLAW_CMD-empty container, so the Dockerfile HEALTHCHECK short-circuit (`[ -f /tmp/nemoclaw-gateway-local ] || exit 0`) is unreachable for docker-driver sandboxes and the container is marked (unhealthy) on every fresh onboard. Fix: drop the early env-gated marker write and call mark_in_container_gateway() directly before each `openclaw gateway run --port ...` launch (both the non-root and root-mode paths). The marker is now true-by-construction: it exists if-and-only-if this container is about to start the gateway, regardless of how the deployment-mode signal is (or isn't) plumbed through the sandbox env. Restart-loop fallback launches inherit the marker from the first launch — the function is idempotent (`: >file`) so calling it again on retry is a no-op. Verified by two new regression tests in test/nemoclaw-start.test.ts: - 'ties the in-container gateway healthcheck marker to the gateway launch site (#4503, #4710)': asserts (a) the region immediately after the function definition contains no early call and no OPENSHELL_DRIVERS conditional, and (b) every `openclaw gateway run --port` invocation is preceded by mark_in_container_gateway within 6 lines (or sits inside a restart-loop body). - 'mark_in_container_gateway writes the marker file idempotently (#4710)': behavioral check on the helper itself. The two existing launch-block tests now stub `mark_in_container_gateway() { :; }` in their preamble to keep the extracted shell snippet self-contained. Test count: 120 passing (vs. 119 baseline; 19 pre-existing failures unchanged). Note: a separate OpenClaw-side concern surfaced in @Dongni-Yang's 2026-06-04 thread (in-sandbox gateway loses its HTTP listener while the process stays alive) is out of scope here — this fix only addresses the marker-file regression hulynn re-confirmed in #4710 for docker-driver sandboxes where the gateway legitimately runs on the host. Standalone deployments where the gateway runs in-container are unaffected: the marker is still created (just at the launch site instead of at startup), so the existing pgrep healthcheck behavior is preserved bit-for-bit. Fixes #4710. Signed-off-by: Shawn Xie <shaxie@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughTie the HEALTHCHECK marker to the actual gateway launch: add host-delivery detection, move the marker helper earlier, call it immediately before starting the gateway (both entrypoint modes), and add tests that extract/execute the helper, verify timing/skip behavior, and check idempotency. Adjust test imports and reduce the test file-size budget. ChangesGateway marker placement and validation
Sequence Diagram(s)sequenceDiagram
participant ComponentA
participant ComponentB
ComponentA->>ComponentB: observable interaction
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/nemoclaw-start.test.ts`:
- Around line 331-346: The file test/nemoclaw-start.test.ts exceeds the legacy
size budget; split out the new marker-focused test and trim verbose comments:
move the test with the title starting "ties the in-container gateway healthcheck
marker to the gateway launch site (`#4503`, `#4710`)" and any closely-related helper
functions/fixtures it depends on into a new test file (e.g.,
test/nemoclaw-gateway-marker.test.ts), shorten or relocate the long inline
commentary (the large block around the NEMOCLAW_CMD checks) into a small summary
or separate docs, and update/keep any imports or shared fixtures referenced by
that test so it runs standalone; after moving, run the size-budget check to
confirm test/nemoclaw-start.test.ts is below the limit and that the new file
contains the marker-focused tests formerly at lines ~403-435.
- Around line 383-399: The current fallback sets inRestartLoop true if any
"while|until|for" appears within 60 lines, which can falsely exempt non-loop
launches; update the detection used around launchLineNumbers and
mark_in_container_gateway so it only treats a launch as inside a restart loop
when a loop header actually encloses the launch: locate the nearest loop header
upstream (e.g., the matching "while|until|for" token found when computing
upstream), ensure you stop at function boundaries, then verify the loop body
scope/indentation spans the launch index (or that the launch line occurs after
the loop header but before the loop’s end) before setting inRestartLoop true;
change the logic that sets inRestartLoop to perform this scoped containment
check rather than the current simple regex existence test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 794633e9-7c20-4f79-87e1-c3ea4c81376a
📒 Files selected for processing (2)
scripts/nemoclaw-start.shtest/nemoclaw-start.test.ts
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
Vitest E2E Scenario RecommendationRequired Vitest E2E scenarios: Dispatch required Vitest E2E scenarios:
Full Vitest E2E advisor summaryVitest E2E Scenario AdvisorBase: Required Vitest E2E scenarios
Optional Vitest E2E scenarios
Relevant changed files
|
PR Review AdvisorFindings: 0 needs attention, 4 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
|
✨ Related open issues: |
… file The #4710 marker test asserted on the start-script source text (regex .not.toMatch against a source window), tripping the source-shape budget gate (1 > 0). Replace it with a behavior test: a fake `openclaw` records whether the /tmp/nemoclaw-gateway-local marker exists at the instant `gateway run` fires, for both the non-root and root step-down launch forms — proving the marker is tied to the launch site, not an env hint. The PR also pushed test/nemoclaw-start.test.ts past its 5289-line legacy size budget. Move both #4710 marker tests into a dedicated test/nemoclaw-start-gateway-marker.test.ts (164 lines, under the 1500 default) and ratchet the legacy budget for the original file down to 5234, per the test-file-size-budget legacy-ratchet rule. No production change; scripts/nemoclaw-start.sh fix is unchanged. Signed-off-by: Shawn Xie <shaxie@nvidia.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/nemoclaw-start-gateway-marker.test.ts (1)
136-145: ⚡ Quick winUnify function extraction to avoid brittle brace parsing.
Line 137 uses the first
}after the function start, which is fragile ifmark_in_container_gatewayever gains${...}or additional blocks/comments. Reuse the already robustextractShellFunctionFromSourcehelper here too.Suggested diff
- const fnStart = src.indexOf("mark_in_container_gateway() {"); - const fnEnd = src.indexOf("}", fnStart); - if (fnStart === -1 || fnEnd === -1) { - throw new Error("mark_in_container_gateway function not found"); - } const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), "nemoclaw-gw-marker-")); const markerPath = path.join(tmpDir, "nemoclaw-gateway-local"); - const fnSrc = src - .slice(fnStart, fnEnd + 1) - .replaceAll("/tmp/nemoclaw-gateway-local", markerPath); + const fnSrc = extractShellFunctionFromSource(src, "mark_in_container_gateway") + .replaceAll("/tmp/nemoclaw-gateway-local", markerPath);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/nemoclaw-start-gateway-marker.test.ts` around lines 136 - 145, The test currently finds the function by searching for "mark_in_container_gateway() {" and the next "}" using fnStart/fnEnd which is brittle; instead call the existing extractShellFunctionFromSource helper to robustly extract the full mark_in_container_gateway function body, then replace the "/tmp/nemoclaw-gateway-local" path in that returned string and assign to fnSrc; update references to fnStart/fnEnd so extractShellFunctionFromSource is used to produce the function source (keeping tmpDir and markerPath logic the same).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/nemoclaw-start-gateway-marker.test.ts`:
- Around line 136-145: The test currently finds the function by searching for
"mark_in_container_gateway() {" and the next "}" using fnStart/fnEnd which is
brittle; instead call the existing extractShellFunctionFromSource helper to
robustly extract the full mark_in_container_gateway function body, then replace
the "/tmp/nemoclaw-gateway-local" path in that returned string and assign to
fnSrc; update references to fnStart/fnEnd so extractShellFunctionFromSource is
used to produce the function source (keeping tmpDir and markerPath logic the
same).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d7a9432a-416d-4821-b929-c5be9474434f
📒 Files selected for processing (3)
ci/test-file-size-budget.jsontest/nemoclaw-start-gateway-marker.test.tstest/nemoclaw-start.test.ts
✅ Files skipped from review due to trivial changes (1)
- ci/test-file-size-budget.json
🚧 Files skipped from review as they are similar to previous changes (1)
- test/nemoclaw-start.test.ts
|
🌿 Preview your docs: https://nvidia-preview-pr-5116.docs.buildwithfern.com/nemoclaw |
Selective E2E Results — ❌ Some jobs failedRun: 27325717417
|
Selective E2E Results — ❌ Some jobs failedRun: 27326029419
|
Selective E2E Results — ❌ Some jobs failedRun: 27326403367
|
Selective E2E Results — ✅ All requested jobs passedRun: 27326894410
|
## Summary - Add v0.0.64 release notes from the release announcement and link them to the relevant deeper docs. - Document that custom policy presets recorded through `policy-add --from-file` and `--from-dir` survive snapshot restore and sandbox recreation. - Refresh generated NemoClaw user skills from the current source docs. ## Source summary - #5104 -> `docs/manage-sandboxes/backup-restore.mdx`, `docs/network-policy/customize-network-policy.mdx`: Documents custom policy presets preserved through snapshot restore. - #4955 -> `docs/about/release-notes.mdx`: Adds release-note coverage for Brave web-search pinning and `BRAVE_API_KEY` placeholder preservation. - #5116, #5269 -> `docs/about/release-notes.mdx`: Adds release-note coverage for Docker-driver gateway health and rootfs guard stability. - #5241, #5085 -> `docs/about/release-notes.mdx`: Adds release-note coverage for chat-completions provider selection and Nemotron Ultra 550B tool-less request compatibility. - #5268, #5210, #5257 -> `docs/about/release-notes.mdx`: Adds release-note coverage for messaging render plan refresh, OpenClaw scope-upgrade approval recovery, and Hermes WhatsApp bridge dependency setup. - Current source docs -> `.agents/skills/`: Regenerates user-skill references so agent-facing guidance matches the source documentation. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` - `npm run build:cli` - `npm run typecheck:cli` - Commit/pre-push hooks: markdownlint, gitleaks, docs-to-skills verification, TypeScript CLI, and skills YAML checks passed. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Clarified sandbox snapshot restore preserves custom policy presets and restores them without original files. * Switched sandbox setup and remote deployment guidance to Docker-based workflows and emphasized remote onboarding flow. * Expanded troubleshooting for gateway recovery, Docker GPU/WSL issues, and onboarding resume. * Added/updated CLI docs: advanced maintenance, session export, upload/download wrappers, and status recovery guidance. * Added v0.0.64 release notes and links to NemoClaw Community; fixed command reference formatting. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
mark_in_container_gatewayonOPENSHELL_DRIVERS=docker(added in fix: skip gateway marker for docker-driver sandboxes #4748) never fires inside the sandbox container, because OpenShell 0.0.44 does not exportOPENSHELL_DRIVERSinto the sandbox env. @hulynn's 2026-06-08 re-verification on v0.0.60 confirms the symptom is unchanged: marker is created on every empty-NEMOCLAW_CMDcontainer, the Dockerfile HEALTHCHECK short-circuit ([ -f /tmp/nemoclaw-gateway-local ] || exit 0) is unreachable for docker-driver sandboxes, and the container is marked(unhealthy)on every fresh onboard.mark_in_container_gateway()directly before eachopenclaw gateway run --port ...launch (both non-root and root-mode paths). The marker is now true-by-construction: it exists if-and-only-if this container is about to start the gateway, regardless of whether the deployment-mode signal is plumbed through the sandbox env.This is hulynn's fix-direction (c) from #4710:
Related Issue
Fixes #4710. Supersedes the no-op env-var guard introduced by #4748.
Changes
scripts/nemoclaw-start.sh: remove the earlycase ",${OPENSHELL_DRIVERS:-},"block and the_NEMOCLAW_DOCKER_DRIVERgate that wrapped the earlymark_in_container_gatewaycall; addmark_in_container_gatewayimmediately before both first-launch invocations ofopenclaw gateway run --port "${_DASHBOARD_PORT}"(non-root at line ~3324, root-mode at line ~3550). Restart-loop fallback launches inherit the marker from the first launch — the function is idempotent (: >file), so calling it again on retry is a no-op.scripts/nemoclaw-start.sh: rewrite the function-definition comment block to explain the launch-site invariant and the [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710 root cause, so future regressions cannot reintroduce an env-gated form without removing the warning.test/nemoclaw-start.test.ts: replace the old startup-snippet test with two new structural tests:ties the in-container gateway healthcheck marker to the gateway launch site (#4503, #4710)— asserts (a) the region immediately after the function definition contains no earlymark_in_container_gatewaycall and noOPENSHELL_DRIVERSconditional; (b) everyopenclaw gateway run --portinvocation in the script is preceded (within 6 lines) bymark_in_container_gateway, OR sits inside a restart-loop body that inherits from the first launch. The first invariant blocks a regression back to the env-gated form; the second locks the new contract.mark_in_container_gateway writes the marker file idempotently (#4710)— behavioral check on the helper itself.test/nemoclaw-start.test.ts: the two existing launch-block tests (registers child PIDs ...andlaunches the root gateway through gosu ...) now stubmark_in_container_gateway() { :; }in their preamble so the extracted shell snippet remains self-contained after the new call is introduced.Type of Change
Verification
npx vitest run --project cli test/nemoclaw-start.test.ts— 120 passing (vs. 119 baseline before this PR; 19 pre-existing failures onmainare unchanged and unrelated to this change).scripts/nemoclaw-start.sh: confirmed both first-launch sites (lines 3328 + 3553 after edit) havemark_in_container_gatewayimmediately above thenohup ... "$OPENCLAW" gateway run --port ...invocation; the two restart-loop fallbacks at lines ~3382 / ~3639 sit insidewhile/untilblocks downstream of those first launches and reuse the marker that's already there.Scope note
A separate OpenClaw-side concern surfaced in @Dongni-Yang's 2026-06-04 thread (in-sandbox gateway loses its HTTP listener while the process stays alive) is out of scope for this PR — that fix has to live in the OpenClaw gateway codebase. This PR only addresses the marker-file regression hulynn re-confirmed in #4710 for docker-driver sandboxes where the gateway legitimately runs on the host. Standalone deployments where the gateway runs in-container are unaffected: the marker is still created (just at the launch site instead of at startup), so the existing pgrep healthcheck behavior is preserved bit-for-bit.
Signed-off-by: Shawn Xie shaxie@nvidia.com
🤖 Generated with Claude Code
Summary by CodeRabbit
Chores
Tests