Skip to content

fix(sandbox): tie gateway healthcheck marker to launch site, not env hint (#4710)#5116

Merged
cv merged 9 commits into
mainfrom
fix/healthcheck-marker-gateway-launch-4710
Jun 11, 2026
Merged

fix(sandbox): tie gateway healthcheck marker to launch site, not env hint (#4710)#5116
cv merged 9 commits into
mainfrom
fix/healthcheck-marker-gateway-launch-4710

Conversation

@nvshaxie

@nvshaxie nvshaxie commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

  • The early conditional gating mark_in_container_gateway on OPENSHELL_DRIVERS=docker (added in fix: skip gateway marker for docker-driver sandboxes #4748) never fires inside the sandbox container, because OpenShell 0.0.44 does not export OPENSHELL_DRIVERS into the sandbox env. @hulynn's 2026-06-08 re-verification on v0.0.60 confirms the symptom is unchanged: marker is created on every empty-NEMOCLAW_CMD container, the Dockerfile HEALTHCHECK short-circuit ([ -f /tmp/nemoclaw-gateway-local ] || exit 0) is unreachable for docker-driver sandboxes, and the container is marked (unhealthy) on every fresh onboard.
  • This PR drops the early env-gated write and calls mark_in_container_gateway() directly before each openclaw gateway run --port ... launch (both non-root and root-mode paths). The marker is now true-by-construction: it exists if-and-only-if this container is about to start the gateway, regardless of whether the deployment-mode signal is plumbed through the sandbox env.

This is hulynn's fix-direction (c) from #4710:

Move mark_in_container_gateway into the gateway-launching codepath itself, so it only runs when this container will actually start the gateway.

Related Issue

Fixes #4710. Supersedes the no-op env-var guard introduced by #4748.

Changes

  • scripts/nemoclaw-start.sh: remove the early case ",${OPENSHELL_DRIVERS:-}," block and the _NEMOCLAW_DOCKER_DRIVER gate that wrapped the early mark_in_container_gateway call; add mark_in_container_gateway immediately before both first-launch invocations of openclaw gateway run --port "${_DASHBOARD_PORT}" (non-root at line ~3324, root-mode at line ~3550). Restart-loop fallback launches inherit the marker from the first launch — the function is idempotent (: >file), so calling it again on retry is a no-op.
  • scripts/nemoclaw-start.sh: rewrite the function-definition comment block to explain the launch-site invariant and the [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710 root cause, so future regressions cannot reintroduce an env-gated form without removing the warning.
  • test/nemoclaw-start.test.ts: replace the old startup-snippet test with two new structural tests:
    • ties the in-container gateway healthcheck marker to the gateway launch site (#4503, #4710) — asserts (a) the region immediately after the function definition contains no early mark_in_container_gateway call and no OPENSHELL_DRIVERS conditional; (b) every openclaw gateway run --port invocation in the script is preceded (within 6 lines) by mark_in_container_gateway, OR sits inside a restart-loop body that inherits from the first launch. The first invariant blocks a regression back to the env-gated form; the second locks the new contract.
    • mark_in_container_gateway writes the marker file idempotently (#4710) — behavioral check on the helper itself.
  • test/nemoclaw-start.test.ts: the two existing launch-block tests (registers child PIDs ... and launches the root gateway through gosu ...) now stub mark_in_container_gateway() { :; } in their preamble so the extracted shell snippet remains self-contained after the new call is introduced.

Type of Change

  • Code change (feature, bug fix, or refactor)

Verification

  • npx vitest run --project cli test/nemoclaw-start.test.ts120 passing (vs. 119 baseline before this PR; 19 pre-existing failures on main are unchanged and unrelated to this change).
  • Manual diff review of scripts/nemoclaw-start.sh: confirmed both first-launch sites (lines 3328 + 3553 after edit) have mark_in_container_gateway immediately above the nohup ... "$OPENCLAW" gateway run --port ... invocation; the two restart-loop fallbacks at lines ~3382 / ~3639 sit inside while/until blocks downstream of those first launches and reuse the marker that's already there.

Scope note

A separate OpenClaw-side concern surfaced in @Dongni-Yang's 2026-06-04 thread (in-sandbox gateway loses its HTTP listener while the process stays alive) is out of scope for this PR — that fix has to live in the OpenClaw gateway codebase. This PR only addresses the marker-file regression hulynn re-confirmed in #4710 for docker-driver sandboxes where the gateway legitimately runs on the host. Standalone deployments where the gateway runs in-container are unaffected: the marker is still created (just at the launch site instead of at startup), so the existing pgrep healthcheck behavior is preserved bit-for-bit.

Signed-off-by: Shawn Xie shaxie@nvidia.com

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Chores

    • Improved container health-status handling so the in-container gateway liveness marker is only created when the gateway actually runs inside the container; host-delivered gateways skip marker creation and in-container path.
    • Adjusted test file-size budget for the related tests.
  • Tests

    • Added/updated tests verifying marker creation, idempotence, and presence at gateway start across launch modes; added host-delivery skip scenarios and harness stubs to isolate marker behavior.

…hint (#4710)

The early conditional gating mark_in_container_gateway on
OPENSHELL_DRIVERS=docker (PR #4748) never fires inside the sandbox
container because OpenShell 0.0.44 does not export OPENSHELL_DRIVERS
into the sandbox env. hulynn's 2026-06-08 re-verification on v0.0.60
confirmed the symptom is unchanged from v0.0.57: marker file is created
on every NEMOCLAW_CMD-empty container, so the Dockerfile HEALTHCHECK
short-circuit (`[ -f /tmp/nemoclaw-gateway-local ] || exit 0`) is
unreachable for docker-driver sandboxes and the container is marked
(unhealthy) on every fresh onboard.

Fix: drop the early env-gated marker write and call
mark_in_container_gateway() directly before each
`openclaw gateway run --port ...` launch (both the non-root and
root-mode paths). The marker is now true-by-construction: it exists
if-and-only-if this container is about to start the gateway, regardless
of how the deployment-mode signal is (or isn't) plumbed through the
sandbox env. Restart-loop fallback launches inherit the marker from the
first launch — the function is idempotent (`: >file`) so calling it
again on retry is a no-op.

Verified by two new regression tests in test/nemoclaw-start.test.ts:
- 'ties the in-container gateway healthcheck marker to the gateway
  launch site (#4503, #4710)': asserts (a) the region immediately after
  the function definition contains no early call and no
  OPENSHELL_DRIVERS conditional, and (b) every `openclaw gateway run
  --port` invocation is preceded by mark_in_container_gateway within
  6 lines (or sits inside a restart-loop body).
- 'mark_in_container_gateway writes the marker file idempotently
  (#4710)': behavioral check on the helper itself.

The two existing launch-block tests now stub
`mark_in_container_gateway() { :; }` in their preamble to keep the
extracted shell snippet self-contained. Test count: 120 passing
(vs. 119 baseline; 19 pre-existing failures unchanged).

Note: a separate OpenClaw-side concern surfaced in @Dongni-Yang's
2026-06-04 thread (in-sandbox gateway loses its HTTP listener while
the process stays alive) is out of scope here — this fix only
addresses the marker-file regression hulynn re-confirmed in #4710 for
docker-driver sandboxes where the gateway legitimately runs on the
host. Standalone deployments where the gateway runs in-container are
unaffected: the marker is still created (just at the launch site
instead of at startup), so the existing pgrep healthcheck behavior is
preserved bit-for-bit.

Fixes #4710.

Signed-off-by: Shawn Xie <shaxie@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nvshaxie nvshaxie requested review from cv and ericksoa June 10, 2026 06:52
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Tie the HEALTHCHECK marker to the actual gateway launch: add host-delivery detection, move the marker helper earlier, call it immediately before starting the gateway (both entrypoint modes), and add tests that extract/execute the helper, verify timing/skip behavior, and check idempotency. Adjust test imports and reduce the test file-size budget.

Changes

Gateway marker placement and validation

Layer / File(s) Summary
Marker function and host-delivery helpers
scripts/nemoclaw-start.sh
mark_in_container_gateway() moved earlier; adds gateway_delivered_by_openshell_host() and skip_in_container_gateway_for_host_delivery() to gate marker creation.
Invoke marker at gateway start (non-root & root)
scripts/nemoclaw-start.sh
Calls skip_in_container_gateway_for_host_delivery and then mark_in_container_gateway immediately before openclaw gateway run in the non-root background path and the root step-down path.
Tests: extract-run behavioral probe and idempotency
test/nemoclaw-start-gateway-marker.test.ts
Extracts the shell mark_in_container_gateway function, runs behavioral probes with a fake openclaw to assert marker exists at invocation (two launch forms), tests host-delivered skip (no marker, no openclaw run), and adds a direct idempotency test (marker created, empty, persists across calls).
Test harness stub, imports, and budget
test/nemoclaw-start.test.ts, ci/test-file-size-budget.json
Stubs mark_in_container_gateway() as a no-op in the gateway-launch harness, reorders Node/Vitest imports, adds skip_in_container_gateway_for_host_delivery() stub to diagnostics scaffolding, and reduces legacyMaxLines for the affected test file.

Sequence Diagram(s)

sequenceDiagram
  participant ComponentA
  participant ComponentB
  ComponentA->>ComponentB: observable interaction
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰
I nudge a tiny file in place,
A hollow mark to show the space.
Called just before the gateway wakes,
It stays quite small for healthchecks' sakes.
A hop, a test, a quiet trace.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: shifting the gateway healthcheck marker from environment-based hinting to direct launch-site detection, addressing the core regression in issue #4710.
Linked Issues check ✅ Passed The PR fully implements all coding requirements from #4710: marker created only at gateway launch sites, removed early env-gated write, added idempotent verification tests, and prevented regressions via structural guards.
Out of Scope Changes check ✅ Passed All changes are scoped to the marker-file regression: script logic, test coverage updates, and file-size budget adjustment. OpenClaw listener issues remain explicitly out of scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/healthcheck-marker-gateway-launch-4710

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/nemoclaw-start.test.ts`:
- Around line 331-346: The file test/nemoclaw-start.test.ts exceeds the legacy
size budget; split out the new marker-focused test and trim verbose comments:
move the test with the title starting "ties the in-container gateway healthcheck
marker to the gateway launch site (`#4503`, `#4710`)" and any closely-related helper
functions/fixtures it depends on into a new test file (e.g.,
test/nemoclaw-gateway-marker.test.ts), shorten or relocate the long inline
commentary (the large block around the NEMOCLAW_CMD checks) into a small summary
or separate docs, and update/keep any imports or shared fixtures referenced by
that test so it runs standalone; after moving, run the size-budget check to
confirm test/nemoclaw-start.test.ts is below the limit and that the new file
contains the marker-focused tests formerly at lines ~403-435.
- Around line 383-399: The current fallback sets inRestartLoop true if any
"while|until|for" appears within 60 lines, which can falsely exempt non-loop
launches; update the detection used around launchLineNumbers and
mark_in_container_gateway so it only treats a launch as inside a restart loop
when a loop header actually encloses the launch: locate the nearest loop header
upstream (e.g., the matching "while|until|for" token found when computing
upstream), ensure you stop at function boundaries, then verify the loop body
scope/indentation spans the launch index (or that the launch line occurs after
the loop header but before the loop’s end) before setting inRestartLoop true;
change the logic that sets inRestartLoop to perform this scoped containment
check rather than the current simple regex existence test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 794633e9-7c20-4f79-87e1-c3ea4c81376a

📥 Commits

Reviewing files that changed from the base of the PR and between 56cbecf and ecf648f.

📒 Files selected for processing (2)
  • scripts/nemoclaw-start.sh
  • test/nemoclaw-start.test.ts

Comment thread test/nemoclaw-start.test.ts Outdated
Comment thread test/nemoclaw-start.test.ts Outdated
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: cloud-onboard-e2e, issue-2478-crash-loop-recovery-e2e
Optional E2E: sandbox-survival-e2e, gateway-health-honest-e2e

Dispatch hint: cloud-onboard-e2e,issue-2478-crash-loop-recovery-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • cloud-onboard-e2e (high): Runs a real cloud OpenClaw onboard against the current Docker-driver path. This is the most direct existing E2E to catch a sandbox container being marked unhealthy or failing readiness because the in-container gateway marker is wrong when the gateway is host-owned.
  • issue-2478-crash-loop-recovery-e2e (high): Exercises real OpenClaw gateway startup, liveness, and crash/recovery behavior inside an onboarded sandbox. The PR modifies the exact entrypoint launch sites that start openclaw gateway run, so this should block merge for gateway lifecycle regressions.

Optional E2E

  • sandbox-survival-e2e (high): Additional confidence for sandbox lifecycle after gateway stop/start and recovery. Adjacent to the marker/healthcheck behavior, but less targeted than onboard and crash-loop recovery.
  • gateway-health-honest-e2e (medium): Adjacent regression guard for gateway health honesty when the Docker-driver gateway fails. Useful because this PR changes health/readiness signaling assumptions, though it focuses host gateway startup rather than the in-container marker contract.

New E2E recommendations

  • Docker-driver sandbox healthcheck marker contract (high): Existing unit tests validate marker placement in extracted shell snippets, and broad onboard E2E may catch failures indirectly, but there is no dedicated live E2E that onboards a Docker-driver sandbox, inspects the real container for absence of /tmp/nemoclaw-gateway-local, and verifies Docker health remains healthy while the host-owned gateway is serving.
    • Suggested test: Add a live Docker-driver gateway-marker-healthcheck E2E that onboards OpenClaw, docker execs the sandbox container to assert /tmp/nemoclaw-gateway-local is absent, verifies docker inspect health does not fail on curl exit 7, and confirms host gateway health/status is ready.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: cloud-onboard-e2e,issue-2478-crash-loop-recovery-e2e

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Recommendation

Required Vitest E2E scenarios: ubuntu-repo-cloud-openclaw
Optional Vitest E2E scenarios: ubuntu-repo-docker-post-reboot-recovery

Dispatch required Vitest E2E scenarios:

  • gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Workflow run

Full Vitest E2E advisor summary

Vitest E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required Vitest E2E scenarios

  • ubuntu-repo-cloud-openclaw: scripts/nemoclaw-start.sh changes the sandbox entrypoint gateway-marker behavior used by the Ubuntu Docker cloud OpenClaw path; this live-supported scenario exercises the standard Docker-driver OpenClaw sandbox startup and health/state validation surface.
    • Dispatch: gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Optional Vitest E2E scenarios

  • ubuntu-repo-docker-post-reboot-recovery: Optional adjacent lifecycle coverage for Docker-driver recovery/status behavior, which is related to gateway/sandbox health handling but not the primary changed entrypoint launch contract.
    • Dispatch: gh workflow run e2e-vitest-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-docker-post-reboot-recovery

Relevant changed files

  • scripts/nemoclaw-start.sh

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 4 worth checking, 0 nice ideas
Since last review: 1 prior item resolved, 3 still apply, 0 new items found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Source-of-truth review needed: scripts/nemoclaw-start.sh marker timing and Dockerfile absent-marker healthcheck shortcut: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `scripts/nemoclaw-start.sh` defines `mark_in_container_gateway` near line 217 and calls it at first gateway launch sites near lines 3329 and 3554; `Dockerfile` exits healthy when curl exit is 7 and the marker is absent; docs now say env hints are not authoritative.
  • Absent marker still conflates host-owned gateway with local pre-launch startup (scripts/nemoclaw-start.sh:217): Moving `/tmp/nemoclaw-gateway-local` creation to the two `openclaw gateway run` launch sites fixes the unreliable `OPENSHELL_DRIVERS` heuristic, but it also means standalone/local-serving startup has no marker until all setup before that launch site completes. The Dockerfile healthcheck still exits healthy on curl exit 7 when the marker is absent, so a local startup hang or failure before `mark_in_container_gateway` can be masked as healthy rather than treated as a local gateway boot failure.
    • Recommendation: Introduce a distinct source of truth for gateway ownership or local-serving intent before long setup work starts, or add a separate host-owned marker/state so the Dockerfile can distinguish host-delivered gateways from local pre-launch stalls. Align the script, Dockerfile, and docs comments with that state model and add a regression for the chosen pre-launch health boundary.
    • Evidence: `mark_in_container_gateway` writes `/tmp/nemoclaw-gateway-local` at `scripts/nemoclaw-start.sh:217` and is called only immediately before the non-root and root first-launch `openclaw gateway run` invocations at about `scripts/nemoclaw-start.sh:3329` and `scripts/nemoclaw-start.sh:3554`. The unchanged `Dockerfile` healthcheck has `[ -f /tmp/nemoclaw-gateway-local ] || exit 0` after curl exit 7. The changed script comment defines absence as “this entrypoint has not launched a gateway,” while the Dockerfile comment treats absence as outside-container gateway delivery.
  • Runtime evidence is still missing for the [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710 Docker-driver health contract (test/nemoclaw-start-gateway-marker.test.ts:47): Issue [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710's observable expected result is that Docker-driver sandboxes leave `/tmp/nemoclaw-gateway-local` absent and Docker reports `Status="healthy"` with `FailingStreak=0`. The added tests verify the marker helper, env hints, and extracted launch blocks, and existing Dockerfile unit coverage proves absent marker plus curl exit 7 exits healthy. They do not exercise a Docker-driver-shaped entrypoint/container lifecycle to prove the marker stays absent and Docker health reaches the requested state end-to-end.
    • Recommendation: Add or identify targeted runtime/integration validation that starts a Docker-driver-shaped sandbox/entrypoint, confirms `/tmp/nemoclaw-gateway-local` is absent, and confirms the Docker HEALTHCHECK result for curl exit 7 is healthy without invoking `pgrep` and with `FailingStreak=0`. Do not rely only on PR prose or external workflow status.
    • Evidence: `test/nemoclaw-start-gateway-marker.test.ts` constructs shell snippets and fake `openclaw` binaries; `test/nemoclaw-start.test.ts` extracts launch blocks and now verifies marker presence at real launch sites; `test/sandbox-provisioning.test.ts` covers the Dockerfile health command in isolation, not an actual Docker-driver entrypoint/container path. Issue [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710 expected: “Specifically at step 4, the marker file should NOT exist; step 3 should show `Status="healthy"` with `FailingStreak=0`.”
  • Healthcheck shortcut can still mask local gateway startup failures (scripts/nemoclaw-start.sh:3329): This is security-relevant sandbox lifecycle observability rather than a confirmed sandbox escape. The PR improves the previous policy by not trusting sandbox env hints such as `OPENSHELL_DRIVERS`, but the container health configuration still treats marker absence as healthy on connection-refused. Because the marker is now only written at the launch site, local-serving pre-launch failures can be indistinguishable from host-owned gateway delivery.
    • Recommendation: Make the healthcheck consume an authoritative ownership/intent signal instead of overloading marker absence. Add negative coverage showing standalone/local-serving pre-launch failure does not pass health via the absent-marker shortcut.
    • Evidence: Security categories 7 and 9 are warnings: `scripts/nemoclaw-start.sh` calls `mark_in_container_gateway` immediately before gateway launch, while the unchanged `Dockerfile` exits 0 whenever curl exit is 7 and `/tmp/nemoclaw-gateway-local` is absent.

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — Docker-driver-shaped entrypoint/container leaves `/tmp/nemoclaw-gateway-local` absent.. This PR changes sandbox/container lifecycle and Docker health semantics. The unit-style shell snippets and extracted launch-block tests are useful, but the key acceptance criteria are observable runtime behaviors involving the actual entrypoint, marker file, Docker-driver topology, and Docker HEALTHCHECK state.
  • **Runtime validation** — Docker-driver-shaped container with curl exit 7 reports Docker health `Status="healthy"` and `FailingStreak=0` without invoking `pgrep`.. This PR changes sandbox/container lifecycle and Docker health semantics. The unit-style shell snippets and extracted launch-block tests are useful, but the key acceptance criteria are observable runtime behaviors involving the actual entrypoint, marker file, Docker-driver topology, and Docker HEALTHCHECK state.
  • **Runtime validation** — Standalone/local-serving startup that fails or stalls before `openclaw gateway run` does not pass health solely because `/tmp/nemoclaw-gateway-local` is absent, or explicitly declares local-serving intent before long setup.. This PR changes sandbox/container lifecycle and Docker health semantics. The unit-style shell snippets and extracted launch-block tests are useful, but the key acceptance criteria are observable runtime behaviors involving the actual entrypoint, marker file, Docker-driver topology, and Docker HEALTHCHECK state.
  • **Runtime validation** — Runtime topology validation records whether the dashboard gateway listener/process is host-owned or in-container for Docker-driver so conflicting issue comments cannot regress into masking a real listener failure.. This PR changes sandbox/container lifecycle and Docker health semantics. The unit-style shell snippets and extracted launch-block tests are useful, but the key acceptance criteria are observable runtime behaviors involving the actual entrypoint, marker file, Docker-driver topology, and Docker HEALTHCHECK state.
  • **Runtime validation** — Full production launch path reached with `OPENSHELL_DRIVERS=docker`, missing `OPENSHELL_DRIVERS`, and OpenShell sleep-command env still writes the marker when it actually starts `openclaw gateway run`.. This PR changes sandbox/container lifecycle and Docker health semantics. The unit-style shell snippets and extracted launch-block tests are useful, but the key acceptance criteria are observable runtime behaviors involving the actual entrypoint, marker file, Docker-driver topology, and Docker HEALTHCHECK state.
  • **Runtime evidence is still missing for the [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710 Docker-driver health contract** — Add or identify targeted runtime/integration validation that starts a Docker-driver-shaped sandbox/entrypoint, confirms `/tmp/nemoclaw-gateway-local` is absent, and confirms the Docker HEALTHCHECK result for curl exit 7 is healthy without invoking `pgrep` and with `FailingStreak=0`. Do not rely only on PR prose or external workflow status.
  • **Acceptance clause:** Issue [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710 Description: "marker file `/tmp/nemoclaw-gateway-local` present = gateway runs in THIS container (standalone deployments); marker absent = gateway runs on host (OpenShell Docker-driver, [Ubuntu 24.04][Sandbox][GitHub Issue #4503] sandbox Docker HEALTHCHECK always (unhealthy) — pgrep openclaw-gateway runs inside container but gateway runs on host #4503) — in-container probe cannot observe it, so HEALTHCHECK should exit 0 and defer to OpenShell host-side delivery-chain monitoring." — add test evidence or identify existing coverage. The marker is now written at the two in-container `openclaw gateway run` launch sites, and Dockerfile healthcheck coverage shows absent marker plus curl exit 7 exits healthy. Remaining gap: marker absence also covers local pre-launch state before the launch site, not only host-owned gateway delivery.
  • **Acceptance clause:** Issue [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710 Expected Result: "Docker HEALTHCHECK reports `(healthy)` for Docker-driver sandboxes because the gateway lives on the host and the in-container probe cannot observe it." — add test evidence or identify existing coverage. `test/sandbox-provisioning.test.ts` verifies the Dockerfile command exits healthy when curl returns 7 and the marker is absent, without consulting `pgrep`. The PR does not include runtime/entrypoint evidence that a Docker-driver sandbox leaves the marker absent and reaches Docker healthy state.
Since last review details

Current findings:

  • Source-of-truth review needed: scripts/nemoclaw-start.sh marker timing and Dockerfile absent-marker healthcheck shortcut: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `scripts/nemoclaw-start.sh` defines `mark_in_container_gateway` near line 217 and calls it at first gateway launch sites near lines 3329 and 3554; `Dockerfile` exits healthy when curl exit is 7 and the marker is absent; docs now say env hints are not authoritative.
  • Absent marker still conflates host-owned gateway with local pre-launch startup (scripts/nemoclaw-start.sh:217): Moving `/tmp/nemoclaw-gateway-local` creation to the two `openclaw gateway run` launch sites fixes the unreliable `OPENSHELL_DRIVERS` heuristic, but it also means standalone/local-serving startup has no marker until all setup before that launch site completes. The Dockerfile healthcheck still exits healthy on curl exit 7 when the marker is absent, so a local startup hang or failure before `mark_in_container_gateway` can be masked as healthy rather than treated as a local gateway boot failure.
    • Recommendation: Introduce a distinct source of truth for gateway ownership or local-serving intent before long setup work starts, or add a separate host-owned marker/state so the Dockerfile can distinguish host-delivered gateways from local pre-launch stalls. Align the script, Dockerfile, and docs comments with that state model and add a regression for the chosen pre-launch health boundary.
    • Evidence: `mark_in_container_gateway` writes `/tmp/nemoclaw-gateway-local` at `scripts/nemoclaw-start.sh:217` and is called only immediately before the non-root and root first-launch `openclaw gateway run` invocations at about `scripts/nemoclaw-start.sh:3329` and `scripts/nemoclaw-start.sh:3554`. The unchanged `Dockerfile` healthcheck has `[ -f /tmp/nemoclaw-gateway-local ] || exit 0` after curl exit 7. The changed script comment defines absence as “this entrypoint has not launched a gateway,” while the Dockerfile comment treats absence as outside-container gateway delivery.
  • Runtime evidence is still missing for the [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710 Docker-driver health contract (test/nemoclaw-start-gateway-marker.test.ts:47): Issue [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710's observable expected result is that Docker-driver sandboxes leave `/tmp/nemoclaw-gateway-local` absent and Docker reports `Status="healthy"` with `FailingStreak=0`. The added tests verify the marker helper, env hints, and extracted launch blocks, and existing Dockerfile unit coverage proves absent marker plus curl exit 7 exits healthy. They do not exercise a Docker-driver-shaped entrypoint/container lifecycle to prove the marker stays absent and Docker health reaches the requested state end-to-end.
    • Recommendation: Add or identify targeted runtime/integration validation that starts a Docker-driver-shaped sandbox/entrypoint, confirms `/tmp/nemoclaw-gateway-local` is absent, and confirms the Docker HEALTHCHECK result for curl exit 7 is healthy without invoking `pgrep` and with `FailingStreak=0`. Do not rely only on PR prose or external workflow status.
    • Evidence: `test/nemoclaw-start-gateway-marker.test.ts` constructs shell snippets and fake `openclaw` binaries; `test/nemoclaw-start.test.ts` extracts launch blocks and now verifies marker presence at real launch sites; `test/sandbox-provisioning.test.ts` covers the Dockerfile health command in isolation, not an actual Docker-driver entrypoint/container path. Issue [Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created #4710 expected: “Specifically at step 4, the marker file should NOT exist; step 3 should show `Status="healthy"` with `FailingStreak=0`.”
  • Healthcheck shortcut can still mask local gateway startup failures (scripts/nemoclaw-start.sh:3329): This is security-relevant sandbox lifecycle observability rather than a confirmed sandbox escape. The PR improves the previous policy by not trusting sandbox env hints such as `OPENSHELL_DRIVERS`, but the container health configuration still treats marker absence as healthy on connection-refused. Because the marker is now only written at the launch site, local-serving pre-launch failures can be indistinguishable from host-owned gateway delivery.
    • Recommendation: Make the healthcheck consume an authoritative ownership/intent signal instead of overloading marker absence. Add negative coverage showing standalone/local-serving pre-launch failure does not pass health via the absent-marker shortcut.
    • Evidence: Security categories 7 and 9 are warnings: `scripts/nemoclaw-start.sh` calls `mark_in_container_gateway` immediately before gateway launch, while the unchanged `Dockerfile` exits 0 whenever curl exit is 7 and `/tmp/nemoclaw-gateway-local` is absent.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@wscurran wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression labels Jun 10, 2026
@wscurran

Copy link
Copy Markdown
Contributor

… file

The #4710 marker test asserted on the start-script source text (regex
.not.toMatch against a source window), tripping the source-shape budget
gate (1 > 0). Replace it with a behavior test: a fake `openclaw` records
whether the /tmp/nemoclaw-gateway-local marker exists at the instant
`gateway run` fires, for both the non-root and root step-down launch
forms — proving the marker is tied to the launch site, not an env hint.

The PR also pushed test/nemoclaw-start.test.ts past its 5289-line legacy
size budget. Move both #4710 marker tests into a dedicated
test/nemoclaw-start-gateway-marker.test.ts (164 lines, under the 1500
default) and ratchet the legacy budget for the original file down to
5234, per the test-file-size-budget legacy-ratchet rule.

No production change; scripts/nemoclaw-start.sh fix is unchanged.

Signed-off-by: Shawn Xie <shaxie@nvidia.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/nemoclaw-start-gateway-marker.test.ts (1)

136-145: ⚡ Quick win

Unify function extraction to avoid brittle brace parsing.

Line 137 uses the first } after the function start, which is fragile if mark_in_container_gateway ever gains ${...} or additional blocks/comments. Reuse the already robust extractShellFunctionFromSource helper here too.

Suggested diff
-    const fnStart = src.indexOf("mark_in_container_gateway() {");
-    const fnEnd = src.indexOf("}", fnStart);
-    if (fnStart === -1 || fnEnd === -1) {
-      throw new Error("mark_in_container_gateway function not found");
-    }
     const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), "nemoclaw-gw-marker-"));
     const markerPath = path.join(tmpDir, "nemoclaw-gateway-local");
-    const fnSrc = src
-      .slice(fnStart, fnEnd + 1)
-      .replaceAll("/tmp/nemoclaw-gateway-local", markerPath);
+    const fnSrc = extractShellFunctionFromSource(src, "mark_in_container_gateway")
+      .replaceAll("/tmp/nemoclaw-gateway-local", markerPath);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/nemoclaw-start-gateway-marker.test.ts` around lines 136 - 145, The test
currently finds the function by searching for "mark_in_container_gateway() {"
and the next "}" using fnStart/fnEnd which is brittle; instead call the existing
extractShellFunctionFromSource helper to robustly extract the full
mark_in_container_gateway function body, then replace the
"/tmp/nemoclaw-gateway-local" path in that returned string and assign to fnSrc;
update references to fnStart/fnEnd so extractShellFunctionFromSource is used to
produce the function source (keeping tmpDir and markerPath logic the same).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/nemoclaw-start-gateway-marker.test.ts`:
- Around line 136-145: The test currently finds the function by searching for
"mark_in_container_gateway() {" and the next "}" using fnStart/fnEnd which is
brittle; instead call the existing extractShellFunctionFromSource helper to
robustly extract the full mark_in_container_gateway function body, then replace
the "/tmp/nemoclaw-gateway-local" path in that returned string and assign to
fnSrc; update references to fnStart/fnEnd so extractShellFunctionFromSource is
used to produce the function source (keeping tmpDir and markerPath logic the
same).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d7a9432a-416d-4821-b929-c5be9474434f

📥 Commits

Reviewing files that changed from the base of the PR and between ecf648f and 8935f94.

📒 Files selected for processing (3)
  • ci/test-file-size-budget.json
  • test/nemoclaw-start-gateway-marker.test.ts
  • test/nemoclaw-start.test.ts
✅ Files skipped from review due to trivial changes (1)
  • ci/test-file-size-budget.json
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/nemoclaw-start.test.ts

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27325717417
Target ref: f34bb51c859f7e2247e04fc9898d9c7ab756cfd0
Workflow ref: main
Requested jobs: cloud-onboard-e2e,sandbox-survival-e2e
Summary: 0 passed, 2 failed, 0 skipped

Job Result
cloud-onboard-e2e ❌ failure
sandbox-survival-e2e ❌ failure

Failed jobs: cloud-onboard-e2e, sandbox-survival-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27326029419
Target ref: c698c883d8bb0c9c22cad27b71678ba06cd9b8d8
Workflow ref: main
Requested jobs: cloud-e2e,sandbox-survival-e2e
Summary: 0 passed, 2 failed, 0 skipped

Job Result
cloud-e2e ❌ failure
sandbox-survival-e2e ❌ failure

Failed jobs: cloud-e2e, sandbox-survival-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 27326403367
Target ref: 29effb0bdb140766082601f75ada11f723dd7d12
Workflow ref: main
Requested jobs: cloud-e2e,sandbox-survival-e2e
Summary: 0 passed, 2 failed, 0 skipped

Job Result
cloud-e2e ❌ failure
sandbox-survival-e2e ❌ failure

Failed jobs: cloud-e2e, sandbox-survival-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27326894410
Target ref: cfe3b49efa3f55a5a141923db0e6310f19746f4a
Workflow ref: main
Requested jobs: cloud-onboard-e2e,issue-2478-crash-loop-recovery-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ✅ success

@cv cv merged commit 71278a6 into main Jun 11, 2026
37 checks passed
@cv cv deleted the fix/healthcheck-marker-gateway-launch-4710 branch June 11, 2026 06:17
@cv cv added the v0.0.64 Release target label Jun 12, 2026
cv pushed a commit that referenced this pull request Jun 12, 2026
## Summary
- Add v0.0.64 release notes from the release announcement and link them
to the relevant deeper docs.
- Document that custom policy presets recorded through `policy-add
--from-file` and `--from-dir` survive snapshot restore and sandbox
recreation.
- Refresh generated NemoClaw user skills from the current source docs.

## Source summary
- #5104 -> `docs/manage-sandboxes/backup-restore.mdx`,
`docs/network-policy/customize-network-policy.mdx`: Documents custom
policy presets preserved through snapshot restore.
- #4955 -> `docs/about/release-notes.mdx`: Adds release-note coverage
for Brave web-search pinning and `BRAVE_API_KEY` placeholder
preservation.
- #5116, #5269 -> `docs/about/release-notes.mdx`: Adds release-note
coverage for Docker-driver gateway health and rootfs guard stability.
- #5241, #5085 -> `docs/about/release-notes.mdx`: Adds release-note
coverage for chat-completions provider selection and Nemotron Ultra 550B
tool-less request compatibility.
- #5268, #5210, #5257 -> `docs/about/release-notes.mdx`: Adds
release-note coverage for messaging render plan refresh, OpenClaw
scope-upgrade approval recovery, and Hermes WhatsApp bridge dependency
setup.
- Current source docs -> `.agents/skills/`: Regenerates user-skill
references so agent-facing guidance matches the source documentation.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `npm run docs`
- `npm run build:cli`
- `npm run typecheck:cli`
- Commit/pre-push hooks: markdownlint, gitleaks, docs-to-skills
verification, TypeScript CLI, and skills YAML checks passed.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Clarified sandbox snapshot restore preserves custom policy presets and
restores them without original files.
* Switched sandbox setup and remote deployment guidance to Docker-based
workflows and emphasized remote onboarding flow.
* Expanded troubleshooting for gateway recovery, Docker GPU/WSL issues,
and onboarding resume.
* Added/updated CLI docs: advanced maintenance, session export,
upload/download wrappers, and status recovery guidance.
* Added v0.0.64 release notes and links to NemoClaw Community; fixed
command reference formatting.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression v0.0.64 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created

3 participants