Skip to content

fix: skip gateway marker for docker-driver sandboxes#4748

Merged
cv merged 3 commits into
NVIDIA:mainfrom
glenn-agent:fix/4710-docker-driver-healthcheck-marker
Jun 5, 2026
Merged

fix: skip gateway marker for docker-driver sandboxes#4748
cv merged 3 commits into
NVIDIA:mainfrom
glenn-agent:fix/4710-docker-driver-healthcheck-marker

Conversation

@glenn-agent

@glenn-agent glenn-agent commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #4710.

Validation

  • npm test -- --run test/nemoclaw-start.test.ts -t "gateway healthcheck marker"
  • git diff --check
  • git verify-commit HEAD

Note: running the full test/nemoclaw-start.test.ts file locally currently fails in unrelated baseline fixture cases because nemoclaw/node_modules/json5 is absent in this checkout; the targeted marker regression passes.

Signed-off-by: Glenn-Agent glenn_agent@163.com

Summary by CodeRabbit

Bug Fixes

  • Fixed gateway healthcheck behavior to correctly handle different driver modes, ensuring proper health status detection when running under Docker drivers versus standard configurations.

Signed-off-by: Glenn-Agent <glenn_agent@163.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 4, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@cv, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 7 minutes and 12 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1be72f4d-b8b1-4906-902a-4ef7f97d1072

📥 Commits

Reviewing files that changed from the base of the PR and between 1b6b1b3 and 343ca83.

📒 Files selected for processing (2)
  • scripts/nemoclaw-start.sh
  • test/nemoclaw-start.test.ts
📝 Walkthrough

Walkthrough

This PR fixes a bug where the /tmp/nemoclaw-gateway-local marker file is unconditionally created in Docker-driver sandboxes, preventing the Docker HEALTHCHECK from correctly detecting that the gateway runs on the host. The fix adds driver detection to gate marker creation, and extends test coverage to validate Docker-driver and mixed-driver execution paths.

Changes

Docker-driver gateway marker fix

Layer / File(s) Summary
Driver detection and marker gating logic
scripts/nemoclaw-start.sh
Detects whether the container runs under an OpenShell docker driver via OPENSHELL_DRIVERS, sets _NEMOCLAW_DOCKER_DRIVER accordingly, and gates marker creation to only proceed when NEMOCLAW_CMD is empty AND the driver is not docker.
Test refactoring and Docker-driver scenario coverage
test/nemoclaw-start.test.ts
Refactors runScenario helper to accept optional environment variables, enabling simulation of different OPENSHELL_DRIVERS values; updates test documentation to clarify Docker HEALTHCHECK behavior; adds assertions for docker and mixed docker/vm driver modes verifying marker remains absent in those paths.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4600: Both PRs change scripts/nemoclaw-start.sh's /tmp/nemoclaw-gateway-local marker emission logic, directly affecting Docker HEALTHCHECK behavior tested here.

Suggested labels

fix, Docker, Sandbox, platform: container

Suggested reviewers

  • jyaunches
  • cv

Poem

🐰 A marker file that knew no bounds,
Now checks the driver, looks around!
When docker drives, the gate stays clear,
The health check shines, the path is clear. 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the main fix: preventing the gateway marker from being created in Docker-driver sandboxes, which directly addresses issue #4710.
Linked Issues check ✅ Passed Changes correctly implement the fix for #4710: detecting Docker-driver mode via OPENSHELL_DRIVERS, gating marker creation accordingly, and extending tests to verify the behavior.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing issue #4710: modifying the marker logic and updating related tests. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
scripts/nemoclaw-start.sh (1)

218-221: ⚡ Quick win

Harden docker-driver detection against whitespace/case in OPENSHELL_DRIVERS

scripts/nemoclaw-start.sh uses a raw, case-sensitive match (case ",${OPENSHELL_DRIVERS:-}," in *,docker,*) ...) with no normalization. This repo’s own/tested values are always "docker" or "vm,docker" (no spaces), so the current behavior aligns with in-repo expectations; however, if OPENSHELL_DRIVERS ever comes in as "vm, docker" or "Docker", the match will fail and marker gating could be wrong—normalize (trim, lowercase, remove whitespace) before the case.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/nemoclaw-start.sh` around lines 218 - 221, Normalize
OPENSHELL_DRIVERS before the case match so whitespace and case won’t break
detection: create a normalized var (derived from OPENSHELL_DRIVERS) that is
lowercased and has all whitespace removed or trimmed around commas, then run the
existing case on that normalized value to set _NEMOCLAW_DOCKER_DRIVER; update
the snippet that currently uses case ",${OPENSHELL_DRIVERS:-}," in to reference
the normalized variable and keep the same pattern matching (e.g., look for
",docker,") so both "vm, docker" and "Docker" will be detected correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 218-221: Normalize OPENSHELL_DRIVERS before the case match so
whitespace and case won’t break detection: create a normalized var (derived from
OPENSHELL_DRIVERS) that is lowercased and has all whitespace removed or trimmed
around commas, then run the existing case on that normalized value to set
_NEMOCLAW_DOCKER_DRIVER; update the snippet that currently uses case
",${OPENSHELL_DRIVERS:-}," in to reference the normalized variable and keep the
same pattern matching (e.g., look for ",docker,") so both "vm, docker" and
"Docker" will be detected correctly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 85063e10-623d-4e68-b646-0925e5e317c9

📥 Commits

Reviewing files that changed from the base of the PR and between 17734b1 and 1b6b1b3.

📒 Files selected for processing (2)
  • scripts/nemoclaw-start.sh
  • test/nemoclaw-start.test.ts

@cv cv merged commit 3c0340a into NVIDIA:main Jun 5, 2026
17 checks passed
@wscurran wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression labels Jun 5, 2026
@wscurran

wscurran commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this detailed PR about skipping the gateway marker for Docker-driver sandboxes, which improves the healthcheck behavior in these environments. This proposes a bug fix for the sandbox that affects the gateway marker detection.


Related open issues:

miyoungc added a commit that referenced this pull request Jun 6, 2026
## Summary
- Adds the `v0.0.60` section to `docs/about/release-notes.mdx` using the
dev announcement from discussion #4877.
- Fills the source-doc gaps found during release-prep review across
inference, policy tiers, command behavior, security boundaries, Hermes
dashboard/tooling, runtime context, and troubleshooting.
- Refreshes generated agent skills under `.agents/skills/` from the
current Fern docs output and upgrades Fern from `5.44.3` to `5.45.0`.

## Source summary
- #4037 -> `docs/reference/architecture.mdx`,
`docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents
system-only runtime context that stays out of visible chat.
- #4875 -> `docs/reference/architecture.mdx`,
`docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents
try-first sandbox network/filesystem guidance and clearer failure
classification.
- #4788 -> `docs/security/best-practices.mdx`,
`docs/about/release-notes.mdx`: Documents shared OpenClaw
device-approval policy for startup and connect.
- #4768 -> `docs/reference/network-policies.mdx`,
`docs/network-policy/integration-policy-examples.mdx`,
`docs/get-started/quickstart.mdx`,
`docs/get-started/quickstart-hermes.mdx`, `docs/reference/commands.mdx`:
Documents `weather`, `public-reference`, and Hermes managed-tool gateway
preset behavior.
- #3788 and #4864 -> `docs/reference/network-policies.mdx`,
`docs/reference/commands.mdx`: Documents non-interactive policy-tier
fail-fast behavior and interactive prompt fallback.
- #4756 and #4866 -> `docs/reference/commands.mdx`: Documents env-aware
default sandbox resolution for `list`, `status`, and `tunnel` commands.
- #4320 -> `docs/reference/commands.mdx`: Documents `$$nemoclaw tunnel
status` behavior.
- #4328 -> `docs/reference/commands.mdx`: Documents line-scoped policy
preset descriptions in `policy-list`.
- #4580 and #4748 -> `docs/reference/architecture.mdx`: Documents
package-managed OpenShell gateway service and Docker-driver
gateway-marker behavior.
- #4598 -> `docs/manage-sandboxes/lifecycle.mdx`: Documents concurrent
gateway/dashboard cleanup isolation by sandbox name and port.
- #4777 -> `docs/reference/troubleshooting.mdx`: Documents Docker GPU
patch rollback behavior.
- #4610 -> `docs/reference/troubleshooting.mdx`,
`docs/reference/commands.mdx`: Keeps mutable OpenClaw config permission
guidance aligned and removes skipped experimental wording.
- #4868 -> `docs/reference/commands.mdx`: Keeps `.dockerignore` handling
for custom `onboard --from <Dockerfile>` contexts in generated skills.
- #4870 -> `docs/reference/commands.mdx`,
`docs/manage-sandboxes/runtime-controls.mdx`: Documents
`NEMOCLAW_MINIMAL_BOOTSTRAP` and generated skill coverage.
- #4641 -> `docs/inference/inference-options.mdx`,
`docs/reference/troubleshooting.mdx`: Documents local NVIDIA NIM
platform-digest pulls and served-model id adoption.
- #4810 and #4867 -> `docs/inference/inference-options.mdx`: Documents
stable NGC managed-vLLM image lineage and DGX Station DeepSeek V4 Flash
coverage.
- #4852 -> `docs/inference/use-local-inference.mdx`,
`docs/reference/troubleshooting.mdx`: Documents Ollama model fit
filtering, 16K context floor, cold-load retry, and failed-model
exclusion.
- #4847 -> `docs/inference/switch-inference-providers.mdx`: Documents
API-family sync, Hermes `api_mode`, and Bedrock Runtime exception.
- #4800 -> `docs/inference/tool-calling-reliability.mdx`: Documents
Nemotron managed-inference native tool-search fallback.
- #4333 -> `docs/inference/switch-inference-providers.mdx`: Documents
interactive multimodal input prompting.
- #4086 -> `docs/reference/troubleshooting.mdx`: Keeps proxy bypass
normalization in generated troubleshooting coverage.
- #4811 and #4855 -> `docs/get-started/quickstart-hermes.mdx`: Documents
prebuilt Hermes dashboard assets and TUI recovery without runtime
rebuilds.
- #4854 -> `docs/inference/switch-inference-providers.mdx`,
`docs/reference/commands.mdx`: Documents Hermes proxy API-key
placeholder preservation during inference switches.
- #4248 -> `docs/manage-sandboxes/messaging-channels.mdx`,
`.agents/skills/`: Keeps messaging enrollment behavior aligned with
manifest-hook implementation.
- #4771 -> `docs/security/best-practices.mdx`,
`docs/security/credential-storage.mdx`: Documents Hermes
placeholder-only secret boundary for sandbox-visible runtime files.
- #4787 -> `docs/security/best-practices.mdx`,
`docs/about/release-notes.mdx`: Documents expanded memory scanner
examples for OpenAI project keys and Slack app-level tokens.
- #4848 -> `docs/reference/commands.mdx`: Documents OpenClaw skill
install mirroring into the agent home directory.
- #4790 -> `docs/about/release-notes.mdx`: Uses the prior release-prep
structure and generated `.agents/skills/` refresh as the template for
this release.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ skills/
--prefix nemoclaw-user --doc-platform fern-mdx --dry-run`
- `npm run docs`
- `git diff --check`
- skip-term scan across `docs/`, `.agents/skills/`, and `skills/`
- `npm run build:cli`
- `npm run typecheck:cli`
- Commit and pre-push hook suites, including markdownlint, gitleaks,
env-var docs gate, docs-to-skills verification, and skills YAML tests

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* DeepSeek-V4-Flash now available as default inference model for DGX
Station.
* Hermes dashboard improved with dedicated port and OAuth-authenticated
tool gateway selection.
* Added weather and public-reference policy presets for expanded agent
capabilities.
* Enhanced Ollama model selection with GPU memory filtering and
automatic retry for timeouts.

* **Bug Fixes**
  * Improved policy tier validation to prevent invalid configurations.
* Better sandbox cleanup scoping by port to prevent conflicts across
deployments.
  * Added GPU patch failure recovery with automatic rollback.

* **Documentation**
* Expanded troubleshooting guides for inference, security, and sandbox
lifecycle.
  * Added .dockerignore best practices for custom deployments.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
cv added a commit that referenced this pull request Jun 11, 2026
…hint (#4710) (#5116)

## Summary
- The early conditional gating `mark_in_container_gateway` on
`OPENSHELL_DRIVERS=docker` (added in #4748) never fires inside the
sandbox container, because OpenShell 0.0.44 does not export
`OPENSHELL_DRIVERS` into the sandbox env. @hulynn's 2026-06-08
re-verification on v0.0.60 confirms the symptom is unchanged: marker is
created on every empty-`NEMOCLAW_CMD` container, the Dockerfile
HEALTHCHECK short-circuit (`[ -f /tmp/nemoclaw-gateway-local ] || exit
0`) is unreachable for docker-driver sandboxes, and the container is
marked `(unhealthy)` on every fresh onboard.
- This PR drops the early env-gated write and calls
`mark_in_container_gateway()` directly before each `openclaw gateway run
--port ...` launch (both non-root and root-mode paths). The marker is
now true-by-construction: it exists if-and-only-if this container is
about to start the gateway, regardless of whether the deployment-mode
signal is plumbed through the sandbox env.

This is hulynn's fix-direction (c) from #4710:
> Move mark_in_container_gateway into the gateway-launching codepath
itself, so it only runs when this container will actually start the
gateway.

## Related Issue
Fixes #4710. Supersedes the no-op env-var guard introduced by #4748.

## Changes
- `scripts/nemoclaw-start.sh`: remove the early `case
",${OPENSHELL_DRIVERS:-},"` block and the `_NEMOCLAW_DOCKER_DRIVER` gate
that wrapped the early `mark_in_container_gateway` call; add
`mark_in_container_gateway` immediately before both first-launch
invocations of `openclaw gateway run --port "${_DASHBOARD_PORT}"`
(non-root at line ~3324, root-mode at line ~3550). Restart-loop fallback
launches inherit the marker from the first launch — the function is
idempotent (`: >file`), so calling it again on retry is a no-op.
- `scripts/nemoclaw-start.sh`: rewrite the function-definition comment
block to explain the launch-site invariant and the #4710 root cause, so
future regressions cannot reintroduce an env-gated form without removing
the warning.
- `test/nemoclaw-start.test.ts`: replace the old startup-snippet test
with two new structural tests:
- **`ties the in-container gateway healthcheck marker to the gateway
launch site (#4503, #4710)`** — asserts (a) the region immediately after
the function definition contains no early `mark_in_container_gateway`
call and no `OPENSHELL_DRIVERS` conditional; (b) every `openclaw gateway
run --port` invocation in the script is preceded (within 6 lines) by
`mark_in_container_gateway`, OR sits inside a restart-loop body that
inherits from the first launch. The first invariant blocks a regression
back to the env-gated form; the second locks the new contract.
- **`mark_in_container_gateway writes the marker file idempotently
(#4710)`** — behavioral check on the helper itself.
- `test/nemoclaw-start.test.ts`: the two existing launch-block tests
(`registers child PIDs ...` and `launches the root gateway through gosu
...`) now stub `mark_in_container_gateway() { :; }` in their preamble so
the extracted shell snippet remains self-contained after the new call is
introduced.

## Type of Change
- [x] Code change (feature, bug fix, or refactor)

## Verification
- `npx vitest run --project cli test/nemoclaw-start.test.ts` — **120
passing** (vs. 119 baseline before this PR; 19 pre-existing failures on
`main` are unchanged and unrelated to this change).
- Manual diff review of `scripts/nemoclaw-start.sh`: confirmed both
first-launch sites (lines 3328 + 3553 after edit) have
`mark_in_container_gateway` immediately above the `nohup ... "$OPENCLAW"
gateway run --port ...` invocation; the two restart-loop fallbacks at
lines ~3382 / ~3639 sit inside `while`/`until` blocks downstream of
those first launches and reuse the marker that's already there.

## Scope note
A separate OpenClaw-side concern surfaced in @Dongni-Yang's 2026-06-04
thread (in-sandbox gateway loses its HTTP listener while the process
stays alive) is out of scope for this PR — that fix has to live in the
OpenClaw gateway codebase. This PR only addresses the **marker-file
regression hulynn re-confirmed in #4710** for docker-driver sandboxes
where the gateway legitimately runs on the host. Standalone deployments
where the gateway runs in-container are unaffected: the marker is still
created (just at the launch site instead of at startup), so the existing
pgrep healthcheck behavior is preserved bit-for-bit.

Signed-off-by: Shawn Xie <shaxie@nvidia.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Improved container health-status handling so the in-container gateway
liveness marker is only created when the gateway actually runs inside
the container; host-delivered gateways skip marker creation and
in-container path.
  * Adjusted test file-size budget for the related tests.

* **Tests**
* Added/updated tests verifying marker creation, idempotence, and
presence at gateway start across launch modes; added host-delivery skip
scenarios and harness stubs to isolate marker behavior.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Shawn Xie <shaxie@nvidia.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression v0.0.60 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ubuntu 24.04][Sandbox] Docker-driver HEALTHCHECK always (unhealthy) — marker file always created

3 participants