Skip to content

fix(sandbox): stop false Docker unhealthy and add paused-container hint (#4503, #4495)#4600

Merged
cv merged 1 commit into
NVIDIA:mainfrom
yimoj:fix/4503-docker-driver-health-status
Jun 1, 2026
Merged

fix(sandbox): stop false Docker unhealthy and add paused-container hint (#4503, #4495)#4600
cv merged 1 commit into
NVIDIA:mainfrom
yimoj:fix/4503-docker-driver-health-status

Conversation

@yimoj

@yimoj yimoj commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes two Docker-driver sandbox container-state issues that mislead operators: a HEALTHCHECK that falsely marks functional Docker-driver sandboxes (unhealthy), and a Phase: Error for a paused container with no recovery guidance. Both share the Docker-driver container-state / health-signal / status-UX axis.

Related Issue

Fixes #4503
Fixes #4495

Changes

#4503 — stop false Docker (unhealthy)

#4495 — paused-container recovery hint without rewriting the phase

  • src/lib/actions/sandbox/docker-health.ts adds getSandboxDockerRuntime, which resolves the docker-driver container once and reads both .State.Health.Status and .State.Paused.
  • src/lib/actions/sandbox/status.ts keeps OpenShell's authoritative Phase: Error verbatim but, when the resolved container is paused, prints an actionable docker unpause <container> recovery hint and skips the misleading rebuild suggestion. dockerPaused is exposed in the structured --json report. The authoritative phase-mapping fix itself belongs upstream in OpenShell.

Regression coverage

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Notes on verification: targeted tests for all changed code pass (docker-health 21, sandbox-provisioning 28, the new start-script marker test, and the paused-status cli test). The Dockerfile HEALTHCHECK fix was reproduced and verified end-to-end with a faithful bash fixture across 8 scenarios (curl exit 0/7/22/28 × marker present/absent × process/log present/absent). The remaining failures in a full cli-project run are environmental — load-induced timeouts on CLI-spawning oclif/help tests and one umask-dependent directory mode-bits test in config-sync.ts (a file this PR does not touch); all pass when run in isolation. CI runs these on dedicated runners.

🤖 Generated with Claude Code

Signed-off-by: Yimo Jiang yimoj@nvidia.com

Summary by CodeRabbit

  • New Features

    • Status reports (including JSON) now detect and expose Docker paused state and show a paused-container recovery hint with a docker unpause <container> suggestion.
  • Bug Fixes

    • Healthcheck logic updated to avoid false restarts when the dashboard listener is unreachable due to host/network delivery by honoring a local marker and handling specific probe exit codes.
  • Tests

    • Added tests covering paused detection, healthcheck marker behavior, and updated status output.

@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 47545e1d-1940-4fc1-a03a-8e8d4004e723

📥 Commits

Reviewing files that changed from the base of the PR and between 330af4e and f3e4696.

📒 Files selected for processing (8)
  • Dockerfile
  • scripts/nemoclaw-start.sh
  • src/lib/actions/sandbox/docker-health.test.ts
  • src/lib/actions/sandbox/docker-health.ts
  • src/lib/actions/sandbox/status.ts
  • test/cli.test.ts
  • test/nemoclaw-start.test.ts
  • test/sandbox-provisioning.test.ts
🚧 Files skipped from review as they are similar to previous changes (8)
  • scripts/nemoclaw-start.sh
  • test/nemoclaw-start.test.ts
  • test/sandbox-provisioning.test.ts
  • Dockerfile
  • test/cli.test.ts
  • src/lib/actions/sandbox/status.ts
  • src/lib/actions/sandbox/docker-health.ts
  • src/lib/actions/sandbox/docker-health.test.ts

📝 Walkthrough

Walkthrough

Adds a marker-gated Docker HEALTHCHECK and creates the marker only when serving the in-container gateway; introduces getSandboxDockerRuntime() returning health+paused+containerName; status reporting exposes dockerPaused and prints docker unpause <name> recovery hints; tests validate marker, healthcheck, and paused-container behavior.

Changes

Paused-Container Status & Healthcheck Marker

Layer / File(s) Summary
Healthcheck marker system (Dockerfile & startup)
Dockerfile, scripts/nemoclaw-start.sh
Dockerfile HEALTHCHECK now checks /tmp/nemoclaw-gateway-local when curl exits with code 7: if the marker is absent the probe exits success; if present it falls back to the existing in-container process/log checks. Startup writes the marker only on the gateway-serving path (NEMOCLAW_CMD empty).
Docker runtime inspection (paused state)
src/lib/actions/sandbox/docker-health.ts, src/lib/actions/sandbox/docker-health.test.ts
Adds SandboxDockerRuntime and getSandboxDockerRuntime() which read Docker health and .State.Paused independently; introduces dockerInspectPaused dependency and normalizePausedState; tests mock paused inspection and validate parsing and error-handling.
Status report generation and paused hints
src/lib/actions/sandbox/status.ts
SandboxStatusReport gets dockerPaused; showSandboxStatus reuses the runtime view and prints a paused-container recovery hint (docker unpause <container>) when the sandbox phase is Error and the driver container is paused; Docker health display uses dockerRuntime.health.
Integration tests: marker, healthcheck, and status
test/cli.test.ts, test/nemoclaw-start.test.ts, test/sandbox-provisioning.test.ts
Tests added to assert marker creation only on gateway-serving startup, HEALTHCHECK treats curl exit 7 as healthy when the marker is absent, and status preserves Phase: Error while adding a paused-container hint and setting dockerPaused: true in JSON.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

NemoClaw CLI, bug

Suggested reviewers

  • cv
  • cjagwani

"🐰
I tuck a marker in /tmp with care,
So healthchecks won't despair.
If containers nap, just show a clue—
docker unpause <name> will do.
Hops and fixes, now we're aware."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically addresses the two main fixes: stopping false Docker unhealthy reports and adding paused-container hints, with direct issue references.
Linked Issues check ✅ Passed The PR fully addresses both linked issues: #4503 (marker-based HEALTHCHECK to prevent false unhealthy when gateway runs on host) and #4495 (paused-container detection and recovery hints).
Out of Scope Changes check ✅ Passed All changes are directly scoped to addressing #4503 and #4495; no unrelated modifications or feature creep detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yimoj yimoj added v0.0.60 Release target v0.0.56 Release target and removed v0.0.60 Release target labels Jun 1, 2026
@wscurran wscurran added Docker platform: ubuntu Affects Ubuntu Linux environments labels Jun 1, 2026
@cv cv added v0.0.57 Release target and removed v0.0.56 Release target labels Jun 1, 2026
@prekshivyas prekshivyas self-assigned this Jun 1, 2026
…nt (NVIDIA#4503, NVIDIA#4495)

Docker-driver sandbox container state could mislead operators in two ways.

`pgrep 'openclaw[ -]gateway'` on curl exit 7, marking functional
Docker-driver sandboxes unhealthy when the OpenClaw gateway runs outside
this container's namespace (on the host) so no in-container process or
listener exists, even though `nemoclaw status`/OpenShell report Ready.

nemoclaw-start now drops a `/tmp/nemoclaw-gateway-local` marker early on
the gateway-serving path. The HEALTHCHECK keeps the strict local liveness
check (pgrep + non-empty gateway log) only when that marker is present
(standalone/Compose NVIDIA#1430 and the NVIDIA#3975 forwarded-port shape); when it is
absent the gateway is delivered out-of-namespace, so an in-container probe
cannot prove failure and the check reports healthy, deferring to
NemoClaw/OpenShell host-side delivery-chain monitoring. Writing the marker
early ensures a slow in-container startup is governed by the strict check
rather than masked as healthy. Genuine bad signals (curl timeout 28, HTTP
4xx/5xx 22) still report unhealthy.

`Phase: Error` even though the sandbox is intact. NemoClaw no longer just
prints the raw phase: `getSandboxDockerRuntime` now also reads
`.State.Paused`, and `status` adds an actionable `docker unpause <name>`
recovery hint (and skips the misleading rebuild suggestion) while keeping
OpenShell's authoritative phase verbatim. `dockerPaused` is exposed in the
structured `--json` report. The phase-mapping fix itself belongs upstream
in OpenShell.

Regression coverage: marker-aware healthcheck cases including the NVIDIA#4503
out-of-namespace path (test/sandbox-provisioning.test.ts), the early-marker
gateway-serving behavior (test/nemoclaw-start.test.ts),
`getSandboxDockerRuntime` paused parsing (docker-health.test.ts), and the
paused-container status hint plus dockerPaused report field
(test/cli.test.ts).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@yimoj yimoj force-pushed the fix/4503-docker-driver-health-status branch from 330af4e to f3e4696 Compare June 1, 2026 22:04
@cv cv merged commit 33f7285 into NVIDIA:main Jun 1, 2026
21 checks passed

@prekshivyas prekshivyas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — reviewed the full diff against #4503 and #4495; well-scoped and the logic holds.

#4503 (false (unhealthy)): the marker gate [ -f /tmp/nemoclaw-gateway-local ] || exit 0 sits correctly on the curl-exit-7 path only. Connection-refused with the marker absent (OpenShell docker-driver, gateway out of this namespace) now reports healthy instead of driving Docker off a signal it cannot observe; genuine bad signals (curl exit 28 wedged, 22 HTTP 4xx/5xx) still exit 1. The marker is dropped early in nemoclaw-start.sh on the gateway-serving path (empty NEMOCLAW_CMD) and stays absent for one-shot commands, so a slow/hung boot stays governed by the strict pgrep+log check rather than masked as healthy.

#4495 (paused container): getSandboxDockerRuntime reads .State.Paused alongside health, and status prints a docker unpause hint while keeping OpenShell's authoritative Phase: Error verbatim — the upstream phase-mapping fix correctly stays out of scope. dockerPaused is surfaced in --json.

Test coverage is thorough (8 new docker-health cases, marker-drop, paused-status, JSON field). All gates green, no unresolved threads.

Non-blocking: getSandboxDockerRuntime resolves twice per status invocation (report builder + printer) = one redundant docker inspect. Worth folding into the snapshot later, not a blocker.

@wscurran wscurran added area: packaging Packages, images, registries, installers, or distribution area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression platform: container Affects Docker, containerd, Podman, or images and removed area: packaging Packages, images, registries, installers, or distribution labels Jun 3, 2026
cv pushed a commit that referenced this pull request Jun 3, 2026
## Summary
- Add the missing `v0.0.57` release-notes section with links to the
detailed docs pages for command, inference, onboarding, messaging,
status, installer, and policy changes.
- Remove public references to docs-skip terms from source docs and
regenerate the NemoClaw user skills from the current Fern MDX docs.
- Carry forward generated references for the per-agent documentation
split, including Hermes-specific reference files.

## Source summary
- #4615 and #4653 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Release notes now cover host-side
`sessions` and `agents` commands plus `NEMOCLAW_EXTRA_AGENTS_JSON`
secondary-agent baking.
- #4163, #4204, #4611, #4619, and #4676 ->
`docs/about/release-notes.mdx`,
`docs/inference/use-local-inference.mdx`: Release notes now cover
managed vLLM progress/readiness, DGX Spark model default changes, local
Ollama streaming usage, and inference route divergence warnings.
- #4267, #4601, #4609, #4642, #4645, and #4661 ->
`docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release
notes now cover UFW auto-remediation, local-inference reachability
gates, gateway reuse/binding, cancel rollback, and policy selection
persistence.
- #4577, #4582, #4607, and #4660 -> `docs/about/release-notes.mdx`,
`docs/manage-sandboxes/messaging-channels.mdx`: Release notes now cover
Slack validation, atomic `channels add`, WhatsApp QR diagnostics, and
Slack placeholder normalization.
- #4388, #4600, #4646, and #4647 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Release notes now cover status failure
layers, paused-container hints, Docker-driver doctor behavior, and
non-destructive stale-registry recovery.
- #4569, #4579, and #4678 -> `docs/about/release-notes.mdx`,
`docs/manage-sandboxes/lifecycle.mdx`,
`docs/network-policy/integration-policy-examples.mdx`: Release notes now
cover installer tag pinning, PyPI `uv` policy access, and observable
Jira validation.
- #4632 -> `.agents/skills/`: Regenerated user skills from the current
per-agent docs source, including newly generated Hermes reference files.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" docs --glob "*.mdx"`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" .agents/skills --glob "*.md"`
- `npm run docs`
- `npm run build:cli`
- Commit hooks: markdownlint, docs-to-skills verification, gitleaks,
skills YAML, commitlint

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Restructured documentation to clearly distinguish OpenClaw and Hermes
agent variants throughout user guides.
* Enhanced security, credential storage, and deployment guidance with
clearer setup flows.
  * Added Hermes plugin installation and ecosystem documentation.
* Improved workspace, messaging, and policy management references with
variant-specific command examples.
  * Refined troubleshooting and CLI reference sections for clarity.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression platform: container Affects Docker, containerd, Podman, or images platform: ubuntu Affects Ubuntu Linux environments v0.0.57 Release target

Projects

None yet

4 participants