fix(sandbox): stop false Docker unhealthy and add paused-container hint (#4503, #4495)#4600
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (8)
🚧 Files skipped from review as they are similar to previous changes (8)
📝 WalkthroughWalkthroughAdds a marker-gated Docker HEALTHCHECK and creates the marker only when serving the in-container gateway; introduces ChangesPaused-Container Status & Healthcheck Marker
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…nt (NVIDIA#4503, NVIDIA#4495) Docker-driver sandbox container state could mislead operators in two ways. `pgrep 'openclaw[ -]gateway'` on curl exit 7, marking functional Docker-driver sandboxes unhealthy when the OpenClaw gateway runs outside this container's namespace (on the host) so no in-container process or listener exists, even though `nemoclaw status`/OpenShell report Ready. nemoclaw-start now drops a `/tmp/nemoclaw-gateway-local` marker early on the gateway-serving path. The HEALTHCHECK keeps the strict local liveness check (pgrep + non-empty gateway log) only when that marker is present (standalone/Compose NVIDIA#1430 and the NVIDIA#3975 forwarded-port shape); when it is absent the gateway is delivered out-of-namespace, so an in-container probe cannot prove failure and the check reports healthy, deferring to NemoClaw/OpenShell host-side delivery-chain monitoring. Writing the marker early ensures a slow in-container startup is governed by the strict check rather than masked as healthy. Genuine bad signals (curl timeout 28, HTTP 4xx/5xx 22) still report unhealthy. `Phase: Error` even though the sandbox is intact. NemoClaw no longer just prints the raw phase: `getSandboxDockerRuntime` now also reads `.State.Paused`, and `status` adds an actionable `docker unpause <name>` recovery hint (and skips the misleading rebuild suggestion) while keeping OpenShell's authoritative phase verbatim. `dockerPaused` is exposed in the structured `--json` report. The phase-mapping fix itself belongs upstream in OpenShell. Regression coverage: marker-aware healthcheck cases including the NVIDIA#4503 out-of-namespace path (test/sandbox-provisioning.test.ts), the early-marker gateway-serving behavior (test/nemoclaw-start.test.ts), `getSandboxDockerRuntime` paused parsing (docker-health.test.ts), and the paused-container status hint plus dockerPaused report field (test/cli.test.ts). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
330af4e to
f3e4696
Compare
prekshivyas
left a comment
There was a problem hiding this comment.
Approving — reviewed the full diff against #4503 and #4495; well-scoped and the logic holds.
#4503 (false (unhealthy)): the marker gate [ -f /tmp/nemoclaw-gateway-local ] || exit 0 sits correctly on the curl-exit-7 path only. Connection-refused with the marker absent (OpenShell docker-driver, gateway out of this namespace) now reports healthy instead of driving Docker off a signal it cannot observe; genuine bad signals (curl exit 28 wedged, 22 HTTP 4xx/5xx) still exit 1. The marker is dropped early in nemoclaw-start.sh on the gateway-serving path (empty NEMOCLAW_CMD) and stays absent for one-shot commands, so a slow/hung boot stays governed by the strict pgrep+log check rather than masked as healthy.
#4495 (paused container): getSandboxDockerRuntime reads .State.Paused alongside health, and status prints a docker unpause hint while keeping OpenShell's authoritative Phase: Error verbatim — the upstream phase-mapping fix correctly stays out of scope. dockerPaused is surfaced in --json.
Test coverage is thorough (8 new docker-health cases, marker-drop, paused-status, JSON field). All gates green, no unresolved threads.
Non-blocking: getSandboxDockerRuntime resolves twice per status invocation (report builder + printer) = one redundant docker inspect. Worth folding into the snapshot later, not a blocker.
## Summary - Add the missing `v0.0.57` release-notes section with links to the detailed docs pages for command, inference, onboarding, messaging, status, installer, and policy changes. - Remove public references to docs-skip terms from source docs and regenerate the NemoClaw user skills from the current Fern MDX docs. - Carry forward generated references for the per-agent documentation split, including Hermes-specific reference files. ## Source summary - #4615 and #4653 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover host-side `sessions` and `agents` commands plus `NEMOCLAW_EXTRA_AGENTS_JSON` secondary-agent baking. - #4163, #4204, #4611, #4619, and #4676 -> `docs/about/release-notes.mdx`, `docs/inference/use-local-inference.mdx`: Release notes now cover managed vLLM progress/readiness, DGX Spark model default changes, local Ollama streaming usage, and inference route divergence warnings. - #4267, #4601, #4609, #4642, #4645, and #4661 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover UFW auto-remediation, local-inference reachability gates, gateway reuse/binding, cancel rollback, and policy selection persistence. - #4577, #4582, #4607, and #4660 -> `docs/about/release-notes.mdx`, `docs/manage-sandboxes/messaging-channels.mdx`: Release notes now cover Slack validation, atomic `channels add`, WhatsApp QR diagnostics, and Slack placeholder normalization. - #4388, #4600, #4646, and #4647 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover status failure layers, paused-container hints, Docker-driver doctor behavior, and non-destructive stale-registry recovery. - #4569, #4579, and #4678 -> `docs/about/release-notes.mdx`, `docs/manage-sandboxes/lifecycle.mdx`, `docs/network-policy/integration-policy-examples.mdx`: Release notes now cover installer tag pinning, PyPI `uv` policy access, and observable Jira validation. - #4632 -> `.agents/skills/`: Regenerated user skills from the current per-agent docs source, including newly generated Hermes reference files. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" docs --glob "*.mdx"` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" .agents/skills --glob "*.md"` - `npm run docs` - `npm run build:cli` - Commit hooks: markdownlint, docs-to-skills verification, gitleaks, skills YAML, commitlint <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Restructured documentation to clearly distinguish OpenClaw and Hermes agent variants throughout user guides. * Enhanced security, credential storage, and deployment guidance with clearer setup flows. * Added Hermes plugin installation and ecosystem documentation. * Improved workspace, messaging, and policy management references with variant-specific command examples. * Refined troubleshooting and CLI reference sections for clarity. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Fixes two Docker-driver sandbox container-state issues that mislead operators: a HEALTHCHECK that falsely marks functional Docker-driver sandboxes
(unhealthy), and aPhase: Errorfor a paused container with no recovery guidance. Both share the Docker-driver container-state / health-signal / status-UX axis.Related Issue
Fixes #4503
Fixes #4495
Changes
#4503 — stop false Docker
(unhealthy)scripts/nemoclaw-start.shdrops a/tmp/nemoclaw-gateway-localmarker early on the gateway-serving path (before the long startup work), signalling that this container runs the in-container OpenClaw gateway.DockerfileHEALTHCHECK keeps the strict local liveness fallback (pgrep 'openclaw[ -]gateway'+ non-empty gateway log) on curl exit 7 only when the marker is present (standalone/Compose No Dockerfile HEALTHCHECK — Unhealthy Containers Not Detected in Standalone Docker Deployments - IssueFinder - SN 08 #1430 and the [DGX Spark][Upgrade] post-rebuild gateway HTTP 0 + container Docker-unhealthy on aarch64 (regression vs main) #3975 forwarded-port shape). When the marker is absent, the gateway is delivered outside this container's namespace (OpenShell docker-driver deployments run it on the host), so an in-container probe cannot prove failure — the check reports healthy and defers to NemoClaw/OpenShell host-side delivery-chain monitoring. Writing the marker early ensures a slow in-container startup is governed by the strict check, not masked as healthy. Genuine bad signals (curl timeout exit 28, HTTP 4xx/5xx exit 22) still report unhealthy.#4495 — paused-container recovery hint without rewriting the phase
src/lib/actions/sandbox/docker-health.tsaddsgetSandboxDockerRuntime, which resolves the docker-driver container once and reads both.State.Health.Statusand.State.Paused.src/lib/actions/sandbox/status.tskeeps OpenShell's authoritativePhase: Errorverbatim but, when the resolved container is paused, prints an actionabledocker unpause <container>recovery hint and skips the misleading rebuild suggestion.dockerPausedis exposed in the structured--jsonreport. The authoritative phase-mapping fix itself belongs upstream in OpenShell.Regression coverage
test/sandbox-provisioning.test.ts: marker-aware healthcheck cases, including the [Ubuntu 24.04][Sandbox][GitHub Issue #4503] sandbox Docker HEALTHCHECK always (unhealthy) — pgrep openclaw-gateway runs inside container but gateway runs on host #4503 out-of-namespace path (curl exit 7 + no in-container gateway + marker absent → healthy) and that wedged/HTTP-error signals still fail.test/nemoclaw-start.test.ts: the marker is dropped on the gateway-serving path and stays absent for one-shot commands.src/lib/actions/sandbox/docker-health.test.ts:getSandboxDockerRuntimepaused parsing and resolution contract.test/cli.test.ts: status surfaces the paused hint without rewritingPhase: Error, plus thedockerPausedreport field.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesnpm run docsbuilds without warnings (doc changes only)Notes on verification: targeted tests for all changed code pass (docker-health 21, sandbox-provisioning 28, the new start-script marker test, and the paused-status cli test). The Dockerfile HEALTHCHECK fix was reproduced and verified end-to-end with a faithful bash fixture across 8 scenarios (curl exit 0/7/22/28 × marker present/absent × process/log present/absent). The remaining failures in a full
cli-project run are environmental — load-induced timeouts on CLI-spawning oclif/help tests and one umask-dependent directory mode-bits test inconfig-sync.ts(a file this PR does not touch); all pass when run in isolation. CI runs these on dedicated runners.🤖 Generated with Claude Code
Signed-off-by: Yimo Jiang yimoj@nvidia.com
Summary by CodeRabbit
New Features
docker unpause <container>suggestion.Bug Fixes
Tests