fix(docker): align in-container healthcheck with delivery chain on OpenShell-managed runtimes#4180
Conversation
…enShell-managed runtimes (NVIDIA#3975) The Dockerfile HEALTHCHECK shipped only an in-container curl http://127.0.0.1:${port}/health probe. On runtime shapes where the dashboard port lives in a different network namespace (DGX Spark / aarch64 with OpenShell-managed forwarding), that probe gets a connection-refused error even though the OpenShell host forward and NemoClaw status report the sandbox as Ready. Docker then marks the container unhealthy, contradicting the delivery chain. Make the HEALTHCHECK a layered probe: 1. Direct curl /health — definitive fast path; preserves the existing standalone/Compose health signal. 2. ONLY on curl exit 7 ("Couldn't connect"), fall back to verifying the OpenClaw gateway process is alive in this container (via pgrep --ignore-ancestors so it cannot self-match the healthcheck shell) AND that /tmp/gateway.log is non-empty (proves the gateway came up). Connect timeouts (exit 28) and HTTP 4xx/5xx still fail, so wedged or broken gateways are still restarted. 3. The process pattern matches both 'openclaw gateway run' and 'openclaw-gateway' — the same launcher/re-exec variant set the host-side stop script in tunnel/services.ts already uses. Surface the raw Docker health state alongside the host-side delivery chain in nemoclaw status so an older image without the fallback can still tell users when Docker's signal disagrees with reality, with a pointer to this issue. The diagnostic is intentionally read-only and does not auto-downgrade Docker's signal because the existing in-sandbox probe shares the same 127.0.0.1 endpoint and cannot independently confirm the mismatch. Regression coverage: - New healthcheck unit fixtures in test/sandbox-provisioning.test.ts cover the fast-path success, the namespace-mismatch fallback, the wedged-server timeout (must remain unhealthy), HTTP error (must remain unhealthy), missing/empty gateway log, and gateway-down. - New src/lib/actions/sandbox/docker-health.test.ts covers driver detection, container name resolution including the prefix-disambiguation rule that protects sandboxes whose names are prefixes of others (e.g. my vs my-assistant, my-assistant vs my-assistant-prod). Fixes NVIDIA#3975. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds a two-stage Dockerfile HEALTHCHECK (HTTP /health probe with connection-refused process/log fallback), a new getSandboxDockerHealth module that resolves and normalizes Docker container health, CLI status output for Docker health, and unit/regression tests validating the fallback semantics. ChangesDocker Health Detection and Display
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/lib/actions/sandbox/docker-health.test.ts (1)
91-98: ⚡ Quick winMake this case actually test “suffixed candidate first”
The test title says exact vs suffixed candidate ordering, but current
psNamesonly covers prefix disambiguation against another sandbox name. Use a same-sandbox suffixed + exact pair to validate the intended behavior.Suggested test update
it("prefers an exact-name match even when the docker ps listing returns a suffixed candidate first", () => { const deps = fixture({ - psNames: "openshell-my-assistant\nopenshell-my\n", + psNames: "openshell-my-12ab\nopenshell-my\n", healthRaw: "healthy", - knownSandboxes: ["my", "my-assistant"], + knownSandboxes: ["my"], }); expect(getSandboxDockerHealth("my", deps).containerName).toBe("openshell-my"); });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/actions/sandbox/docker-health.test.ts` around lines 91 - 98, Update the test to actually simulate a same-sandbox suffixed candidate appearing before the exact name: in the spec that currently calls getSandboxDockerHealth("my", deps) make deps.psNames start with a suffixed container for the same sandbox (e.g. "openshell-my-1\nopenshell-my\n" or "openshell-my-v2\nopenshell-my\n") and set knownSandboxes to only include "my" (remove "my-assistant"); keep the expectation expect(getSandboxDockerHealth("my", deps).containerName).toBe("openshell-my") to assert the exact-name is preferred even when the suffixed candidate appears first.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/actions/sandbox/docker-health.ts`:
- Around line 74-84: The loop that picks an owned candidate can choose a
suffixed container (openshell-<sandbox>-<id>) over the exact container
(openshell-<sandbox>) because it only sorts by length; update the selection in
the for loop that iterates over candidates to prefer an exact stripped === name
match before accepting starts-with matches: for each candidate, compute stripped
and then look up owner by first checking for knownSandboxes entry where stripped
=== name, and only if none found fall back to the startsWith(`${name}-`) test
(or adjust the filter/sort to prioritize exact matches), using the existing
variables candidates, knownSandboxes, sandboxName and owner to locate and return
the exact container when present.
---
Nitpick comments:
In `@src/lib/actions/sandbox/docker-health.test.ts`:
- Around line 91-98: Update the test to actually simulate a same-sandbox
suffixed candidate appearing before the exact name: in the spec that currently
calls getSandboxDockerHealth("my", deps) make deps.psNames start with a suffixed
container for the same sandbox (e.g. "openshell-my-1\nopenshell-my\n" or
"openshell-my-v2\nopenshell-my\n") and set knownSandboxes to only include "my"
(remove "my-assistant"); keep the expectation
expect(getSandboxDockerHealth("my", deps).containerName).toBe("openshell-my") to
assert the exact-name is preferred even when the suffixed candidate appears
first.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: dd406233-2c48-4656-b9cf-064d5a4934f9
📒 Files selected for processing (5)
Dockerfilesrc/lib/actions/sandbox/docker-health.test.tssrc/lib/actions/sandbox/docker-health.tssrc/lib/actions/sandbox/status.tstest/sandbox-provisioning.test.ts
…IDIA#3975) Address CodeRabbit feedback on NVIDIA#4180: when both `openshell-<name>` and `openshell-<name>-<id>` containers coexist, the longest-known-name heuristic still returns the suffixed one if it appears earlier in `docker ps`, because the runtime suffix is not itself a registered sandbox so the resolver falls back to `<name>` and matches our query. Short-circuit on exact-name match before walking the candidate list. Add a regression test covering the docker-runtime-suffixed sibling case. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
Summary
The Dockerfile HEALTHCHECK ran an in-container
curl http://127.0.0.1:${port}/health. On runtime shapes where the dashboard port lives in a different network namespace (DGX Spark / aarch64 with OpenShell-managed forwarding), that probe gets connection-refused even though the OpenShell host forward andnemoclaw statusshow the sandbox as Ready, so Docker marks the container unhealthy and contradicts the actual delivery chain (#3975).Related Issue
Fixes #3975.
Changes
pgrep --ignore-ancestors -f 'openclaw[ -]gateway'— the same launcher/re-exec variant set the host stop script intunnel/services.tsalready uses, and--ignore-ancestorsprevents the healthcheck shell from self-matching) AND that/tmp/gateway.logis non-empty (proves the gateway came up). Curl timeouts (exit 28) and HTTP 4xx/5xx still fail, so wedged or broken gateways are still restarted.nemoclaw statusso older images without the fallback can still tell users when Docker's signal disagrees with the delivery chain, with a pointer back to this issue. The new lookup resolves sandbox container names using the longest-matching registered sandbox name so it cannot accidentally surface another sandbox's health (e.g. formyvsmy-assistant, ormy-assistantvsmy-assistant-prod).test/sandbox-provisioning.test.ts(fast-path success, namespace-mismatch fallback, wedged-server timeout stays unhealthy, HTTP error stays unhealthy, missing log stays unhealthy) and unit tests insrc/lib/actions/sandbox/docker-health.test.ts(driver detection, exact-match preference, prefix disambiguation).Type of Change
Verification
npm testpasses for the changed slice (test/sandbox-provisioning.test.ts,src/lib/actions/sandbox/*) — fullnpm testhad pre-existing failures insrc/lib/inference/local.test.tsandtest/cli.test.tsthat reproduce onupstream/mainand are unrelated to this changecodex reviewclean on the diff after iteration (only[P3]about the untrackedissue-*.mdscratch note remained, which is not part of the commit)make docsbuilds without warnings (doc changes only) — N/ANote: DGX Spark / aarch64 hardware was not available locally, so the bug was reproduced at the unit/fixture level by demonstrating that the existing
curl 127.0.0.1:${port}/healthhealthcheck command exits 7 when no listener is bound — the exact shape the reporter saw — while the host-side delivery chain stays reachable. The new tests assert the layered probe handles that case without masking real failure modes.Signed-off-by: Yimo Jiang yimoj@nvidia.com
Summary by CodeRabbit
New Features
Tests