Skip to content

fix(docker): align in-container healthcheck with delivery chain on OpenShell-managed runtimes#4180

Merged
cv merged 2 commits into
NVIDIA:mainfrom
yimoj:fix/3975-docker-healthcheck-endpoint
May 25, 2026
Merged

fix(docker): align in-container healthcheck with delivery chain on OpenShell-managed runtimes#4180
cv merged 2 commits into
NVIDIA:mainfrom
yimoj:fix/3975-docker-healthcheck-endpoint

Conversation

@yimoj

@yimoj yimoj commented May 25, 2026

Copy link
Copy Markdown
Contributor

Summary

The Dockerfile HEALTHCHECK ran an in-container curl http://127.0.0.1:${port}/health. On runtime shapes where the dashboard port lives in a different network namespace (DGX Spark / aarch64 with OpenShell-managed forwarding), that probe gets connection-refused even though the OpenShell host forward and nemoclaw status show the sandbox as Ready, so Docker marks the container unhealthy and contradicts the actual delivery chain (#3975).

Related Issue

Fixes #3975.

Changes

  • Dockerfile: layered HEALTHCHECK. Fast path remains the direct curl. ONLY on curl exit 7 ("Couldn't connect") we now fall back to verifying the OpenClaw gateway process is alive (pgrep --ignore-ancestors -f 'openclaw[ -]gateway' — the same launcher/re-exec variant set the host stop script in tunnel/services.ts already uses, and --ignore-ancestors prevents the healthcheck shell from self-matching) AND that /tmp/gateway.log is non-empty (proves the gateway came up). Curl timeouts (exit 28) and HTTP 4xx/5xx still fail, so wedged or broken gateways are still restarted.
  • src/lib/actions/sandbox/status.ts + new docker-health.ts: surface the raw Docker health state in nemoclaw status so older images without the fallback can still tell users when Docker's signal disagrees with the delivery chain, with a pointer back to this issue. The new lookup resolves sandbox container names using the longest-matching registered sandbox name so it cannot accidentally surface another sandbox's health (e.g. for my vs my-assistant, or my-assistant vs my-assistant-prod).
  • Tests: regression coverage in test/sandbox-provisioning.test.ts (fast-path success, namespace-mismatch fallback, wedged-server timeout stays unhealthy, HTTP error stays unhealthy, missing log stays unhealthy) and unit tests in src/lib/actions/sandbox/docker-health.test.ts (driver detection, exact-match preference, prefix disambiguation).

Type of Change

  • Code change (feature, bug fix, or refactor)

Verification

  • npm test passes for the changed slice (test/sandbox-provisioning.test.ts, src/lib/actions/sandbox/*) — full npm test had pre-existing failures in src/lib/inference/local.test.ts and test/cli.test.ts that reproduce on upstream/main and are unrelated to this change
  • Tests added for new and changed behavior
  • No secrets, API keys, or credentials committed
  • codex review clean on the diff after iteration (only [P3] about the untracked issue-*.md scratch note remained, which is not part of the commit)
  • make docs builds without warnings (doc changes only) — N/A

Note: DGX Spark / aarch64 hardware was not available locally, so the bug was reproduced at the unit/fixture level by demonstrating that the existing curl 127.0.0.1:${port}/health healthcheck command exits 7 when no listener is bound — the exact shape the reporter saw — while the host-side delivery chain stays reachable. The new tests assert the layered probe handles that case without masking real failure modes.


Signed-off-by: Yimo Jiang yimoj@nvidia.com

Summary by CodeRabbit

  • New Features

    • Improved container health checking with dynamic port detection and a fallback verification when the primary HTTP probe fails.
    • Status command now shows a dedicated Docker health line for active sandboxes with clearer healthy/starting/unhealthy distinctions.
  • Tests

    • Added extensive tests covering healthcheck behavior, fallback logic, container selection, and reported states.

Review Change Stack

…enShell-managed runtimes (NVIDIA#3975)

The Dockerfile HEALTHCHECK shipped only an in-container
curl http://127.0.0.1:${port}/health probe. On runtime shapes where
the dashboard port lives in a different network namespace (DGX Spark
/ aarch64 with OpenShell-managed forwarding), that probe gets a
connection-refused error even though the OpenShell host forward and
NemoClaw status report the sandbox as Ready. Docker then marks the
container unhealthy, contradicting the delivery chain.

Make the HEALTHCHECK a layered probe:

  1. Direct curl /health — definitive fast path; preserves the
     existing standalone/Compose health signal.
  2. ONLY on curl exit 7 ("Couldn't connect"), fall back to verifying
     the OpenClaw gateway process is alive in this container (via
     pgrep --ignore-ancestors so it cannot self-match the healthcheck
     shell) AND that /tmp/gateway.log is non-empty (proves the
     gateway came up). Connect timeouts (exit 28) and HTTP 4xx/5xx
     still fail, so wedged or broken gateways are still restarted.
  3. The process pattern matches both 'openclaw gateway run' and
     'openclaw-gateway' — the same launcher/re-exec variant set the
     host-side stop script in tunnel/services.ts already uses.

Surface the raw Docker health state alongside the host-side delivery
chain in nemoclaw status so an older image without the fallback can
still tell users when Docker's signal disagrees with reality, with a
pointer to this issue. The diagnostic is intentionally read-only and
does not auto-downgrade Docker's signal because the existing
in-sandbox probe shares the same 127.0.0.1 endpoint and cannot
independently confirm the mismatch.

Regression coverage:
- New healthcheck unit fixtures in test/sandbox-provisioning.test.ts
  cover the fast-path success, the namespace-mismatch fallback, the
  wedged-server timeout (must remain unhealthy), HTTP error (must
  remain unhealthy), missing/empty gateway log, and gateway-down.
- New src/lib/actions/sandbox/docker-health.test.ts covers driver
  detection, container name resolution including the
  prefix-disambiguation rule that protects sandboxes whose names are
  prefixes of others (e.g. my vs my-assistant, my-assistant vs
  my-assistant-prod).

Fixes NVIDIA#3975.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@coderabbitai

coderabbitai Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 484c698d-a825-4e0c-b44f-d9f83da31f66

📥 Commits

Reviewing files that changed from the base of the PR and between 225ff1c and fa24903.

📒 Files selected for processing (2)
  • src/lib/actions/sandbox/docker-health.test.ts
  • src/lib/actions/sandbox/docker-health.ts

📝 Walkthrough

Walkthrough

Adds a two-stage Dockerfile HEALTHCHECK (HTTP /health probe with connection-refused process/log fallback), a new getSandboxDockerHealth module that resolves and normalizes Docker container health, CLI status output for Docker health, and unit/regression tests validating the fallback semantics.

Changes

Docker Health Detection and Display

Layer / File(s) Summary
Docker health detection module and tests
src/lib/actions/sandbox/docker-health.ts, src/lib/actions/sandbox/docker-health.test.ts
Defines DockerHealthState and SandboxDockerHealth; resolves the correct openshell container name among candidates, disambiguates prefix collisions, reads State.Health.Status via docker inspect, normalizes values to the exported union, and returns none/unknown appropriately. Unit tests cover normalization, selection logic, missing containers, and inspect errors.
Dockerfile two-stage HEALTHCHECK with fallback
Dockerfile
Replaces single curl -sf probe with port-resolution and conditional logic: HTTP /health success => healthy; curl exit code 7 (connection refused) => fallback to pgrep process check and non-empty gateway log; timeouts (code 28) and other curl errors fail without fallback. Includes inline documentation.
Status command Docker health output
src/lib/actions/sandbox/status.ts
Reorganizes imports and adds Docker health output when the sandbox is present; prints color-coded states (healthy, starting, unhealthy) and guidance for unhealthy noting possible mismatched in-container vs delivery-chain signals (ref: #3975).
Dockerfile HEALTHCHECK fallback regression tests
test/sandbox-provisioning.test.ts
Adds regression suite extracting the production HEALTHCHECK, running it with per-test temp logs and stubbed curl/pgrep exit codes; asserts when fallback should and should not occur (connection-refused with running process, timeout/no fallback, HTTP error/no fallback, missing log failures).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

Docker, OpenShell, fix, Sandbox, v0.0.50

Suggested reviewers

  • ericksoa
  • cv

Poem

🐰 I nibbled logs and chased a curl's tail,
When connections failed I followed the trail,
A process found, a tiny log's light—
Now health checks hop home through the night.
thumps a happy foot 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: aligning in-container Docker healthcheck with delivery chain on OpenShell-managed runtimes, which is the core focus of the PR.
Linked Issues check ✅ Passed The PR directly addresses #3975 by implementing a layered HEALTHCHECK with fallback logic to resolve false-unhealthy signals due to namespace mismatches, and surfacing Docker health state in status reporting.
Out of Scope Changes check ✅ Passed All changes (Dockerfile HEALTHCHECK, docker-health utilities, status display, and regression tests) are directly scoped to resolving the namespace-mismatch healthcheck issue and preventing false-unhealthy states.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/lib/actions/sandbox/docker-health.test.ts (1)

91-98: ⚡ Quick win

Make this case actually test “suffixed candidate first”

The test title says exact vs suffixed candidate ordering, but current psNames only covers prefix disambiguation against another sandbox name. Use a same-sandbox suffixed + exact pair to validate the intended behavior.

Suggested test update
   it("prefers an exact-name match even when the docker ps listing returns a suffixed candidate first", () => {
     const deps = fixture({
-      psNames: "openshell-my-assistant\nopenshell-my\n",
+      psNames: "openshell-my-12ab\nopenshell-my\n",
       healthRaw: "healthy",
-      knownSandboxes: ["my", "my-assistant"],
+      knownSandboxes: ["my"],
     });
     expect(getSandboxDockerHealth("my", deps).containerName).toBe("openshell-my");
   });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/actions/sandbox/docker-health.test.ts` around lines 91 - 98, Update
the test to actually simulate a same-sandbox suffixed candidate appearing before
the exact name: in the spec that currently calls getSandboxDockerHealth("my",
deps) make deps.psNames start with a suffixed container for the same sandbox
(e.g. "openshell-my-1\nopenshell-my\n" or "openshell-my-v2\nopenshell-my\n") and
set knownSandboxes to only include "my" (remove "my-assistant"); keep the
expectation expect(getSandboxDockerHealth("my",
deps).containerName).toBe("openshell-my") to assert the exact-name is preferred
even when the suffixed candidate appears first.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/actions/sandbox/docker-health.ts`:
- Around line 74-84: The loop that picks an owned candidate can choose a
suffixed container (openshell-<sandbox>-<id>) over the exact container
(openshell-<sandbox>) because it only sorts by length; update the selection in
the for loop that iterates over candidates to prefer an exact stripped === name
match before accepting starts-with matches: for each candidate, compute stripped
and then look up owner by first checking for knownSandboxes entry where stripped
=== name, and only if none found fall back to the startsWith(`${name}-`) test
(or adjust the filter/sort to prioritize exact matches), using the existing
variables candidates, knownSandboxes, sandboxName and owner to locate and return
the exact container when present.

---

Nitpick comments:
In `@src/lib/actions/sandbox/docker-health.test.ts`:
- Around line 91-98: Update the test to actually simulate a same-sandbox
suffixed candidate appearing before the exact name: in the spec that currently
calls getSandboxDockerHealth("my", deps) make deps.psNames start with a suffixed
container for the same sandbox (e.g. "openshell-my-1\nopenshell-my\n" or
"openshell-my-v2\nopenshell-my\n") and set knownSandboxes to only include "my"
(remove "my-assistant"); keep the expectation
expect(getSandboxDockerHealth("my", deps).containerName).toBe("openshell-my") to
assert the exact-name is preferred even when the suffixed candidate appears
first.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: dd406233-2c48-4656-b9cf-064d5a4934f9

📥 Commits

Reviewing files that changed from the base of the PR and between 50c208b and 225ff1c.

📒 Files selected for processing (5)
  • Dockerfile
  • src/lib/actions/sandbox/docker-health.test.ts
  • src/lib/actions/sandbox/docker-health.ts
  • src/lib/actions/sandbox/status.ts
  • test/sandbox-provisioning.test.ts

Comment thread src/lib/actions/sandbox/docker-health.ts
…IDIA#3975)

Address CodeRabbit feedback on NVIDIA#4180: when both `openshell-<name>` and
`openshell-<name>-<id>` containers coexist, the longest-known-name
heuristic still returns the suffixed one if it appears earlier in
`docker ps`, because the runtime suffix is not itself a registered
sandbox so the resolver falls back to `<name>` and matches our query.

Short-circuit on exact-name match before walking the candidate list.
Add a regression test covering the docker-runtime-suffixed sibling
case.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression v0.0.51 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DGX Spark][Upgrade] post-rebuild gateway HTTP 0 + container Docker-unhealthy on aarch64 (regression vs main)

3 participants