fix(onboard): probe gateway container before trusting stale health metadata#2032
Conversation
…tadata (Fixes #2020) The openshell CLI reports stale "Connected" status from its local state file even after the Docker container has been removed. This causes onboard step 2 to skip gateway startup, leading to "Connection refused" in step 4. Add a Docker container liveness probe (verifyGatewayContainerRunning) that checks the actual container state before trusting a "healthy" reuse decision in both the preflight port check and the main onboard flow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughOnboarding now performs a Docker liveness probe when cached gateway metadata reports "healthy". If the gateway container is missing, the flow stops dashboard forwarding, destroys the gateway, clears the local registry, and downgrades reuse state to "missing"; "unknown" logs a warning and keeps "healthy". Changes
Sequence DiagramsequenceDiagram
participant Onboard as Onboard
participant State as GatewayState
participant Docker as DockerEngine
participant Forward as DashboardForward
participant Registry as LocalRegistry
Onboard->>State: compute gatewayReuseState
State-->>Onboard: "healthy"
Onboard->>Docker: docker inspect --type container\nopenshell-cluster-${GATEWAY_NAME} -> {{.State.Running}}
Docker-->>Onboard: "running" / "missing" / "unknown"
alt missing
Onboard->>Forward: stop forwarding
Onboard->>Onboard: log cleanup / mark stale
Onboard->>Docker: destroy gateway
Onboard->>Registry: clear local registry
Onboard->>State: set gatewayReuseState = "missing"
else running
Onboard->>State: keep gatewayReuseState = "healthy"
else unknown
Onboard->>Onboard: log warning and proceed with cached "healthy"
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
test/gateway-liveness-probe.test.ts (1)
62-70: Scope the purity check toisGatewayHealthy()instead of the whole file.These assertions will fail if any unrelated helper in
gateway-state.tsever usesdocker,spawn, orexec, even whenisGatewayHealthy()stays pure. Narrowing the check to that function body will make the test less noisy.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/gateway-liveness-probe.test.ts` around lines 62 - 70, The test currently scans the entire gateway-state.ts file; instead extract only the isGatewayHealthy() function body and run the purity assertions against that snippet. Locate isGatewayHealthy in gateway-state.ts (handle both "function isGatewayHealthy(...){...}" and "export const isGatewayHealthy = (...) => { ... }" forms), capture the function body with a regex, then assert that the captured body does not contain "docker", "spawn", or "exec" instead of scanning the whole file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2374-2379: The stale-gateway branches currently call
runOpenshell(["gateway","destroy",...]) directly and bypass the shared cleanup
logic; replace that direct destroy call with the shared destroyGateway()
function (which performs Docker volume cleanup), keep the existing
runOpenshell(["forward","stop",...], {ignoreError:true}) if needed, then call
registry.clearAll() and set gatewayReuseState = "missing" as before; apply the
same replacement in the other identical stale-metadata branch that checks
verifyGatewayContainerRunning() (the other occurrence noted in the review).
In `@test/gateway-liveness-probe.test.ts`:
- Around line 38-43: The current regex used in mainFlowProbe can match across
the earlier preflight() block; tighten it so the sequence occurs inside the
onboard function by requiring "onboard" between getGatewayReuseState and
canReuseHealthyGateway. Update the content.match pattern to include "onboard"
(for example require 'onboard\\s*\\(' or 'function onboard' before the
verifyGatewayContainerRunning\\(\\) call) so that the match ensures
verifyGatewayContainerRunning() is invoked specifically within onboard() and not
from preflight(); target symbols: getGatewayReuseState,
verifyGatewayContainerRunning(), onboard(, and const canReuseHealthyGateway.
---
Nitpick comments:
In `@test/gateway-liveness-probe.test.ts`:
- Around line 62-70: The test currently scans the entire gateway-state.ts file;
instead extract only the isGatewayHealthy() function body and run the purity
assertions against that snippet. Locate isGatewayHealthy in gateway-state.ts
(handle both "function isGatewayHealthy(...){...}" and "export const
isGatewayHealthy = (...) => { ... }" forms), capture the function body with a
regex, then assert that the captured body does not contain "docker", "spawn", or
"exec" instead of scanning the whole file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 9e6e5e92-90e8-4d61-a656-832a0bcd1622
📒 Files selected for processing (2)
src/lib/onboard.tstest/gateway-liveness-probe.test.ts
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/lib/onboard.ts (1)
2374-2378:⚠️ Potential issue | 🟠 MajorUse
destroyGateway()in both stale-metadata cleanup branches.These branches still destroy the gateway directly, so they skip the Docker volume cleanup in
destroyGateway(). That can leaveopenshell-cluster-nemoclaw*volumes behind after stale-metadata recovery and make the nextgateway startfail.Proposed fix
- runOpenshell(["gateway", "destroy", "-g", GATEWAY_NAME], { ignoreError: true }); + destroyGateway();Also applies to: 5925-5929
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 2374 - 2378, The stale-metadata cleanup branches call openshell directly and skip Docker volume cleanup; replace the direct destroys with a call to destroyGateway() so volume cleanup runs. Specifically, in the branch that checks gatewayReuseState === "healthy" and !verifyGatewayContainerRunning() (and the analogous stale branch further down), remove the direct runOpenshell(["gateway","destroy",...]) and registry.clearAll() sequence and instead call destroyGateway() after stopping forwards so the openshell-cluster-nemoclaw* volumes are removed; keep the runOpenshell(["forward","stop", String(DASHBOARD_PORT)], { ignoreError: true }) step but use destroyGateway() to perform the destroy+cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 128-135: The verifyGatewayContainerRunning function currently
discards stderr and treats any non-zero exit as "not running", which conflates a
missing container with transient Docker daemon errors; modify
verifyGatewayContainerRunning to capture stderr from the run(...) call (instead
of redirecting to /dev/null), inspect stderr for specific signals ("No such
object:" → container missing) and explicitly handle at least three outcomes
(running, missing, unknown/error) — e.g., return an enum/string or distinct
booleans — and update callers to only perform stale-metadata cleanup when stderr
indicates "No such object" (or the explicit "missing" result) while treating
daemon/connection errors as transient (unknown) so cleanup is not triggered.
Ensure you reference the containerName variable and the run invocation when
making the change.
---
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 2374-2378: The stale-metadata cleanup branches call openshell
directly and skip Docker volume cleanup; replace the direct destroys with a call
to destroyGateway() so volume cleanup runs. Specifically, in the branch that
checks gatewayReuseState === "healthy" and !verifyGatewayContainerRunning() (and
the analogous stale branch further down), remove the direct
runOpenshell(["gateway","destroy",...]) and registry.clearAll() sequence and
instead call destroyGateway() after stopping forwards so the
openshell-cluster-nemoclaw* volumes are removed; keep the
runOpenshell(["forward","stop", String(DASHBOARD_PORT)], { ignoreError: true })
step but use destroyGateway() to perform the destroy+cleanup.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 3112519c-a634-4659-bb22-c03fb6961dc3
📒 Files selected for processing (1)
src/lib/onboard.ts
Address CodeRabbit review feedback: - Use destroyGateway() instead of inline gateway destroy calls so Docker volume cleanup is not skipped when recovering from stale metadata - Scope test regex to onboard() function to prevent false-positive match from the preflight block - Scope gateway-state.ts purity check to isGatewayHealthy() body only Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/lib/onboard.ts (1)
128-135:⚠️ Potential issue | 🟠 MajorDifferentiate
missingfrom Docker runtime failures before cleanupLine 131/Line 134 currently treat any non-zero
docker inspectas “container not running.” That also matches daemon/socket failures, which can incorrectly trigger destructive stale cleanup in Line 2374 and Line 5925.🛠️ Suggested fix (tri-state probe)
-function verifyGatewayContainerRunning() { +function getGatewayContainerState() { const containerName = `openshell-cluster-${GATEWAY_NAME}`; const result = run( - `docker inspect --type container --format '{{.State.Running}}' ${containerName} 2>/dev/null`, + `docker inspect --type container --format '{{.State.Running}}' ${containerName}`, { ignoreError: true, suppressOutput: true }, ); - return result.status === 0 && (result.stdout || "").trim() === "true"; + const stdout = String(result.stdout || "").trim().toLowerCase(); + const stderr = String(result.stderr || "").trim().toLowerCase(); + if (result.status === 0) return stdout === "true" ? "running" : "stopped"; + if (stderr.includes("no such object")) return "missing"; + return "unknown"; }Then only run stale-metadata cleanup when state is
"missing"(or explicitly"stopped"if intended), and keep"unknown"non-destructive.#!/bin/bash # Verify current behavior conflates all inspect failures and drives cleanup from a boolean. rg -n --type=ts -C3 'function verifyGatewayContainerRunning|docker inspect --type container --format.*State\.Running|2>/dev/null|return result\.status === 0' src/lib/onboard.ts # Verify both call sites perform cleanup when helper returns false. rg -n --type=ts -C4 'gatewayReuseState === "healthy" && !verifyGatewayContainerRunning\(\)|destroyGateway\(\)|registry\.clearAll\(\)|gatewayReuseState = "missing"' src/lib/onboard.ts🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 128 - 135, verifyGatewayContainerRunning currently returns a boolean and treats any non-zero docker inspect as "not running", which conflates missing containers with Docker/daemon errors; change it into a tri-state probe (e.g., return "healthy" | "missing" | "unknown" or an object { state: "healthy"|"missing"|"unknown", info?: string }) in verifyGatewayContainerRunning by running the inspect command and: if status === 0 and stdout.trim() === "true" -> "healthy"; if status !== 0 and stderr contains Docker's "No such object" (or equivalent) -> "missing"; otherwise -> "unknown" (non-destructive). Then update call sites that currently do gatewayReuseState === "healthy" && !verifyGatewayContainerRunning(...) and the cleanup code paths that call destroyGateway()/registry.clearAll()/set gatewayReuseState = "missing" to only perform destructive cleanup when the probe returns "missing" (or explicitly "stopped" if that is intended), leaving "unknown" to skip destructive actions and surface an error for investigation.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
2374-2380: Extract stale-gateway cleanup into one helperThe two new stale-metadata branches duplicate the same cleanup sequence. Centralizing this reduces drift risk and keeps preflight/onboard behavior consistent.
♻️ Suggested refactor
+function cleanupStaleGatewayMetadata() { + console.log(" Gateway metadata is stale (container not running). Cleaning up..."); + runOpenshell(["forward", "stop", String(DASHBOARD_PORT)], { ignoreError: true }); + destroyGateway(); + registry.clearAll(); + console.log(" ✓ Stale gateway metadata cleaned up"); +} ... - if (gatewayReuseState === "healthy" && !verifyGatewayContainerRunning()) { - console.log(" Gateway metadata is stale (container not running). Cleaning up..."); - runOpenshell(["forward", "stop", String(DASHBOARD_PORT)], { ignoreError: true }); - destroyGateway(); - registry.clearAll(); - gatewayReuseState = "missing"; - console.log(" ✓ Stale gateway metadata cleaned up"); - } + if (gatewayReuseState === "healthy" && !verifyGatewayContainerRunning()) { + cleanupStaleGatewayMetadata(); + gatewayReuseState = "missing"; + }Also applies to: 5925-5932
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 2374 - 2380, Duplicate stale-gateway cleanup logic should be extracted into a single helper to avoid drift: create a function (e.g., cleanupStaleGateway or handleStaleGateway) that performs the sequence currently duplicated — logging the "stale" message, calling runOpenshell(["forward", "stop", String(DASHBOARD_PORT)], { ignoreError: true }), calling destroyGateway(), calling registry.clearAll(), setting gatewayReuseState = "missing", and logging the completion message — then replace both duplicated branches (the block guarded by gatewayReuseState === "healthy" && !verifyGatewayContainerRunning() and the similar block at the other location) with calls to that helper (keep existing condition checks and only invoke the new helper).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 128-135: verifyGatewayContainerRunning currently returns a boolean
and treats any non-zero docker inspect as "not running", which conflates missing
containers with Docker/daemon errors; change it into a tri-state probe (e.g.,
return "healthy" | "missing" | "unknown" or an object { state:
"healthy"|"missing"|"unknown", info?: string }) in verifyGatewayContainerRunning
by running the inspect command and: if status === 0 and stdout.trim() === "true"
-> "healthy"; if status !== 0 and stderr contains Docker's "No such object" (or
equivalent) -> "missing"; otherwise -> "unknown" (non-destructive). Then update
call sites that currently do gatewayReuseState === "healthy" &&
!verifyGatewayContainerRunning(...) and the cleanup code paths that call
destroyGateway()/registry.clearAll()/set gatewayReuseState = "missing" to only
perform destructive cleanup when the probe returns "missing" (or explicitly
"stopped" if that is intended), leaving "unknown" to skip destructive actions
and surface an error for investigation.
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2374-2380: Duplicate stale-gateway cleanup logic should be
extracted into a single helper to avoid drift: create a function (e.g.,
cleanupStaleGateway or handleStaleGateway) that performs the sequence currently
duplicated — logging the "stale" message, calling runOpenshell(["forward",
"stop", String(DASHBOARD_PORT)], { ignoreError: true }), calling
destroyGateway(), calling registry.clearAll(), setting gatewayReuseState =
"missing", and logging the completion message — then replace both duplicated
branches (the block guarded by gatewayReuseState === "healthy" &&
!verifyGatewayContainerRunning() and the similar block at the other location)
with calls to that helper (keep existing condition checks and only invoke the
new helper).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 78042da6-8ee4-4857-bd3b-c58817a55189
📒 Files selected for processing (2)
src/lib/onboard.tstest/gateway-liveness-probe.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- test/gateway-liveness-probe.test.ts
verifyGatewayContainerRunning() now returns a tri-state
("running" | "missing" | "unknown") instead of a boolean.
Only triggers stale-metadata cleanup when the container is
confirmed missing (stderr contains "No such object"), not when
Docker is temporarily unavailable. This prevents destroying a
healthy gateway during transient Docker failures.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/gateway-liveness-probe.test.ts (1)
48-60: Add an assertion for thedocker inspect => "false"path.These checks confirm tri-state strings exist, but they don’t guard the specific regression where
.State.Runningisfalseand gets treated as"unknown". Add coverage for that path so stopped containers can’t silently bypass cleanup logic.✅ Example assertion to add
it("returns tri-state: running, missing, or unknown", () => { // The helper must distinguish container-removed from Docker-unavailable expect(content).toContain('return "running"'); expect(content).toContain('return "missing"'); expect(content).toContain('return "unknown"'); + // Also ensure successful inspect with "false" is handled explicitly. + expect(content).toMatch(/running\s*===\s*"false"|trim\(\)\s*===\s*"false"/); });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/gateway-liveness-probe.test.ts` around lines 48 - 60, Add a focused assertion that verifies the probe treats a docker inspect result with .State.Running === false as the "missing" path (not "unknown"); locate the test block referencing content and containerState (tests named "returns tri-state: running, missing, or unknown" and "only downgrades to 'missing' when container is confirmed missing") and add an assertion that ensures the generated probe string contains the conditional or return handling for the ".State.Running === false" case (e.g., a check for '.State.Running === false' or the code path that maps that to containerState === "missing") so the stopped-container path is explicitly covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 142-149: The current inspect-result handling treats a successful
docker inspect with stdout "false" as "unknown", allowing cached "healthy"
reuse; change the branch that checks result.status and (result.stdout ||
"").trim() so that when stdout === "false" it returns a non-reusable state
(e.g., "stopped" or "not_running") instead of falling through to "unknown".
Locate the block with the condition if (result.status === 0 && (result.stdout ||
"").trim() === "true") and the stderr handling using (result.stderr ||
"").toString(), and add an explicit check for stdout === "false" returning the
chosen non-reusable state so callers won't reuse a stopped container.
---
Nitpick comments:
In `@test/gateway-liveness-probe.test.ts`:
- Around line 48-60: Add a focused assertion that verifies the probe treats a
docker inspect result with .State.Running === false as the "missing" path (not
"unknown"); locate the test block referencing content and containerState (tests
named "returns tri-state: running, missing, or unknown" and "only downgrades to
'missing' when container is confirmed missing") and add an assertion that
ensures the generated probe string contains the conditional or return handling
for the ".State.Running === false" case (e.g., a check for '.State.Running ===
false' or the code path that maps that to containerState === "missing") so the
stopped-container path is explicitly covered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 7ab930d3-c66e-4fb8-a7ee-2912c1606336
📒 Files selected for processing (2)
src/lib/onboard.tstest/gateway-liveness-probe.test.ts
When docker inspect succeeds (exit 0) but .State.Running is "false", the container exists but is stopped — treat this as "missing" so stale-metadata cleanup is triggered instead of falling through to "unknown" and keeping cached "healthy" reuse. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
2391-2405: Extract the stale-gateway reconciliation into one helper.The probe, warning, cleanup, and state downgrade are duplicated in both
preflight()andonboard(). Keeping that flow in one place will make the next behavior change much less likely to land in only one path.♻️ Refactor sketch
+function reconcileHealthyGatewayReuseState(gatewayReuseState) { + if (gatewayReuseState !== "healthy") return gatewayReuseState; + + const containerState = verifyGatewayContainerRunning(); + if (containerState === "missing") { + console.log(" Gateway metadata is stale (container not running). Cleaning up..."); + runOpenshell(["forward", "stop", String(DASHBOARD_PORT)], { ignoreError: true }); + destroyGateway(); + registry.clearAll(); + console.log(" ✓ Stale gateway metadata cleaned up"); + return "missing"; + } + + if (containerState === "unknown") { + console.log( + " Warning: could not verify gateway container state (Docker may be unavailable). Proceeding with cached health status.", + ); + } + + return gatewayReuseState; +}- let gatewayReuseState = getGatewayReuseState(gatewayStatus, gwInfo, activeGatewayInfo); - - if (gatewayReuseState === "healthy") { - const containerState = verifyGatewayContainerRunning(); - ... - } + let gatewayReuseState = reconcileHealthyGatewayReuseState( + getGatewayReuseState(gatewayStatus, gwInfo, activeGatewayInfo), + );Also applies to: 5947-5961
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 2391 - 2405, Extract the duplicated stale-gateway reconciliation logic into a single helper (e.g., reconcileStaleGateway or ensureGatewayRunning) and call it from both preflight() and onboard(); the helper should call verifyGatewayContainerRunning(), emit the same warning when containerState === "unknown", perform the cleanup steps (runOpenshell(["forward","stop", String(DASHBOARD_PORT)], { ignoreError: true }), destroyGateway(), registry.clearAll()) and downgrade gatewayReuseState to "missing" when containerState === "missing", and return the final state if callers need it—replace the duplicated block in both locations with a call to that helper so behavior remains identical and centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2391-2405: Extract the duplicated stale-gateway reconciliation
logic into a single helper (e.g., reconcileStaleGateway or ensureGatewayRunning)
and call it from both preflight() and onboard(); the helper should call
verifyGatewayContainerRunning(), emit the same warning when containerState ===
"unknown", perform the cleanup steps (runOpenshell(["forward","stop",
String(DASHBOARD_PORT)], { ignoreError: true }), destroyGateway(),
registry.clearAll()) and downgrade gatewayReuseState to "missing" when
containerState === "missing", and return the final state if callers need
it—replace the duplicated block in both locations with a call to that helper so
behavior remains identical and centralized.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 1f591805-1621-41ab-ae77-4a5456a4d287
📒 Files selected for processing (1)
src/lib/onboard.ts
…obe (#2053) ## Summary One-line fix — verifyGatewayContainerRunning() crashes with TypeError on second sandbox onboard because result.stdout is a Buffer, not a string. ## Related Issue Regression from PR #2032. Detailed issue at [2054](#2054) ## Changes - Wrap result.stdout with String() before calling .trim() in verifyGatewayContainerRunning() ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes ## AI Disclosure - [x] AI-assisted — tool: Cursor --- Signed-off-by: Truong Nguyen <tgnguyen@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved reliability of gateway container status verification by enhancing string type handling during status checks. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Truong Nguyen <tgnguyen@nvidia.com>
Summary
nemoclaw onboardstep 2 skipped gateway startup when openshell CLI metadata reported "healthy" but the Docker container had been removed, causing "Connection refused" in step 4verifyGatewayContainerRunning()helper that runsdocker inspectto check actual container state before trusting a "healthy" reuse decision"missing"and cleaning up stale metadata when the container is deadTest plan
test/gateway-liveness-probe.test.tsvalidates the probe helper exists, is called in both code paths, downgrades state correctly, and confirmsgateway-state.tsremains pure (no I/O)test/gateway-state.test.tspasses unchanged (pure classifier tests)docker rm openshell-cluster-nemoclaw), runnemoclaw onboard— should detect stale metadata and start a fresh gateway instead of skippingnemoclaw onboardshould still reuse it (happy path unchanged)🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests