Skip to content

fix(onboard): probe gateway container before trusting stale health metadata#2032

Merged
ericksoa merged 9 commits into
mainfrom
fix/2020-gateway-liveness-probe
Apr 17, 2026
Merged

fix(onboard): probe gateway container before trusting stale health metadata#2032
ericksoa merged 9 commits into
mainfrom
fix/2020-gateway-liveness-probe

Conversation

@ericksoa

@ericksoa ericksoa commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Fixes [All Platform][Onboard]Provider setup fails with "Connection refused" despite gateway showing healthy #2020: nemoclaw onboard step 2 skipped gateway startup when openshell CLI metadata reported "healthy" but the Docker container had been removed, causing "Connection refused" in step 4
  • Adds verifyGatewayContainerRunning() helper that runs docker inspect to check actual container state before trusting a "healthy" reuse decision
  • Applies the probe in both the preflight port check and the main onboard gateway-reuse flow, downgrading to "missing" and cleaning up stale metadata when the container is dead

Test plan

  • New test/gateway-liveness-probe.test.ts validates the probe helper exists, is called in both code paths, downgrades state correctly, and confirms gateway-state.ts remains pure (no I/O)
  • Existing test/gateway-state.test.ts passes unchanged (pure classifier tests)
  • Manual: remove gateway container (docker rm openshell-cluster-nemoclaw), run nemoclaw onboard — should detect stale metadata and start a fresh gateway instead of skipping
  • Manual: with a healthy gateway running, nemoclaw onboard should still reuse it (happy path unchanged)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Improved gateway reliability: added a runtime liveness check before reusing a gateway; stops forwarding, clears stale local state, and prevents reuse when the gateway is missing. Unclear probe results emit a warning and retain the cached healthy status.
  • Tests

    • Added tests validating the liveness check, the ordering of recovery/cleanup actions, and behavior for running/missing/unknown probe outcomes; also verifies health-check helpers are free of side-effecting runtime probes.

…tadata (Fixes #2020)

The openshell CLI reports stale "Connected" status from its local state file
even after the Docker container has been removed. This causes onboard step 2
to skip gateway startup, leading to "Connection refused" in step 4.

Add a Docker container liveness probe (verifyGatewayContainerRunning) that
checks the actual container state before trusting a "healthy" reuse decision
in both the preflight port check and the main onboard flow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: bb9d6517-b71e-4fc0-9f31-484ace59d1f8

📥 Commits

Reviewing files that changed from the base of the PR and between e020edb and c065318.

📒 Files selected for processing (1)
  • src/lib/onboard.ts
✅ Files skipped from review due to trivial changes (1)
  • src/lib/onboard.ts

📝 Walkthrough

Walkthrough

Onboarding now performs a Docker liveness probe when cached gateway metadata reports "healthy". If the gateway container is missing, the flow stops dashboard forwarding, destroys the gateway, clears the local registry, and downgrades reuse state to "missing"; "unknown" logs a warning and keeps "healthy".

Changes

Cohort / File(s) Summary
Gateway liveness & onboarding
src/lib/onboard.ts
Add verifyGatewayContainerRunning() which runs docker inspect --type container for openshell-cluster-${GATEWAY_NAME} and returns `"running"
Tests & state-safety assertions
test/gateway-liveness-probe.test.ts, src/lib/gateway-state.ts
Add test test/gateway-liveness-probe.test.ts asserting presence, tri-state behavior, and correct probe ordering in preflight/onboard; verifies downgrade-to-"missing" paths (stop forwarding, destroyGateway(), registry clear). Also asserts src/lib/gateway-state.ts contains no Docker/process-spawning side effects.

Sequence Diagram

sequenceDiagram
    participant Onboard as Onboard
    participant State as GatewayState
    participant Docker as DockerEngine
    participant Forward as DashboardForward
    participant Registry as LocalRegistry

    Onboard->>State: compute gatewayReuseState
    State-->>Onboard: "healthy"
    Onboard->>Docker: docker inspect --type container\nopenshell-cluster-${GATEWAY_NAME} -> {{.State.Running}}
    Docker-->>Onboard: "running" / "missing" / "unknown"
    alt missing
        Onboard->>Forward: stop forwarding
        Onboard->>Onboard: log cleanup / mark stale
        Onboard->>Docker: destroy gateway
        Onboard->>Registry: clear local registry
        Onboard->>State: set gatewayReuseState = "missing"
    else running
        Onboard->>State: keep gatewayReuseState = "healthy"
    else unknown
        Onboard->>Onboard: log warning and proceed with cached "healthy"
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 I sniffed the rails where containers hide,

A missing pawprint sent me to pry.
I stopped the forward, swept the stale away,
Cleared the registry so new hops may play.
Fresh gateway dawns — let the sandbox sigh.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a Docker container liveness probe to verify gateway health before reusing cached metadata.
Linked Issues check ✅ Passed The code changes directly address issue #2020 by implementing Docker container liveness verification via verifyGatewayContainerRunning() and applying the probe in both preflight and onboard flows to prevent reusing missing/stopped gateways.
Out of Scope Changes check ✅ Passed All changes are scoped to the gateway container liveness probe feature: the helper function, integration into reuse logic, and a test file validating the implementation and verifying gateway-state.ts remains pure.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/2020-gateway-liveness-probe

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/gateway-liveness-probe.test.ts (1)

62-70: Scope the purity check to isGatewayHealthy() instead of the whole file.

These assertions will fail if any unrelated helper in gateway-state.ts ever uses docker, spawn, or exec, even when isGatewayHealthy() stays pure. Narrowing the check to that function body will make the test less noisy.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/gateway-liveness-probe.test.ts` around lines 62 - 70, The test currently
scans the entire gateway-state.ts file; instead extract only the
isGatewayHealthy() function body and run the purity assertions against that
snippet. Locate isGatewayHealthy in gateway-state.ts (handle both "function
isGatewayHealthy(...){...}" and "export const isGatewayHealthy = (...) => { ...
}" forms), capture the function body with a regex, then assert that the captured
body does not contain "docker", "spawn", or "exec" instead of scanning the whole
file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2374-2379: The stale-gateway branches currently call
runOpenshell(["gateway","destroy",...]) directly and bypass the shared cleanup
logic; replace that direct destroy call with the shared destroyGateway()
function (which performs Docker volume cleanup), keep the existing
runOpenshell(["forward","stop",...], {ignoreError:true}) if needed, then call
registry.clearAll() and set gatewayReuseState = "missing" as before; apply the
same replacement in the other identical stale-metadata branch that checks
verifyGatewayContainerRunning() (the other occurrence noted in the review).

In `@test/gateway-liveness-probe.test.ts`:
- Around line 38-43: The current regex used in mainFlowProbe can match across
the earlier preflight() block; tighten it so the sequence occurs inside the
onboard function by requiring "onboard" between getGatewayReuseState and
canReuseHealthyGateway. Update the content.match pattern to include "onboard"
(for example require 'onboard\\s*\\(' or 'function onboard' before the
verifyGatewayContainerRunning\\(\\) call) so that the match ensures
verifyGatewayContainerRunning() is invoked specifically within onboard() and not
from preflight(); target symbols: getGatewayReuseState,
verifyGatewayContainerRunning(), onboard(, and const canReuseHealthyGateway.

---

Nitpick comments:
In `@test/gateway-liveness-probe.test.ts`:
- Around line 62-70: The test currently scans the entire gateway-state.ts file;
instead extract only the isGatewayHealthy() function body and run the purity
assertions against that snippet. Locate isGatewayHealthy in gateway-state.ts
(handle both "function isGatewayHealthy(...){...}" and "export const
isGatewayHealthy = (...) => { ... }" forms), capture the function body with a
regex, then assert that the captured body does not contain "docker", "spawn", or
"exec" instead of scanning the whole file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9e6e5e92-90e8-4d61-a656-832a0bcd1622

📥 Commits

Reviewing files that changed from the base of the PR and between 1487f5b and a5798a6.

📒 Files selected for processing (2)
  • src/lib/onboard.ts
  • test/gateway-liveness-probe.test.ts

Comment thread src/lib/onboard.ts Outdated
Comment thread test/gateway-liveness-probe.test.ts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/lib/onboard.ts (1)

2374-2378: ⚠️ Potential issue | 🟠 Major

Use destroyGateway() in both stale-metadata cleanup branches.

These branches still destroy the gateway directly, so they skip the Docker volume cleanup in destroyGateway(). That can leave openshell-cluster-nemoclaw* volumes behind after stale-metadata recovery and make the next gateway start fail.

Proposed fix
-    runOpenshell(["gateway", "destroy", "-g", GATEWAY_NAME], { ignoreError: true });
+    destroyGateway();

Also applies to: 5925-5929

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2374 - 2378, The stale-metadata cleanup
branches call openshell directly and skip Docker volume cleanup; replace the
direct destroys with a call to destroyGateway() so volume cleanup runs.
Specifically, in the branch that checks gatewayReuseState === "healthy" and
!verifyGatewayContainerRunning() (and the analogous stale branch further down),
remove the direct runOpenshell(["gateway","destroy",...]) and
registry.clearAll() sequence and instead call destroyGateway() after stopping
forwards so the openshell-cluster-nemoclaw* volumes are removed; keep the
runOpenshell(["forward","stop", String(DASHBOARD_PORT)], { ignoreError: true })
step but use destroyGateway() to perform the destroy+cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 128-135: The verifyGatewayContainerRunning function currently
discards stderr and treats any non-zero exit as "not running", which conflates a
missing container with transient Docker daemon errors; modify
verifyGatewayContainerRunning to capture stderr from the run(...) call (instead
of redirecting to /dev/null), inspect stderr for specific signals ("No such
object:" → container missing) and explicitly handle at least three outcomes
(running, missing, unknown/error) — e.g., return an enum/string or distinct
booleans — and update callers to only perform stale-metadata cleanup when stderr
indicates "No such object" (or the explicit "missing" result) while treating
daemon/connection errors as transient (unknown) so cleanup is not triggered.
Ensure you reference the containerName variable and the run invocation when
making the change.

---

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 2374-2378: The stale-metadata cleanup branches call openshell
directly and skip Docker volume cleanup; replace the direct destroys with a call
to destroyGateway() so volume cleanup runs. Specifically, in the branch that
checks gatewayReuseState === "healthy" and !verifyGatewayContainerRunning() (and
the analogous stale branch further down), remove the direct
runOpenshell(["gateway","destroy",...]) and registry.clearAll() sequence and
instead call destroyGateway() after stopping forwards so the
openshell-cluster-nemoclaw* volumes are removed; keep the
runOpenshell(["forward","stop", String(DASHBOARD_PORT)], { ignoreError: true })
step but use destroyGateway() to perform the destroy+cleanup.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 3112519c-a634-4659-bb22-c03fb6961dc3

📥 Commits

Reviewing files that changed from the base of the PR and between a5798a6 and 716af9e.

📒 Files selected for processing (1)
  • src/lib/onboard.ts

Comment thread src/lib/onboard.ts
Address CodeRabbit review feedback:
- Use destroyGateway() instead of inline gateway destroy calls so Docker
  volume cleanup is not skipped when recovering from stale metadata
- Scope test regex to onboard() function to prevent false-positive match
  from the preflight block
- Scope gateway-state.ts purity check to isGatewayHealthy() body only

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/lib/onboard.ts (1)

128-135: ⚠️ Potential issue | 🟠 Major

Differentiate missing from Docker runtime failures before cleanup

Line 131/Line 134 currently treat any non-zero docker inspect as “container not running.” That also matches daemon/socket failures, which can incorrectly trigger destructive stale cleanup in Line 2374 and Line 5925.

🛠️ Suggested fix (tri-state probe)
-function verifyGatewayContainerRunning() {
+function getGatewayContainerState() {
   const containerName = `openshell-cluster-${GATEWAY_NAME}`;
   const result = run(
-    `docker inspect --type container --format '{{.State.Running}}' ${containerName} 2>/dev/null`,
+    `docker inspect --type container --format '{{.State.Running}}' ${containerName}`,
     { ignoreError: true, suppressOutput: true },
   );
-  return result.status === 0 && (result.stdout || "").trim() === "true";
+  const stdout = String(result.stdout || "").trim().toLowerCase();
+  const stderr = String(result.stderr || "").trim().toLowerCase();
+  if (result.status === 0) return stdout === "true" ? "running" : "stopped";
+  if (stderr.includes("no such object")) return "missing";
+  return "unknown";
 }

Then only run stale-metadata cleanup when state is "missing" (or explicitly "stopped" if intended), and keep "unknown" non-destructive.

#!/bin/bash
# Verify current behavior conflates all inspect failures and drives cleanup from a boolean.
rg -n --type=ts -C3 'function verifyGatewayContainerRunning|docker inspect --type container --format.*State\.Running|2>/dev/null|return result\.status === 0' src/lib/onboard.ts

# Verify both call sites perform cleanup when helper returns false.
rg -n --type=ts -C4 'gatewayReuseState === "healthy" && !verifyGatewayContainerRunning\(\)|destroyGateway\(\)|registry\.clearAll\(\)|gatewayReuseState = "missing"' src/lib/onboard.ts
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 128 - 135, verifyGatewayContainerRunning
currently returns a boolean and treats any non-zero docker inspect as "not
running", which conflates missing containers with Docker/daemon errors; change
it into a tri-state probe (e.g., return "healthy" | "missing" | "unknown" or an
object { state: "healthy"|"missing"|"unknown", info?: string }) in
verifyGatewayContainerRunning by running the inspect command and: if status ===
0 and stdout.trim() === "true" -> "healthy"; if status !== 0 and stderr contains
Docker's "No such object" (or equivalent) -> "missing"; otherwise -> "unknown"
(non-destructive). Then update call sites that currently do gatewayReuseState
=== "healthy" && !verifyGatewayContainerRunning(...) and the cleanup code paths
that call destroyGateway()/registry.clearAll()/set gatewayReuseState = "missing"
to only perform destructive cleanup when the probe returns "missing" (or
explicitly "stopped" if that is intended), leaving "unknown" to skip destructive
actions and surface an error for investigation.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

2374-2380: Extract stale-gateway cleanup into one helper

The two new stale-metadata branches duplicate the same cleanup sequence. Centralizing this reduces drift risk and keeps preflight/onboard behavior consistent.

♻️ Suggested refactor
+function cleanupStaleGatewayMetadata() {
+  console.log("  Gateway metadata is stale (container not running). Cleaning up...");
+  runOpenshell(["forward", "stop", String(DASHBOARD_PORT)], { ignoreError: true });
+  destroyGateway();
+  registry.clearAll();
+  console.log("  ✓ Stale gateway metadata cleaned up");
+}
...
-  if (gatewayReuseState === "healthy" && !verifyGatewayContainerRunning()) {
-    console.log("  Gateway metadata is stale (container not running). Cleaning up...");
-    runOpenshell(["forward", "stop", String(DASHBOARD_PORT)], { ignoreError: true });
-    destroyGateway();
-    registry.clearAll();
-    gatewayReuseState = "missing";
-    console.log("  ✓ Stale gateway metadata cleaned up");
-  }
+  if (gatewayReuseState === "healthy" && !verifyGatewayContainerRunning()) {
+    cleanupStaleGatewayMetadata();
+    gatewayReuseState = "missing";
+  }

Also applies to: 5925-5932

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2374 - 2380, Duplicate stale-gateway cleanup
logic should be extracted into a single helper to avoid drift: create a function
(e.g., cleanupStaleGateway or handleStaleGateway) that performs the sequence
currently duplicated — logging the "stale" message, calling
runOpenshell(["forward", "stop", String(DASHBOARD_PORT)], { ignoreError: true
}), calling destroyGateway(), calling registry.clearAll(), setting
gatewayReuseState = "missing", and logging the completion message — then replace
both duplicated branches (the block guarded by gatewayReuseState === "healthy"
&& !verifyGatewayContainerRunning() and the similar block at the other location)
with calls to that helper (keep existing condition checks and only invoke the
new helper).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 128-135: verifyGatewayContainerRunning currently returns a boolean
and treats any non-zero docker inspect as "not running", which conflates missing
containers with Docker/daemon errors; change it into a tri-state probe (e.g.,
return "healthy" | "missing" | "unknown" or an object { state:
"healthy"|"missing"|"unknown", info?: string }) in verifyGatewayContainerRunning
by running the inspect command and: if status === 0 and stdout.trim() === "true"
-> "healthy"; if status !== 0 and stderr contains Docker's "No such object" (or
equivalent) -> "missing"; otherwise -> "unknown" (non-destructive). Then update
call sites that currently do gatewayReuseState === "healthy" &&
!verifyGatewayContainerRunning(...) and the cleanup code paths that call
destroyGateway()/registry.clearAll()/set gatewayReuseState = "missing" to only
perform destructive cleanup when the probe returns "missing" (or explicitly
"stopped" if that is intended), leaving "unknown" to skip destructive actions
and surface an error for investigation.

---

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2374-2380: Duplicate stale-gateway cleanup logic should be
extracted into a single helper to avoid drift: create a function (e.g.,
cleanupStaleGateway or handleStaleGateway) that performs the sequence currently
duplicated — logging the "stale" message, calling runOpenshell(["forward",
"stop", String(DASHBOARD_PORT)], { ignoreError: true }), calling
destroyGateway(), calling registry.clearAll(), setting gatewayReuseState =
"missing", and logging the completion message — then replace both duplicated
branches (the block guarded by gatewayReuseState === "healthy" &&
!verifyGatewayContainerRunning() and the similar block at the other location)
with calls to that helper (keep existing condition checks and only invoke the
new helper).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 78042da6-8ee4-4857-bd3b-c58817a55189

📥 Commits

Reviewing files that changed from the base of the PR and between 716af9e and 7028a04.

📒 Files selected for processing (2)
  • src/lib/onboard.ts
  • test/gateway-liveness-probe.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/gateway-liveness-probe.test.ts

ericksoa and others added 2 commits April 17, 2026 12:37
verifyGatewayContainerRunning() now returns a tri-state
("running" | "missing" | "unknown") instead of a boolean.
Only triggers stale-metadata cleanup when the container is
confirmed missing (stderr contains "No such object"), not when
Docker is temporarily unavailable. This prevents destroying a
healthy gateway during transient Docker failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/gateway-liveness-probe.test.ts (1)

48-60: Add an assertion for the docker inspect => "false" path.

These checks confirm tri-state strings exist, but they don’t guard the specific regression where .State.Running is false and gets treated as "unknown". Add coverage for that path so stopped containers can’t silently bypass cleanup logic.

✅ Example assertion to add
   it("returns tri-state: running, missing, or unknown", () => {
     // The helper must distinguish container-removed from Docker-unavailable
     expect(content).toContain('return "running"');
     expect(content).toContain('return "missing"');
     expect(content).toContain('return "unknown"');
+    // Also ensure successful inspect with "false" is handled explicitly.
+    expect(content).toMatch(/running\s*===\s*"false"|trim\(\)\s*===\s*"false"/);
   });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/gateway-liveness-probe.test.ts` around lines 48 - 60, Add a focused
assertion that verifies the probe treats a docker inspect result with
.State.Running === false as the "missing" path (not "unknown"); locate the test
block referencing content and containerState (tests named "returns tri-state:
running, missing, or unknown" and "only downgrades to 'missing' when container
is confirmed missing") and add an assertion that ensures the generated probe
string contains the conditional or return handling for the ".State.Running ===
false" case (e.g., a check for '.State.Running === false' or the code path that
maps that to containerState === "missing") so the stopped-container path is
explicitly covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 142-149: The current inspect-result handling treats a successful
docker inspect with stdout "false" as "unknown", allowing cached "healthy"
reuse; change the branch that checks result.status and (result.stdout ||
"").trim() so that when stdout === "false" it returns a non-reusable state
(e.g., "stopped" or "not_running") instead of falling through to "unknown".
Locate the block with the condition if (result.status === 0 && (result.stdout ||
"").trim() === "true") and the stderr handling using (result.stderr ||
"").toString(), and add an explicit check for stdout === "false" returning the
chosen non-reusable state so callers won't reuse a stopped container.

---

Nitpick comments:
In `@test/gateway-liveness-probe.test.ts`:
- Around line 48-60: Add a focused assertion that verifies the probe treats a
docker inspect result with .State.Running === false as the "missing" path (not
"unknown"); locate the test block referencing content and containerState (tests
named "returns tri-state: running, missing, or unknown" and "only downgrades to
'missing' when container is confirmed missing") and add an assertion that
ensures the generated probe string contains the conditional or return handling
for the ".State.Running === false" case (e.g., a check for '.State.Running ===
false' or the code path that maps that to containerState === "missing") so the
stopped-container path is explicitly covered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 7ab930d3-c66e-4fb8-a7ee-2912c1606336

📥 Commits

Reviewing files that changed from the base of the PR and between 7028a04 and e020edb.

📒 Files selected for processing (2)
  • src/lib/onboard.ts
  • test/gateway-liveness-probe.test.ts

Comment thread src/lib/onboard.ts
ericksoa and others added 2 commits April 17, 2026 13:06
When docker inspect succeeds (exit 0) but .State.Running is "false",
the container exists but is stopped — treat this as "missing" so
stale-metadata cleanup is triggered instead of falling through to
"unknown" and keeping cached "healthy" reuse.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

2391-2405: Extract the stale-gateway reconciliation into one helper.

The probe, warning, cleanup, and state downgrade are duplicated in both preflight() and onboard(). Keeping that flow in one place will make the next behavior change much less likely to land in only one path.

♻️ Refactor sketch
+function reconcileHealthyGatewayReuseState(gatewayReuseState) {
+  if (gatewayReuseState !== "healthy") return gatewayReuseState;
+
+  const containerState = verifyGatewayContainerRunning();
+  if (containerState === "missing") {
+    console.log("  Gateway metadata is stale (container not running). Cleaning up...");
+    runOpenshell(["forward", "stop", String(DASHBOARD_PORT)], { ignoreError: true });
+    destroyGateway();
+    registry.clearAll();
+    console.log("  ✓ Stale gateway metadata cleaned up");
+    return "missing";
+  }
+
+  if (containerState === "unknown") {
+    console.log(
+      "  Warning: could not verify gateway container state (Docker may be unavailable). Proceeding with cached health status.",
+    );
+  }
+
+  return gatewayReuseState;
+}
- let gatewayReuseState = getGatewayReuseState(gatewayStatus, gwInfo, activeGatewayInfo);
-
- if (gatewayReuseState === "healthy") {
-   const containerState = verifyGatewayContainerRunning();
-   ...
- }
+ let gatewayReuseState = reconcileHealthyGatewayReuseState(
+   getGatewayReuseState(gatewayStatus, gwInfo, activeGatewayInfo),
+ );

Also applies to: 5947-5961

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2391 - 2405, Extract the duplicated
stale-gateway reconciliation logic into a single helper (e.g.,
reconcileStaleGateway or ensureGatewayRunning) and call it from both preflight()
and onboard(); the helper should call verifyGatewayContainerRunning(), emit the
same warning when containerState === "unknown", perform the cleanup steps
(runOpenshell(["forward","stop", String(DASHBOARD_PORT)], { ignoreError: true
}), destroyGateway(), registry.clearAll()) and downgrade gatewayReuseState to
"missing" when containerState === "missing", and return the final state if
callers need it—replace the duplicated block in both locations with a call to
that helper so behavior remains identical and centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2391-2405: Extract the duplicated stale-gateway reconciliation
logic into a single helper (e.g., reconcileStaleGateway or ensureGatewayRunning)
and call it from both preflight() and onboard(); the helper should call
verifyGatewayContainerRunning(), emit the same warning when containerState ===
"unknown", perform the cleanup steps (runOpenshell(["forward","stop",
String(DASHBOARD_PORT)], { ignoreError: true }), destroyGateway(),
registry.clearAll()) and downgrade gatewayReuseState to "missing" when
containerState === "missing", and return the final state if callers need
it—replace the duplicated block in both locations with a call to that helper so
behavior remains identical and centralized.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 1f591805-1621-41ab-ae77-4a5456a4d287

📥 Commits

Reviewing files that changed from the base of the PR and between e020edb and 8c65164.

📒 Files selected for processing (1)
  • src/lib/onboard.ts

@ericksoa ericksoa merged commit 26a8f7a into main Apr 17, 2026
13 checks passed
ericksoa pushed a commit that referenced this pull request Apr 18, 2026
…obe (#2053)

## Summary

One-line fix — verifyGatewayContainerRunning() crashes with TypeError on
second sandbox onboard because result.stdout is a Buffer, not a string.

## Related Issue

Regression from PR #2032.
Detailed issue at [2054](#2054)

## Changes

- Wrap result.stdout with String() before calling .trim() in
verifyGatewayContainerRunning()

## Type of Change

- [x] Code change (feature, bug fix, or refactor)

## Verification

- [x] `npx prek run --all-files` passes
- [x] `npm test` passes

## AI Disclosure

- [x] AI-assisted — tool: Cursor

---

Signed-off-by: Truong Nguyen <tgnguyen@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved reliability of gateway container status verification by
enhancing string type handling during status checks.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Truong Nguyen <tgnguyen@nvidia.com>
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[All Platform][Onboard]Provider setup fails with "Connection refused" despite gateway showing healthy

3 participants