fix(onboard): require host HTTP readiness before reusing the gateway#3312
Conversation
Closes #3258 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds host HTTP readiness probes and configurable polling to onboarding gateway reuse/startup/recovery flows; exports isGatewayHttpReady, waitForGatewayHttpReady, and getGatewayReuseHealthWaitConfig; integrates checks into preflight/startup/recover/resume, adds unit tests, and documents the behavior and env vars. ChangesGateway HTTP Readiness & Reuse Health Verification
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
4074-4109: Run the restart-sensitive onboarding E2Es on this change.The new unit coverage is useful, but this touches core gateway reuse/start behavior in
src/lib/onboard.ts; I'd still run the onboard/lifecycle E2Es that exercise Docker/Colima restart timing before merge.As per coding guidelines,
src/lib/onboard.ts: "This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow."Also applies to: 4341-4457
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 4074 - 4109, This change touches gateway reuse/start logic around waitForGatewayHttpReady, gatewayReuseState, runOpenshell, destroyGateway, and registry.clearAll — run the restart-sensitive onboarding/lifecycle E2E suites that simulate Docker/Colima daemon restarts to verify the three branches (HTTP-ready despite Docker unavailable, HTTP-unready with Docker unavailable, and container-running-but-HTTP-unready) behave as intended; if any E2E fails, adjust the logic so that waitForGatewayHttpReady is awaited in the same way the E2Es expect, preserve the non-destructive "unknown" path (do not set gatewayReuseState="missing" when Docker is unavailable), and ensure the recreate path still calls runOpenshell(["forward","stop",String(DASHBOARD_PORT)], {ignoreError:true}), destroyGateway(), and registry.clearAll() only when Docker is known-functional.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 4451-4457: recoverGatewayRuntime currently treats "openshell
status" alone as success in two places; update both return sites so they also
await isGatewayHttpReady() before considering recovery complete: when checking
status.includes("Connected") && isSelectedGateway(status) and inside the
recovery poll loop after restart, only set process.env.OPENSHELL_GATEWAY =
GATEWAY_NAME and return true if (await isGatewayHttpReady()) also succeeds;
ensure you await the readiness call and keep the same guards
(status.includes("Connected") and isSelectedGateway) around setting the env and
returning.
In `@test/gateway-http-reuse-wait.test.ts`:
- Around line 221-312: The tests are asserting source text instead of runtime
behavior and the unknownBlockRegex doesn't include the else-if branch, so update
the suite to import and exercise the compiled runtime (dist/lib/onboard.js) and
replace regex matches with unit tests that stub/spy the key functions
(waitForGatewayHttpReady, isGatewayHttpReady, destroyGateway, registry.clearAll)
and assert observed calls and gatewayReuseState transitions when calling
preflight, onboard, and startGatewayWithOptions; specifically, simulate
containerState values ("running", "unknown") and HTTP success/failure via the
waitForGatewayHttpReady stub to verify the two downgrade behaviors
(destroy+missing for running+HTTP-fail and missing-without-destroy for
unknown+HTTP-fail) and ensure the startGatewayWithOptions path requires both
isGatewayHealthy and await isGatewayHttpReady before logging "Reusing existing
gateway".
- Around line 143-145: The test "returns false on connection refused" is flaky
because it probes port 1; update the test to deterministically get an ephemeral
localhost port, close it, then call isGatewayHttpReady against
http://127.0.0.1:<port>/; specifically, in the test create a temporary server
(e.g., via net.createServer or http.createServer), listen(0) to get
address().port, close the server to free the port, then await
isGatewayHttpReady(2000, `http://127.0.0.1:${port}/`) and expect false —
reference the test case name and the isGatewayHttpReady call so you can locate
and replace the hardcoded "127.0.0.1:1" probe with this ephemeral-port pattern.
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 4074-4109: This change touches gateway reuse/start logic around
waitForGatewayHttpReady, gatewayReuseState, runOpenshell, destroyGateway, and
registry.clearAll — run the restart-sensitive onboarding/lifecycle E2E suites
that simulate Docker/Colima daemon restarts to verify the three branches
(HTTP-ready despite Docker unavailable, HTTP-unready with Docker unavailable,
and container-running-but-HTTP-unready) behave as intended; if any E2E fails,
adjust the logic so that waitForGatewayHttpReady is awaited in the same way the
E2Es expect, preserve the non-destructive "unknown" path (do not set
gatewayReuseState="missing" when Docker is unavailable), and ensure the recreate
path still calls runOpenshell(["forward","stop",String(DASHBOARD_PORT)],
{ignoreError:true}), destroyGateway(), and registry.clearAll() only when Docker
is known-functional.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1299fd53-dccc-462e-b3b4-206dad77ebdc
📒 Files selected for processing (2)
src/lib/onboard.tstest/gateway-http-reuse-wait.test.ts
…e-text test asserts Address CodeRabbit review on #3258: - Extend host HTTP readiness gate to both return sites in recoverGatewayRuntime() so the recovery path cannot return success on stale CLI metadata while http://localhost:8080/ is still 502. - Replace port-1 connection-refused probe with an ephemeral-port bind+close helper for deterministic local closure. - Drop the source-text integration assertions; runtime behaviour is covered by the live HTTP server tests against isGatewayHttpReady and the injectable-probe tests for waitForGatewayHttpReady. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
end-of-file-fixer hook in CI rejects extra trailing newline. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Mention the new gateway-reuse HTTP readiness probe added in #3258 and the two override knobs (NEMOCLAW_REUSE_HEALTH_POLL_COUNT, NEMOCLAW_REUSE_HEALTH_POLL_INTERVAL) in the onboard preflight section. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Centralise the validation guards in waitForGatewayHttpReady so env-derived defaults and caller-supplied options are normalised in one place — Math.max(1, attempts) and Math.max(0, intervalSeconds). The config function is now just an env reader. Restructure the wait loop to always probe at least once before any sleep, eliminating the trailing-attempt guard and naturally enforcing the minimum. Validate the per-probe timeoutMs in isGatewayHttpReady too: a non-positive or non-finite value falls back to the safe default rather than tearing the request down immediately. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Document that NEMOCLAW_REUSE_HEALTH_POLL_COUNT is floored at 1 (so the probe always runs at least once) and the interval is floored at 0. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
test/gateway-http-reuse-wait.test.ts (1)
253-270: ⚡ Quick winAdd an explicit negative-interval clamp test.
You already verify attempt clamping; adding one case for
intervalSeconds < 0would fully lock down the documented normalization behavior.Suggested test addition
describe("waitForGatewayHttpReady (`#3258`)", () => { @@ it("always probes at least once even when maxAttempts is 0 or negative", async () => { @@ }); + + it("clamps negative intervalSeconds to 0", async () => { + let calls = 0; + const sleeps: number[] = []; + const result = await waitForGatewayHttpReady({ + probe: async () => { + calls += 1; + return false; + }, + sleeper: (s: number) => sleeps.push(s), + maxAttempts: 3, + intervalSeconds: -5, + }); + expect(result).toBe(false); + expect(calls).toBe(3); + expect(sleeps).toEqual([0, 0]); + }); });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/gateway-http-reuse-wait.test.ts` around lines 253 - 270, Add a new test alongside the existing "always probes at least once..." case that passes a negative intervalSeconds to waitForGatewayHttpReady and asserts the same normalization behavior: the probe is invoked exactly once, result is false, and no sleeps occur. Use the same structure/variables as the current test (calls counter, sleeps array, probe async that returns false, sleeper that pushes to sleeps) but set intervalSeconds to a negative value (e.g., -5) and keep maxAttempts at 0 (or another bad value) to verify the interval clamping logic in waitForGatewayHttpReady.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 4087-4108: The code is incorrectly downgrading gatewayReuseState
from "unknown" to "missing" which causes the orphan-cleanup path to remove live
containers; change this to a separate non-destructive state/flag (e.g., set
gatewayReuseState = "no_reuse_no_cleanup" or set a boolean like
skipOrphanCleanup = true) inside the branch where waitForGatewayHttpReady() is
false, and update any later logic that performs orphan cleanup (the code that
checks gatewayReuseState for "missing" or runs cleanup) to only treat the
original "missing" value as eligible for stop/remove; ensure functions/variables
referenced include gatewayReuseState and waitForGatewayHttpReady and propagate
the new state/flag through the preflight flow so reuse is prevented but
orphan-cleanup is suppressed.
- Around line 3583-3593: getGatewayReuseHealthWaitConfig/opts values can be
non-finite and Math.max won't sanitize NaN/Infinity, so clamp and coerce
maxAttempts and intervalSeconds to finite numbers before the poll loop: compute
a finiteMaxAttempts = Number.isFinite(opts.maxAttempts) ? Math.max(1,
Math.trunc(opts.maxAttempts)) : Math.max(1, config.count) and a
finiteIntervalSeconds = Number.isFinite(opts.intervalSeconds) ? Math.max(0,
opts.intervalSeconds) : Math.max(0, config.interval); then use finiteMaxAttempts
and finiteIntervalSeconds in place of maxAttempts/intervalSeconds when calling
sleeper and in the for loop (keep the initial probe() call and reference probe
and sleeper as before).
---
Nitpick comments:
In `@test/gateway-http-reuse-wait.test.ts`:
- Around line 253-270: Add a new test alongside the existing "always probes at
least once..." case that passes a negative intervalSeconds to
waitForGatewayHttpReady and asserts the same normalization behavior: the probe
is invoked exactly once, result is false, and no sleeps occur. Use the same
structure/variables as the current test (calls counter, sleeps array, probe
async that returns false, sleeper that pushes to sleeps) but set intervalSeconds
to a negative value (e.g., -5) and keep maxAttempts at 0 (or another bad value)
to verify the interval clamping logic in waitForGatewayHttpReady.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f8fb97bd-0c6e-4a36-95fb-73bf8ad255b3
📒 Files selected for processing (3)
docs/reference/commands.mdsrc/lib/onboard.tstest/gateway-http-reuse-wait.test.ts
✅ Files skipped from review due to trivial changes (1)
- docs/reference/commands.md
…guard NaN/Infinity Address two CodeRabbit findings on #3258: * Don't downgrade reuseState to "missing" in the preflight unknown+ HTTP-fail branch. The orphan-cleanup that runs on "missing" can destroy a possibly-live gateway when Docker's transient inspect failure coincides with an HTTP warm-up miss — exactly the case the branch is supposed to protect. The main-onboard branch (which has no orphan-cleanup) and the startGatewayWithOptions HTTP gate together still ensure the actual reuse decision goes through HTTP verification. * Add Number.isFinite guards in waitForGatewayHttpReady so caller- supplied Infinity/NaN don't bypass the Math.max clamps and either hang the loop or feed NaN into the sleeper. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
cv
left a comment
There was a problem hiding this comment.
Can you extract this new code out of onboard.ts, please? This file is already too big.
…onboard/ Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
… unknown and gateway HTTP is unready Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
E2E Advisor RecommendationRequired E2E: None Full advisor summaryPi Semantic E2E AdvisorFailed: pi exited with status 1; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-advisor-pi-raw-output.txt |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard/gateway-http-readiness.ts`:
- Around line 116-133: The probe() calls (used via const probe = opts.probe ??
(() => isGatewayHttpReady())) can reject and currently abort the readiness loop;
change both the initial await probe() and the subsequent await probe() inside
the for loop to catch any thrown errors and treat them as a failed probe (return
false for that attempt) so retries continue; implement this by wrapping each
await probe() in a try/catch (or use a helper like safeProbe()) and return true
only on a successful resolved true, otherwise continue the loop and ultimately
return false after attempts are exhausted; reference probe(),
isGatewayHttpReady(), maxAttempts, intervalSeconds and sleeper to locate the
spots to update.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e2cd2c59-2ceb-4641-a4f7-7ea7fce979a0
📒 Files selected for processing (3)
docs/reference/commands.mdsrc/lib/onboard.tssrc/lib/onboard/gateway-http-readiness.ts
✅ Files skipped from review due to trivial changes (1)
- docs/reference/commands.md
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/onboard.ts
…ttpReady Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…x md lint Address PR review findings: 1. Merge origin/main to pick up PR #3312 (isGatewayHttpReady from src/lib/onboard/gateway-http-readiness.ts). Replace our ad-hoc verifyDockerDriverGatewayListening (TCP probe) with a call to the shared isGatewayHttpReady helper — HTTP is strictly stronger than a raw TCP connect and is already the de-facto standard used by every other gateway-reuse decision site in onboard.ts after #3312. Docker-driver and K3s paths now converge on the same probe. 2. Drop the parallel test/gateway-tcp-probe.test.ts (9 tests for a helper that no longer exists). The shared helper's behavior is already covered by test/gateway-http-reuse-wait.test.ts (21 tests). Replace with test/gateway-health-honest-integration.test.ts (6 source-shape tests) guarding the #3111 integration pattern. 3. Fix 7 markdownlint errors in ACCEPTANCE.md (MD040, MD022, MD031) — the 'checks' job flagged these; now passes locally. 4. Update ACCEPTANCE.md to reflect the simpler design: Phase 1 is 'reuse existing helper' rather than 'add new TCP probe helper', and the refactoring alignment table records #3312 as 'adopted' rather than 'to be coordinated'. Behavior of the fix is unchanged: - child.once('exit') listener tracks zombies (detached children that process.kill(pid, 0) would falsely report as alive) - poll loop guard is childExited || !isPidAlive(childPid) - healthy log gated on isGatewayHttpReady() after isGatewayHealthy() - user-visible exit code/signal surfaced on failure Related: #3111 Coverage guard: #3362 — gateway-health-honest-e2e
…iness Addresses the failing openshell-gateway-upgrade-e2e surfaced by the e2e-advisor on PR #3378. ## Problem The previous commit converged the Docker-driver gateway's readiness probe onto isGatewayHttpReady from ./onboard/gateway-http-readiness (introduced by #3312). That broke openshell-gateway-upgrade-e2e: Starting OpenShell gateway... [PASS] Stale gateway is healthy with supervisor:0.0.36 [INFO] Invoking NemoClaw gateway start path; it must restart the stale process Docker-driver gateway failed to start. <-- 60s timeout The gateway log showed the process was actually serving correctly: POST /openshell.v1.OpenShell/Health response status=200 latency_ms=0 POST /openshell.v1.OpenShell/Health response status=200 latency_ms=0 GET / response status=404 latency_ms=0 ## Root cause isGatewayHttpReady does GET /, and only accepts {200, 401}. The K3s-path gateway answers GET / via a dispatcher catch-all. The Docker-driver gateway, by contrast, only serves routes under /openshell.v1.OpenShell/* and returns 404 for GET /. So the probe always fails on the Docker- driver path regardless of whether the gateway is up. ## Fix Revert to a plain TCP probe on the gateway port for the Docker-driver path. Add a small isDockerDriverGatewayTcpReady(port, timeoutMs, host) helper next to isPidAlive. This is semantic-free — it just asks "is anyone listening" — which is exactly what's needed to detect the #3111 failure mode (crashed binary → nothing listening) without making assumptions about HTTP route shape. isGatewayHttpReady is left in place for all K3s-path call sites; the two paths use different probes because the two gateway types expose different HTTP surfaces. TODO(#3213) filed for future convergence. ## Tests test/gateway-health-honest-integration.test.ts now has 10 tests: 5 behavioural: listening / closed / timeout / min-timeout / no-throw 5 source-shape: child-exit tracking, poll-loop guard, TCP probe call site, regression guard against reintroducing the HTTP probe in the Docker-driver loop, exit-details surfaced ## Expected CI result - gateway-health-honest-e2e: GREEN (already was — this was never affected) - openshell-gateway-upgrade-e2e: FLIPS BACK to GREEN (the breakage this commit addresses) Related: #3111 Broke in: 29bba3e (refactor(onboard): converge on shared isGatewayHttpReady)
Addresses PR review feedback — onboard.ts is the God Object being decomposed, and growing its exports is an anti-pattern. Follow the pattern established by PR #3312 (which added src/lib/onboard/gateway-http-readiness.ts) and create a peer module for the Docker-driver path. New file: src/lib/onboard/gateway-tcp-readiness.ts Exports: isGatewayTcpReady(port, timeoutMs, host) Model: mirrors gateway-http-readiness.ts structure exactly — default port from GATEWAY_PORT, minimum-timeout clamp, pure async, never throws. New file: src/lib/onboard/gateway-tcp-readiness.test.ts Co-located behavioural tests per the established pattern (see temp-files.test.ts, legacy-command.test.ts, etc.) 6 tests: listening / closed / timeout / min-timeout / no-throw / default-port. Imports via ES module, not via dist/ indirection. onboard.ts net change: Before: +88 lines (+48 helper + integration) After: +22 lines (2 require lines + integration only) Related to: #3111
…odules
Addresses PR review: 'This PR is adding lines to onboard.ts — which is
now an antipattern' and the onboard-entrypoint-budget CI check.
src/lib/onboard.ts is the God Object being decomposed; growing it is
explicitly flagged as "away from target architecture" in the repo's
review guidelines, and the onboard-entrypoint-budget workflow fails the
PR if the entrypoint grows net-positive.
## Changes
Extract three focused modules under src/lib/onboard/ that collectively
replace the inline logic added in earlier commits on this branch:
- src/lib/onboard/gateway-tcp-readiness.ts
isGatewayTcpReady(port, timeoutMs, host): plain TCP liveness probe
for the Docker-driver gateway. Peer of gateway-http-readiness.ts
(introduced by #3312). Uses a minimum-timeout clamp so a regression
that drops Math.max(50, timeoutMs) is caught by tests.
- src/lib/onboard/child-exit-tracker.ts
trackChildExit(child): attaches a one-shot 'exit' listener to a
ChildProcess and returns a ChildExitState view. Zombie-safe: fires
reliably for detached children that isPidAlive(pid, 0) would
falsely report as alive (#3111). Also exposes describeExit() —
'exited with code 127' / 'killed by signal SIGKILL' — so callers
don't have to reformat the signal/code themselves.
- src/lib/onboard/docker-driver-gateway-failure.ts
reportDockerDriverGatewayStartFailure(logPath, childExit, opts):
the full diagnostic bundle emitted on startup failure — 'failed to
start' header, exit descriptor, redacted log tail, troubleshooting
footer. Honors exitOnFailure (production) vs. throw-only (recovery).
Each module has co-located behavioural tests (6 + 5 + 6 tests = 17),
using direct ES imports instead of the dist/ indirection that
test/onboard.test.ts uses. Source-shape guards in
test/gateway-health-honest-integration.test.ts enforce that the
three modules stay extracted (the entrypoint cannot reintroduce the
inline implementations).
## Result
src/lib/onboard.ts: 15 additions, 21 deletions — net -6 lines.
src/lib/onboard/**: 3 new modules, 17 new unit tests, full behavioural
+ signal-vs-code coverage on all three helpers.
Both E2E tests remain green on this branch:
- gateway-health-honest-e2e (#3111 coverage guard)
- openshell-gateway-upgrade-e2e (regression target the earlier
HTTP-probe attempt broke; TCP probe is the correct fit)
Related: #3111
Aligns with architecture goal tracked by onboard-entrypoint-budget CI.
…eporting healthy (#3111) (#3378) ## Summary `nemoclaw onboard` was printing `✓ Docker-driver gateway is healthy` even when the openshell-gateway binary crashed immediately on startup, then failing the next step with `Connection refused`. The reported trigger on Ubuntu 22.04 was a GLIBC 2.38/2.39 mismatch in the shipped binary, but the underlying NemoClaw bug is **platform-independent** — any reason the binary fails to start (missing shared lib, CDI spec error, port conflict, permissions, corrupted download) surfaces the same false-positive. This PR fixes the false-positive at the caller site in `startDockerDriverGateway()`, without modifying the shared `isGatewayHealthy()` (which is pinned to pure-function status by the #2020 follow-up test). Fixes #3111. Closes the gap covered by the failing E2E test added in #3362 — `gateway-health-honest-e2e` flips from 🔴 red on `main` to 🟢 green on this branch. ## Root cause `startDockerDriverGateway()` in `src/lib/onboard.ts` spawned the gateway binary with `spawn(..., { detached: true })` + `child.unref()`, then polled using two checks that could both lie: 1. **`isPidAlive(childPid)`** uses `process.kill(pid, 0)` which returns `true` for **zombies**. Since the parent Node process never `wait()`s on the detached child, crashed children linger as zombies and `isPidAlive` reports them as alive. 2. **`isGatewayHealthy(status, gwInfo, activeGwInfo)`** is a pure string match over openshell CLI output. `isGatewayConnected` in `src/lib/state/gateway.ts` matches on `"Server Status"` — the **table header** that `openshell status` always prints. On a crashed gateway, the header is still emitted and the body contains `× client error (Connect) tcp connect error Connection refused` — but `isGatewayConnected` returns true anyway. Smoking gun from the red-on-main run [25698031380](https://github.com/NVIDIA/NemoClaw/actions/runs/25698031380): ``` [DIAG] openshell status: Server Status Gateway: nemoclaw Server: http://127.0.0.1:8080 Error: × client error (Connect) ├─▶ tcp connect error ╰─▶ Connection refused (os error 111) ``` ## Changes **`src/lib/onboard.ts`** — three coordinated changes in `startDockerDriverGateway`: 1. **Track child-exit via the ChildProcess `'exit'` event**, not just `isPidAlive`. A `child.once('exit', ...)` listener flips a `childExited` flag that the poll loop consults alongside `isPidAlive`. This catches zombies that `isPidAlive` misses and also captures the exit code/signal for the failure message. 2. **Add `verifyDockerDriverGatewayListening(port, timeoutMs)`** — a TCP connect probe to `127.0.0.1:${GATEWAY_PORT}` using `node:net` with a socket timeout. Resolves boolean, never throws. This is the Docker-driver path equivalent of `verifyGatewayContainerRunning` (added for #2020 on the K3s path). 3. **Gate the "healthy" log on the TCP probe**: the poll loop now only logs `✓ Docker-driver gateway is healthy` after `isGatewayHealthy` **AND** a successful TCP connect. On probe failure the loop keeps polling — the binary may still be binding its listener. The `childExited` check at the top of the loop terminates us if the process actually died. Also improves the final failure message to include **how** the gateway exited (signal vs. exit code) so users don't have to `tail` the gateway log. ## What this PR does NOT change `isGatewayHealthy` in `src/lib/state/gateway.ts` is left untouched. The #2020 follow-up test at `test/gateway-liveness-probe.test.ts:74` pins it to pure-function status ("no I/O, no docker, no spawn, no exec"). Fix at the caller, not the shared helper — same pattern as #2020. ## Tests - **New unit test:** `test/gateway-tcp-probe.test.ts` (9 tests) - `verifyDockerDriverGatewayListening` resolves true for listening ports - resolves false for closed ports (Connection refused) - resolves false on timeout (non-routable RFC 1918 host) - enforces minimum 50 ms timeout - never throws - **source-shape guards** (regexes over `src/lib/onboard.ts`): child-exit tracking present, `childExited || !isPidAlive` check at poll-loop top, TCP probe called before the "healthy" log, exit details surfaced in the failure message - **E2E acceptance gate:** `gateway-health-honest-e2e` (from #3362) — red on main, expected green on this branch. Will dispatch via nightly-e2e selective run once PR opens. - **Existing tests:** full `vitest` suite passes (CLI); `test/gateway-state.test.ts` (47 tests) and `test/gateway-liveness-probe.test.ts` (7 tests) still green — no behavior change in the shared helpers. ## Refactoring alignment Noted in `ACCEPTANCE.md` (also in this branch): - **#2562** (unified timeout abstraction) — `TODO(#2562)` on the TCP probe helper's timeout logic, for mechanical adoption later. - **#3213** (unified advisory registry) — `TODO(#3213)` on the failure-message format so it can migrate to structured advisories later. - **PR #3312** (laitingsheng — `isGatewayHttpReady` for K3s path) — same pattern, different surface. Easy to converge once both land; can extract a shared "gateway liveness probe" module if desired. - **PR #3306** (cv — `gateway-bootstrap.ts` extraction) — doesn't touch `startDockerDriverGateway`; no conflict expected. ## Related - Fixes #3111 - Coverage guard: #3362 (`gateway-health-honest-e2e`) - Cross-reference patterns: #2020 (K3s path equivalent), #3312 (HTTP readiness for K3s reuse) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a TCP readiness check during gateway startup to ensure the service is actually listening before reporting healthy. * **Bug Fixes** * Reduced false “alive” reports for the gateway by detecting early child-process termination. * Improved startup failure messages to include how the gateway process terminated (signal vs exit code). * **Tests** * Added integration and unit tests to validate startup health and TCP probe behavior. [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3378) <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
nemoclaw onboarddecides whether to reuse the existing OpenShell gateway based onopenshellCLI metadata. After a Docker daemon restart (colima stop && colima start, system suspend/resume, etc.), that metadata can still report "healthy" while the gateway upstream is still warming up —http://127.0.0.1:${GATEWAY_PORT}/returns 502 until k3s is ready. Onboard then skips startup and step 4 fails with "Connection refused". This PR adds a host-level HTTP readiness probe at every reuse decision so we wait for actual readiness, or recreate the gateway when it never becomes ready.Related Issue
Closes #3258.
Changes
isGatewayHttpReady(timeoutMs, url)— single async probe ofhttp://127.0.0.1:${GATEWAY_PORT}/, accepts only the alive whitelist{200, 401}(mirrorsverify-deployment.ts:GATEWAY_ALIVE_CODES). A non-positive or non-finitetimeoutMsfalls back to the safe default rather than tearing the request down immediately. Theurlparameter is overridable so unit tests can drive a realhttp.Server.waitForGatewayHttpReady(opts)— async polling wrapper with injectableprobe/sleeper. Defaults fromgetGatewayReuseHealthWaitConfig(), overridable viaNEMOCLAW_REUSE_HEALTH_POLL_COUNT(default 6) andNEMOCLAW_REUSE_HEALTH_POLL_INTERVAL(default 5 seconds). The loop probes once before any sleep, then loops with sleep-then-probe — soMath.max(1, attempts)andMath.max(0, interval)normalise both env-derived defaults and caller-supplied options in one place, and the probe always runs at least once.preflight()andonboard():containerState === "missing"— unchanged (existing destroy + recreate path).containerState === "running"— wait for HTTP; on failure, destroy and downgrade to"missing"so the recreate path takes over.containerState === "unknown"— preserve the [All Platform][Onboard]Provider setup fails with "Connection refused" despite gateway showing healthy #2020 invariant by NOT callingdestroyGateway()here (Docker may itself be unavailable). Probe HTTP for an early signal; on failure, downgrade reuse state to"missing"so the regular gateway-start path runs and surfaces a clearer error.startGatewayWithOptions(): the early-return reuse path nowawait isGatewayHttpReady()before printing "✓ Reusing existing gateway". Without this, a fresh OpenShell CLI snapshot taken insidestartGatewayWithOptionscould re-trigger reuse with stale-but-CLI-healthy metadata, bypassing the upstream probe.isGatewayHealthy(...)ANDisGatewayHttpReady()before declaring success — the CLI metadata alone is not sufficient evidence that the upstream is serving.recoverGatewayRuntime()so its two return sites also requireisGatewayHttpReady()alongsideopenshell status+isSelectedGateway. Closes the same stale-metadata gap on the recovery path (CodeRabbit follow-up).maxAttemptsis0or negative, non-positive timeouts fall back to the default), retry/exhaust behaviour with injectable probes, and live-fire HTTP status-code semantics (200, 401 → ready; 502, 404, 403, connection refused → not ready). The connection-refused case binds an ephemeral port and immediately closes it for deterministic local closure.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests