Skip to content

fix(onboard): require host HTTP readiness before reusing the gateway#3312

Merged
cv merged 11 commits into
mainfrom
fix/3258-wait-cluster-health-on-reuse
May 12, 2026
Merged

fix(onboard): require host HTTP readiness before reusing the gateway#3312
cv merged 11 commits into
mainfrom
fix/3258-wait-cluster-health-on-reuse

Conversation

@laitingsheng

@laitingsheng laitingsheng commented May 9, 2026

Copy link
Copy Markdown
Contributor

Summary

nemoclaw onboard decides whether to reuse the existing OpenShell gateway based on openshell CLI metadata. After a Docker daemon restart (colima stop && colima start, system suspend/resume, etc.), that metadata can still report "healthy" while the gateway upstream is still warming up — http://127.0.0.1:${GATEWAY_PORT}/ returns 502 until k3s is ready. Onboard then skips startup and step 4 fails with "Connection refused". This PR adds a host-level HTTP readiness probe at every reuse decision so we wait for actual readiness, or recreate the gateway when it never becomes ready.

Related Issue

Closes #3258.

Changes

  • Add isGatewayHttpReady(timeoutMs, url) — single async probe of http://127.0.0.1:${GATEWAY_PORT}/, accepts only the alive whitelist {200, 401} (mirrors verify-deployment.ts:GATEWAY_ALIVE_CODES). A non-positive or non-finite timeoutMs falls back to the safe default rather than tearing the request down immediately. The url parameter is overridable so unit tests can drive a real http.Server.
  • Add waitForGatewayHttpReady(opts) — async polling wrapper with injectable probe/sleeper. Defaults from getGatewayReuseHealthWaitConfig(), overridable via NEMOCLAW_REUSE_HEALTH_POLL_COUNT (default 6) and NEMOCLAW_REUSE_HEALTH_POLL_INTERVAL (default 5 seconds). The loop probes once before any sleep, then loops with sleep-then-probe — so Math.max(1, attempts) and Math.max(0, interval) normalise both env-derived defaults and caller-supplied options in one place, and the probe always runs at least once.
  • Apply the wait at both gateway-reuse decision sites in preflight() and onboard():
    • containerState === "missing" — unchanged (existing destroy + recreate path).
    • containerState === "running" — wait for HTTP; on failure, destroy and downgrade to "missing" so the recreate path takes over.
    • containerState === "unknown" — preserve the [All Platform][Onboard]Provider setup fails with "Connection refused" despite gateway showing healthy #2020 invariant by NOT calling destroyGateway() here (Docker may itself be unavailable). Probe HTTP for an early signal; on failure, downgrade reuse state to "missing" so the regular gateway-start path runs and surfaces a clearer error.
  • Add a final HTTP gate inside startGatewayWithOptions(): the early-return reuse path now await isGatewayHttpReady() before printing "✓ Reusing existing gateway". Without this, a fresh OpenShell CLI snapshot taken inside startGatewayWithOptions could re-trigger reuse with stale-but-CLI-healthy metadata, bypassing the upstream probe.
  • Tighten the post-start health-poll loop to require BOTH isGatewayHealthy(...) AND isGatewayHttpReady() before declaring success — the CLI metadata alone is not sufficient evidence that the upstream is serving.
  • Extend the same gate to recoverGatewayRuntime() so its two return sites also require isGatewayHttpReady() alongside openshell status + isSelectedGateway. Closes the same stale-metadata gap on the recovery path (CodeRabbit follow-up).
  • Document the two new override knobs in docs/reference/commands.md under the onboard preflight section.
  • Add test/gateway-http-reuse-wait.test.ts covering env-var defaults, validation clamps (probe runs even when maxAttempts is 0 or negative, non-positive timeouts fall back to the default), retry/exhaust behaviour with injectable probes, and live-fire HTTP status-code semantics (200, 401 → ready; 502, 404, 403, connection refused → not ready). The connection-refused case binds an ephemeral port and immediately closes it for deterministic local closure.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • New Features

    • Onboarding now probes the gateway's local HTTP endpoint and requires HTTP readiness alongside metadata before reusing a gateway.
  • Bug Fixes

    • Preflight no longer performs destructive cleanup when container state is unknown but HTTP is unreachable; resume aborts with an actionable error. If a running gateway's HTTP is unresponsive, onboarding stops forwarding, recreates the gateway, and clears reuse state.
  • Documentation

    • Added guidance and env tuning for reuse health poll count and interval.
  • Tests

    • Expanded tests for HTTP readiness, retry behavior, edge cases, and non-destructive preflight paths.

Review Change Stack

Closes #3258

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 9, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 62690094-2426-4c60-9a73-1f4b40b8f2d1

📥 Commits

Reviewing files that changed from the base of the PR and between 172efa8 and e4ca4fa.

📒 Files selected for processing (2)
  • src/lib/onboard/gateway-http-readiness.ts
  • test/gateway-http-reuse-wait.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lib/onboard/gateway-http-readiness.ts
  • test/gateway-http-reuse-wait.test.ts

📝 Walkthrough

Walkthrough

Adds host HTTP readiness probes and configurable polling to onboarding gateway reuse/startup/recovery flows; exports isGatewayHttpReady, waitForGatewayHttpReady, and getGatewayReuseHealthWaitConfig; integrates checks into preflight/startup/recover/resume, adds unit tests, and documents the behavior and env vars.

Changes

Gateway HTTP Readiness & Reuse Health Verification

Layer / File(s) Summary
HTTP Readiness Core Helpers & Exports
src/lib/onboard/gateway-http-readiness.ts
Defines alive status codes and probe timeout, env parsing and getGatewayReuseHealthWaitConfig(), isGatewayHttpReady(), and waitForGatewayHttpReady() with injectable probe/sleeper.
Test Module Scaffolding
test/gateway-http-reuse-wait.test.ts
Adds Vitest scaffold importing compiled onboard module and test helpers for closed-local ports and ephemeral status servers.
Configuration & Defaults Tests
test/gateway-http-reuse-wait.test.ts
Validates getGatewayReuseHealthWaitConfig defaults, env var overrides, non-finite fallbacks, and unclamped env reads.
HTTP Readiness Semantic Tests
test/gateway-http-reuse-wait.test.ts
Tests isGatewayHttpReady returns true for 200/401, false for 502/404/403/connection-refused, and handles timeout input fallbacks.
Wait & Retry Control Flow Tests
test/gateway-http-reuse-wait.test.ts
Validates waitForGatewayHttpReady retry/sleep behavior, maxAttempts/interval edge cases, single-probe and at-least-once semantics.
Onboard Imports
src/lib/onboard.ts
Imports readiness helpers and reuse-wait config into main onboarding flow.
Preflight Gateway Reuse Checks
src/lib/onboard.ts
When cached metadata is healthy but Docker verification is 'unknown', call waitForGatewayHttpReady to decide proceed-with-warning vs prevent reuse; when Docker reports 'running' but HTTP unreachable, stop forwarding, destroy stale gateway, clear reuse state to 'missing', and clear local gateway state.
Startup Reuse Gates & Post-Start Polling
src/lib/onboard.ts
Require isGatewayHttpReady() in addition to CLI metadata for final reuse and as part of the post-start health polling success condition.
Recovery Path HTTP Verification
src/lib/onboard.ts
Require host HTTP readiness for recoverGatewayRuntime 'already connected/selected' shortcut and its polling success condition.
Resume Onboard Reuse Checks
src/lib/onboard.ts
Applies the same preflight HTTP-readiness decision logic on resume onboarding for 'unknown' Docker verification and 'running'-but-HTTP-unreachable cases.
Exports for Testing
src/lib/onboard.ts
Exports getGatewayReuseHealthWaitConfig, isGatewayHttpReady, and waitForGatewayHttpReady for unit tests.
Documentation
docs/reference/commands.md
Documents gateway reuse probe behavior and env vars NEMOCLAW_REUSE_HEALTH_POLL_COUNT (default 6, min 1) and NEMOCLAW_REUSE_HEALTH_POLL_INTERVAL (default 5s, min 0).
Onboard Regression Test
test/gateway-liveness-probe.test.ts
Adds test ensuring onboard aborts without destructive fallback when container state is 'unknown' and HTTP probe fails.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I sniffed the gateway by the door,
A cached "healthy" meant nothing more.
I poked the port with careful cheer,
If HTTP sleeps, we stop and steer.
Now reuse waits till the host is sure.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(onboard): require host HTTP readiness before reusing the gateway' is specific and directly reflects the main change: adding HTTP readiness verification before gateway reuse.
Linked Issues check ✅ Passed The PR fully addresses issue #3258 objectives: adds HTTP readiness probing (isGatewayHttpReady), implements retry logic with configurable env vars, applies the check at reuse decision sites, gates final reuse on HTTP status, and tightens health polling.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the linked issue: HTTP readiness probing, retry logic, onboard flow changes, documentation, and tests—no unrelated functionality has been introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3258-wait-cluster-health-on-reuse

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

4074-4109: Run the restart-sensitive onboarding E2Es on this change.

The new unit coverage is useful, but this touches core gateway reuse/start behavior in src/lib/onboard.ts; I'd still run the onboard/lifecycle E2Es that exercise Docker/Colima restart timing before merge.

As per coding guidelines, src/lib/onboard.ts: "This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow."

Also applies to: 4341-4457

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 4074 - 4109, This change touches gateway
reuse/start logic around waitForGatewayHttpReady, gatewayReuseState,
runOpenshell, destroyGateway, and registry.clearAll — run the restart-sensitive
onboarding/lifecycle E2E suites that simulate Docker/Colima daemon restarts to
verify the three branches (HTTP-ready despite Docker unavailable, HTTP-unready
with Docker unavailable, and container-running-but-HTTP-unready) behave as
intended; if any E2E fails, adjust the logic so that waitForGatewayHttpReady is
awaited in the same way the E2Es expect, preserve the non-destructive "unknown"
path (do not set gatewayReuseState="missing" when Docker is unavailable), and
ensure the recreate path still calls
runOpenshell(["forward","stop",String(DASHBOARD_PORT)], {ignoreError:true}),
destroyGateway(), and registry.clearAll() only when Docker is known-functional.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 4451-4457: recoverGatewayRuntime currently treats "openshell
status" alone as success in two places; update both return sites so they also
await isGatewayHttpReady() before considering recovery complete: when checking
status.includes("Connected") && isSelectedGateway(status) and inside the
recovery poll loop after restart, only set process.env.OPENSHELL_GATEWAY =
GATEWAY_NAME and return true if (await isGatewayHttpReady()) also succeeds;
ensure you await the readiness call and keep the same guards
(status.includes("Connected") and isSelectedGateway) around setting the env and
returning.

In `@test/gateway-http-reuse-wait.test.ts`:
- Around line 221-312: The tests are asserting source text instead of runtime
behavior and the unknownBlockRegex doesn't include the else-if branch, so update
the suite to import and exercise the compiled runtime (dist/lib/onboard.js) and
replace regex matches with unit tests that stub/spy the key functions
(waitForGatewayHttpReady, isGatewayHttpReady, destroyGateway, registry.clearAll)
and assert observed calls and gatewayReuseState transitions when calling
preflight, onboard, and startGatewayWithOptions; specifically, simulate
containerState values ("running", "unknown") and HTTP success/failure via the
waitForGatewayHttpReady stub to verify the two downgrade behaviors
(destroy+missing for running+HTTP-fail and missing-without-destroy for
unknown+HTTP-fail) and ensure the startGatewayWithOptions path requires both
isGatewayHealthy and await isGatewayHttpReady before logging "Reusing existing
gateway".
- Around line 143-145: The test "returns false on connection refused" is flaky
because it probes port 1; update the test to deterministically get an ephemeral
localhost port, close it, then call isGatewayHttpReady against
http://127.0.0.1:<port>/; specifically, in the test create a temporary server
(e.g., via net.createServer or http.createServer), listen(0) to get
address().port, close the server to free the port, then await
isGatewayHttpReady(2000, `http://127.0.0.1:${port}/`) and expect false —
reference the test case name and the isGatewayHttpReady call so you can locate
and replace the hardcoded "127.0.0.1:1" probe with this ephemeral-port pattern.

---

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 4074-4109: This change touches gateway reuse/start logic around
waitForGatewayHttpReady, gatewayReuseState, runOpenshell, destroyGateway, and
registry.clearAll — run the restart-sensitive onboarding/lifecycle E2E suites
that simulate Docker/Colima daemon restarts to verify the three branches
(HTTP-ready despite Docker unavailable, HTTP-unready with Docker unavailable,
and container-running-but-HTTP-unready) behave as intended; if any E2E fails,
adjust the logic so that waitForGatewayHttpReady is awaited in the same way the
E2Es expect, preserve the non-destructive "unknown" path (do not set
gatewayReuseState="missing" when Docker is unavailable), and ensure the recreate
path still calls runOpenshell(["forward","stop",String(DASHBOARD_PORT)],
{ignoreError:true}), destroyGateway(), and registry.clearAll() only when Docker
is known-functional.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1299fd53-dccc-462e-b3b4-206dad77ebdc

📥 Commits

Reviewing files that changed from the base of the PR and between c4aaec3 and c2228b3.

📒 Files selected for processing (2)
  • src/lib/onboard.ts
  • test/gateway-http-reuse-wait.test.ts

Comment thread src/lib/onboard.ts
Comment thread test/gateway-http-reuse-wait.test.ts Outdated
Comment thread test/gateway-http-reuse-wait.test.ts Outdated
…e-text test asserts

Address CodeRabbit review on #3258:
- Extend host HTTP readiness gate to both return sites in
  recoverGatewayRuntime() so the recovery path cannot return success on
  stale CLI metadata while http://localhost:8080/ is still 502.
- Replace port-1 connection-refused probe with an ephemeral-port
  bind+close helper for deterministic local closure.
- Drop the source-text integration assertions; runtime behaviour is
  covered by the live HTTP server tests against isGatewayHttpReady and
  the injectable-probe tests for waitForGatewayHttpReady.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
end-of-file-fixer hook in CI rejects extra trailing newline.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Mention the new gateway-reuse HTTP readiness probe added in #3258 and
the two override knobs (NEMOCLAW_REUSE_HEALTH_POLL_COUNT,
NEMOCLAW_REUSE_HEALTH_POLL_INTERVAL) in the onboard preflight section.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Centralise the validation guards in waitForGatewayHttpReady so
env-derived defaults and caller-supplied options are normalised in one
place — Math.max(1, attempts) and Math.max(0, intervalSeconds). The
config function is now just an env reader.

Restructure the wait loop to always probe at least once before any
sleep, eliminating the trailing-attempt guard and naturally enforcing
the minimum.

Validate the per-probe timeoutMs in isGatewayHttpReady too: a
non-positive or non-finite value falls back to the safe default rather
than tearing the request down immediately.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Document that NEMOCLAW_REUSE_HEALTH_POLL_COUNT is floored at 1 (so the
probe always runs at least once) and the interval is floored at 0.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@laitingsheng laitingsheng marked this pull request as ready for review May 9, 2026 11:28

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/gateway-http-reuse-wait.test.ts (1)

253-270: ⚡ Quick win

Add an explicit negative-interval clamp test.

You already verify attempt clamping; adding one case for intervalSeconds < 0 would fully lock down the documented normalization behavior.

Suggested test addition
 describe("waitForGatewayHttpReady (`#3258`)", () => {
@@
   it("always probes at least once even when maxAttempts is 0 or negative", async () => {
@@
   });
+
+  it("clamps negative intervalSeconds to 0", async () => {
+    let calls = 0;
+    const sleeps: number[] = [];
+    const result = await waitForGatewayHttpReady({
+      probe: async () => {
+        calls += 1;
+        return false;
+      },
+      sleeper: (s: number) => sleeps.push(s),
+      maxAttempts: 3,
+      intervalSeconds: -5,
+    });
+    expect(result).toBe(false);
+    expect(calls).toBe(3);
+    expect(sleeps).toEqual([0, 0]);
+  });
 });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/gateway-http-reuse-wait.test.ts` around lines 253 - 270, Add a new test
alongside the existing "always probes at least once..." case that passes a
negative intervalSeconds to waitForGatewayHttpReady and asserts the same
normalization behavior: the probe is invoked exactly once, result is false, and
no sleeps occur. Use the same structure/variables as the current test (calls
counter, sleeps array, probe async that returns false, sleeper that pushes to
sleeps) but set intervalSeconds to a negative value (e.g., -5) and keep
maxAttempts at 0 (or another bad value) to verify the interval clamping logic in
waitForGatewayHttpReady.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 4087-4108: The code is incorrectly downgrading gatewayReuseState
from "unknown" to "missing" which causes the orphan-cleanup path to remove live
containers; change this to a separate non-destructive state/flag (e.g., set
gatewayReuseState = "no_reuse_no_cleanup" or set a boolean like
skipOrphanCleanup = true) inside the branch where waitForGatewayHttpReady() is
false, and update any later logic that performs orphan cleanup (the code that
checks gatewayReuseState for "missing" or runs cleanup) to only treat the
original "missing" value as eligible for stop/remove; ensure functions/variables
referenced include gatewayReuseState and waitForGatewayHttpReady and propagate
the new state/flag through the preflight flow so reuse is prevented but
orphan-cleanup is suppressed.
- Around line 3583-3593: getGatewayReuseHealthWaitConfig/opts values can be
non-finite and Math.max won't sanitize NaN/Infinity, so clamp and coerce
maxAttempts and intervalSeconds to finite numbers before the poll loop: compute
a finiteMaxAttempts = Number.isFinite(opts.maxAttempts) ? Math.max(1,
Math.trunc(opts.maxAttempts)) : Math.max(1, config.count) and a
finiteIntervalSeconds = Number.isFinite(opts.intervalSeconds) ? Math.max(0,
opts.intervalSeconds) : Math.max(0, config.interval); then use finiteMaxAttempts
and finiteIntervalSeconds in place of maxAttempts/intervalSeconds when calling
sleeper and in the for loop (keep the initial probe() call and reference probe
and sleeper as before).

---

Nitpick comments:
In `@test/gateway-http-reuse-wait.test.ts`:
- Around line 253-270: Add a new test alongside the existing "always probes at
least once..." case that passes a negative intervalSeconds to
waitForGatewayHttpReady and asserts the same normalization behavior: the probe
is invoked exactly once, result is false, and no sleeps occur. Use the same
structure/variables as the current test (calls counter, sleeps array, probe
async that returns false, sleeper that pushes to sleeps) but set intervalSeconds
to a negative value (e.g., -5) and keep maxAttempts at 0 (or another bad value)
to verify the interval clamping logic in waitForGatewayHttpReady.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f8fb97bd-0c6e-4a36-95fb-73bf8ad255b3

📥 Commits

Reviewing files that changed from the base of the PR and between 0a847e5 and dd2dc9b.

📒 Files selected for processing (3)
  • docs/reference/commands.md
  • src/lib/onboard.ts
  • test/gateway-http-reuse-wait.test.ts
✅ Files skipped from review due to trivial changes (1)
  • docs/reference/commands.md

Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/onboard.ts Outdated
…guard NaN/Infinity

Address two CodeRabbit findings on #3258:

* Don't downgrade reuseState to "missing" in the preflight unknown+
  HTTP-fail branch. The orphan-cleanup that runs on "missing" can
  destroy a possibly-live gateway when Docker's transient inspect
  failure coincides with an HTTP warm-up miss — exactly the case the
  branch is supposed to protect. The main-onboard branch (which has no
  orphan-cleanup) and the startGatewayWithOptions HTTP gate together
  still ensure the actual reuse decision goes through HTTP verification.
* Add Number.isFinite guards in waitForGatewayHttpReady so caller-
  supplied Infinity/NaN don't bypass the Math.max clamps and either
  hang the loop or feed NaN into the sleeper.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

@cv cv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you extract this new code out of onboard.ts, please? This file is already too big.

…onboard/

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
… unknown and gateway HTTP is unready

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

github-actions Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: None
Optional E2E: None

Workflow run

Full advisor summary

Pi Semantic E2E Advisor

Failed: pi exited with status 1; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-advisor-pi-raw-output.txt

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/gateway-http-readiness.ts`:
- Around line 116-133: The probe() calls (used via const probe = opts.probe ??
(() => isGatewayHttpReady())) can reject and currently abort the readiness loop;
change both the initial await probe() and the subsequent await probe() inside
the for loop to catch any thrown errors and treat them as a failed probe (return
false for that attempt) so retries continue; implement this by wrapping each
await probe() in a try/catch (or use a helper like safeProbe()) and return true
only on a successful resolved true, otherwise continue the loop and ultimately
return false after attempts are exhausted; reference probe(),
isGatewayHttpReady(), maxAttempts, intervalSeconds and sleeper to locate the
spots to update.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e2cd2c59-2ceb-4641-a4f7-7ea7fce979a0

📥 Commits

Reviewing files that changed from the base of the PR and between 373b888 and 172efa8.

📒 Files selected for processing (3)
  • docs/reference/commands.md
  • src/lib/onboard.ts
  • src/lib/onboard/gateway-http-readiness.ts
✅ Files skipped from review due to trivial changes (1)
  • docs/reference/commands.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard.ts

Comment thread src/lib/onboard/gateway-http-readiness.ts
…ttpReady

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@cv cv merged commit f1a8a72 into main May 12, 2026
25 of 26 checks passed
jyaunches added a commit that referenced this pull request May 12, 2026
…x md lint

Address PR review findings:

1. Merge origin/main to pick up PR #3312 (isGatewayHttpReady from
   src/lib/onboard/gateway-http-readiness.ts). Replace our ad-hoc
   verifyDockerDriverGatewayListening (TCP probe) with a call to the
   shared isGatewayHttpReady helper — HTTP is strictly stronger than
   a raw TCP connect and is already the de-facto standard used by
   every other gateway-reuse decision site in onboard.ts after #3312.
   Docker-driver and K3s paths now converge on the same probe.

2. Drop the parallel test/gateway-tcp-probe.test.ts (9 tests for a
   helper that no longer exists). The shared helper's behavior is
   already covered by test/gateway-http-reuse-wait.test.ts (21 tests).
   Replace with test/gateway-health-honest-integration.test.ts (6
   source-shape tests) guarding the #3111 integration pattern.

3. Fix 7 markdownlint errors in ACCEPTANCE.md (MD040, MD022, MD031) —
   the 'checks' job flagged these; now passes locally.

4. Update ACCEPTANCE.md to reflect the simpler design: Phase 1 is
   'reuse existing helper' rather than 'add new TCP probe helper',
   and the refactoring alignment table records #3312 as 'adopted'
   rather than 'to be coordinated'.

Behavior of the fix is unchanged:
  - child.once('exit') listener tracks zombies (detached children
    that process.kill(pid, 0) would falsely report as alive)
  - poll loop guard is childExited || !isPidAlive(childPid)
  - healthy log gated on isGatewayHttpReady() after isGatewayHealthy()
  - user-visible exit code/signal surfaced on failure

Related: #3111
Coverage guard: #3362 — gateway-health-honest-e2e
jyaunches added a commit that referenced this pull request May 12, 2026
…iness

Addresses the failing openshell-gateway-upgrade-e2e surfaced by the
e2e-advisor on PR #3378.

## Problem

The previous commit converged the Docker-driver gateway's readiness
probe onto isGatewayHttpReady from ./onboard/gateway-http-readiness
(introduced by #3312). That broke openshell-gateway-upgrade-e2e:

    Starting OpenShell gateway...
    [PASS] Stale gateway is healthy with supervisor:0.0.36
    [INFO] Invoking NemoClaw gateway start path; it must restart the stale process
    Docker-driver gateway failed to start.    <-- 60s timeout

The gateway log showed the process was actually serving correctly:
    POST /openshell.v1.OpenShell/Health response status=200 latency_ms=0
    POST /openshell.v1.OpenShell/Health response status=200 latency_ms=0
    GET  /                                  response status=404 latency_ms=0

## Root cause

isGatewayHttpReady does GET /, and only accepts {200, 401}. The K3s-path
gateway answers GET / via a dispatcher catch-all. The Docker-driver
gateway, by contrast, only serves routes under /openshell.v1.OpenShell/*
and returns 404 for GET /. So the probe always fails on the Docker-
driver path regardless of whether the gateway is up.

## Fix

Revert to a plain TCP probe on the gateway port for the Docker-driver
path. Add a small isDockerDriverGatewayTcpReady(port, timeoutMs, host)
helper next to isPidAlive. This is semantic-free — it just asks "is
anyone listening" — which is exactly what's needed to detect the #3111
failure mode (crashed binary → nothing listening) without making
assumptions about HTTP route shape.

isGatewayHttpReady is left in place for all K3s-path call sites; the
two paths use different probes because the two gateway types expose
different HTTP surfaces. TODO(#3213) filed for future convergence.

## Tests

test/gateway-health-honest-integration.test.ts now has 10 tests:
  5 behavioural: listening / closed / timeout / min-timeout / no-throw
  5 source-shape: child-exit tracking, poll-loop guard, TCP probe call
    site, regression guard against reintroducing the HTTP probe in the
    Docker-driver loop, exit-details surfaced

## Expected CI result

  - gateway-health-honest-e2e: GREEN (already was — this was never affected)
  - openshell-gateway-upgrade-e2e: FLIPS BACK to GREEN (the breakage this
    commit addresses)

Related: #3111
Broke in: 29bba3e (refactor(onboard): converge on shared isGatewayHttpReady)
jyaunches added a commit that referenced this pull request May 12, 2026
Addresses PR review feedback — onboard.ts is the God Object being
decomposed, and growing its exports is an anti-pattern. Follow the
pattern established by PR #3312 (which added
src/lib/onboard/gateway-http-readiness.ts) and create a peer module
for the Docker-driver path.

New file: src/lib/onboard/gateway-tcp-readiness.ts
  Exports: isGatewayTcpReady(port, timeoutMs, host)
  Model: mirrors gateway-http-readiness.ts structure exactly —
         default port from GATEWAY_PORT, minimum-timeout clamp,
         pure async, never throws.

New file: src/lib/onboard/gateway-tcp-readiness.test.ts
  Co-located behavioural tests per the established pattern
  (see temp-files.test.ts, legacy-command.test.ts, etc.)
  6 tests: listening / closed / timeout / min-timeout / no-throw /
  default-port. Imports via ES module, not via dist/ indirection.

onboard.ts net change:
  Before: +88 lines (+48 helper + integration)
  After:  +22 lines (2 require lines + integration only)

Related to: #3111
jyaunches added a commit that referenced this pull request May 12, 2026
…odules

Addresses PR review: 'This PR is adding lines to onboard.ts — which is
now an antipattern' and the onboard-entrypoint-budget CI check.

src/lib/onboard.ts is the God Object being decomposed; growing it is
explicitly flagged as "away from target architecture" in the repo's
review guidelines, and the onboard-entrypoint-budget workflow fails the
PR if the entrypoint grows net-positive.

## Changes

Extract three focused modules under src/lib/onboard/ that collectively
replace the inline logic added in earlier commits on this branch:

- src/lib/onboard/gateway-tcp-readiness.ts
    isGatewayTcpReady(port, timeoutMs, host): plain TCP liveness probe
    for the Docker-driver gateway. Peer of gateway-http-readiness.ts
    (introduced by #3312). Uses a minimum-timeout clamp so a regression
    that drops Math.max(50, timeoutMs) is caught by tests.

- src/lib/onboard/child-exit-tracker.ts
    trackChildExit(child): attaches a one-shot 'exit' listener to a
    ChildProcess and returns a ChildExitState view. Zombie-safe: fires
    reliably for detached children that isPidAlive(pid, 0) would
    falsely report as alive (#3111). Also exposes describeExit() —
    'exited with code 127' / 'killed by signal SIGKILL' — so callers
    don't have to reformat the signal/code themselves.

- src/lib/onboard/docker-driver-gateway-failure.ts
    reportDockerDriverGatewayStartFailure(logPath, childExit, opts):
    the full diagnostic bundle emitted on startup failure — 'failed to
    start' header, exit descriptor, redacted log tail, troubleshooting
    footer. Honors exitOnFailure (production) vs. throw-only (recovery).

Each module has co-located behavioural tests (6 + 5 + 6 tests = 17),
using direct ES imports instead of the dist/ indirection that
test/onboard.test.ts uses. Source-shape guards in
test/gateway-health-honest-integration.test.ts enforce that the
three modules stay extracted (the entrypoint cannot reintroduce the
inline implementations).

## Result

  src/lib/onboard.ts: 15 additions, 21 deletions — net -6 lines.
  src/lib/onboard/**: 3 new modules, 17 new unit tests, full behavioural
      + signal-vs-code coverage on all three helpers.

Both E2E tests remain green on this branch:
  - gateway-health-honest-e2e (#3111 coverage guard)
  - openshell-gateway-upgrade-e2e (regression target the earlier
    HTTP-probe attempt broke; TCP probe is the correct fit)

Related: #3111
Aligns with architecture goal tracked by onboard-entrypoint-budget CI.
cv added a commit that referenced this pull request May 12, 2026
…eporting healthy (#3111) (#3378)

## Summary

`nemoclaw onboard` was printing `✓ Docker-driver gateway is healthy`
even when the openshell-gateway binary crashed immediately on startup,
then failing the next step with `Connection refused`. The reported
trigger on Ubuntu 22.04 was a GLIBC 2.38/2.39 mismatch in the shipped
binary, but the underlying NemoClaw bug is **platform-independent** —
any reason the binary fails to start (missing shared lib, CDI spec
error, port conflict, permissions, corrupted download) surfaces the same
false-positive.

This PR fixes the false-positive at the caller site in
`startDockerDriverGateway()`, without modifying the shared
`isGatewayHealthy()` (which is pinned to pure-function status by the
#2020 follow-up test).

Fixes #3111.
Closes the gap covered by the failing E2E test added in #3362 —
`gateway-health-honest-e2e` flips from 🔴 red on `main` to 🟢 green on
this branch.

## Root cause

`startDockerDriverGateway()` in `src/lib/onboard.ts` spawned the gateway
binary with `spawn(..., { detached: true })` + `child.unref()`, then
polled using two checks that could both lie:

1. **`isPidAlive(childPid)`** uses `process.kill(pid, 0)` which returns
`true` for **zombies**. Since the parent Node process never `wait()`s on
the detached child, crashed children linger as zombies and `isPidAlive`
reports them as alive.

2. **`isGatewayHealthy(status, gwInfo, activeGwInfo)`** is a pure string
match over openshell CLI output. `isGatewayConnected` in
`src/lib/state/gateway.ts` matches on `"Server Status"` — the **table
header** that `openshell status` always prints. On a crashed gateway,
the header is still emitted and the body contains `× client error
(Connect) tcp connect error Connection refused` — but
`isGatewayConnected` returns true anyway.

Smoking gun from the red-on-main run
[25698031380](https://github.com/NVIDIA/NemoClaw/actions/runs/25698031380):

```
[DIAG] openshell status: Server Status
  Gateway: nemoclaw
  Server: http://127.0.0.1:8080
Error:   × client error (Connect)
  ├─▶ tcp connect error
  ╰─▶ Connection refused (os error 111)
```

## Changes

**`src/lib/onboard.ts`** — three coordinated changes in
`startDockerDriverGateway`:

1. **Track child-exit via the ChildProcess `'exit'` event**, not just
`isPidAlive`. A `child.once('exit', ...)` listener flips a `childExited`
flag that the poll loop consults alongside `isPidAlive`. This catches
zombies that `isPidAlive` misses and also captures the exit code/signal
for the failure message.

2. **Add `verifyDockerDriverGatewayListening(port, timeoutMs)`** — a TCP
connect probe to `127.0.0.1:${GATEWAY_PORT}` using `node:net` with a
socket timeout. Resolves boolean, never throws. This is the
Docker-driver path equivalent of `verifyGatewayContainerRunning` (added
for #2020 on the K3s path).

3. **Gate the "healthy" log on the TCP probe**: the poll loop now only
logs `✓ Docker-driver gateway is healthy` after `isGatewayHealthy`
**AND** a successful TCP connect. On probe failure the loop keeps
polling — the binary may still be binding its listener. The
`childExited` check at the top of the loop terminates us if the process
actually died.

Also improves the final failure message to include **how** the gateway
exited (signal vs. exit code) so users don't have to `tail` the gateway
log.

## What this PR does NOT change

`isGatewayHealthy` in `src/lib/state/gateway.ts` is left untouched. The
#2020 follow-up test at `test/gateway-liveness-probe.test.ts:74` pins it
to pure-function status ("no I/O, no docker, no spawn, no exec"). Fix at
the caller, not the shared helper — same pattern as #2020.

## Tests

- **New unit test:** `test/gateway-tcp-probe.test.ts` (9 tests)
- `verifyDockerDriverGatewayListening` resolves true for listening ports
  - resolves false for closed ports (Connection refused)
  - resolves false on timeout (non-routable RFC 1918 host)
  - enforces minimum 50 ms timeout
  - never throws
- **source-shape guards** (regexes over `src/lib/onboard.ts`):
child-exit tracking present, `childExited || !isPidAlive` check at
poll-loop top, TCP probe called before the "healthy" log, exit details
surfaced in the failure message
- **E2E acceptance gate:** `gateway-health-honest-e2e` (from #3362) —
red on main, expected green on this branch. Will dispatch via
nightly-e2e selective run once PR opens.
- **Existing tests:** full `vitest` suite passes (CLI);
`test/gateway-state.test.ts` (47 tests) and
`test/gateway-liveness-probe.test.ts` (7 tests) still green — no
behavior change in the shared helpers.

## Refactoring alignment

Noted in `ACCEPTANCE.md` (also in this branch):

- **#2562** (unified timeout abstraction) — `TODO(#2562)` on the TCP
probe helper's timeout logic, for mechanical adoption later.
- **#3213** (unified advisory registry) — `TODO(#3213)` on the
failure-message format so it can migrate to structured advisories later.
- **PR #3312** (laitingsheng — `isGatewayHttpReady` for K3s path) — same
pattern, different surface. Easy to converge once both land; can extract
a shared "gateway liveness probe" module if desired.
- **PR #3306** (cv — `gateway-bootstrap.ts` extraction) — doesn't touch
`startDockerDriverGateway`; no conflict expected.

## Related

- Fixes #3111
- Coverage guard: #3362 (`gateway-health-honest-e2e`)
- Cross-reference patterns: #2020 (K3s path equivalent), #3312 (HTTP
readiness for K3s reuse)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added a TCP readiness check during gateway startup to ensure the
service is actually listening before reporting healthy.
* **Bug Fixes**
* Reduced false “alive” reports for the gateway by detecting early
child-process termination.
* Improved startup failure messages to include how the gateway process
terminated (signal vs exit code).
* **Tests**
* Added integration and unit tests to validate startup health and TCP
probe behavior.

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3378)
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

3 participants