refactor(onboard): extract gateway bootstrap repair helpers#3306
Conversation
This reverts commit d113704.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
…ture-layout # Conflicts: # src/lib/actions/sandbox/destroy.ts # src/lib/inference/health.ts
…d-selection-drift
…d-web-search-support
…ard-web-search-config
…rd-web-search-verify
…rd-gateway-bootstrap
…eporting healthy (#3111) (#3378) ## Summary `nemoclaw onboard` was printing `✓ Docker-driver gateway is healthy` even when the openshell-gateway binary crashed immediately on startup, then failing the next step with `Connection refused`. The reported trigger on Ubuntu 22.04 was a GLIBC 2.38/2.39 mismatch in the shipped binary, but the underlying NemoClaw bug is **platform-independent** — any reason the binary fails to start (missing shared lib, CDI spec error, port conflict, permissions, corrupted download) surfaces the same false-positive. This PR fixes the false-positive at the caller site in `startDockerDriverGateway()`, without modifying the shared `isGatewayHealthy()` (which is pinned to pure-function status by the #2020 follow-up test). Fixes #3111. Closes the gap covered by the failing E2E test added in #3362 — `gateway-health-honest-e2e` flips from 🔴 red on `main` to 🟢 green on this branch. ## Root cause `startDockerDriverGateway()` in `src/lib/onboard.ts` spawned the gateway binary with `spawn(..., { detached: true })` + `child.unref()`, then polled using two checks that could both lie: 1. **`isPidAlive(childPid)`** uses `process.kill(pid, 0)` which returns `true` for **zombies**. Since the parent Node process never `wait()`s on the detached child, crashed children linger as zombies and `isPidAlive` reports them as alive. 2. **`isGatewayHealthy(status, gwInfo, activeGwInfo)`** is a pure string match over openshell CLI output. `isGatewayConnected` in `src/lib/state/gateway.ts` matches on `"Server Status"` — the **table header** that `openshell status` always prints. On a crashed gateway, the header is still emitted and the body contains `× client error (Connect) tcp connect error Connection refused` — but `isGatewayConnected` returns true anyway. Smoking gun from the red-on-main run [25698031380](https://github.com/NVIDIA/NemoClaw/actions/runs/25698031380): ``` [DIAG] openshell status: Server Status Gateway: nemoclaw Server: http://127.0.0.1:8080 Error: × client error (Connect) ├─▶ tcp connect error ╰─▶ Connection refused (os error 111) ``` ## Changes **`src/lib/onboard.ts`** — three coordinated changes in `startDockerDriverGateway`: 1. **Track child-exit via the ChildProcess `'exit'` event**, not just `isPidAlive`. A `child.once('exit', ...)` listener flips a `childExited` flag that the poll loop consults alongside `isPidAlive`. This catches zombies that `isPidAlive` misses and also captures the exit code/signal for the failure message. 2. **Add `verifyDockerDriverGatewayListening(port, timeoutMs)`** — a TCP connect probe to `127.0.0.1:${GATEWAY_PORT}` using `node:net` with a socket timeout. Resolves boolean, never throws. This is the Docker-driver path equivalent of `verifyGatewayContainerRunning` (added for #2020 on the K3s path). 3. **Gate the "healthy" log on the TCP probe**: the poll loop now only logs `✓ Docker-driver gateway is healthy` after `isGatewayHealthy` **AND** a successful TCP connect. On probe failure the loop keeps polling — the binary may still be binding its listener. The `childExited` check at the top of the loop terminates us if the process actually died. Also improves the final failure message to include **how** the gateway exited (signal vs. exit code) so users don't have to `tail` the gateway log. ## What this PR does NOT change `isGatewayHealthy` in `src/lib/state/gateway.ts` is left untouched. The #2020 follow-up test at `test/gateway-liveness-probe.test.ts:74` pins it to pure-function status ("no I/O, no docker, no spawn, no exec"). Fix at the caller, not the shared helper — same pattern as #2020. ## Tests - **New unit test:** `test/gateway-tcp-probe.test.ts` (9 tests) - `verifyDockerDriverGatewayListening` resolves true for listening ports - resolves false for closed ports (Connection refused) - resolves false on timeout (non-routable RFC 1918 host) - enforces minimum 50 ms timeout - never throws - **source-shape guards** (regexes over `src/lib/onboard.ts`): child-exit tracking present, `childExited || !isPidAlive` check at poll-loop top, TCP probe called before the "healthy" log, exit details surfaced in the failure message - **E2E acceptance gate:** `gateway-health-honest-e2e` (from #3362) — red on main, expected green on this branch. Will dispatch via nightly-e2e selective run once PR opens. - **Existing tests:** full `vitest` suite passes (CLI); `test/gateway-state.test.ts` (47 tests) and `test/gateway-liveness-probe.test.ts` (7 tests) still green — no behavior change in the shared helpers. ## Refactoring alignment Noted in `ACCEPTANCE.md` (also in this branch): - **#2562** (unified timeout abstraction) — `TODO(#2562)` on the TCP probe helper's timeout logic, for mechanical adoption later. - **#3213** (unified advisory registry) — `TODO(#3213)` on the failure-message format so it can migrate to structured advisories later. - **PR #3312** (laitingsheng — `isGatewayHttpReady` for K3s path) — same pattern, different surface. Easy to converge once both land; can extract a shared "gateway liveness probe" module if desired. - **PR #3306** (cv — `gateway-bootstrap.ts` extraction) — doesn't touch `startDockerDriverGateway`; no conflict expected. ## Related - Fixes #3111 - Coverage guard: #3362 (`gateway-health-honest-e2e`) - Cross-reference patterns: #2020 (K3s path equivalent), #3312 (HTTP readiness for K3s reuse) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a TCP readiness check during gateway startup to ensure the service is actually listening before reporting healthy. * **Bug Fixes** * Reduced false “alive” reports for the gateway by detecting early child-process termination. * Improved startup failure messages to include how the gateway process terminated (signal vs exit code). * **Tests** * Added integration and unit tests to validate startup health and TCP probe behavior. [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3378) <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
…ort, function or class' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
## Summary Extract dashboard access URL and guidance helpers out of the large onboarding module. This continues the onboarding cleanup stack by isolating dashboard chain resolution, URL/token handling, WSL host detection, and guidance text generation. ## Changes - Add `src/lib/onboard/dashboard-access.ts` for dashboard forward target/port resolution, start command generation, authenticated URL construction, display redaction, WSL host lookup, access entries, and guidance lines. - Update `src/lib/onboard.ts` to delegate dashboard access helpers through wrappers that preserve existing defaults and exports. - Add unit tests covering forward target/port derivation, OpenShell command construction, token redaction, WSL host detection, WSL access entries, and WSL/empty-access guidance. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- Signed-off-by: Carlos Villela <cvillela@nvidia.com> --------- Co-authored-by: cjagwani <cjagwani@nvidia.com>
…fy' into pr-3306 Signed-off-by: Aaron Erickson <aerickson@nvidia.com> # Conflicts: # src/lib/onboard.ts
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryPi Semantic E2E AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
Selective E2E Results — ❌ Some jobs failedRun: 25780109390
|
ericksoa
left a comment
There was a problem hiding this comment.
Approved after resolving the merge conflict and validating the updated head. Normal PR checks are green. Focused nightly signal passed for sandbox-survival-e2e, onboard-resume-e2e, and onboard-repair-e2e; openshell-gateway-upgrade-e2e hit the known stale source-layout check handled separately by #3438.
Summary
Extract OpenShell gateway bootstrap secret repair helpers out of the large onboarding module. This continues the onboarding cleanup stack by isolating the repair plan, generated Kubernetes secret script, cluster command wrappers, and repair orchestration behind a focused helper module.
Changes
src/lib/onboard/gateway-bootstrap.tsfor bootstrap secret repair planning, script generation, missing-secret detection, healthcheck execution, and repair orchestration.src/lib/onboard.tsto use the extracted gateway bootstrap helpers while preserving existing exports and call sites.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: Carlos Villela cvillela@nvidia.com