Skip to content

refactor(onboard): extract gateway bootstrap repair helpers#3306

Merged
ericksoa merged 93 commits into
mainfrom
refactor/onboard-gateway-bootstrap
May 13, 2026
Merged

refactor(onboard): extract gateway bootstrap repair helpers#3306
ericksoa merged 93 commits into
mainfrom
refactor/onboard-gateway-bootstrap

Conversation

@cv

@cv cv commented May 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

Extract OpenShell gateway bootstrap secret repair helpers out of the large onboarding module. This continues the onboarding cleanup stack by isolating the repair plan, generated Kubernetes secret script, cluster command wrappers, and repair orchestration behind a focused helper module.

Changes

  • Add src/lib/onboard/gateway-bootstrap.ts for bootstrap secret repair planning, script generation, missing-secret detection, healthcheck execution, and repair orchestration.
  • Update src/lib/onboard.ts to use the extracted gateway bootstrap helpers while preserving existing exports and call sites.
  • Add unit tests covering repair-plan normalization, no-op scripts, targeted repair scripts, missing-secret parsing, and successful repair behavior.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Carlos Villela cvillela@nvidia.com

@cv cv self-assigned this May 9, 2026
@coderabbitai

coderabbitai Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 21577653-8316-4292-bab0-1a751ccade48

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/onboard-gateway-bootstrap

Comment @coderabbitai help to get the list of available commands and usage tips.

@cv cv marked this pull request as draft May 9, 2026 03:24
@copy-pr-bot

copy-pr-bot Bot commented May 9, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@copy-pr-bot

copy-pr-bot Bot commented May 9, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cv cv removed the v0.0.39 label May 12, 2026
cv added a commit that referenced this pull request May 12, 2026
…eporting healthy (#3111) (#3378)

## Summary

`nemoclaw onboard` was printing `✓ Docker-driver gateway is healthy`
even when the openshell-gateway binary crashed immediately on startup,
then failing the next step with `Connection refused`. The reported
trigger on Ubuntu 22.04 was a GLIBC 2.38/2.39 mismatch in the shipped
binary, but the underlying NemoClaw bug is **platform-independent** —
any reason the binary fails to start (missing shared lib, CDI spec
error, port conflict, permissions, corrupted download) surfaces the same
false-positive.

This PR fixes the false-positive at the caller site in
`startDockerDriverGateway()`, without modifying the shared
`isGatewayHealthy()` (which is pinned to pure-function status by the
#2020 follow-up test).

Fixes #3111.
Closes the gap covered by the failing E2E test added in #3362 —
`gateway-health-honest-e2e` flips from 🔴 red on `main` to 🟢 green on
this branch.

## Root cause

`startDockerDriverGateway()` in `src/lib/onboard.ts` spawned the gateway
binary with `spawn(..., { detached: true })` + `child.unref()`, then
polled using two checks that could both lie:

1. **`isPidAlive(childPid)`** uses `process.kill(pid, 0)` which returns
`true` for **zombies**. Since the parent Node process never `wait()`s on
the detached child, crashed children linger as zombies and `isPidAlive`
reports them as alive.

2. **`isGatewayHealthy(status, gwInfo, activeGwInfo)`** is a pure string
match over openshell CLI output. `isGatewayConnected` in
`src/lib/state/gateway.ts` matches on `"Server Status"` — the **table
header** that `openshell status` always prints. On a crashed gateway,
the header is still emitted and the body contains `× client error
(Connect) tcp connect error Connection refused` — but
`isGatewayConnected` returns true anyway.

Smoking gun from the red-on-main run
[25698031380](https://github.com/NVIDIA/NemoClaw/actions/runs/25698031380):

```
[DIAG] openshell status: Server Status
  Gateway: nemoclaw
  Server: http://127.0.0.1:8080
Error:   × client error (Connect)
  ├─▶ tcp connect error
  ╰─▶ Connection refused (os error 111)
```

## Changes

**`src/lib/onboard.ts`** — three coordinated changes in
`startDockerDriverGateway`:

1. **Track child-exit via the ChildProcess `'exit'` event**, not just
`isPidAlive`. A `child.once('exit', ...)` listener flips a `childExited`
flag that the poll loop consults alongside `isPidAlive`. This catches
zombies that `isPidAlive` misses and also captures the exit code/signal
for the failure message.

2. **Add `verifyDockerDriverGatewayListening(port, timeoutMs)`** — a TCP
connect probe to `127.0.0.1:${GATEWAY_PORT}` using `node:net` with a
socket timeout. Resolves boolean, never throws. This is the
Docker-driver path equivalent of `verifyGatewayContainerRunning` (added
for #2020 on the K3s path).

3. **Gate the "healthy" log on the TCP probe**: the poll loop now only
logs `✓ Docker-driver gateway is healthy` after `isGatewayHealthy`
**AND** a successful TCP connect. On probe failure the loop keeps
polling — the binary may still be binding its listener. The
`childExited` check at the top of the loop terminates us if the process
actually died.

Also improves the final failure message to include **how** the gateway
exited (signal vs. exit code) so users don't have to `tail` the gateway
log.

## What this PR does NOT change

`isGatewayHealthy` in `src/lib/state/gateway.ts` is left untouched. The
#2020 follow-up test at `test/gateway-liveness-probe.test.ts:74` pins it
to pure-function status ("no I/O, no docker, no spawn, no exec"). Fix at
the caller, not the shared helper — same pattern as #2020.

## Tests

- **New unit test:** `test/gateway-tcp-probe.test.ts` (9 tests)
- `verifyDockerDriverGatewayListening` resolves true for listening ports
  - resolves false for closed ports (Connection refused)
  - resolves false on timeout (non-routable RFC 1918 host)
  - enforces minimum 50 ms timeout
  - never throws
- **source-shape guards** (regexes over `src/lib/onboard.ts`):
child-exit tracking present, `childExited || !isPidAlive` check at
poll-loop top, TCP probe called before the "healthy" log, exit details
surfaced in the failure message
- **E2E acceptance gate:** `gateway-health-honest-e2e` (from #3362) —
red on main, expected green on this branch. Will dispatch via
nightly-e2e selective run once PR opens.
- **Existing tests:** full `vitest` suite passes (CLI);
`test/gateway-state.test.ts` (47 tests) and
`test/gateway-liveness-probe.test.ts` (7 tests) still green — no
behavior change in the shared helpers.

## Refactoring alignment

Noted in `ACCEPTANCE.md` (also in this branch):

- **#2562** (unified timeout abstraction) — `TODO(#2562)` on the TCP
probe helper's timeout logic, for mechanical adoption later.
- **#3213** (unified advisory registry) — `TODO(#3213)` on the
failure-message format so it can migrate to structured advisories later.
- **PR #3312** (laitingsheng — `isGatewayHttpReady` for K3s path) — same
pattern, different surface. Easy to converge once both land; can extract
a shared "gateway liveness probe" module if desired.
- **PR #3306** (cv — `gateway-bootstrap.ts` extraction) — doesn't touch
`startDockerDriverGateway`; no conflict expected.

## Related

- Fixes #3111
- Coverage guard: #3362 (`gateway-health-honest-e2e`)
- Cross-reference patterns: #2020 (K3s path equivalent), #3312 (HTTP
readiness for K3s reuse)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added a TCP readiness check during gateway startup to ensure the
service is actually listening before reporting healthy.
* **Bug Fixes**
* Reduced false “alive” reports for the gateway by detecting early
child-process termination.
* Improved startup failure messages to include how the gateway process
terminated (signal vs exit code).
* **Tests**
* Added integration and unit tests to validate startup health and TCP
probe behavior.

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3378)
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
cjagwani

This comment was marked as outdated.

@cjagwani cjagwani marked this pull request as ready for review May 13, 2026 03:42

@cjagwani cjagwani left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @cv — branch has merge conflicts against main, mind resolving when you get a chance? approval at 20cf1ab stands.

cjagwani and others added 4 commits May 12, 2026 21:02
## Summary
Extract dashboard access URL and guidance helpers out of the large
onboarding module. This continues the onboarding cleanup stack by
isolating dashboard chain resolution, URL/token handling, WSL host
detection, and guidance text generation.

## Changes
- Add `src/lib/onboard/dashboard-access.ts` for dashboard forward
target/port resolution, start command generation, authenticated URL
construction, display redaction, WSL host lookup, access entries, and
guidance lines.
- Update `src/lib/onboard.ts` to delegate dashboard access helpers
through wrappers that preserve existing defaults and exports.
- Add unit tests covering forward target/port derivation, OpenShell
command construction, token redaction, WSL host detection, WSL access
entries, and WSL/empty-access guidance.

## Type of Change
- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
Signed-off-by: Carlos Villela <cvillela@nvidia.com>

---------

Co-authored-by: cjagwani <cjagwani@nvidia.com>
…fy' into pr-3306

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

# Conflicts:
#	src/lib/onboard.ts
@github-actions

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: cloud-e2e, sandbox-survival-e2e, dashboard-remote-bind-e2e
Optional E2E: onboard-repair-e2e, double-onboard-e2e, cloud-onboard-e2e

Dispatch hint: cloud-e2e,sandbox-survival-e2e

Workflow run

Full advisor summary

Pi Semantic E2E Advisor

Base: origin/refactor/onboard-web-search-verify
Head: HEAD
Confidence: medium

Required E2E

  • cloud-e2e: Full onboard path exercises both refactored modules: gateway bootstrap secret detection/repair runs inside the gateway-start health-poll loop, and dashboard access URL/forward-target derivation runs at end-of-onboard for printDashboard and openshell forward start. A wiring regression in the dependency-injected helpers would surface here.
  • sandbox-survival-e2e: Sandbox survival across gateway restarts drives the recovery-poll loop in onboard.ts that calls repairGatewayBootstrapSecrets() and gatewayClusterHealthcheckPassed() — exactly the helpers extracted into gateway-bootstrap.ts. This is the closest existing E2E to the gateway-bootstrap repair path.
  • dashboard-remote-bind-e2e: Asserts the OpenShell forward bind target produced by getDashboardForwardTarget/getDashboardForwardStartCommand resolves to 0.0.0.0: for non-loopback chat UI URLs. The forward target is now produced by dashboard-access.ts; a regression would silently revert the bind to 127.0.0.1 and break remote dashboard access.

Optional E2E

  • onboard-repair-e2e: Exercises onboard --resume repair flow which re-enters the gateway start path; useful confidence check that the extracted gateway-bootstrap helpers stay correctly wired during repair.
  • double-onboard-e2e: Re-running onboard hits both the gateway start (repair helpers) and printDashboard (dashboard-access helpers) paths a second time; confirms idempotent behavior of the extracted modules.
  • cloud-onboard-e2e: Public installer onboard path; another execution of the full onboarding sequence including dashboard URL display.

New E2E recommendations

  • onboard/gateway-bootstrap (medium): There is no E2E that explicitly deletes one or more of the openshell-{server-tls,server-client-ca,client-tls,ssh-handshake} secrets from the openshell namespace and then asserts onboard regenerates them via repairGatewayBootstrapSecrets and converges to a healthy gateway. Existing sandbox-survival/onboard-repair tests exercise restart but not the missing-secret repair branch end-to-end.
    • Suggested test: Add test/e2e/test-gateway-bootstrap-secret-repair.sh: after a successful onboard, kubectl -n openshell delete secret openshell-server-tls openshell-ssh-handshake inside openshell-cluster-nemoclaw, re-run onboard --resume (or trigger gateway restart), and assert (a) onboard logs the 'OpenShell bootstrap secrets missing: ... Repairing...' line, (b) the secrets are recreated with the expected SAN/EKU, and (c) the cluster healthcheck passes and the gateway becomes Connected.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: cloud-e2e,sandbox-survival-e2e

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25780109390
Branch: refactor/onboard-gateway-bootstrap
Requested jobs: sandbox-survival-e2e,openshell-gateway-upgrade-e2e,onboard-repair-e2e,onboard-resume-e2e
Summary: 3 passed, 1 failed, 0 skipped

Job Result
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openshell-gateway-upgrade-e2e ❌ failure
sandbox-survival-e2e ✅ success

Failed jobs: openshell-gateway-upgrade-e2e. Check run artifacts for logs.

@cv cv changed the base branch from refactor/onboard-web-search-verify to main May 13, 2026 05:42

@ericksoa ericksoa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved after resolving the merge conflict and validating the updated head. Normal PR checks are green. Focused nightly signal passed for sandbox-survival-e2e, onboard-resume-e2e, and onboard-repair-e2e; openshell-gateway-upgrade-e2e hit the known stale source-layout check handled separately by #3438.

@ericksoa ericksoa enabled auto-merge (squash) May 13, 2026 05:44
@ericksoa ericksoa disabled auto-merge May 13, 2026 05:45
@ericksoa ericksoa merged commit 600d47b into main May 13, 2026
64 of 66 checks passed
@cv cv deleted the refactor/onboard-gateway-bootstrap branch May 27, 2026 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactor PR restructures code without intended behavior change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants