Skip to content

fix(onboard): remove sandbox bridge reachability probe#3458

Merged
ericksoa merged 1 commit into
mainfrom
revert/sandbox-bridge-reachability-probe
May 13, 2026
Merged

fix(onboard): remove sandbox bridge reachability probe#3458
ericksoa merged 1 commit into
mainfrom
revert/sandbox-bridge-reachability-probe

Conversation

@ericksoa

@ericksoa ericksoa commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Why

The #3441 probe blocks onboarding with false "host firewall" remediation on hosts where the helper container cannot resolve host.openshell.internal. The probe did not match real OpenShell sandbox host alias wiring and the surgical alias fix did not stabilize the nightly path, so this reverts the risky preflight to get main back to good.

Validation

  • npm run build:cli
  • npm run typecheck:cli
  • npx vitest run test/gateway-liveness-probe.test.ts
  • npx @biomejs/biome lint src/lib/onboard.ts test/gateway-liveness-probe.test.ts
  • git diff --check HEAD~1..HEAD

Summary by CodeRabbit

  • Refactor
    • Docker gateway health validation updated to use TCP readiness probe and OpenShell CLI metadata, removing previous sandbox bridge reachability verification from the startup process.

@ericksoa ericksoa self-assigned this May 13, 2026
@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

The PR removes Docker-network-based sandbox bridge reachability verification from the gateway readiness flow. Docker-driver gateway health is now determined solely by OpenShell CLI "healthy" metadata plus a TCP readiness probe, eliminating the separate gateway-sandbox-reachability module and its test coverage.

Changes

Sandbox Bridge Reachability Removal

Layer / File(s) Summary
Module and exports removal
src/lib/onboard/gateway-sandbox-reachability.ts, src/lib/onboard/gateway-sandbox-reachability.test.ts
Complete removal of sandbox bridge TCP reachability probe, CLI error formatting function, verify-or-exit helper, and associated types/interfaces; deletes all 116 Vitest cases covering reachability scenarios and error messaging.
Onboard flow refactoring
src/lib/onboard.ts
Removed import and three call sites of verifySandboxBridgeGatewayReachableOrExit from Docker-driver gateway reuse and startup paths; added inline comments clarifying success gating on OpenShell health metadata and TCP readiness probe instead.
Test modernization
test/gateway-liveness-probe.test.ts
Removed Docker-driver liveness test for sandbox bridge preflight verification; added regression test asserting isGatewayHealthy() remains pure (no Docker or process execution primitives in function body).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3441: Inverse change—adds verifySandboxBridgeGatewayReachableOrExit and wires it into gateway startup; this PR removes that same helper and consolidates to TCP readiness.
  • NVIDIA/NemoClaw#3378: Also refactors Docker-driver "healthy" gating in src/lib/onboard.ts to use OpenShell health plus TCP probe; this PR extends that by dropping the older bridge check entirely.

Suggested labels

NemoClaw CLI, fix, Docker, OpenShell, v0.0.40

Suggested reviewers

  • jyaunches
  • prekshivyas

Poem

🐰 The bridge grew quiet, its checks fell away,
TCP probes now lead the gateway's day,
One truth remains: the health to measure,
Pure functions: our greatest treasure.
Simpler paths, clearer skies await! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: removing the sandbox bridge reachability probe from the onboarding flow, which aligns with the changeset that removes the probe module, tests, and related call sites.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch revert/sandbox-bridge-reachability-probe

Comment @coderabbitai help to get the list of available commands and usage tips.

@ericksoa ericksoa changed the title revert(onboard): remove sandbox bridge reachability probe fix(onboard): remove sandbox bridge reachability probe May 13, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

4438-4573: Run the Docker-driver onboarding E2Es for this readiness-path revert.

This touches the core gateway reuse/start health gate, so sandbox-operations-e2e and openshell-gateway-upgrade-e2e are the highest-signal checks to run before merge.

As per coding guidelines, src/lib/onboard.ts "contains core onboarding logic" and the recommended selective checks here include sandbox-operations-e2e and openshell-gateway-upgrade-e2e.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 4438 - 4573, The PR changed the gateway
reuse/start health path in startDockerDriverGateway, so before merging run the
high-signal E2E suites to validate behavior and catch regressions: run
sandbox-operations-e2e and openshell-gateway-upgrade-e2e against the updated
startDockerDriverGateway path (exercise branches that reuse existing gateway,
portListener flow, restart-on-drift, and the final health polling) and confirm
reportDockerDriverGatewayStartFailure and related logic
(registerDockerDriverGatewayEndpoint, isGatewayHealthy,
isDockerDriverGatewayHttpReady/isGatewayTcpReady) behave as expected; if
failures appear, reproduce locally, add targeted fixes or retries in
startDockerDriverGateway (or adjust health checks) and re-run the two E2E suites
until green.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 4438-4573: The PR changed the gateway reuse/start health path in
startDockerDriverGateway, so before merging run the high-signal E2E suites to
validate behavior and catch regressions: run sandbox-operations-e2e and
openshell-gateway-upgrade-e2e against the updated startDockerDriverGateway path
(exercise branches that reuse existing gateway, portListener flow,
restart-on-drift, and the final health polling) and confirm
reportDockerDriverGatewayStartFailure and related logic
(registerDockerDriverGatewayEndpoint, isGatewayHealthy,
isDockerDriverGatewayHttpReady/isGatewayTcpReady) behave as expected; if
failures appear, reproduce locally, add targeted fixes or retries in
startDockerDriverGateway (or adjust health checks) and re-run the two E2E suites
until green.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 18e1a384-733b-40e4-9f25-9f4c0e4baf0c

📥 Commits

Reviewing files that changed from the base of the PR and between c517d62 and eebb7ef.

📒 Files selected for processing (4)
  • src/lib/onboard.ts
  • src/lib/onboard/gateway-sandbox-reachability.test.ts
  • src/lib/onboard/gateway-sandbox-reachability.ts
  • test/gateway-liveness-probe.test.ts
💤 Files with no reviewable changes (3)
  • src/lib/onboard/gateway-sandbox-reachability.ts
  • src/lib/onboard/gateway-sandbox-reachability.test.ts
  • test/gateway-liveness-probe.test.ts

@github-actions

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: gateway-health-honest-e2e, cloud-onboard-e2e
Optional E2E: sandbox-survival-e2e, sandbox-operations-e2e, double-onboard-e2e

Dispatch hint: gateway-health-honest-e2e

Workflow run

Full advisor summary

Pi Semantic E2E Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • gateway-health-honest-e2e: This is the canonical regression test for honest gateway health reporting in startDockerDriverGateway — the exact function this PR mutates. It sabotages the gateway binary and asserts onboard does not falsely log 'Docker-driver gateway is healthy'. Must pass to confirm removing the bridge-reachability probe did not break the remaining health gating (TCP readiness + child-exit tracker).
  • cloud-onboard-e2e: Public installer + full non-interactive onboard on ubuntu-latest exercises the Docker-driver gateway happy path end-to-end (startDockerDriverGateway → ready return → sandbox create). Validates that the now-probe-less ready return still produces a working sandbox on a clean host.

Optional E2E

  • sandbox-survival-e2e: Exercises gateway stop/start recovery, which re-enters startDockerDriverGateway through the reuse and port-listener adoption branches that used to call the deleted probe. Useful confidence check that the reuse paths still behave correctly without it.
  • sandbox-operations-e2e: Covers cross-sandbox network isolation and sandbox→gateway traffic on the openshell-docker bridge — the exact packet path the deleted probe was diagnosing. Will catch a real connectivity regression if one slipped in alongside the probe removal.
  • double-onboard-e2e: Re-enters the Docker-driver gateway reuse branch (existing healthy gateway adoption) which previously called verifySandboxBridgeGatewayReachableOrExit before logging '✓ Reusing existing Docker-driver gateway'. Confidence check for the reuse path.

New E2E recommendations

  • onboard/sandbox-bridge-reachability (medium): With verifySandboxBridgeGatewayReachableOrExit removed, NemoClaw no longer surfaces an actionable error (with the 'sudo ufw allow from to any port 8080 proto tcp' hint) when a host firewall blocks the openshell-docker bridge → host:GATEWAY_PORT path. There is no remaining E2E that exercises the firewall-blocked-bridge UX — the regression-e2e gateway-health-honest test only sabotages the gateway binary itself, not host packet filtering. A future user with default-deny ufw will silently complete onboard and only fail at first sandbox use.
    • Suggested test: Add an onboard-firewall-bridge-e2e job that, on a Linux runner, completes onboard, then injects an iptables/ufw rule dropping NEW TCP from the openshell-docker bridge subnet to host port GATEWAY_PORT, runs a sandbox-side TCP probe to host.openshell.internal:GATEWAY_PORT, and asserts the user gets a clear, actionable diagnostic (or, if this PR intentionally drops that contract, asserts the documented post-onboard diagnostic command catches it). Cleans up the firewall rule on teardown.

Dispatch hint

  • Workflow: regression-e2e.yaml
  • jobs input: gateway-health-honest-e2e

@ericksoa ericksoa merged commit 09e1793 into main May 13, 2026
67 of 70 checks passed
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants