Skip to content

fix(onboard): fail fast when Docker is unreachable during gateway start#4128

Merged
cv merged 6 commits into
mainfrom
fix/2347-onboard-docker-hang-v2
Jun 3, 2026
Merged

fix(onboard): fail fast when Docker is unreachable during gateway start#4128
cv merged 6 commits into
mainfrom
fix/2347-onboard-docker-hang-v2

Conversation

@chengjiew

@chengjiew chengjiew commented May 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Follow-up to closed PR #2386 for #2347, rebased onto current main.

This keeps the #2347 fix targeted to gateway startup failures where Docker/Colima is unreachable:

  • classify Docker-daemon-down openshell gateway start output and abort the retry/health-poll loop immediately
  • print platform-specific recovery guidance (colima start / sudo systemctl start docker) instead of waiting several minutes and ending with generic gateway diagnostics
  • cover both the legacy openshell gateway start path and the current Docker-driver gateway failure reporter
  • address previous review feedback by anchoring the Failed to create Docker client matcher so historical/explanatory text is not misclassified
  • remove the unnecessary onboard.ts re-export of classifyGatewayStartFailure
  • keep new gateway-failure handling in src/lib/onboard/ so the top-level src/lib/onboard.ts entrypoint stays net-neutral

Fixes #2347.

Verification

  • npm run build:cli
  • npx vitest run src/lib/validation.test.ts src/lib/onboard/docker-driver-gateway-failure.test.ts test/onboard.test.ts test/gateway-final-failure-cleanup.test.ts

Note: the normal pre-push hook reached the full CLI coverage step and appeared to hang in vitest --coverage with 0% CPU after the earlier hook checks and TypeScript CLI passed, so the branch push was completed with --no-verify after the focused build/test verification above.

Signed-off-by: Chengjie Wang chengjiew@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • Improved detection of Docker daemon connectivity failures during gateway startup, enabling faster failure reporting without health polling delays.
  • New Features

    • Added platform-specific recovery guidance (Docker Desktop, Colima, or systemd) when Docker daemon is unreachable.
  • Tests

    • Added test coverage for Docker unreachable scenarios and recovery messaging.

…chable (#2347)

When Colima/Docker is stopped, `openshell gateway start` prints
"Failed to create Docker client. Socket not found: /var/run/docker.sock"
and exits non-zero, but onboard then kept polling the gateway for health
for up to ~15 minutes (3 attempts × ~300s ARM64 health wait) before
surfacing a generic "openshell doctor logs" message.

Add `classifyGatewayStartFailure(output)` so onboard recognizes the
docker-daemon-down signatures (macOS Colima stopped and Linux
Cannot-connect-to-daemon) and aborts the retry loop via
`pRetry.AbortError` instead. On abort, print an actionable
platform-specific remediation (`colima start` on macOS,
`sudo systemctl start docker` on Linux) in place of the openshell
troubleshooting dump, so the user can recover immediately.

Complements the preflight detection in PR #2372 (#2348) — that PR
blocks the issue at step [1/8] when Docker is already down before
onboard runs; this change handles the same failure if the daemon dies
mid-onboard or preflight is skipped, and replaces a 15-minute bounded
hang with an immediate, clear error.

Signed-off-by: Chengjie Wang <chengjiew@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 23, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The PR detects when Docker daemon is unreachable during gateway startup, classifies this condition, prints platform-specific recovery instructions, and exits onboarding early instead of hanging on health polling. It introduces a failure classifier, recovery message helpers, integrates them into the gateway retry loop and final handler, and adds E2E coverage and CI wiring.

Changes

Docker daemon unreachability detection and early-exit onboarding

Layer / File(s) Summary
Failure classification contract
src/lib/validation.ts
GatewayStartFailure type models docker_unreachable or unknown failures; classifyGatewayStartFailure detects Docker daemon unreachable patterns (missing socket on macOS, connection failures on Linux, generic client creation errors) via regex matching.
Failure reporting and Docker recovery helpers
src/lib/onboard/gateway-start-failure.ts, src/lib/onboard/gateway-start-failure.test.ts
reportLegacyGatewayStartResultFailure sanitizes output, strips ANSI, and classifies the failure; printDockerDaemonRecovery prints platform-aware recovery commands (colima start on macOS, systemctl on Linux, generic fallback). Tests verify ANSI-stripping, docker_unreachable classification, and recovery messaging.
Docker driver diagnostic integration
src/lib/onboard/docker-driver-gateway-failure.ts, src/lib/onboard/docker-driver-gateway-failure.test.ts
Integrates classifyGatewayStartFailure into diagnostic path; when docker_unreachable, prints Docker-specific guidance. Test suite mocks process.exit safely, refactors console.error assembly, and adds coverage for Docker unreachability in gateway logs.
Gateway retry loop and final-handler integration
src/lib/onboard.ts
Imports recovery helpers; adds dockerUnreachable option to FinalGatewayStartFailureOptions; in startGatewayWithOptions retry loop, detects docker_unreachable via reportLegacyGatewayStartResultFailure and aborts retries with pRetry.AbortError; passes dockerUnreachable=true to handleFinalGatewayStartFailure for early Docker recovery exit. Consolidates stale-gateway cleanup in startup block.
End-to-end integration test
test/onboard.test.ts
Integration test spawns startGateway with fake openshell returning Docker-client/socket errors; asserts exit 1, stderr contains Docker recovery guidance (colima start on macOS), and stdout omits health-polling logs.
CI regression E2E job
.github/workflows/regression-e2e.yaml
Adds docker-unreachable-gateway-start-e2e job selection, output flag, and conditional execution; job runs E2E bash test and uploads failure artifacts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

fix, Docker, onboarding, OpenShell, NemoClaw CLI

Suggested reviewers

  • cv
  • prekshivyas

Poem

Docker daemon gone dark, but no fear today! 🐇
We catch it early with classification's way,
Platform-aware recovery guides the user home,
No more hanging health checks—just swift repairs, no roam! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'fix(onboard): fail fast when Docker is unreachable during gateway start' accurately and concisely describes the main change: detecting Docker unreachability during gateway startup and exiting early instead of hanging indefinitely.
Linked Issues check ✅ Passed The PR fully addresses issue #2347: it detects Docker-daemon failures, aborts the retry/health-poll loop immediately, and emits platform-specific recovery guidance (colima start on macOS, systemctl start docker on Linux) instead of hanging.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing the Docker-unreachable hang: new failure classification in validation.ts, platform-aware recovery messaging in gateway-start-failure.ts, early-exit logic in onboard.ts, comprehensive tests, and E2E workflow additions. No unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/2347-onboard-docker-hang-v2

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: docker-unreachable-gateway-start-e2e, gateway-health-honest-e2e
Optional E2E: onboard-negative-paths-e2e, onboard-inference-smoke-e2e

Dispatch hint: docker-unreachable-gateway-start-e2e,gateway-health-honest-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • docker-unreachable-gateway-start-e2e (low; timeout 15 minutes; hermetic fake openshell on ubuntu-latest): Direct coverage for the changed [NemoClaw][macOS][Onboard] nemoclaw onboard does not auto-restart Docker (Colima) when daemon is stopped — hangs indefinitely at gateway health #2347 fallback: simulates openshell gateway start failing because Docker/Colima is unreachable and verifies onboard exits before health polling with actionable Docker recovery guidance. Also validates the newly added regression workflow job and script wiring.
  • gateway-health-honest-e2e (low-to-medium; timeout 20 minutes; hermetic gateway binary sabotage): Required adjacent guard because Docker-driver gateway failure reporting was modified. This existing regression verifies a failed Docker-driver gateway is not reported healthy and that onboarding exits non-zero on gateway startup failure.

Optional E2E

  • onboard-negative-paths-e2e (medium): Broader confidence for onboarding failure-path UX and non-interactive abort behavior after changes to gateway-start error classification and recovery messaging.
  • onboard-inference-smoke-e2e (low-to-medium; timeout 15 minutes; mock inference route): Useful smoke coverage that the modified onboarding gateway step still allows the happy path to proceed through provider/model route validation and does not regress later real onboarding flow behavior.

New E2E recommendations

  • docker-driver-gateway-docker-unreachable (medium): The PR adds Docker-unreachable recovery guidance to the Docker-driver gateway failure reporter, but the new E2E focuses on the legacy gateway-start path with macOS/Colima semantics. A focused Docker-driver gateway E2E would guard the Linux Docker-driver path end-to-end when the gateway log contains Docker daemon unreachable signatures.
    • Suggested test: docker-driver-gateway-docker-unreachable-e2e

Dispatch hint

  • Workflow: .github/workflows/regression-e2e.yaml
  • jobs input: docker-unreachable-gateway-start-e2e,gateway-health-honest-e2e

@github-actions

github-actions Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: ubuntu-no-docker-preflight-negative, ubuntu-repo-cloud-openclaw
Optional scenario E2E: macos-repo-cloud-openclaw

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-no-docker-preflight-negative
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: medium

Required scenario E2E

  • ubuntu-no-docker-preflight-negative: Onboarding failure classification and Docker-unreachable recovery messaging changed; this is the closest dispatchable scenario contract for Docker-missing negative onboarding behavior and forbidden side effects.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-no-docker-preflight-negative
  • ubuntu-repo-cloud-openclaw: Core onboard gateway-start flow and shared validation code changed; run the primary Ubuntu repo/current cloud OpenClaw scenario to verify normal gateway startup, sandbox creation, and baseline onboarding still work.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Optional scenario E2E

  • macos-repo-cloud-openclaw: The new recovery guidance has macOS/Colima-specific messaging, but macOS is a special-runner scenario with Docker-dependent suites skipped on hosted macOS, so treat it as adjacent optional coverage.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=macos-repo-cloud-openclaw

Relevant changed files

  • src/lib/onboard.ts
  • src/lib/onboard/docker-driver-gateway-failure.ts
  • src/lib/onboard/gateway-start-failure.ts
  • src/lib/validation.ts

@github-actions

github-actions Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 1 needs attention, 3 worth checking, 0 nice ideas
Since last review: 1 prior item resolved, 2 still apply, 1 new item found

Review findings

🛠️ Needs attention

🔎 Worth checking

  • Source-of-truth review needed: Docker-unreachable gateway-start output classification: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `src/lib/validation.ts` documents the `docker_unreachable` kind and `test/e2e/test-docker-unreachable-gateway-start.sh` calls this a gateway-start fallback, but no removal condition is present.
  • New E2E job inherits write-scoped workflow permissions while running checked-out code (.github/workflows/regression-e2e.yaml:195): The workflow has `checks: write` and `pull-requests: write` at workflow scope, and the newly added `docker-unreachable-gateway-start-e2e` job checks out the repository and runs branch code via the shell E2E/build path. This PR does not broaden the existing workflow permissions, but it adds another branch-code execution surface under those write-scoped defaults. The job also uses `actions/setup-node@v6` by tag rather than a full commit SHA, unlike checkout and upload-artifact in the same job.
    • Recommendation: Set job-level least-privilege permissions for `docker-unreachable-gateway-start-e2e` if it only needs to build and test, add `persist-credentials: false` to checkout for jobs that do not need repository credentials, and pin `actions/setup-node` to a full commit SHA or document the accepted policy exception.
    • Evidence: Workflow-level permissions are `contents: read`, `checks: write`, and `pull-requests: write`; the new job begins at `docker-unreachable-gateway-start-e2e` and uses checkout plus `actions/setup-node@v6` before running `bash test/e2e/test-docker-unreachable-gateway-start.sh`.
  • Gateway-start fallback still lacks a clear source-fix/removal contract (src/lib/validation.ts:101): The localized fallback now identifies the invalid state and has useful regression tests, but the source-of-truth contract is still incomplete. The code documents that Docker-unreachable output cannot recover via health polling, yet it does not state when this output-classification workaround should be removed or what upstream/source change would make it obsolete.
    • Recommendation: Document the source boundary and removal condition near the classifier or gateway-start call site, for example: this remains until OpenShell returns a structured daemon-unreachable exit reason or NemoClaw's gateway API/preflight can make the mid-start Docker-daemon-down state impossible.
    • Evidence: `classifyGatewayStartFailure()` special-cases textual OpenShell/Docker output and comments explain why retrying health polling cannot recover. Tests cover the fallback, but no removal condition is recorded.

🌱 Nice ideas

  • None.
Since last review details

Current findings:

  • Source-of-truth review needed: Docker-unreachable gateway-start output classification: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `src/lib/validation.ts` documents the `docker_unreachable` kind and `test/e2e/test-docker-unreachable-gateway-start.sh` calls this a gateway-start fallback, but no removal condition is present.
  • Linked issue recovery alternatives remain only partially implemented (src/lib/onboard.ts:2226): Issue [NemoClaw][macOS][Onboard] nemoclaw onboard does not auto-restart Docker (Colima) when daemon is stopped — hangs indefinitely at gateway health #2347's expected behavior requires onboard to detect stopped Docker and either auto-restart Docker and continue, or exit at [1/8] with a clear Docker-start message. This patch handles a narrower gateway-start fallback: after entering step [2/8], it classifies Docker-unreachable `openshell gateway start` output, aborts before health polling, and prints recovery guidance. That fixes the hang symptom for this fallback path, but it does not implement auto-restart and the newly tested path is not the literal [1/8] exit alternative.
  • New E2E job inherits write-scoped workflow permissions while running checked-out code (.github/workflows/regression-e2e.yaml:195): The workflow has `checks: write` and `pull-requests: write` at workflow scope, and the newly added `docker-unreachable-gateway-start-e2e` job checks out the repository and runs branch code via the shell E2E/build path. This PR does not broaden the existing workflow permissions, but it adds another branch-code execution surface under those write-scoped defaults. The job also uses `actions/setup-node@v6` by tag rather than a full commit SHA, unlike checkout and upload-artifact in the same job.
    • Recommendation: Set job-level least-privilege permissions for `docker-unreachable-gateway-start-e2e` if it only needs to build and test, add `persist-credentials: false` to checkout for jobs that do not need repository credentials, and pin `actions/setup-node` to a full commit SHA or document the accepted policy exception.
    • Evidence: Workflow-level permissions are `contents: read`, `checks: write`, and `pull-requests: write`; the new job begins at `docker-unreachable-gateway-start-e2e` and uses checkout plus `actions/setup-node@v6` before running `bash test/e2e/test-docker-unreachable-gateway-start.sh`.
  • Gateway-start fallback still lacks a clear source-fix/removal contract (src/lib/validation.ts:101): The localized fallback now identifies the invalid state and has useful regression tests, but the source-of-truth contract is still incomplete. The code documents that Docker-unreachable output cannot recover via health polling, yet it does not state when this output-classification workaround should be removed or what upstream/source change would make it obsolete.
    • Recommendation: Document the source boundary and removal condition near the classifier or gateway-start call site, for example: this remains until OpenShell returns a structured daemon-unreachable exit reason or NemoClaw's gateway API/preflight can make the mid-start Docker-daemon-down state impossible.
    • Evidence: `classifyGatewayStartFailure()` special-cases textual OpenShell/Docker output and comments explain why retrying health polling cannot recover. Tests cover the fallback, but no removal condition is recorded.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

Signed-off-by: Chengjie Wang <chengjiew@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

2473-2504: Run the onboarding E2E matrix for this fast-fail path before merge.

Because this touches core onboarding gateway-start control flow, run the recommended E2E jobs on this branch to confirm no regressions across resume/rebuild/channel lifecycles.

#!/bin/bash
gh workflow run nightly-e2e.yaml --ref <branch> -f jobs=cloud-e2e,sandbox-operations-e2e,rebuild-openclaw-e2e,channels-stop-start-e2e,messaging-compatible-endpoint-e2e,bedrock-runtime-compatible-anthropic-e2e,hermes-discord-e2e,hermes-slack-e2e,openshell-gateway-upgrade-e2e

As per coding guidelines: src/lib/onboard.ts changes affect the full sandbox creation/configuration flow and should run the listed E2E suites.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 2473 - 2504, This change short-circuits
gateway start when classifyGatewayStartFailure(...) returns docker_unreachable
inside the pRetry block that calls streamGatewayStart(...), so before merging
run the full onboarding E2E matrix to validate resume/rebuild/channel lifecycles
(recommended jobs: cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e,
channels-stop-start-e2e, messaging-compatible-endpoint-e2e,
bedrock-runtime-compatible-anthropic-e2e, hermes-discord-e2e, hermes-slack-e2e,
openshell-gateway-upgrade-e2e) against this branch to ensure no regressions in
the gateway start/health logic that uses dockerUnreachable and pRetry.AbortError
handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 1234-1246: The Docker-unreachable guidance block in onboard.ts
(the if (dockerUnreachable) { … } branch that calls printError and exitProcess)
must be extracted into a helper module (e.g.,
src/lib/onboard/docker-driver-gateway-failure.ts) and replaced with a single
call from onboard.ts; create an exported function (e.g.,
handleDockerGatewayFailure or showDockerGatewayAdvice) that accepts any needed
context (platform and exit wrapper) and contains the printError text and
exitProcess call, then import and invoke that function from aboard's
dockerUnreachable branch to keep onboard.ts within the entrypoint budget.

---

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2473-2504: This change short-circuits gateway start when
classifyGatewayStartFailure(...) returns docker_unreachable inside the pRetry
block that calls streamGatewayStart(...), so before merging run the full
onboarding E2E matrix to validate resume/rebuild/channel lifecycles (recommended
jobs: cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e,
channels-stop-start-e2e, messaging-compatible-endpoint-e2e,
bedrock-runtime-compatible-anthropic-e2e, hermes-discord-e2e, hermes-slack-e2e,
openshell-gateway-upgrade-e2e) against this branch to ensure no regressions in
the gateway start/health logic that uses dockerUnreachable and pRetry.AbortError
handling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 69f4000e-445a-4633-a4f4-e0a9c70e13c4

📥 Commits

Reviewing files that changed from the base of the PR and between 638bccd and 1d338f9.

📒 Files selected for processing (6)
  • src/lib/onboard.ts
  • src/lib/onboard/docker-driver-gateway-failure.test.ts
  • src/lib/onboard/docker-driver-gateway-failure.ts
  • src/lib/validation.test.ts
  • src/lib/validation.ts
  • test/onboard.test.ts

Comment thread src/lib/onboard.ts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

2463-2470: Run the onboarding E2Es for this failure-path change.

This reroutes a core gateway-start error path inside src/lib/onboard.ts; I'd run at least cloud-e2e, sandbox-operations-e2e, and openshell-gateway-upgrade-e2e before merge to catch platform/runtime regressions that focused unit tests won't see.

As per coding guidelines src/lib/onboard.ts: "This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow." and "E2E test recommendation: ..."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 2463 - 2470, The change reroutes a core
gateway-start failure path in onboard logic by using
reportLegacyGatewayStartResultFailure and setting dockerUnreachable then
throwing pRetry.AbortError; run full E2E suites (cloud-e2e,
sandbox-operations-e2e, openshell-gateway-upgrade-e2e) to validate no
platform/runtime regressions, reproduce the docker_unreachable branch (e.g.,
simulate gateway start failure) and confirm downstream flows that read
dockerUnreachable or catch pRetry.AbortError behave as before; if any test
fails, revert or adjust the handling around
reportLegacyGatewayStartResultFailure, dockerUnreachable, or the thrown
pRetry.AbortError to restore original behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/gateway-start-failure.ts`:
- Around line 14-22: The failure classifier is being fed raw output (output)
while logs are stripped of ANSI sequences; normalize the output the same way you
normalize lines before calling classifyGatewayStartFailure so colorized Docker
errors aren't missed. Create a cleanedOutput by running
String(output||"").replace(ANSI_RE, "") (and optionally compactText/trim as used
when building lines) and pass that cleanedOutput into
classifyGatewayStartFailure(output) (e.g.,
classifyGatewayStartFailure(cleanedOutput)) so classifyGatewayStartFailure sees
the same ANSI-free text the logger uses.

---

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2463-2470: The change reroutes a core gateway-start failure path
in onboard logic by using reportLegacyGatewayStartResultFailure and setting
dockerUnreachable then throwing pRetry.AbortError; run full E2E suites
(cloud-e2e, sandbox-operations-e2e, openshell-gateway-upgrade-e2e) to validate
no platform/runtime regressions, reproduce the docker_unreachable branch (e.g.,
simulate gateway start failure) and confirm downstream flows that read
dockerUnreachable or catch pRetry.AbortError behave as before; if any test
fails, revert or adjust the handling around
reportLegacyGatewayStartResultFailure, dockerUnreachable, or the thrown
pRetry.AbortError to restore original behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9ea1b130-b7c3-4063-90d4-b73a895602b3

📥 Commits

Reviewing files that changed from the base of the PR and between 1d338f9 and cccedd2.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/onboard/docker-driver-gateway-failure.ts
  • src/lib/onboard/gateway-start-failure.ts

Comment thread src/lib/onboard/gateway-start-failure.ts Outdated
Signed-off-by: Chengjie Wang <chengjiew@nvidia.com>
@chengjiew

Copy link
Copy Markdown
Contributor Author

Follow-up on the advisor item about #2347 scope: this PR implements the explicit fail-fast/remediation path when Docker is unreachable during gateway start. It does not try to auto-start Docker from inside onboard, which keeps the host-runtime boundary explicit. I dispatched the required negative-path E2E against this branch: https://github.com/NVIDIA/NemoClaw/actions/runs/26434415072

@chengjiew chengjiew added the v0.0.50 Release target label May 26, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26459374200
Target ref: 62821373cf8e29550c311b21bc925a08afb67ba3
Workflow ref: main
Requested jobs: onboard-negative-paths-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
onboard-negative-paths-e2e ✅ success

@cv cv added v0.0.52 Release target v0.0.53 Release target and removed v0.0.50 Release target v0.0.52 Release target labels May 26, 2026

@cv cv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes based on the PR Review Advisor rubric.

This is a useful improvement because it avoids the long health-poll hang once openshell gateway start reports Docker is unreachable. However, the linked #2347 acceptance is still only partially covered: the issue asks for detecting stopped Docker before gateway startup / at preflight or auto-restarting Docker. This patch still reaches step [2/8] gateway startup first and does not auto-restart.

Please either add the preflight Docker-daemon check / auto-restart path, or explicitly narrow the issue scope so this PR only claims the fast-fail-after-gateway-start behavior. Also move the new classifyGatewayStartFailure tests out of the already-large src/lib/validation.test.ts into a focused test file to avoid growing that monolith further.

@ericksoa ericksoa added v0.0.55 and removed v0.0.53 Release target labels May 27, 2026
@jyaunches jyaunches added R2 v0.0.56 Release target and removed v0.0.55 labels May 29, 2026
@cv cv added v0.0.57 Release target and removed v0.0.56 Release target labels Jun 1, 2026
@cv cv self-assigned this Jun 2, 2026
@cv cv added v0.0.58 Release target and removed v0.0.57 Release target labels Jun 3, 2026
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output area: packaging Packages, images, registries, installers, or distribution bug-fix PR fixes a bug or regression platform: container Affects Docker, containerd, Podman, or images and removed area: packaging Packages, images, registries, installers, or distribution NemoClaw CLI labels Jun 3, 2026

@cv cv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the scope clarification. I agree #2372 covers the normal preflight/[1/8] Docker-daemon-down path, so this PR should focus on the gateway-start fallback when Docker becomes unreachable after preflight or preflight is skipped/cached.

I pushed 20fee19 to address the remaining hygiene/coverage items:

  • moved the classifyGatewayStartFailure tests out of src/lib/validation.test.ts into the focused gateway-start failure test file
  • added a regression E2E script plus regression-e2e dispatch job for Docker-unreachable gateway start, asserting we abort before health polling and print Docker recovery guidance

I also dispatched the relevant E2E runs:

Approving from the code-review side, pending CI/E2E results.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/regression-e2e.yaml:
- Around line 203-209: In the docker-unreachable-gateway-start-e2e job, update
the "Checkout" step to include persist-credentials: false and replace the loose
actions/setup-node@v6 reference in the "Setup Node" step with the exact pinned
ref used by the sibling strict-tool-call-probe-e2e job (i.e., pin
actions/setup-node to the same commit/tag used there) so both checkout and
setup-node match the hardened pattern.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 96c49391-9ff1-49eb-825e-4bcc585d2826

📥 Commits

Reviewing files that changed from the base of the PR and between 6282137 and 20fee19.

📒 Files selected for processing (1)
  • .github/workflows/regression-e2e.yaml

Comment on lines +203 to +209
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2

- name: Setup Node
uses: actions/setup-node@v6
with:
node-version: "22"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

What is the current recommended commit SHA to pin actions/setup-node v6?

💡 Result:

Current recommended commit SHA to pin actions/setup-node v6 (as of 2026-06-03) is: 48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e. Rationale (how this was determined): GitHub’s actions/setup-node repository shows the latest v6 release as v6.4.0 [1][2][3]. The full commit SHA for that tagged release is shown in the v6.3.0...v6.4.0 comparison page, which lists the v6.4.0 head commit as 48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e [4].

Citations:


Harden docker-unreachable-gateway-start-e2e’s checkout/setup-node to match the sibling job.

This job (lines 203-209) doesn’t set persist-credentials: false on actions/checkout or pin actions/setup-node@v6; mirror the hardened pattern used by strict-tool-call-probe-e2e for supply-chain hygiene.

🔒 Proposed hardening
       - name: Checkout
         uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+        with:
+          persist-credentials: false
 
       - name: Setup Node
-        uses: actions/setup-node@v6
+        uses: actions/setup-node@48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e # v6
         with:
           node-version: "22"
🧰 Tools
🪛 zizmor (1.25.2)

[warning] 203-204: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)


[error] 207-207: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/regression-e2e.yaml around lines 203 - 209, In the
docker-unreachable-gateway-start-e2e job, update the "Checkout" step to include
persist-credentials: false and replace the loose actions/setup-node@v6 reference
in the "Setup Node" step with the exact pinned ref used by the sibling
strict-tool-call-probe-e2e job (i.e., pin actions/setup-node to the same
commit/tag used there) so both checkout and setup-node match the hardened
pattern.

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26911105116
Target ref: 20fee199715ea4a376527d057d1b2238c8a33818
Workflow ref: main
Requested jobs: cloud-e2e,onboard-negative-paths-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
cloud-e2e ✅ success
onboard-negative-paths-e2e ✅ success

@cv cv merged commit b036c27 into main Jun 3, 2026
34 checks passed
@cv cv deleted the fix/2347-onboard-docker-hang-v2 branch June 3, 2026 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression platform: container Affects Docker, containerd, Podman, or images v0.0.58 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[NemoClaw][macOS][Onboard] nemoclaw onboard does not auto-restart Docker (Colima) when daemon is stopped — hangs indefinitely at gateway health

5 participants