fix(onboard): fail fast when Docker is unreachable during gateway start#4128
Conversation
…chable (#2347) When Colima/Docker is stopped, `openshell gateway start` prints "Failed to create Docker client. Socket not found: /var/run/docker.sock" and exits non-zero, but onboard then kept polling the gateway for health for up to ~15 minutes (3 attempts × ~300s ARM64 health wait) before surfacing a generic "openshell doctor logs" message. Add `classifyGatewayStartFailure(output)` so onboard recognizes the docker-daemon-down signatures (macOS Colima stopped and Linux Cannot-connect-to-daemon) and aborts the retry loop via `pRetry.AbortError` instead. On abort, print an actionable platform-specific remediation (`colima start` on macOS, `sudo systemctl start docker` on Linux) in place of the openshell troubleshooting dump, so the user can recover immediately. Complements the preflight detection in PR #2372 (#2348) — that PR blocks the issue at step [1/8] when Docker is already down before onboard runs; this change handles the same failure if the daemon dies mid-onboard or preflight is skipped, and replaces a 15-minute bounded hang with an immediate, clear error. Signed-off-by: Chengjie Wang <chengjiew@nvidia.com>
📝 WalkthroughWalkthroughThe PR detects when Docker daemon is unreachable during gateway startup, classifies this condition, prints platform-specific recovery instructions, and exits onboarding early instead of hanging on health polling. It introduces a failure classifier, recovery message helpers, integrates them into the gateway retry loop and final handler, and adds E2E coverage and CI wiring. ChangesDocker daemon unreachability detection and early-exit onboarding
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
PR Review AdvisorFindings: 1 needs attention, 3 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
Signed-off-by: Chengjie Wang <chengjiew@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
2473-2504: Run the onboarding E2E matrix for this fast-fail path before merge.Because this touches core onboarding gateway-start control flow, run the recommended E2E jobs on this branch to confirm no regressions across resume/rebuild/channel lifecycles.
#!/bin/bash gh workflow run nightly-e2e.yaml --ref <branch> -f jobs=cloud-e2e,sandbox-operations-e2e,rebuild-openclaw-e2e,channels-stop-start-e2e,messaging-compatible-endpoint-e2e,bedrock-runtime-compatible-anthropic-e2e,hermes-discord-e2e,hermes-slack-e2e,openshell-gateway-upgrade-e2eAs per coding guidelines:
src/lib/onboard.tschanges affect the full sandbox creation/configuration flow and should run the listed E2E suites.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 2473 - 2504, This change short-circuits gateway start when classifyGatewayStartFailure(...) returns docker_unreachable inside the pRetry block that calls streamGatewayStart(...), so before merging run the full onboarding E2E matrix to validate resume/rebuild/channel lifecycles (recommended jobs: cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e, channels-stop-start-e2e, messaging-compatible-endpoint-e2e, bedrock-runtime-compatible-anthropic-e2e, hermes-discord-e2e, hermes-slack-e2e, openshell-gateway-upgrade-e2e) against this branch to ensure no regressions in the gateway start/health logic that uses dockerUnreachable and pRetry.AbortError handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 1234-1246: The Docker-unreachable guidance block in onboard.ts
(the if (dockerUnreachable) { … } branch that calls printError and exitProcess)
must be extracted into a helper module (e.g.,
src/lib/onboard/docker-driver-gateway-failure.ts) and replaced with a single
call from onboard.ts; create an exported function (e.g.,
handleDockerGatewayFailure or showDockerGatewayAdvice) that accepts any needed
context (platform and exit wrapper) and contains the printError text and
exitProcess call, then import and invoke that function from aboard's
dockerUnreachable branch to keep onboard.ts within the entrypoint budget.
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2473-2504: This change short-circuits gateway start when
classifyGatewayStartFailure(...) returns docker_unreachable inside the pRetry
block that calls streamGatewayStart(...), so before merging run the full
onboarding E2E matrix to validate resume/rebuild/channel lifecycles (recommended
jobs: cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e,
channels-stop-start-e2e, messaging-compatible-endpoint-e2e,
bedrock-runtime-compatible-anthropic-e2e, hermes-discord-e2e, hermes-slack-e2e,
openshell-gateway-upgrade-e2e) against this branch to ensure no regressions in
the gateway start/health logic that uses dockerUnreachable and pRetry.AbortError
handling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 69f4000e-445a-4633-a4f4-e0a9c70e13c4
📒 Files selected for processing (6)
src/lib/onboard.tssrc/lib/onboard/docker-driver-gateway-failure.test.tssrc/lib/onboard/docker-driver-gateway-failure.tssrc/lib/validation.test.tssrc/lib/validation.tstest/onboard.test.ts
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
2463-2470: Run the onboarding E2Es for this failure-path change.This reroutes a core gateway-start error path inside
src/lib/onboard.ts; I'd run at leastcloud-e2e,sandbox-operations-e2e, andopenshell-gateway-upgrade-e2ebefore merge to catch platform/runtime regressions that focused unit tests won't see.As per coding guidelines
src/lib/onboard.ts: "This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow." and "E2E test recommendation: ..."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 2463 - 2470, The change reroutes a core gateway-start failure path in onboard logic by using reportLegacyGatewayStartResultFailure and setting dockerUnreachable then throwing pRetry.AbortError; run full E2E suites (cloud-e2e, sandbox-operations-e2e, openshell-gateway-upgrade-e2e) to validate no platform/runtime regressions, reproduce the docker_unreachable branch (e.g., simulate gateway start failure) and confirm downstream flows that read dockerUnreachable or catch pRetry.AbortError behave as before; if any test fails, revert or adjust the handling around reportLegacyGatewayStartResultFailure, dockerUnreachable, or the thrown pRetry.AbortError to restore original behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard/gateway-start-failure.ts`:
- Around line 14-22: The failure classifier is being fed raw output (output)
while logs are stripped of ANSI sequences; normalize the output the same way you
normalize lines before calling classifyGatewayStartFailure so colorized Docker
errors aren't missed. Create a cleanedOutput by running
String(output||"").replace(ANSI_RE, "") (and optionally compactText/trim as used
when building lines) and pass that cleanedOutput into
classifyGatewayStartFailure(output) (e.g.,
classifyGatewayStartFailure(cleanedOutput)) so classifyGatewayStartFailure sees
the same ANSI-free text the logger uses.
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2463-2470: The change reroutes a core gateway-start failure path
in onboard logic by using reportLegacyGatewayStartResultFailure and setting
dockerUnreachable then throwing pRetry.AbortError; run full E2E suites
(cloud-e2e, sandbox-operations-e2e, openshell-gateway-upgrade-e2e) to validate
no platform/runtime regressions, reproduce the docker_unreachable branch (e.g.,
simulate gateway start failure) and confirm downstream flows that read
dockerUnreachable or catch pRetry.AbortError behave as before; if any test
fails, revert or adjust the handling around
reportLegacyGatewayStartResultFailure, dockerUnreachable, or the thrown
pRetry.AbortError to restore original behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9ea1b130-b7c3-4063-90d4-b73a895602b3
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/onboard/docker-driver-gateway-failure.tssrc/lib/onboard/gateway-start-failure.ts
Signed-off-by: Chengjie Wang <chengjiew@nvidia.com>
|
Follow-up on the advisor item about #2347 scope: this PR implements the explicit fail-fast/remediation path when Docker is unreachable during gateway start. It does not try to auto-start Docker from inside onboard, which keeps the host-runtime boundary explicit. I dispatched the required negative-path E2E against this branch: https://github.com/NVIDIA/NemoClaw/actions/runs/26434415072 |
Selective E2E Results — ✅ All requested jobs passedRun: 26459374200
|
cv
left a comment
There was a problem hiding this comment.
Requesting changes based on the PR Review Advisor rubric.
This is a useful improvement because it avoids the long health-poll hang once openshell gateway start reports Docker is unreachable. However, the linked #2347 acceptance is still only partially covered: the issue asks for detecting stopped Docker before gateway startup / at preflight or auto-restarting Docker. This patch still reaches step [2/8] gateway startup first and does not auto-restart.
Please either add the preflight Docker-daemon check / auto-restart path, or explicitly narrow the issue scope so this PR only claims the fast-fail-after-gateway-start behavior. Also move the new classifyGatewayStartFailure tests out of the already-large src/lib/validation.test.ts into a focused test file to avoid growing that monolith further.
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
cv
left a comment
There was a problem hiding this comment.
Thanks for the scope clarification. I agree #2372 covers the normal preflight/[1/8] Docker-daemon-down path, so this PR should focus on the gateway-start fallback when Docker becomes unreachable after preflight or preflight is skipped/cached.
I pushed 20fee19 to address the remaining hygiene/coverage items:
- moved the classifyGatewayStartFailure tests out of src/lib/validation.test.ts into the focused gateway-start failure test file
- added a regression E2E script plus regression-e2e dispatch job for Docker-unreachable gateway start, asserting we abort before health polling and print Docker recovery guidance
I also dispatched the relevant E2E runs:
- regression-e2e: https://github.com/NVIDIA/NemoClaw/actions/runs/26911103174
- nightly selective E2E: https://github.com/NVIDIA/NemoClaw/actions/runs/26911105116
- scenario E2E: https://github.com/NVIDIA/NemoClaw/actions/runs/26911106558
Approving from the code-review side, pending CI/E2E results.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/regression-e2e.yaml:
- Around line 203-209: In the docker-unreachable-gateway-start-e2e job, update
the "Checkout" step to include persist-credentials: false and replace the loose
actions/setup-node@v6 reference in the "Setup Node" step with the exact pinned
ref used by the sibling strict-tool-call-probe-e2e job (i.e., pin
actions/setup-node to the same commit/tag used there) so both checkout and
setup-node match the hardened pattern.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 96c49391-9ff1-49eb-825e-4bcc585d2826
📒 Files selected for processing (1)
.github/workflows/regression-e2e.yaml
| - name: Checkout | ||
| uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 | ||
|
|
||
| - name: Setup Node | ||
| uses: actions/setup-node@v6 | ||
| with: | ||
| node-version: "22" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What is the current recommended commit SHA to pin actions/setup-node v6?
💡 Result:
Current recommended commit SHA to pin actions/setup-node v6 (as of 2026-06-03) is: 48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e. Rationale (how this was determined): GitHub’s actions/setup-node repository shows the latest v6 release as v6.4.0 [1][2][3]. The full commit SHA for that tagged release is shown in the v6.3.0...v6.4.0 comparison page, which lists the v6.4.0 head commit as 48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e [4].
Citations:
- 1: https://github.com/actions/setup-node/releases/tag/v6.4.0
- 2: https://github.com/actions/setup-node/releases
- 3: https://github.com/actions/setup-node
- 4: actions/setup-node@v6.3.0...v6.4.0
Harden docker-unreachable-gateway-start-e2e’s checkout/setup-node to match the sibling job.
This job (lines 203-209) doesn’t set persist-credentials: false on actions/checkout or pin actions/setup-node@v6; mirror the hardened pattern used by strict-tool-call-probe-e2e for supply-chain hygiene.
🔒 Proposed hardening
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+ with:
+ persist-credentials: false
- name: Setup Node
- uses: actions/setup-node@v6
+ uses: actions/setup-node@48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e # v6
with:
node-version: "22"🧰 Tools
🪛 zizmor (1.25.2)
[warning] 203-204: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false
(artipacked)
[error] 207-207: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)
(unpinned-uses)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/regression-e2e.yaml around lines 203 - 209, In the
docker-unreachable-gateway-start-e2e job, update the "Checkout" step to include
persist-credentials: false and replace the loose actions/setup-node@v6 reference
in the "Setup Node" step with the exact pinned ref used by the sibling
strict-tool-call-probe-e2e job (i.e., pin actions/setup-node to the same
commit/tag used there) so both checkout and setup-node match the hardened
pattern.
Selective E2E Results — ✅ All requested jobs passedRun: 26911105116
|
Summary
Follow-up to closed PR #2386 for #2347, rebased onto current
main.This keeps the #2347 fix targeted to gateway startup failures where Docker/Colima is unreachable:
openshell gateway startoutput and abort the retry/health-poll loop immediatelycolima start/sudo systemctl start docker) instead of waiting several minutes and ending with generic gateway diagnosticsopenshell gateway startpath and the current Docker-driver gateway failure reporterFailed to create Docker clientmatcher so historical/explanatory text is not misclassifiedonboard.tsre-export ofclassifyGatewayStartFailuresrc/lib/onboard/so the top-levelsrc/lib/onboard.tsentrypoint stays net-neutralFixes #2347.
Verification
npm run build:clinpx vitest run src/lib/validation.test.ts src/lib/onboard/docker-driver-gateway-failure.test.ts test/onboard.test.ts test/gateway-final-failure-cleanup.test.tsNote: the normal pre-push hook reached the full CLI coverage step and appeared to hang in
vitest --coveragewith 0% CPU after the earlier hook checks and TypeScript CLI passed, so the branch push was completed with--no-verifyafter the focused build/test verification above.Signed-off-by: Chengjie Wang chengjiew@nvidia.com
Summary by CodeRabbit
Bug Fixes
New Features
Tests