Skip to content

fix(verify): retry gateway and dashboard probes; correct gateway-log hint#3576

Merged
cv merged 6 commits into
mainfrom
fix/post-onboard-verify-retry
May 15, 2026
Merged

fix(verify): retry gateway and dashboard probes; correct gateway-log hint#3576
cv merged 6 commits into
mainfrom
fix/post-onboard-verify-retry

Conversation

@laitingsheng

@laitingsheng laitingsheng commented May 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Post-onboard verification was surfacing scary "Deployment verification found issues" warnings in two distinct cases: (1) on slower hosts the gateway and host port forward had not finished coming up before the probes fired, and (2) the step [8/8] policy-apply step restarts the sandbox container but leaves gateway start to the next nemoclaw connect, so the in-image verify never sees a running gateway. The gateway-failure hint also pointed users at the wrong log location, compounding the confusion.

Related Issue

Fixes #3563
Fixes #3569
Fixes #3573

Changes

  • Make verifyDeployment async and retry the two blocking probes (gateway + dashboard) with a 1/2/5/7/10 s backoff before declaring failure.
  • Call checkAndRecoverSandboxProcesses after policy-apply and before verifyDeployment in onboard.ts so the post-policy sandbox restart never leaves the gateway down by the time the verify probes run.
  • Surface a corrected gateway-failure hint that references both the in-sandbox log via nemoclaw <name> logs and the host-side OpenShell gateway log path.
  • Update the caller in onboard.ts to await the new signature.
  • Add tests for late-arriving gateway and dashboard recovery, retry-budget exhaustion, and the new hint shape.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • `npx prek run --all-files` passes
  • `npm test` passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • `make docs` builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • New Features

    • Configurable retry behavior for gateway and dashboard checks during post-deployment verification.
    • Automatic process recovery is performed before running verification.
    • Gateway diagnostics now point to in-sandbox and host-side logs for troubleshooting.
  • Improvements

    • Verification runs asynchronously and completes before diagnostics are formatted and the dashboard is shown.
  • Tests

    • Verification tests updated to cover retry behavior and async call patterns.

Review Change Stack

…hint

Post-onboard verification raced the createSandbox return, so on slower
hosts the gateway and dashboard probes printed scary "Deployment
verification found issues" warnings even though the sandbox came up a
few seconds later and `nemoclaw status` reported healthy.

The follow-up hint also pointed users at /tmp/gateway.log inside the
sandbox — that file is never written; the gateway runs on the host.

- verifyDeployment is now async and retries the two blocking probes
  with a 1/2/5/7/10 s backoff, mirroring the TOCTOU handling added
  in #3313 for the forward-start path.
- Gateway-fail hint now points at the host-side log paths the wizard
  already resolves in onboard/sandbox-create-failure.ts.

Fixes #3563

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 15, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 999fbf32-f13e-4b1f-a996-03d2ad956255

📥 Commits

Reviewing files that changed from the base of the PR and between ce7d36a and 067ee12.

📒 Files selected for processing (1)
  • src/lib/onboard.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard.ts

📝 Walkthrough

Walkthrough

Post-deployment verification is made async with configurable retry delays and injectable sleep; gateway/dashboard probes are wrapped with retry logic and gateway failure hints now point to in-sandbox and host logs. Onboarding calls processRecovery before verification and now awaits verifyDeployment. Tests updated to async and cover retry scenarios.

Changes

Post-Onboarding Deployment Verification with Retry

Layer / File(s) Summary
Retry configuration and diagnostic helpers
src/lib/verify-deployment.ts
VerifyDeploymentOptions adds retryDelaysMs and sleep; default retry delays and buildGatewayLogHint(sandboxName) introduced for structured gateway failure hints.
Retry-wrapped gateway and dashboard probing
src/lib/verify-deployment.ts
Gateway and dashboard probes refactored into single-attempt helpers and async retry wrappers (verifyGatewayInSandbox, verifyDashboardFromHost) that honor configured delays and injected sleep.
Async verification entry point
src/lib/verify-deployment.ts
verifyDeployment is now async, accepts optional VerifyDeploymentOptions, derives retry behavior, awaits retry-enabled probes, and uses the new gateway log hint in diagnostics.
Onboarding integration and sandbox exec helper
src/lib/onboard.ts, src/lib/onboard/sandbox-verification-exec.ts
Onboarding imports the shared executeSandboxCommandForVerification, calls processRecovery.checkAndRecoverSandboxProcesses(...) before verification, and awaits verifyDeployment(...). New sandbox-verification-exec.ts provides the spawnSync-based exec helper.
Test suite updates and retry tests
src/lib/verify-deployment.test.ts
Tests updated to async/await and pass NO_RETRY where appropriate; diagnostic expectations adjusted; new tests cover retry-based recovery and exhaustion scenarios.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

  • #3563, #3569, #3573

Suggested labels

NemoClaw CLI, Sandbox, OpenShell, enhancement: testing

Suggested reviewers

  • jyaunches
  • ericksoa
  • prekshivyas

Poem

🐰 I hopped through retries, soft and spry,
Awaited probes beneath the sky,
Logs now point where answers hide,
Onboard waits, no false alarms abide,
A little rabbit winks—deploys pass by!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly summarizes the main changes: adding retry logic to gateway and dashboard probes and correcting the gateway-log hint.
Linked Issues check ✅ Passed The PR addresses all three linked issues by implementing retry logic with backoff for gateway/dashboard probes, correcting the gateway-log hint to reference both in-sandbox and host logs, and adding sandbox process recovery before verification.
Out of Scope Changes check ✅ Passed All changes directly address the linked issues: async/await conversion for retry support, backoff delays, process recovery integration, hint correction, and corresponding test updates—no out-of-scope modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/post-onboard-verify-retry

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: cloud-onboard-e2e, device-auth-health-e2e
Optional E2E: sandbox-survival-e2e, diagnostics-e2e, cloud-e2e

Dispatch hint: cloud-onboard-e2e,device-auth-health-e2e

Auto-dispatched E2E: cloud-onboard-e2e, device-auth-health-e2e via nightly-e2e.yaml at 067ee121caf13a4cf7e8d1373ef68a8eb162502enightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • cloud-onboard-e2e (medium (~45 min)): Exercises the non-interactive cloud onboard path end-to-end with policy presets, sandbox creation, final deployment verification, sandbox health checks, security checks, and inference.local. This is the closest required coverage for the changed post-policy process recovery plus async verifyDeployment path.
  • device-auth-health-e2e (medium (~30 min)): Specifically validates onboard with device auth enabled, gateway health detection, host dashboard port-forward liveness, process recovery after a gateway kill, and that onboard logs include deployment verification output. This targets the verifyDeployment health/diagnostic behavior changed here.

Optional E2E

  • sandbox-survival-e2e (medium (~30 min)): Useful adjacent confidence for sandbox survival across gateway restarts and inference after gateway stop/start. The PR touches process recovery before verification, but this is broader than the final onboard verification regression risk.
  • diagnostics-e2e (medium (~45 min)): Optional because verifyDeployment diagnostics and gateway log hints changed. This job checks broader diagnostic/debug behavior and credential-safe artifacts, but does not appear to directly assert the new verifyDeployment hint text.
  • cloud-e2e (medium (~45 min)): Broad full user journey coverage for install, onboard, sandbox verification, live inference, and CLI operations. Valuable if maintainers want maximum confidence beyond the more targeted cloud-onboard/device-auth-health runs.

New E2E recommendations

  • post-policy deployment verification retry (medium): Existing E2Es cover successful onboard and device-auth health, but there is no clearly named live E2E that deliberately delays/lazily starts the gateway or dashboard forward after policy-apply restart and asserts verifyDeployment retries instead of emitting a false gateway/dashboard failure.
    • Suggested test: Add or extend an E2E to simulate delayed gateway/forward readiness after policy application and assert onboard eventually prints successful deployment verification.
  • verifyDeployment diagnostic hints (low): The new gateway failure hint references both nemoclaw <sandbox> logs and host OpenShell gateway log paths. Unit tests cover the string, but no E2E appears to force a final verification failure and assert the user-facing diagnostic guidance in a real onboard log.
    • Suggested test: Add an E2E negative-path check that forces gateway verification failure after onboard and validates the final diagnostic includes both in-sandbox and host-side log guidance without leaking secrets.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: cloud-onboard-e2e,device-auth-health-e2e

Address Codex feedback: the gateway probe runs inside the sandbox and
hits the OpenClaw gateway, which writes to /tmp/gateway.log via
agent/runtime.ts:133. The previous revision of this fix replaced the
old "check /tmp/gateway.log inside the sandbox" hint with host paths
only — accurate for createSandbox-never-came-up cases but wrong for the
common case where the in-sandbox gateway is the one that failed.

Surface both layers now: tell users to inspect the in-sandbox log via
`nemoclaw <name> logs` (the documented CLI built around
buildSandboxOpenclawGatewayLogsArgs) and fall back to the host-side
OpenShell gateway log when the sandbox itself never came up. Update
the regression test to pin both references and to forbid the bare
old-style "Check /tmp/gateway.log inside the sandbox" wording.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@laitingsheng laitingsheng marked this pull request as ready for review May 15, 2026 11:23
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25915249599
Target ref: feaff3a2ebc0edfd8a92c1c78cd2589e9c540281
Workflow ref: main
Requested jobs: cloud-onboard-e2e,device-auth-health-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success
device-auth-health-e2e ✅ success

@laitingsheng laitingsheng marked this pull request as draft May 15, 2026 13:24
@laitingsheng laitingsheng marked this pull request as ready for review May 15, 2026 14:44
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25924157026
Target ref: ce7d36a845fe054f64078f25d9c9296531e62073
Workflow ref: main
Requested jobs: cloud-e2e,cloud-onboard-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
cloud-e2e ✅ success
cloud-onboard-e2e ✅ success

@cv cv enabled auto-merge (squash) May 15, 2026 15:18
@cv cv merged commit ecd2708 into main May 15, 2026
23 checks passed
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25925777160
Target ref: 067ee121caf13a4cf7e8d1373ef68a8eb162502e
Workflow ref: main
Requested jobs: cloud-onboard-e2e,device-auth-health-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success
device-auth-health-e2e ✅ success

@wscurran wscurran added bug-fix PR fixes a bug or regression and removed fix labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

3 participants