fix(verify): retry gateway and dashboard probes; correct gateway-log hint#3576
Conversation
…hint Post-onboard verification raced the createSandbox return, so on slower hosts the gateway and dashboard probes printed scary "Deployment verification found issues" warnings even though the sandbox came up a few seconds later and `nemoclaw status` reported healthy. The follow-up hint also pointed users at /tmp/gateway.log inside the sandbox — that file is never written; the gateway runs on the host. - verifyDeployment is now async and retries the two blocking probes with a 1/2/5/7/10 s backoff, mirroring the TOCTOU handling added in #3313 for the forward-start path. - Gateway-fail hint now points at the host-side log paths the wizard already resolves in onboard/sandbox-create-failure.ts. Fixes #3563 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughPost-deployment verification is made async with configurable retry delays and injectable sleep; gateway/dashboard probes are wrapped with retry logic and gateway failure hints now point to in-sandbox and host logs. Onboarding calls processRecovery before verification and now awaits verifyDeployment. Tests updated to async and cover retry scenarios. ChangesPost-Onboarding Deployment Verification with Retry
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
Address Codex feedback: the gateway probe runs inside the sandbox and hits the OpenClaw gateway, which writes to /tmp/gateway.log via agent/runtime.ts:133. The previous revision of this fix replaced the old "check /tmp/gateway.log inside the sandbox" hint with host paths only — accurate for createSandbox-never-came-up cases but wrong for the common case where the in-sandbox gateway is the one that failed. Surface both layers now: tell users to inspect the in-sandbox log via `nemoclaw <name> logs` (the documented CLI built around buildSandboxOpenclawGatewayLogsArgs) and fall back to the host-side OpenShell gateway log when the sandbox itself never came up. Update the regression test to pin both references and to forbid the bare old-style "Check /tmp/gateway.log inside the sandbox" wording. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 25915249599
|
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…board/ Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 25924157026
|
Selective E2E Results — ✅ All requested jobs passedRun: 25925777160
|
Summary
Post-onboard verification was surfacing scary "Deployment verification found issues" warnings in two distinct cases: (1) on slower hosts the gateway and host port forward had not finished coming up before the probes fired, and (2) the step [8/8] policy-apply step restarts the sandbox container but leaves gateway start to the next
nemoclaw connect, so the in-image verify never sees a running gateway. The gateway-failure hint also pointed users at the wrong log location, compounding the confusion.Related Issue
Fixes #3563
Fixes #3569
Fixes #3573
Changes
verifyDeploymentasync and retry the two blocking probes (gateway + dashboard) with a 1/2/5/7/10 s backoff before declaring failure.checkAndRecoverSandboxProcessesafter policy-apply and beforeverifyDeploymentinonboard.tsso the post-policy sandbox restart never leaves the gateway down by the time the verify probes run.nemoclaw <name> logsand the host-side OpenShell gateway log path.onboard.tstoawaitthe new signature.Type of Change
Verification
Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
New Features
Improvements
Tests