fix(onboard): warm up scope upgrade so first agent run avoids fallback#5323
Conversation
NVIDIA#4504 was reopened: on a fresh onboard the device is auto-paired with only `operator.pairing`, but the first `openclaw agent --agent main` run immediately requests `operator.write`. That upgrade is not pre-approved, so the gateway rejects with `1008 pairing required` and the agent silently falls back to embedded mode (exit 0). The connect-time approval pass added earlier cannot help the first run because the upgrade request does not exist until that run happens — after finalization's approval sweep already found nothing pending. OpenClaw's `devices` CLI only exposes list/get/approve, so there is no way to pre-grant a scope the device has not yet requested. Instead, warm up the upgrade during finalization: run a bounded throwaway in-sandbox `openclaw agent --agent main -m "ping"` to provoke the exact `operator.write` request, poll `devices list` until that allowlisted upgrade is pending, then let the existing approval pass clear it — so `operator.write` is persisted before handoff and the user's first real run connects via the gateway with no fallback. - New best-effort leaf `auto-pair-warmup.ts` (`runSandboxScopeWarmupRun`, 30s outer cap, ~10s bounded pending-upgrade poll, swallows all errors). - New `warmupScopeUpgrade` finalization dep, run after process recovery and before the approval pass (order is load-bearing: provoke -> approve). - Builds on the merged NVIDIA#4763 approval-pass/recover wiring; degrades cleanly to that behavior if the gateway never registers the upgrade. Closes NVIDIA#4504 Signed-off-by: Tony Luo <xialuo@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
💤 Files with no reviewable changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a best-effort sandbox scope-upgrade warm-up that runs an in‑sandbox probe and polls device scopes before finalization scope approval; wires the helper into finalization deps and adds unit and integration tests for behavior and ordering. ChangesSandbox scope-upgrade warm-up integration
Sequence Diagram(s)sequenceDiagram
participant Host
participant OpenShell
participant Sandbox
participant OpenClaw
participant PythonParser
Host->>OpenShell: sandbox exec "wrapped warm-up script"
OpenShell->>Sandbox: run wrapped script
Sandbox->>OpenClaw: openclaw agent --agent main -m "ping"
OpenClaw-->>Sandbox: agent response
loop poll attempts
Sandbox->>OpenClaw: openclaw devices list --json (timeout)
OpenClaw-->>Sandbox: JSON devices
Sandbox->>PythonParser: parse JSON for operator.write in scopes/requestedScopes
PythonParser-->>Sandbox: found? true/false
end
Sandbox-->>Host: exit 0 (best-effort, errors swallowed)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/actions/sandbox/auto-pair-warmup.ts`:
- Around line 132-136: The current call to getOpenshellBinary() is unsafe
because that helper calls process.exit on failure and will abort the whole
process; instead avoid invoking getOpenshellBinary() here and invoke the CLI in
a non-terminating way so ENOENT/timeouts remain catchable by the surrounding
try/catch. Concretely, replace the direct getOpenshellBinary() call used with
spawnSync in auto-pair-warmup.ts and call the openshell binary by name (e.g.,
'openshell') or use a local non-fatal resolver so that spawnSync('openshell',
...) runs and any ENOENT/timeout errors are thrown into the existing try/catch;
keep using spawnSync and preserve the current swallow-and-degrade behavior
rather than calling the terminating getOpenshellBinary().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 016b820c-be47-48f1-987a-4cbe0aaa3ea7
📒 Files selected for processing (6)
src/lib/actions/sandbox/auto-pair-warmup.test.tssrc/lib/actions/sandbox/auto-pair-warmup.tssrc/lib/onboard/finalization-deps.tssrc/lib/onboard/machine/handlers/finalization.test.tssrc/lib/onboard/machine/handlers/finalization.tstest/helpers/onboard-final-flow-phases.ts
|
@coderabbitai full review |
✅ Action performedFull review finished. |
runSandboxScopeWarmupRun resolved the CLI via getOpenshellBinary(), which process.exit(1)s when OpenShell cannot be found. That fail-fast escapes the function's try/catch and would turn the best-effort warm-up into a hard onboard-finalization exit. Resolve with resolveOpenshell() (returns null) and no-op when the CLI is absent, preserving the never-throw/never-block contract. Signed-off-by: Tony Luo <xialuo@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@coderabbitai addressed ( |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/lib/actions/sandbox/auto-pair-warmup.ts (1)
132-136:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't resolve OpenShell through the fatal helper in this best-effort path.
Line 136 still goes through
getOpenshellBinary(), and that helper exits the whole process when resolution fails. That bypasses thetry/catchhere, so a missing/corrupt OpenShell install turns this documented swallow-and-degrade warm-up into a hard onboard abort.Minimal fix
- const { getOpenshellBinary } = - require("../../adapters/openshell/runtime") as typeof import("../../adapters/openshell/runtime"); try { spawnSync( - getOpenshellBinary(), + "openshell", [🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/actions/sandbox/auto-pair-warmup.ts` around lines 132 - 136, The warm-up currently calls getOpenshellBinary() which uses the fatal helper and can exit the process; change this to a non-fatal resolution so the warm-up can silently degrade: add (or use) a safe resolver like tryGetOpenshellBinary in the "../../adapters/openshell/runtime" module that returns the binary path or null/undefined without calling fatal, then replace the direct getOpenshellBinary() call in auto-pair-warmup.ts with tryGetOpenshellBinary() and only call spawnSync when a path is returned (log a debug message and skip spawnSync when null).
🧹 Nitpick comments (1)
src/lib/actions/sandbox/auto-pair-warmup.test.ts (1)
63-72: ⚡ Quick winKeep this unit case fully mocked/in-process.
Line 71 spawns a real
shprocess and depends on host shell utilities, so this test is no longer isolated to the TS contract. Please either mock the execution boundary here or move this round-trip check to the existing integration harness the file comment already references.As per coding guidelines, "Mock external dependencies in unit tests; do not call real NVIDIA APIs in unit tests."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/actions/sandbox/auto-pair-warmup.test.ts` around lines 63 - 72, The test currently calls the real shell via spawnSync which makes the unit test non-isolated; change the test to mock the execution boundary instead of invoking the host shell: in this test replace direct use of child_process.spawnSync (or the top-level spawnSync import) with a Jest/ Sinon stub that returns a simulated { status: 0, stdout: "...", stderr: "" } for the wrapped script, and keep the same assertions against result.status and any stdout/stderr expectations; alternatively, move this “round-trip” scenario into the integration harness referenced in the file comment if you need to exercise a real shell. Ensure you reference wrapSandboxShellScript and stub/mock spawnSync consistently so the test remains fully in-process and isolated.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/actions/sandbox/auto-pair-warmup.ts`:
- Around line 49-59: The poll retry budget can consume the entire outer
WARMUP_TIMEOUT_MS because WARMUP_POLL_ATTEMPTS (10) ×
(WARMUP_POLL_LIST_TIMEOUT_S (2s) + 1s sleep) ≈ 30s, leaving no headroom for
startup or the provoke agent run; update the logic so the worst-case poll
duration stays safely below WARMUP_TIMEOUT_MS by either reducing
WARMUP_POLL_ATTEMPTS or making it derived from WARMUP_TIMEOUT_MS and
WARMUP_POLL_LIST_TIMEOUT_S (e.g. attempts = Math.floor((WARMUP_TIMEOUT_MS -
HEADROOM_MS) / ((WARMUP_POLL_LIST_TIMEOUT_S*1000) + SLEEP_MS))), or stop doing
the unconditional 1s sleep after the final attempt; adjust the constants or loop
in auto-pair-warmup.ts (referencing WARMUP_POLL_ATTEMPTS,
WARMUP_POLL_LIST_TIMEOUT_S, WARMUP_TIMEOUT_MS and the sleep between retries) so
the computed worst-case poll time plus headroom is always < WARMUP_TIMEOUT_MS.
---
Duplicate comments:
In `@src/lib/actions/sandbox/auto-pair-warmup.ts`:
- Around line 132-136: The warm-up currently calls getOpenshellBinary() which
uses the fatal helper and can exit the process; change this to a non-fatal
resolution so the warm-up can silently degrade: add (or use) a safe resolver
like tryGetOpenshellBinary in the "../../adapters/openshell/runtime" module that
returns the binary path or null/undefined without calling fatal, then replace
the direct getOpenshellBinary() call in auto-pair-warmup.ts with
tryGetOpenshellBinary() and only call spawnSync when a path is returned (log a
debug message and skip spawnSync when null).
---
Nitpick comments:
In `@src/lib/actions/sandbox/auto-pair-warmup.test.ts`:
- Around line 63-72: The test currently calls the real shell via spawnSync which
makes the unit test non-isolated; change the test to mock the execution boundary
instead of invoking the host shell: in this test replace direct use of
child_process.spawnSync (or the top-level spawnSync import) with a Jest/ Sinon
stub that returns a simulated { status: 0, stdout: "...", stderr: "" } for the
wrapped script, and keep the same assertions against result.status and any
stdout/stderr expectations; alternatively, move this “round-trip” scenario into
the integration harness referenced in the file comment if you need to exercise a
real shell. Ensure you reference wrapSandboxShellScript and stub/mock spawnSync
consistently so the test remains fully in-process and isolated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4bb07985-601c-47da-8890-2004f65badf9
📒 Files selected for processing (6)
src/lib/actions/sandbox/auto-pair-warmup.test.tssrc/lib/actions/sandbox/auto-pair-warmup.tssrc/lib/onboard/finalization-deps.tssrc/lib/onboard/machine/handlers/finalization.test.tssrc/lib/onboard/machine/handlers/finalization.tstest/helpers/onboard-final-flow-phases.ts
|
Tip For best results, initiate chat on the files or code changes.
🐇 ✨ |
The pending-upgrade poll used WARMUP_POLL_ATTEMPTS=10 with a 2s list timeout and an unconditional 1s sleep per iteration, so the poll alone could consume ~30s — the entire WARMUP_TIMEOUT_MS — leaving no headroom for shell startup or the throwaway agent run, exactly when devices list is slow. Drop to 5 attempts and skip the trailing sleep (worst case 5x2 + 4x1 = 14s), comfortably under the 30s outer cap. The comment now states the real worst case. Signed-off-by: Tony Luo <xialuo@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@coderabbitai addressed both findings:
|
|
@coderabbitai full review |
✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/actions/sandbox/auto-pair-warmup.ts`:
- Around line 77-118: The poll loop currently only returns success when a
pending device request includes 'operator.write', which delays re-onboard when
the scope is already granted; update the Python probe (the OPENCLAW 'devices
list --json' handler) to treat already-granted scopes as terminal success too by
inspecting the response for granted/active devices and their scopes in addition
to 'pending' (check keys like 'granted' or device entries under the top-level
response and examine 'scopes'/'requestedScopes' for 'operator.write'), and
short-circuit to sys.exit(0) when any device already has 'operator.write' (also
treat empty/no pending upgrade as success when no upgrade is needed); ensure the
check still falls back to the existing pending-scan behavior if no
granted/active entries are present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2ea9d4bb-26ee-4df4-ab8d-c195fd1d7e3b
📒 Files selected for processing (6)
src/lib/actions/sandbox/auto-pair-warmup.test.tssrc/lib/actions/sandbox/auto-pair-warmup.tssrc/lib/onboard/finalization-deps.tssrc/lib/onboard/machine/handlers/finalization.test.tssrc/lib/onboard/machine/handlers/finalization.tstest/helpers/onboard-final-flow-phases.ts
|
Tip For best results, initiate chat on the files or code changes.
The never-throw / never-block-finalization contract is clean on both fronts. Thanks for the thorough follow-through! 🐇 ✨ |
…nted The pending-upgrade poll only succeeded on a PENDING operator.write request. On a re-onboard where operator.write is already granted there is no new pending request, so the poll burned its full budget before finalization continued, contradicting the idempotent no-op contract. Scan every device collection the devices-list response exposes (not just pending) and treat operator.write present anywhere — pending or already granted — as terminal success, so the already-paired path returns immediately. Signed-off-by: Tony Luo <xialuo@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@coderabbitai addressed ( On the regression test: the warm-up leaf's spawn path can't be exercised in-process (the lazy |
|
@coderabbitai full review |
✅ Action performedFull review finished. |
|
Tip For best results, initiate chat on the files or code changes.
The warm-up contract — best-effort, never-throw, idempotent, bounded — holds cleanly across all the cases now. Nice work threading this through. 🐇 ✨ |
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
✨ Related open issues: |
Summary
Reopened #4504. The earlier fix (#4763) corrected the approval-pass mechanics (timeout, full env-triplet strip,
recover/finalization wiring) but the reporter verified on v0.0.63 that the very firstopenclaw agentrun after a fresh onboard still silently falls back to embedded mode (exit 0).Root cause (verified against the reporter's gateway log)
On a fresh onboard the device is auto-paired with only
operator.pairing. The firstopenclaw agent --agent mainrun immediately requestsoperator.write, which is not pre-approved → gateway rejects with1008 pairing required→ silent embedded fallback. The connect-time approval pass cannot help the first run: the upgrade request does not exist until that run happens, after finalization's approval sweep already found nothing pending. OpenClaw'sdevicesCLI exposes onlylist/get/approve, so there is no way to pre-grant a scope the device has not yet requested.Fix — warm up the upgrade during finalization
Run a bounded, best-effort throwaway in-sandbox
openclaw agent --agent main -m "ping"during onboard finalization to provoke the exactoperator.writescope-upgrade request, polldevices listuntil that allowlisted upgrade is pending, then let the existing approval pass clear it.operator.writeis persisted before handoff, so the user's first real run connects via the gateway with no fallback.src/lib/actions/sandbox/auto-pair-warmup.ts(runSandboxScopeWarmupRun): 30s outerspawnSynccap, ~10s bounded pending-upgrade poll, all errors swallowed (never throws / never blocks onboard).warmupScopeUpgradefinalization dep, called after process recovery and before the existing approval pass — order is load-bearing (provoke → approve).recoverwiring; if the gateway never registers the upgrade within the budget it degrades cleanly to that behavior (one fallback, thenrecoverfixes it) — no regression.Testing
npx vitest --project clion affected files: 25 pass — finalization ordering (recover → warmup → approval), best-effort/non-blocking, agent-agnostic; broad regression 1659 pass on the v2 base.npm run typecheck:cliclean for the changed files;npm run build:cliclean.src/lib/onboard.tsunchanged (dep flows through the existingfinalizationHandlerDepsspread) →codebase-growth-guardrailsneutral.Note
The long-term home for this is OpenClaw pre-approving the full operator scope set at pairing time; this warm-up is the NemoClaw-side mitigation until then.
🤖 Generated with Claude Code
Signed-off-by: Tony Luo xialuo@nvidia.com
Summary by CodeRabbit
New Features
Tests