Summary
Extract a rebuild-specific sandbox recreate path from the monolithic onboard() function, and introduce a canonical credential resolution pattern to eliminate an entire class of bugs where credentials are resolved inconsistently.
This issue captures architectural improvements identified during the investigation of #2273 (non-atomic rebuild). The immediate bug is being fixed in a separate PR, but the underlying architecture that made the bug possible — and will produce similar bugs in the future — needs structural attention.
Context: What #2273 Revealed
While investigating why nemoclaw rebuild destroys a sandbox and then fails to recreate it (leaving the user with no working sandbox), a deep audit of the rebuild → onboard → credential resolution chain uncovered five systemic issues, two of which require architectural work beyond a point fix:
The Credential Resolution Inconsistency
The codebase has two competing patterns for resolving provider credentials:
| Pattern |
Location |
Checks process.env? |
Checks credentials.json? |
getCredential(key) |
src/lib/credentials.ts |
✅ |
✅ |
process.env[key] (direct) |
src/lib/onboard.ts non-interactive path |
✅ |
❌ |
A bridge function hydrateCredentialEnv() exists at line 670 of onboard.ts to unify these — it calls getCredential() and populates process.env. It's used correctly in some paths (resume path at line 6580, setupInference() at line 4811) but bypassed in the non-interactive provider selection (lines 4238, 4274).
This means the same credential-not-found bug exists for every remote provider (NVIDIA, OpenAI, Anthropic, Gemini, custom, Anthropic-compatible) — not just the NVIDIA endpoint reported in #2273. The immediate fix addresses all six, but there's no structural guard preventing a seventh provider from being added with the same mistake.
The Monolith-as-Subroutine Problem
onboard() is a 476-line orchestrator inside a 6,906-line file (src/lib/onboard.ts). It was designed as a standalone, user-facing CLI flow with its own:
- Lock management (
acquireOnboardLock / releaseOnboardLock)
- Session state machine (8 steps with
pending → in_progress → complete/failed transitions)
- 87
process.exit() calls for error handling
- Preflight validation (Docker, OpenShell, usage notice consent)
But sandboxRebuild() in src/nemoclaw.ts calls it as a subroutine:
await onboard({ resume: true, nonInteractive: true, recreateSandbox: true, agent: rebuildAgent });
This creates severe impedance mismatch:
- Rebuild can't catch failures: Any of onboard's 87
process.exit() calls kills the entire process — rebuild has no opportunity to rollback the sandbox destruction.
- Rebuild manipulates onboard's internals: Before calling onboard, rebuild writes directly to the session file to set
sandboxName, resumable, status, and agent — reverse-engineering onboard's step machine from the outside.
- Session is a global singleton:
~/.nemoclaw/onboard-session.json is shared across all sandboxes. Rebuild assumes the session belongs to the sandbox being rebuilt, but it may have been overwritten by a subsequent onboard of a different sandbox.
- Rebuild validates things it doesn't need: The onboard flow checks usage notice consent, runs preflight (Docker availability, OpenShell version), and manages the gateway — none of which rebuild needs since the gateway is already running.
This fragile coupling has already produced bugs:
Proposed Work
Part A: Extract Rebuild-Specific Recreate Path
Goal: Replace the onboard() call in sandboxRebuild() with a focused function that does only what rebuild needs.
The subset rebuild actually requires from onboard:
- Hydrate credentials from saved storage
- Create the sandbox via OpenShell
- Configure the inference provider and model
- Run agent-specific setup (e.g.,
openclaw doctor --fix)
What rebuild does NOT need:
- Lock management (rebuild has its own flow control)
- Session state persistence (rebuild manages its own backup/restore lifecycle)
- Preflight checks (gateway is already running, Docker is already available)
- Usage notice consent (already accepted)
- Provider selection UI (provider is known from the registry/session)
Approach:
- Extract the sandbox-creation, inference-configuration, and agent-setup logic from
onboard() into composable functions that throw on error instead of calling process.exit().
- Create a
recreateSandbox() function (or similar) that composes these primitives and is designed to be called from rebuild.
- The new function should accept explicit parameters (sandbox name, provider, model, credential env) rather than reading them from the global session file.
sandboxRebuild() calls this new function wrapped in try/catch, enabling rollback on failure.
Impact: Eliminates the entire class of bugs caused by:
process.exit() in library code called as a subroutine
- Implicit session state coupling between rebuild and onboard
- Global session file as a communication channel
Part B: Canonical Credential Resolution + Lint Guard
Goal: Make it structurally impossible to introduce a new credential check that only reads process.env without also checking credentials.json.
Approach:
- Introduce
resolveProviderCredential(envName) as the single canonical entry point for all provider credential resolution. It calls getCredential() internally and optionally hydrates process.env.
- Add an ESLint rule (or a custom lint check in the pre-commit hooks) that flags direct
process.env.NVIDIA_API_KEY, process.env.OPENAI_API_KEY, process.env[credentialEnv], etc. in onboard.ts provider-selection contexts. The rule should suggest using resolveProviderCredential() instead.
- Add a parametric test that iterates over every entry in
REMOTE_PROVIDER_CONFIG and verifies that the non-interactive credential resolution path for each provider uses getCredential() (or the canonical resolver) rather than process.env directly.
Impact: Prevents regression. When a new provider is added to REMOTE_PROVIDER_CONFIG, the parametric test automatically covers it. The lint rule catches the mistake at development time.
Why This Reduces Fragility
The current architecture has three fragility vectors that this work addresses:
-
Error handling fragility: process.exit() in library code means callers can't compose safely. Extracting throw-based functions lets rebuild (and future callers like upgrade-sandboxes) handle errors, rollback, retry, or report — rather than dying silently.
-
State coupling fragility: Rebuild communicates with onboard through a mutable global file, relying on onboard's internal step-machine being in the right state. Explicit parameter passing eliminates this implicit contract.
-
Credential consistency fragility: Two resolution patterns with no enforcement means every new provider or code path is an opportunity to reintroduce the bug. A canonical function + lint rule + parametric test makes the correct pattern the path of least resistance.
Relationship to #2273
The immediate fix for #2273 (in a separate PR) addresses:
- Using
getCredential() / hydrateCredentialEnv() in the non-interactive provider selection for all providers
- Adding preflight validation to rebuild (check credentials before destroying)
- Adding rollback on recreate failure (restore from backup if onboard fails)
This issue captures the deeper structural work that prevents the entire class of bugs from recurring.
Acceptance Criteria
Summary
Extract a rebuild-specific sandbox recreate path from the monolithic
onboard()function, and introduce a canonical credential resolution pattern to eliminate an entire class of bugs where credentials are resolved inconsistently.This issue captures architectural improvements identified during the investigation of #2273 (non-atomic rebuild). The immediate bug is being fixed in a separate PR, but the underlying architecture that made the bug possible — and will produce similar bugs in the future — needs structural attention.
Context: What #2273 Revealed
While investigating why
nemoclaw rebuilddestroys a sandbox and then fails to recreate it (leaving the user with no working sandbox), a deep audit of the rebuild → onboard → credential resolution chain uncovered five systemic issues, two of which require architectural work beyond a point fix:The Credential Resolution Inconsistency
The codebase has two competing patterns for resolving provider credentials:
process.env?credentials.json?getCredential(key)src/lib/credentials.tsprocess.env[key](direct)src/lib/onboard.tsnon-interactive pathA bridge function
hydrateCredentialEnv()exists at line 670 ofonboard.tsto unify these — it callsgetCredential()and populatesprocess.env. It's used correctly in some paths (resume path at line 6580,setupInference()at line 4811) but bypassed in the non-interactive provider selection (lines 4238, 4274).This means the same credential-not-found bug exists for every remote provider (NVIDIA, OpenAI, Anthropic, Gemini, custom, Anthropic-compatible) — not just the NVIDIA endpoint reported in #2273. The immediate fix addresses all six, but there's no structural guard preventing a seventh provider from being added with the same mistake.
The Monolith-as-Subroutine Problem
onboard()is a 476-line orchestrator inside a 6,906-line file (src/lib/onboard.ts). It was designed as a standalone, user-facing CLI flow with its own:acquireOnboardLock/releaseOnboardLock)pending→in_progress→complete/failedtransitions)process.exit()calls for error handlingBut
sandboxRebuild()insrc/nemoclaw.tscalls it as a subroutine:This creates severe impedance mismatch:
process.exit()calls kills the entire process — rebuild has no opportunity to rollback the sandbox destruction.sandboxName,resumable,status, andagent— reverse-engineering onboard's step machine from the outside.~/.nemoclaw/onboard-session.jsonis shared across all sandboxes. Rebuild assumes the session belongs to the sandbox being rebuilt, but it may have been overwritten by a subsequent onboard of a different sandbox.This fragile coupling has already produced bugs:
nemoclaw rebuildbuilds the wrong sandbox type and deletes existing sandbox! #2201: Rebuild built the wrong sandbox type because it picked up a staleagentfield from a different sandbox's session.process.envinstead of saved credentials, and rebuild couldn't catch the resultingprocess.exit(1)to rollback.Proposed Work
Part A: Extract Rebuild-Specific Recreate Path
Goal: Replace the
onboard()call insandboxRebuild()with a focused function that does only what rebuild needs.The subset rebuild actually requires from onboard:
openclaw doctor --fix)What rebuild does NOT need:
Approach:
onboard()into composable functions that throw on error instead of callingprocess.exit().recreateSandbox()function (or similar) that composes these primitives and is designed to be called from rebuild.sandboxRebuild()calls this new function wrapped in try/catch, enabling rollback on failure.Impact: Eliminates the entire class of bugs caused by:
process.exit()in library code called as a subroutinePart B: Canonical Credential Resolution + Lint Guard
Goal: Make it structurally impossible to introduce a new credential check that only reads
process.envwithout also checkingcredentials.json.Approach:
resolveProviderCredential(envName)as the single canonical entry point for all provider credential resolution. It callsgetCredential()internally and optionally hydratesprocess.env.process.env.NVIDIA_API_KEY,process.env.OPENAI_API_KEY,process.env[credentialEnv], etc. inonboard.tsprovider-selection contexts. The rule should suggest usingresolveProviderCredential()instead.REMOTE_PROVIDER_CONFIGand verifies that the non-interactive credential resolution path for each provider usesgetCredential()(or the canonical resolver) rather thanprocess.envdirectly.Impact: Prevents regression. When a new provider is added to
REMOTE_PROVIDER_CONFIG, the parametric test automatically covers it. The lint rule catches the mistake at development time.Why This Reduces Fragility
The current architecture has three fragility vectors that this work addresses:
Error handling fragility:
process.exit()in library code means callers can't compose safely. Extracting throw-based functions lets rebuild (and future callers likeupgrade-sandboxes) handle errors, rollback, retry, or report — rather than dying silently.State coupling fragility: Rebuild communicates with onboard through a mutable global file, relying on onboard's internal step-machine being in the right state. Explicit parameter passing eliminates this implicit contract.
Credential consistency fragility: Two resolution patterns with no enforcement means every new provider or code path is an opportunity to reintroduce the bug. A canonical function + lint rule + parametric test makes the correct pattern the path of least resistance.
Relationship to #2273
The immediate fix for #2273 (in a separate PR) addresses:
getCredential()/hydrateCredentialEnv()in the non-interactive provider selection for all providersThis issue captures the deeper structural work that prevents the entire class of bugs from recurring.
Acceptance Criteria
sandboxRebuild()no longer callsonboard()directly — it uses a focused recreate function that throws on errorresolveProviderCredential()(or equivalent) is the single canonical path for provider credential resolution inonboard.tsprocess.envaccess for provider credential keys in provider-selection contextsREMOTE_PROVIDER_CONFIGto verify consistent credential resolutionupgrade-sandboxes(which also callssandboxRebuild) benefits from the same improvements