Skip to content

Architectural improvement: Extract rebuild recreate path from onboard monolith and canonicalize credential resolution #2306

@jyaunches

Description

@jyaunches

Summary

Extract a rebuild-specific sandbox recreate path from the monolithic onboard() function, and introduce a canonical credential resolution pattern to eliminate an entire class of bugs where credentials are resolved inconsistently.

This issue captures architectural improvements identified during the investigation of #2273 (non-atomic rebuild). The immediate bug is being fixed in a separate PR, but the underlying architecture that made the bug possible — and will produce similar bugs in the future — needs structural attention.

Context: What #2273 Revealed

While investigating why nemoclaw rebuild destroys a sandbox and then fails to recreate it (leaving the user with no working sandbox), a deep audit of the rebuild → onboard → credential resolution chain uncovered five systemic issues, two of which require architectural work beyond a point fix:

The Credential Resolution Inconsistency

The codebase has two competing patterns for resolving provider credentials:

Pattern Location Checks process.env? Checks credentials.json?
getCredential(key) src/lib/credentials.ts
process.env[key] (direct) src/lib/onboard.ts non-interactive path

A bridge function hydrateCredentialEnv() exists at line 670 of onboard.ts to unify these — it calls getCredential() and populates process.env. It's used correctly in some paths (resume path at line 6580, setupInference() at line 4811) but bypassed in the non-interactive provider selection (lines 4238, 4274).

This means the same credential-not-found bug exists for every remote provider (NVIDIA, OpenAI, Anthropic, Gemini, custom, Anthropic-compatible) — not just the NVIDIA endpoint reported in #2273. The immediate fix addresses all six, but there's no structural guard preventing a seventh provider from being added with the same mistake.

The Monolith-as-Subroutine Problem

onboard() is a 476-line orchestrator inside a 6,906-line file (src/lib/onboard.ts). It was designed as a standalone, user-facing CLI flow with its own:

  • Lock management (acquireOnboardLock / releaseOnboardLock)
  • Session state machine (8 steps with pendingin_progresscomplete/failed transitions)
  • 87 process.exit() calls for error handling
  • Preflight validation (Docker, OpenShell, usage notice consent)

But sandboxRebuild() in src/nemoclaw.ts calls it as a subroutine:

await onboard({ resume: true, nonInteractive: true, recreateSandbox: true, agent: rebuildAgent });

This creates severe impedance mismatch:

  • Rebuild can't catch failures: Any of onboard's 87 process.exit() calls kills the entire process — rebuild has no opportunity to rollback the sandbox destruction.
  • Rebuild manipulates onboard's internals: Before calling onboard, rebuild writes directly to the session file to set sandboxName, resumable, status, and agent — reverse-engineering onboard's step machine from the outside.
  • Session is a global singleton: ~/.nemoclaw/onboard-session.json is shared across all sandboxes. Rebuild assumes the session belongs to the sandbox being rebuilt, but it may have been overwritten by a subsequent onboard of a different sandbox.
  • Rebuild validates things it doesn't need: The onboard flow checks usage notice consent, runs preflight (Docker availability, OpenShell version), and manages the gateway — none of which rebuild needs since the gateway is already running.

This fragile coupling has already produced bugs:

Proposed Work

Part A: Extract Rebuild-Specific Recreate Path

Goal: Replace the onboard() call in sandboxRebuild() with a focused function that does only what rebuild needs.

The subset rebuild actually requires from onboard:

  1. Hydrate credentials from saved storage
  2. Create the sandbox via OpenShell
  3. Configure the inference provider and model
  4. Run agent-specific setup (e.g., openclaw doctor --fix)

What rebuild does NOT need:

  • Lock management (rebuild has its own flow control)
  • Session state persistence (rebuild manages its own backup/restore lifecycle)
  • Preflight checks (gateway is already running, Docker is already available)
  • Usage notice consent (already accepted)
  • Provider selection UI (provider is known from the registry/session)

Approach:

  • Extract the sandbox-creation, inference-configuration, and agent-setup logic from onboard() into composable functions that throw on error instead of calling process.exit().
  • Create a recreateSandbox() function (or similar) that composes these primitives and is designed to be called from rebuild.
  • The new function should accept explicit parameters (sandbox name, provider, model, credential env) rather than reading them from the global session file.
  • sandboxRebuild() calls this new function wrapped in try/catch, enabling rollback on failure.

Impact: Eliminates the entire class of bugs caused by:

  • process.exit() in library code called as a subroutine
  • Implicit session state coupling between rebuild and onboard
  • Global session file as a communication channel

Part B: Canonical Credential Resolution + Lint Guard

Goal: Make it structurally impossible to introduce a new credential check that only reads process.env without also checking credentials.json.

Approach:

  1. Introduce resolveProviderCredential(envName) as the single canonical entry point for all provider credential resolution. It calls getCredential() internally and optionally hydrates process.env.
  2. Add an ESLint rule (or a custom lint check in the pre-commit hooks) that flags direct process.env.NVIDIA_API_KEY, process.env.OPENAI_API_KEY, process.env[credentialEnv], etc. in onboard.ts provider-selection contexts. The rule should suggest using resolveProviderCredential() instead.
  3. Add a parametric test that iterates over every entry in REMOTE_PROVIDER_CONFIG and verifies that the non-interactive credential resolution path for each provider uses getCredential() (or the canonical resolver) rather than process.env directly.

Impact: Prevents regression. When a new provider is added to REMOTE_PROVIDER_CONFIG, the parametric test automatically covers it. The lint rule catches the mistake at development time.

Why This Reduces Fragility

The current architecture has three fragility vectors that this work addresses:

  1. Error handling fragility: process.exit() in library code means callers can't compose safely. Extracting throw-based functions lets rebuild (and future callers like upgrade-sandboxes) handle errors, rollback, retry, or report — rather than dying silently.

  2. State coupling fragility: Rebuild communicates with onboard through a mutable global file, relying on onboard's internal step-machine being in the right state. Explicit parameter passing eliminates this implicit contract.

  3. Credential consistency fragility: Two resolution patterns with no enforcement means every new provider or code path is an opportunity to reintroduce the bug. A canonical function + lint rule + parametric test makes the correct pattern the path of least resistance.

Relationship to #2273

The immediate fix for #2273 (in a separate PR) addresses:

  • Using getCredential() / hydrateCredentialEnv() in the non-interactive provider selection for all providers
  • Adding preflight validation to rebuild (check credentials before destroying)
  • Adding rollback on recreate failure (restore from backup if onboard fails)

This issue captures the deeper structural work that prevents the entire class of bugs from recurring.

Acceptance Criteria

  • sandboxRebuild() no longer calls onboard() directly — it uses a focused recreate function that throws on error
  • The recreate function accepts explicit parameters (no session file manipulation)
  • Rebuild can catch recreate failures and attempt rollback
  • resolveProviderCredential() (or equivalent) is the single canonical path for provider credential resolution in onboard.ts
  • A lint rule or pre-commit check flags direct process.env access for provider credential keys in provider-selection contexts
  • A parametric test covers every entry in REMOTE_PROVIDER_CONFIG to verify consistent credential resolution
  • upgrade-sandboxes (which also calls sandboxRebuild) benefits from the same improvements

Metadata

Metadata

Assignees

Labels

area: architectureArchitecture, design debt, major refactors, or maintainabilityrefactorPR restructures code without intended behavior change

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions