Architectural improvement: Extract rebuild recreate path from onboard monolith and canonicalize credential resolution

## Summary

Extract a rebuild-specific sandbox recreate path from the monolithic `onboard()` function, and introduce a canonical credential resolution pattern to eliminate an entire class of bugs where credentials are resolved inconsistently.

This issue captures architectural improvements identified during the investigation of #2273 (non-atomic rebuild). The immediate bug is being fixed in a separate PR, but the underlying architecture that made the bug possible — and will produce similar bugs in the future — needs structural attention.

## Context: What #2273 Revealed

While investigating why `nemoclaw rebuild` destroys a sandbox and then fails to recreate it (leaving the user with no working sandbox), a deep audit of the rebuild → onboard → credential resolution chain uncovered **five systemic issues**, two of which require architectural work beyond a point fix:

### The Credential Resolution Inconsistency

The codebase has two competing patterns for resolving provider credentials:

| Pattern | Location | Checks `process.env`? | Checks `credentials.json`? |
|---------|----------|----------------------|---------------------------|
| `getCredential(key)` | `src/lib/credentials.ts` | ✅ | ✅ |
| `process.env[key]` (direct) | `src/lib/onboard.ts` non-interactive path | ✅ | ❌ |

A bridge function `hydrateCredentialEnv()` exists at line 670 of `onboard.ts` to unify these — it calls `getCredential()` and populates `process.env`. It's used correctly in some paths (resume path at line 6580, `setupInference()` at line 4811) but **bypassed** in the non-interactive provider selection (lines 4238, 4274).

This means the same credential-not-found bug exists for **every remote provider** (NVIDIA, OpenAI, Anthropic, Gemini, custom, Anthropic-compatible) — not just the NVIDIA endpoint reported in #2273. The immediate fix addresses all six, but there's no structural guard preventing a seventh provider from being added with the same mistake.

### The Monolith-as-Subroutine Problem

`onboard()` is a 476-line orchestrator inside a 6,906-line file (`src/lib/onboard.ts`). It was designed as a standalone, user-facing CLI flow with its own:
- Lock management (`acquireOnboardLock` / `releaseOnboardLock`)
- Session state machine (8 steps with `pending` → `in_progress` → `complete`/`failed` transitions)
- 87 `process.exit()` calls for error handling
- Preflight validation (Docker, OpenShell, usage notice consent)

But `sandboxRebuild()` in `src/nemoclaw.ts` calls it as a subroutine:

```typescript
await onboard({ resume: true, nonInteractive: true, recreateSandbox: true, agent: rebuildAgent });
```

This creates severe impedance mismatch:
- **Rebuild can't catch failures**: Any of onboard's 87 `process.exit()` calls kills the entire process — rebuild has no opportunity to rollback the sandbox destruction.
- **Rebuild manipulates onboard's internals**: Before calling onboard, rebuild writes directly to the session file to set `sandboxName`, `resumable`, `status`, and `agent` — reverse-engineering onboard's step machine from the outside.
- **Session is a global singleton**: `~/.nemoclaw/onboard-session.json` is shared across all sandboxes. Rebuild assumes the session belongs to the sandbox being rebuilt, but it may have been overwritten by a subsequent onboard of a different sandbox.
- **Rebuild validates things it doesn't need**: The onboard flow checks usage notice consent, runs preflight (Docker availability, OpenShell version), and manages the gateway — none of which rebuild needs since the gateway is already running.

This fragile coupling has already produced bugs:
- **#2201**: Rebuild built the wrong sandbox type because it picked up a stale `agent` field from a different sandbox's session.
- **#2273**: Non-interactive onboard checked `process.env` instead of saved credentials, and rebuild couldn't catch the resulting `process.exit(1)` to rollback.

## Proposed Work

### Part A: Extract Rebuild-Specific Recreate Path

**Goal**: Replace the `onboard()` call in `sandboxRebuild()` with a focused function that does only what rebuild needs.

The subset rebuild actually requires from onboard:
1. Hydrate credentials from saved storage
2. Create the sandbox via OpenShell
3. Configure the inference provider and model
4. Run agent-specific setup (e.g., `openclaw doctor --fix`)

What rebuild does NOT need:
- Lock management (rebuild has its own flow control)
- Session state persistence (rebuild manages its own backup/restore lifecycle)
- Preflight checks (gateway is already running, Docker is already available)
- Usage notice consent (already accepted)
- Provider selection UI (provider is known from the registry/session)

**Approach**:
- Extract the sandbox-creation, inference-configuration, and agent-setup logic from `onboard()` into composable functions that **throw on error** instead of calling `process.exit()`.
- Create a `recreateSandbox()` function (or similar) that composes these primitives and is designed to be called from rebuild.
- The new function should accept explicit parameters (sandbox name, provider, model, credential env) rather than reading them from the global session file.
- `sandboxRebuild()` calls this new function wrapped in try/catch, enabling rollback on failure.

**Impact**: Eliminates the entire class of bugs caused by:
- `process.exit()` in library code called as a subroutine
- Implicit session state coupling between rebuild and onboard
- Global session file as a communication channel

### Part B: Canonical Credential Resolution + Lint Guard

**Goal**: Make it structurally impossible to introduce a new credential check that only reads `process.env` without also checking `credentials.json`.

**Approach**:
1. Introduce `resolveProviderCredential(envName)` as the single canonical entry point for all provider credential resolution. It calls `getCredential()` internally and optionally hydrates `process.env`.
2. Add an ESLint rule (or a custom lint check in the pre-commit hooks) that flags direct `process.env.NVIDIA_API_KEY`, `process.env.OPENAI_API_KEY`, `process.env[credentialEnv]`, etc. in `onboard.ts` provider-selection contexts. The rule should suggest using `resolveProviderCredential()` instead.
3. Add a **parametric test** that iterates over every entry in `REMOTE_PROVIDER_CONFIG` and verifies that the non-interactive credential resolution path for each provider uses `getCredential()` (or the canonical resolver) rather than `process.env` directly.

**Impact**: Prevents regression. When a new provider is added to `REMOTE_PROVIDER_CONFIG`, the parametric test automatically covers it. The lint rule catches the mistake at development time.

## Why This Reduces Fragility

The current architecture has three fragility vectors that this work addresses:

1. **Error handling fragility**: `process.exit()` in library code means callers can't compose safely. Extracting throw-based functions lets rebuild (and future callers like `upgrade-sandboxes`) handle errors, rollback, retry, or report — rather than dying silently.

2. **State coupling fragility**: Rebuild communicates with onboard through a mutable global file, relying on onboard's internal step-machine being in the right state. Explicit parameter passing eliminates this implicit contract.

3. **Credential consistency fragility**: Two resolution patterns with no enforcement means every new provider or code path is an opportunity to reintroduce the bug. A canonical function + lint rule + parametric test makes the correct pattern the path of least resistance.

## Relationship to #2273

The immediate fix for #2273 (in a separate PR) addresses:
- Using `getCredential()` / `hydrateCredentialEnv()` in the non-interactive provider selection for all providers
- Adding preflight validation to rebuild (check credentials before destroying)
- Adding rollback on recreate failure (restore from backup if onboard fails)

This issue captures the deeper structural work that prevents the entire class of bugs from recurring.

## Acceptance Criteria

- [ ] `sandboxRebuild()` no longer calls `onboard()` directly — it uses a focused recreate function that throws on error
- [ ] The recreate function accepts explicit parameters (no session file manipulation)
- [ ] Rebuild can catch recreate failures and attempt rollback
- [ ] `resolveProviderCredential()` (or equivalent) is the single canonical path for provider credential resolution in `onboard.ts`
- [ ] A lint rule or pre-commit check flags direct `process.env` access for provider credential keys in provider-selection contexts
- [ ] A parametric test covers every entry in `REMOTE_PROVIDER_CONFIG` to verify consistent credential resolution
- [ ] `upgrade-sandboxes` (which also calls `sandboxRebuild`) benefits from the same improvements


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architectural improvement: Extract rebuild recreate path from onboard monolith and canonicalize credential resolution #2306

Summary

Context: What #2273 Revealed

The Credential Resolution Inconsistency

The Monolith-as-Subroutine Problem

Proposed Work

Part A: Extract Rebuild-Specific Recreate Path

Part B: Canonical Credential Resolution + Lint Guard

Why This Reduces Fragility

Relationship to #2273

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Pattern	Location	Checks `process.env`?	Checks `credentials.json`?
`getCredential(key)`	`src/lib/credentials.ts`	✅	✅
`process.env[key]` (direct)	`src/lib/onboard.ts` non-interactive path	✅	❌

Architectural improvement: Extract rebuild recreate path from onboard monolith and canonicalize credential resolution #2306

Description

Summary

Context: What #2273 Revealed

The Credential Resolution Inconsistency

The Monolith-as-Subroutine Problem

Proposed Work

Part A: Extract Rebuild-Specific Recreate Path

Part B: Canonical Credential Resolution + Lint Guard

Why This Reduces Fragility

Relationship to #2273

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions