refactor(cli): migrate onboard-session.js to TypeScript by cv · Pull Request #1272 · NVIDIA/NemoClaw

cv · 2026-04-01T20:13:00Z

Summary

Convert bin/lib/onboard-session.js (440 lines) to src/lib/onboard-session.ts
Typed interfaces: Session, StepState, SessionFailure, LockResult, SessionUpdates, LockInfo, SessionMetadata
Full migration including fs I/O (session persistence, file locking) and pure helpers (redaction, validation, normalization)
Co-locate tests (14 tests) with updated cache-clearing for dist/ path

Stacked on #1240. 615 CLI tests pass. Coverage ratchet passes.

Relates to #924 (shell consolidation).

🤖 Generated with Claude Code

Summary by CodeRabbit

Refactor
- Reorganized onboarding session management module structure and improved distribution of the codebase. Session tracking, step-level state management, lock handling, and sensitive data redaction functionality have been consolidated and optimized for better maintainability.

…ript modules Move ~210 lines of pure, side-effect-free functions from the 3,800-line onboard.js into five typed TypeScript modules under src/lib/: - gateway-state.ts: gateway/sandbox state classification - validation.ts: failure classification, API key validation, model ID checks - url-utils.ts: URL normalization, text compaction, env formatting - build-context.ts: Docker build context filtering, recovery hints - dashboard.ts: dashboard URL resolution and construction Infrastructure: - tsconfig.src.json compiles src/ → dist/ as CJS (dist/ already gitignored) - tsconfig.cli.json updated to type-check src/ - npm run build:cli added to package.json - Pre-commit test-cli hook builds TS and includes dist/lib/ in coverage onboard.js imports from compiled dist/lib/ output. All 542 CLI tests pass. No user-facing behavior changes. Closes #1237. Relates to #924 (shell consolidation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

56 new tests across three co-located test files: - src/lib/validation.test.ts: classifyValidationFailure, classifyApplyFailure, classifySandboxCreateFailure, validateNvidiaApiKeyValue, isSafeModelId - src/lib/dashboard.test.ts: resolveDashboardForwardTarget, buildControlUiUrls - src/lib/url-utils.test.ts: compactText, stripEndpointSuffix, normalizeProviderBaseUrl, isLoopbackHostname, formatEnvAssignment, parsePolicyPresetEnv vitest.config.ts updated to include src/**/*.test.ts in the CLI project. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The existing callsite in printDashboard calls buildControlUiUrls() with no arguments when no token is available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The tsc-js pre-push hook (jsconfig.json) type-checks bin/lib/onboard.js which requires dist/lib/ to exist. CI was not running npm run build:cli before the prek checks, causing TS2307 "Cannot find module" errors. - Add npm run build:cli step to .github/actions/basic-checks - Update tsc-js hook to run build:cli before tsc Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…law into cv/extract-onboard-pure-fns

## Summary - Convert `bin/lib/preflight.js` (357 lines) to `src/lib/preflight.ts` with full type definitions - Typed interfaces for all opts objects and return types: `PortProbeResult`, `MemoryInfo`, `SwapResult`, `CheckPortOpts`, `GetMemoryInfoOpts`, `EnsureSwapOpts` - Extract `parseLsofLines` helper to reduce duplication in `checkPortAvailable` - Incorporate #1227 fix: `sudo` → `sudo -n` (non-interactive) for lsof retry - `bin/lib/preflight.js` becomes a thin re-export shim — existing consumers unaffected - Co-locate tests: `test/preflight.test.js` → `src/lib/preflight.test.ts` - Add real net probe tests (EADDRINUSE detection on occupied ports) - Fix all co-located test imports to use `dist/` paths for coverage attribution - Add targeted dashboard/validation branch tests to maintain ratchet Stacked on #1240. Not touched by any #924 blocker PR. ## Test plan - [x] 612 CLI tests pass (601 existing + 11 new) - [x] `tsc -p tsconfig.src.json` compiles cleanly - [x] `tsc -p tsconfig.cli.json` type-checks cleanly - [x] `tsc -p jsconfig.json` type-checks cleanly (the pre-push check that caught the union issue) - [x] Coverage ratchet passes Relates to #924 (shell consolidation). Supersedes #1227. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Convert bin/lib/onboard-session.js (440 lines) to src/lib/onboard-session.ts with typed interfaces for Session, StepState, SessionFailure, LockResult, SessionUpdates, and related types. Full migration — all functions including fs I/O (session persistence, file-based locking) and pure helpers (redaction, validation, normalization). Co-locates tests. Cache-clearing pattern updated for dist/ path. 615 CLI tests pass. Coverage ratchet passes. Relates to #924 (shell consolidation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-01T20:13:59Z

📝 Walkthrough

Walkthrough

Reorganized session management code from a JavaScript implementation in bin/lib/ to a new TypeScript implementation in src/lib/, with the original bin/ file converted to a thin re-export shim of the compiled dist/ module. Updated tests to handle module caching for both entrypoints.

Changes

Cohort / File(s)	Summary
Shim Re-export `bin/lib/onboard-session.js`	Replaced entire implementation with single re-export from compiled dist module; all previously defined constants, functions, and helpers now sourced from build output.
Test Module Loading `src/lib/onboard-session.test.ts`	Updated module cache resolution to handle both shim and dist paths; switched test to require compiled module from dist while maintaining cache cleanup for both entrypoints.
Session Management Implementation `src/lib/onboard-session.ts`	New TypeScript module providing file-backed session persistence at `~/.nemoclaw/onboard-session.json`, including session lifecycle (create/load/save/clear), file-based locking with process liveness detection, step-level state tracking, atomic updates, and sensitive data redaction utilities.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 Hoppy refactoring times,
From JavaScript to TypeScript rhymes,
Sessions saved with lock and key,
Sensitive data we shall be,
Built from source to dist they go—
Onboarding flows now steal the show! 🔐✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'refactor(cli): migrate onboard-session.js to TypeScript' clearly and specifically describes the primary change—converting a JavaScript module to TypeScript.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch cv/migrate-onboard-session-ts

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lib/onboard-session.test.ts (1)

19-35: ⚠️ Potential issue | 🟡 Minor

The shared temp HOME makes these lock tests order-dependent.

These hooks reuse one tmpDir for the whole file, but releaseOnboardLock() intentionally leaves malformed lock files behind. After the malformed-lock cases, later tests inherit that state because the hooks only clear the module cache and the session file. Recreate tmpDir per test, or wipe it in the hooks.

One simple cleanup option

 beforeEach(() => {
+  fs.rmSync(tmpDir, { recursive: true, force: true });
+  fs.mkdirSync(tmpDir, { recursive: true, mode: 0o700 });
   process.env.HOME = tmpDir;
   delete require.cache[shimPath];
   delete require.cache[distPath];
   session = require("../../dist/lib/onboard-session");
   session.clearSession();
   session.releaseOnboardLock();
 });
 
 afterEach(() => {
   delete require.cache[shimPath];
   delete require.cache[distPath];
+  fs.rmSync(tmpDir, { recursive: true, force: true });
   if (originalHome === undefined) {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-session.test.ts` around lines 19 - 35, The tests reuse a
single tmpDir across the file which causes order-dependent failures because
session.releaseOnboardLock() leaves malformed lock files; update the
beforeEach/afterEach hooks to create a fresh temporary directory for each test
(or delete all files inside tmpDir) and set process.env.HOME to that fresh
directory, ensuring you still delete require.cache for shimPath and distPath and
call session.clearSession() and session.releaseOnboardLock() as before; target
the beforeEach/afterEach functions and symbols tmpDir, process.env.HOME,
session.clearSession(), and session.releaseOnboardLock() when applying the
change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard-session.js`:
- Line 7: Replace the shim export in bin/lib/onboard-session.js (currently
"module.exports = require('../../dist/lib/onboard-session')") with the actual
CommonJS implementation so the CLI can run without a build; locate the logic in
dist/lib/onboard-session (or the source TS/JS equivalent), convert/inline that
implementation into bin/lib/onboard-session.js as a plain CommonJS module
(exports or module.exports) and remove any dependency on dist/, ensuring
bin/nemoclaw.js can require('./lib/onboard-session') in a fresh checkout/npm
install.

In `@src/lib/onboard-session.ts`:
- Around line 204-251: normalizeSession is too complex and currently bypasses
the complexity rule; split its responsibilities into small helpers. Create
helper functions such as coerceSessionFields(data: Record<string, unknown>) ->
Partial<Session> to handle top-level field coercion and redaction (use
redactUrl, sanitizeFailure), normalizeMetadata(metadata: unknown) ->
SessionMetadata|undefined, and normalizeSteps(steps: unknown, currentSteps:
Session['steps']) -> Session['steps'] to validate and populate individual step
entries (use validateStep and redactSensitiveText). Replace the inline logic in
normalizeSession to call these helpers, remove the eslint-disable-next-line
complexity waiver, and keep behavior identical (set normalized.resumable and
normalized.status as before). Ensure the new helpers are exported or file-scoped
so tests/other code can be updated if needed.
- Around line 336-365: The current stale-lock cleanup unlinks LOCK_FILE directly
which allows a race where a concurrent process recreates the lock and the slower
process then deletes it; change to an atomic stale-claim approach: when you
detect a stale lock (parseLockFile(...) returned existing and
isProcessAlive(...) is false), write a temporary file (e.g.,
`${LOCK_FILE}.${process.pid}.tmp`) with your PID/startedAt/command and then
attempt to atomically create the real lock using fs.linkSync(tempPath,
LOCK_FILE) (this call will succeed only if LOCK_FILE does not exist); if
linkSync succeeds you have acquired the lock (remove tempPath and return
acquired=true), if it fails because LOCK_FILE now exists remove tempPath and
continue retrying; do not call fs.unlinkSync(LOCK_FILE) unconditionally—only
remove stale files after successfully claiming the lock or when you can safely
determine nobody else replaced it. Use parseLockFile, LOCK_FILE, isProcessAlive,
and fs.linkSync/fs.writeFileSync/fs.unlinkSync to implement this.
- Around line 227-229: The metadata assignment currently copies gatewayName
regardless of type, allowing non-string values to persist; update the
normalization in the metadata block (used by loadSession/createSession flows) to
only set metadata.gatewayName when isObject(d.metadata) is true and (d.metadata
as Record<string, unknown>).gatewayName is a string (typeof === 'string'),
otherwise leave metadata undefined or omit gatewayName; ensure you reference the
existing isObject check and the SessionMetadata type so only string gatewayName
values are propagated.

---

Outside diff comments:
In `@src/lib/onboard-session.test.ts`:
- Around line 19-35: The tests reuse a single tmpDir across the file which
causes order-dependent failures because session.releaseOnboardLock() leaves
malformed lock files; update the beforeEach/afterEach hooks to create a fresh
temporary directory for each test (or delete all files inside tmpDir) and set
process.env.HOME to that fresh directory, ensuring you still delete
require.cache for shimPath and distPath and call session.clearSession() and
session.releaseOnboardLock() as before; target the beforeEach/afterEach
functions and symbols tmpDir, process.env.HOME, session.clearSession(), and
session.releaseOnboardLock() when applying the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 356d71e7-e42a-411b-862f-93c6eab61708

📥 Commits

Reviewing files that changed from the base of the PR and between 2804b74 and 3e67453.

📒 Files selected for processing (3)

bin/lib/onboard-session.js
src/lib/onboard-session.test.ts
src/lib/onboard-session.ts

coderabbitai · 2026-04-01T21:09:09Z

-  summarizeForDebug,
-  updateSession,
-};
+module.exports = require("../../dist/lib/onboard-session");


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Tracked onboard-session entrypoints:" git ls-files -- bin/lib/onboard-session.js dist/lib/onboard-session.js src/lib/onboard-session.ts echo echo "Packaging/build hooks that mention bin/dist:" rg -n -C2 '"(bin|files|prepare|prepack|postinstall|build:cli)"|dist/lib/onboard-session|bin/lib/onboard-session' \ package.json README.md .npmignore .gitignore 2>/dev/null || true

Repository: NVIDIA/NemoClaw

Length of output: 1400

Remove the shim and keep the implementation in bin/lib/onboard-session.js.

The shim at line 7 delegates to dist/lib/onboard-session.js, which is not tracked in git and not included in the npm package. This breaks the bin/ contract that requires working without a build step—bin/nemoclaw.js will fail to load this module in a clean checkout or fresh npm install.

Either restore the implementation directly in bin/lib/onboard-session.js (as CommonJS) or restructure to avoid the dist/ dependency in the CLI path.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard-session.js` at line 7, Replace the shim export in bin/lib/onboard-session.js (currently "module.exports = require('../../dist/lib/onboard-session')") with the actual CommonJS implementation so the CLI can run without a build; locate the logic in dist/lib/onboard-session (or the source TS/JS equivalent), convert/inline that implementation into bin/lib/onboard-session.js as a plain CommonJS module (exports or module.exports) and remove any dependency on dist/, ensuring bin/nemoclaw.js can require('./lib/onboard-session') in a fresh checkout/npm install.

coderabbitai · 2026-04-01T21:09:09Z

+// eslint-disable-next-line complexity
+export function normalizeSession(data: unknown): Session | null {
+  if (!isObject(data) || (data as Record<string, unknown>).version !== SESSION_VERSION) return null;
+  const d = data as Record<string, unknown>;
+  const normalized = createSession({
+    sessionId: typeof d.sessionId === "string" ? d.sessionId : undefined,
+    mode: typeof d.mode === "string" ? d.mode : undefined,
+    startedAt: typeof d.startedAt === "string" ? d.startedAt : undefined,
+    updatedAt: typeof d.updatedAt === "string" ? d.updatedAt : undefined,
+    sandboxName: typeof d.sandboxName === "string" ? d.sandboxName : null,
+    provider: typeof d.provider === "string" ? d.provider : null,
+    model: typeof d.model === "string" ? d.model : null,
+    endpointUrl: typeof d.endpointUrl === "string" ? redactUrl(d.endpointUrl) : null,
+    credentialEnv: typeof d.credentialEnv === "string" ? d.credentialEnv : null,
+    preferredInferenceApi:
+      typeof d.preferredInferenceApi === "string" ? d.preferredInferenceApi : null,
+    nimContainer: typeof d.nimContainer === "string" ? d.nimContainer : null,
+    policyPresets: Array.isArray(d.policyPresets)
+      ? (d.policyPresets as unknown[]).filter((value) => typeof value === "string") as string[]
+      : null,
+    lastStepStarted: typeof d.lastStepStarted === "string" ? d.lastStepStarted : null,
+    lastCompletedStep: typeof d.lastCompletedStep === "string" ? d.lastCompletedStep : null,
+    failure: sanitizeFailure(d.failure as Record<string, unknown> | null),
+    metadata: isObject(d.metadata)
+      ? ({ gatewayName: (d.metadata as Record<string, unknown>).gatewayName } as SessionMetadata)
+      : undefined,
+  } as Partial<Session>);
+  normalized.resumable = d.resumable !== false;
+  normalized.status = typeof d.status === "string" ? d.status : normalized.status;
+
+  if (isObject(d.steps)) {
+    for (const [name, step] of Object.entries(d.steps as Record<string, unknown>)) {
+      if (
+        Object.prototype.hasOwnProperty.call(normalized.steps, name) &&
+        validateStep(step)
+      ) {
+        const s = step as Record<string, unknown>;
+        normalized.steps[name] = {
+          status: s.status as string,
+          startedAt: typeof s.startedAt === "string" ? s.startedAt : null,
+          completedAt: typeof s.completedAt === "string" ? s.completedAt : null,
+          error: redactSensitiveText(s.error),
+        };
+      }
+    }
+  }
+
+  return normalized;


🛠️ Refactor suggestion | 🟠 Major

Split normalizeSession instead of waiving the complexity check.

This deserializer is already doing schema coercion, redaction, defaults, and step normalization in one place. Please extract smaller helpers rather than landing the new complexity waiver here.

As per coding guidelines, **/*.{js,ts,tsx}: Maintain cyclomatic complexity limit of 20 (target: ratchet down to 15) for all functions.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-session.ts` around lines 204 - 251, normalizeSession is too complex and currently bypasses the complexity rule; split its responsibilities into small helpers. Create helper functions such as coerceSessionFields(data: Record<string, unknown>) -> Partial<Session> to handle top-level field coercion and redaction (use redactUrl, sanitizeFailure), normalizeMetadata(metadata: unknown) -> SessionMetadata|undefined, and normalizeSteps(steps: unknown, currentSteps: Session['steps']) -> Session['steps'] to validate and populate individual step entries (use validateStep and redactSensitiveText). Replace the inline logic in normalizeSession to call these helpers, remove the eslint-disable-next-line complexity waiver, and keep behavior identical (set normalized.resumable and normalized.status as before). Ensure the new helpers are exported or file-scoped so tests/other code can be updated if needed.

coderabbitai · 2026-04-01T21:09:09Z

+    metadata: isObject(d.metadata)
+      ? ({ gatewayName: (d.metadata as Record<string, unknown>).gatewayName } as SessionMetadata)
+      : undefined,


⚠️ Potential issue | 🟡 Minor

Validate metadata.gatewayName before copying it.

A persisted payload like { "metadata": { "gatewayName": 123 } } currently survives normalization. Because createSession() copies truthy gatewayName values through, loadSession() can return a Session whose metadata.gatewayName is not a string.

Suggested fix

- metadata: isObject(d.metadata) - ? ({ gatewayName: (d.metadata as Record<string, unknown>).gatewayName } as SessionMetadata) - : undefined, + metadata: + isObject(d.metadata) && + typeof (d.metadata as Record<string, unknown>).gatewayName === "string" + ? { gatewayName: (d.metadata as Record<string, unknown>).gatewayName as string } + : undefined,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-session.ts` around lines 227 - 229, The metadata assignment currently copies gatewayName regardless of type, allowing non-string values to persist; update the normalization in the metadata block (used by loadSession/createSession flows) to only set metadata.gatewayName when isObject(d.metadata) is true and (d.metadata as Record<string, unknown>).gatewayName is a string (typeof === 'string'), otherwise leave metadata undefined or omit gatewayName; ensure you reference the existing isObject check and the SessionMetadata type so only string gatewayName values are propagated.

coderabbitai · 2026-04-01T21:09:09Z

+      let existing: LockInfo | null;
+      try {
+        existing = parseLockFile(fs.readFileSync(LOCK_FILE, "utf8"));
+      } catch (readError: unknown) {
+        if ((readError as NodeJS.ErrnoException)?.code === "ENOENT") {
+          continue;
+        }
+        throw readError;
+      }
+      if (!existing) {
+        continue;
+      }
+      if (existing && isProcessAlive(existing.pid)) {
+        return {
+          acquired: false,
+          lockFile: LOCK_FILE,
+          stale: false,
+          holderPid: existing.pid,
+          holderStartedAt: existing.startedAt,
+          holderCommand: existing.command,
+        };
+      }
+
+      try {
+        fs.unlinkSync(LOCK_FILE);
+      } catch (unlinkError: unknown) {
+        if ((unlinkError as NodeJS.ErrnoException)?.code !== "ENOENT") {
+          throw unlinkError;
+        }
+      }


⚠️ Potential issue | 🔴 Critical

The stale-lock cleanup can let two processes acquire the lock.

If two contenders read the same stale file, the slower one can reach Line 360 after the faster one has already recreated LOCK_FILE, unlink that fresh lock, and then win on its next retry. That breaks the mutual-exclusion guarantee. Please use an atomic stale-lock claim strategy here, or switch this path to a proven lockfile implementation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-session.ts` around lines 336 - 365, The current stale-lock cleanup unlinks LOCK_FILE directly which allows a race where a concurrent process recreates the lock and the slower process then deletes it; change to an atomic stale-claim approach: when you detect a stale lock (parseLockFile(...) returned existing and isProcessAlive(...) is false), write a temporary file (e.g., `${LOCK_FILE}.${process.pid}.tmp`) with your PID/startedAt/command and then attempt to atomically create the real lock using fs.linkSync(tempPath, LOCK_FILE) (this call will succeed only if LOCK_FILE does not exist); if linkSync succeeds you have acquired the lock (remove tempPath and return acquired=true), if it fails because LOCK_FILE now exists remove tempPath and continue retrying; do not call fs.unlinkSync(LOCK_FILE) unconditionally—only remove stale files after successfully claiming the lock or when you can safely determine nobody else replaced it. Use parseLockFile, LOCK_FILE, isProcessAlive, and fs.linkSync/fs.writeFileSync/fs.unlinkSync to implement this.

ericksoa

Faithful migration of the largest module in the series. Pre-existing issues tracked separately:

#1281 — stale-lock race condition in acquireOnboardLock
#1282 — normalizeSession cyclomatic complexity
#1283 — metadata.gatewayName not type-validated
#1284 — lock tests share tmpDir causing order dependence

5 behind main — needs a merge.

## Summary - Convert `bin/lib/onboard-session.js` (440 lines) to `src/lib/onboard-session.ts` - Typed interfaces: `Session`, `StepState`, `SessionFailure`, `LockResult`, `SessionUpdates`, `LockInfo`, `SessionMetadata` - Full migration including fs I/O (session persistence, file locking) and pure helpers (redaction, validation, normalization) - Co-locate tests (14 tests) with updated cache-clearing for dist/ path Stacked on #1240. 615 CLI tests pass. Coverage ratchet passes. Relates to #924 (shell consolidation). 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Refactor** * Reorganized onboarding session management module structure and improved distribution of the codebase. Session tracking, step-level state management, lock handling, and sensitive data redaction functionality have been consolidated and optimized for better maintainability.  --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Summary - Convert `bin/lib/onboard-session.js` (440 lines) to `src/lib/onboard-session.ts` - Typed interfaces: `Session`, `StepState`, `SessionFailure`, `LockResult`, `SessionUpdates`, `LockInfo`, `SessionMetadata` - Full migration including fs I/O (session persistence, file locking) and pure helpers (redaction, validation, normalization) - Co-locate tests (14 tests) with updated cache-clearing for dist/ path Stacked on NVIDIA#1240. 615 CLI tests pass. Coverage ratchet passes. Relates to NVIDIA#924 (shell consolidation). 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Refactor** * Reorganized onboarding session management module structure and improved distribution of the codebase. Session tracking, step-level state management, lock handling, and sensitive data redaction functionality have been consolidated and optimized for better maintainability.  --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The acquireOnboardLock stale-cleanup path read a stale lock, decided the holder was dead, and unconditionally unlinked LOCK_FILE before retrying writeFileSync(wx). Two concurrent processes that both observe the same stale lock will both try to clean it up — and the slower of the two can unlink the *fresh* lock the faster process just claimed, breaking mutual exclusion: both processes end up holding 'their' lock and onboard runs in parallel against the same shared session state. Reported as NVIDIA#1281, originally surfaced by CodeRabbit on NVIDIA#1272. Fix: capture the stale file's inode via fs.statSync({ bigint: true }) at the same time we read its contents, then in a new unlinkIfInodeMatches() helper, re-stat right before fs.unlinkSync and bail if the inode has changed. The dual stat-then-unlink is the only portable POSIX primitive Node exposes for this — there is no atomic "unlink-if-inode" syscall — so a sufficiently unlucky race can still slip through. The window is orders of magnitude smaller than the unconditional unlink it replaces, and the outer retry loop will detect a wrong unlink on its next writeFileSync(wx) attempt because either we re-create the file or we observe a new lock with a different inode. Also bumps MAX_ATTEMPTS from 2 to 5 because the inode-verified cleanup can take a few more spins under contention before one cleaner wins. Adds a regression test that simulates the race deterministically by wrapping fs.statSync so the first stat succeeds against the original stale inode, then atomically swaps the lock file (unlink + recreate) to give it a new inode before unlinkIfInodeMatches re-stats it. The test asserts the fresh claim survives the race and is the file on disk after acquireOnboardLock returns. Verified by stashing the source fix and re-running: the new test fails on the unguarded code as expected, and passes with the inode guard in place. Closes NVIDIA#1281

## Summary - Convert `bin/lib/onboard-session.js` (440 lines) to `src/lib/onboard-session.ts` - Typed interfaces: `Session`, `StepState`, `SessionFailure`, `LockResult`, `SessionUpdates`, `LockInfo`, `SessionMetadata` - Full migration including fs I/O (session persistence, file locking) and pure helpers (redaction, validation, normalization) - Co-locate tests (14 tests) with updated cache-clearing for dist/ path Stacked on NVIDIA#1240. 615 CLI tests pass. Coverage ratchet passes. Relates to NVIDIA#924 (shell consolidation). 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Refactor** * Reorganized onboarding session management module structure and improved distribution of the codebase. Session tracking, step-level state management, lock handling, and sensitive data redaction functionality have been consolidated and optimized for better maintainability.  --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The acquireOnboardLock stale-cleanup path read a stale lock, decided the holder was dead, and unconditionally unlinked LOCK_FILE before retrying writeFileSync(wx). Two concurrent processes that both observe the same stale lock will both try to clean it up — and the slower of the two can unlink the *fresh* lock the faster process just claimed, breaking mutual exclusion: both processes end up holding 'their' lock and onboard runs in parallel against the same shared session state. Reported as NVIDIA#1281, originally surfaced by CodeRabbit on NVIDIA#1272. Fix: capture the stale file's inode via fs.statSync({ bigint: true }) at the same time we read its contents, then in a new unlinkIfInodeMatches() helper, re-stat right before fs.unlinkSync and bail if the inode has changed. The dual stat-then-unlink is the only portable POSIX primitive Node exposes for this — there is no atomic "unlink-if-inode" syscall — so a sufficiently unlucky race can still slip through. The window is orders of magnitude smaller than the unconditional unlink it replaces, and the outer retry loop will detect a wrong unlink on its next writeFileSync(wx) attempt because either we re-create the file or we observe a new lock with a different inode. Also bumps MAX_ATTEMPTS from 2 to 5 because the inode-verified cleanup can take a few more spins under contention before one cleaner wins. Adds a regression test that simulates the race deterministically by wrapping fs.statSync so the first stat succeeds against the original stale inode, then atomically swaps the lock file (unlink + recreate) to give it a new inode before unlinkIfInodeMatches re-stats it. The test asserts the fresh claim survives the race and is the file on disk after acquireOnboardLock returns. Verified by stashing the source fix and re-running: the new test fails on the unguarded code as expected, and passes with the inode guard in place. Closes NVIDIA#1281

…#1656) ## Summary Closes #1281. `acquireOnboardLock` in `src/lib/onboard-session.ts` had a stale-cleanup path that read a stale lock, decided the holder was dead, and **unconditionally** called `fs.unlinkSync(LOCK_FILE)` before retrying `writeFileSync(wx)`. Two concurrent processes that both observe the same stale lock both try to clean it up — and the slower of the two can unlink the *fresh* lock the faster process just claimed. Both processes then end up holding 'their' lock and onboard runs in parallel against the same shared session state. Originally surfaced by CodeRabbit on #1272. ## Race walkthrough ``` A: writeFile(wx) → EEXIST (stale lock present) B: writeFile(wx) → EEXIST (same stale lock) A: read stale, isProcessAlive(stale.pid) → false A: unlinkSync(LOCK_FILE) ← deletes stale A: loop, writeFile(wx) → SUCCESS, A holds the lock B: read stale (still has its own copy from before) B: isProcessAlive(stale.pid) → false (B is reasoning about the OLD pid, not the fresh lock A just wrote) B: unlinkSync(LOCK_FILE) ← DELETES A's FRESH LOCK B: loop, writeFile(wx) → SUCCESS, B also "holds" the lock ``` After this, both A and B believe they own the lock and proceed with onboard. ## Fix Capture the stale file's **inode** via `fs.statSync({ bigint: true })` at the same time we read its contents. Then in a new `unlinkIfInodeMatches()` helper, re-stat right before `fs.unlinkSync` and bail if the inode has changed. The dual stat-then-unlink is the only portable POSIX primitive Node exposes for this — there is no atomic "unlink-if-inode" syscall — so a sufficiently unlucky race can still slip through. **However** the window is orders of magnitude smaller than the unconditional unlink it replaces, and the outer retry loop will detect a wrong unlink on its next `writeFileSync(wx)` attempt because either we re-create the file or we observe a new lock with a different inode and retry. Also bumps `MAX_ATTEMPTS` from 2 to 5 because the inode-verified cleanup can take a few more spins under contention before one cleaner wins. ## Behavior preserved - Malformed lock files are still left on disk (the existing "treats unreadable or transient lock contents as a retry" test still passes — I explicitly do NOT call `unlinkIfInodeMatches` on malformed locks, only on parseable-but-stale ones). - A holder PID that is still alive is still reported correctly to the caller. - `releaseOnboardLock()` semantics unchanged. ## Test plan - [x] Added regression test `regression #1281: stale-cleanup race does not unlink a fresh lock claimed by another process` that simulates the race deterministically by wrapping `fs.statSync` so the first stat (inside acquireOnboardLock) succeeds against the original stale inode, then atomically swaps the lock file (unlink + recreate) to give it a new inode before `unlinkIfInodeMatches` re-stats it. The test asserts the fresh claim survives the race and is the file on disk after `acquireOnboardLock` returns. - [x] **Negative case verified**: stashed the source fix, re-ran the test against the unguarded code. The new regression test correctly **fails** because the unconditional `unlinkSync` deletes the fresh claim and writes a new one with a different command string. The test's `expect(onDisk.command).toContain("fresh claim from concurrent process")` then fails with the actual content showing the wrong winner. - [x] **Positive case verified**: re-applied the fix, re-ran. All 14 lock-related tests pass. - [x] No regression on existing lock tests (`acquires and releases the onboard lock`, `replaces a stale onboard lock`, `treats unreadable or transient lock contents as a retry, not a stale lock`, `ignores malformed lock files when releasing the onboard lock`). - [x] One pre-existing Windows-only POSIX-permissions test (`creates and persists a session with restrictive permissions` checking `mode & 0o777 === 0o600`) fails on this branch with the same failure as `main` — this is the umask/POSIX-mode-on-Windows issue and is unrelated to the lock fix. ## Why an inode check rather than the issue's suggested `linkSync` approach The issue body suggests using `fs.linkSync(temp, LOCK_FILE)` which fails atomically with EEXIST if the target exists. That's a valid alternative, but: 1. The current `writeFileSync(LOCK_FILE, payload, { flag: "wx" })` already provides the same atomic-create-or-fail semantics — there's no need to introduce a tempfile + link dance for the *initial* claim. The race is specifically in the *cleanup* of stale locks, not the initial claim. 2. Inode-comparison preserves the existing happy-path code (single-syscall `writeFileSync(wx)`) and only adds the check on the rare stale-cleanup branch. 3. Tempfile + linkSync would add a second possible failure mode (tempfile leak under crash) without solving the cleanup race any more cleanly. Both approaches are correct; this one is the smaller surgical change.  ## Summary by CodeRabbit * **Tests** * Added a regression test that reproduces a stale-lock race and verifies correct behavior without cross-test side effects. * **Bug Fixes** * Increased retry attempts for lock acquisition to reduce transient failures. * More robust, atomic lock creation to avoid partial writes. * Safer stale-lock cleanup using on-disk verification before removal. * More reliable lock release that verifies ownership before removing lock files.  Signed-off-by: ColinM-sys <cmcdonough@50words.com> --------- Signed-off-by: ColinM-sys <cmcdonough@50words.com> Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-authored-by: Prekshi Vyas <34834085+prekshivyas@users.noreply.github.com> Co-authored-by: Prekshi Vyas <prekshiv@nvidia.com>

cv and others added 8 commits April 1, 2026 01:05

fix: make buildControlUiUrls token parameter optional

eec919d

The existing callsite in printDashboard calls buildControlUiUrls() with no arguments when no token is available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' into cv/extract-onboard-pure-fns

eb16b9e

Merge branch 'cv/extract-onboard-pure-fns' of github.com:NVIDIA/NemoC…

d3e2c33

…law into cv/extract-onboard-pure-fns

cv changed the base branch from cv/extract-onboard-pure-fns to main April 1, 2026 20:33

wscurran added the status: triage label Apr 1, 2026

Merge branch 'main' into cv/migrate-onboard-session-ts

3e67453

wscurran added NemoClaw CLI refactor PR restructures code without intended behavior change and removed status: triage labels Apr 1, 2026

coderabbitai Bot reviewed Apr 1, 2026

View reviewed changes

Merge branch 'main' into cv/migrate-onboard-session-ts

5a818e9

cv enabled auto-merge (squash) April 1, 2026 22:20

cv and others added 2 commits April 1, 2026 15:23

merge: update branch with main

91c2f66

Merge branch 'main' into cv/migrate-onboard-session-ts

58ef479

ericksoa approved these changes Apr 2, 2026

View reviewed changes

cv merged commit 4f9e9ce into main Apr 2, 2026
4 checks passed

cv deleted the cv/migrate-onboard-session-ts branch April 2, 2026 00:53

cv mentioned this pull request Apr 2, 2026

refactor(cli): migrate remaining unblocked CJS modules to TypeScript #1298

Closed

ColinM-sys mentioned this pull request Apr 9, 2026

fix(cli): close stale-lock cleanup race in acquireOnboardLock (#1281) #1656

Merged

5 tasks

ericksoa mentioned this pull request May 1, 2026

[SECURITY] No per-task lifecycle tracking or graceful abort — stuck tasks require sandbox destruction #801

Closed

2 tasks

wscurran added area: cli Command line interface, flags, terminal UX, or output feature PR adds or expands user-visible functionality and removed NemoClaw CLI feature PR adds or expands user-visible functionality labels Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(cli): migrate onboard-session.js to TypeScript#1272

refactor(cli): migrate onboard-session.js to TypeScript#1272
cv merged 12 commits into
mainfrom
cv/migrate-onboard-session-ts

cv commented Apr 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 1, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 1, 2026

Uh oh!

coderabbitai Bot Apr 1, 2026

Uh oh!

coderabbitai Bot Apr 1, 2026

Uh oh!

coderabbitai Bot Apr 1, 2026

Uh oh!

ericksoa left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cv commented Apr 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

ericksoa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cv commented Apr 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 1, 2026 •

edited

Loading