refactor(cli): migrate onboard-session.js to TypeScript#1272
Conversation
…ript modules Move ~210 lines of pure, side-effect-free functions from the 3,800-line onboard.js into five typed TypeScript modules under src/lib/: - gateway-state.ts: gateway/sandbox state classification - validation.ts: failure classification, API key validation, model ID checks - url-utils.ts: URL normalization, text compaction, env formatting - build-context.ts: Docker build context filtering, recovery hints - dashboard.ts: dashboard URL resolution and construction Infrastructure: - tsconfig.src.json compiles src/ → dist/ as CJS (dist/ already gitignored) - tsconfig.cli.json updated to type-check src/ - npm run build:cli added to package.json - Pre-commit test-cli hook builds TS and includes dist/lib/ in coverage onboard.js imports from compiled dist/lib/ output. All 542 CLI tests pass. No user-facing behavior changes. Closes #1237. Relates to #924 (shell consolidation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
56 new tests across three co-located test files: - src/lib/validation.test.ts: classifyValidationFailure, classifyApplyFailure, classifySandboxCreateFailure, validateNvidiaApiKeyValue, isSafeModelId - src/lib/dashboard.test.ts: resolveDashboardForwardTarget, buildControlUiUrls - src/lib/url-utils.test.ts: compactText, stripEndpointSuffix, normalizeProviderBaseUrl, isLoopbackHostname, formatEnvAssignment, parsePolicyPresetEnv vitest.config.ts updated to include src/**/*.test.ts in the CLI project. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The existing callsite in printDashboard calls buildControlUiUrls() with no arguments when no token is available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The tsc-js pre-push hook (jsconfig.json) type-checks bin/lib/onboard.js which requires dist/lib/ to exist. CI was not running npm run build:cli before the prek checks, causing TS2307 "Cannot find module" errors. - Add npm run build:cli step to .github/actions/basic-checks - Update tsc-js hook to run build:cli before tsc Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…law into cv/extract-onboard-pure-fns
## Summary - Convert `bin/lib/preflight.js` (357 lines) to `src/lib/preflight.ts` with full type definitions - Typed interfaces for all opts objects and return types: `PortProbeResult`, `MemoryInfo`, `SwapResult`, `CheckPortOpts`, `GetMemoryInfoOpts`, `EnsureSwapOpts` - Extract `parseLsofLines` helper to reduce duplication in `checkPortAvailable` - Incorporate #1227 fix: `sudo` → `sudo -n` (non-interactive) for lsof retry - `bin/lib/preflight.js` becomes a thin re-export shim — existing consumers unaffected - Co-locate tests: `test/preflight.test.js` → `src/lib/preflight.test.ts` - Add real net probe tests (EADDRINUSE detection on occupied ports) - Fix all co-located test imports to use `dist/` paths for coverage attribution - Add targeted dashboard/validation branch tests to maintain ratchet Stacked on #1240. Not touched by any #924 blocker PR. ## Test plan - [x] 612 CLI tests pass (601 existing + 11 new) - [x] `tsc -p tsconfig.src.json` compiles cleanly - [x] `tsc -p tsconfig.cli.json` type-checks cleanly - [x] `tsc -p jsconfig.json` type-checks cleanly (the pre-push check that caught the union issue) - [x] Coverage ratchet passes Relates to #924 (shell consolidation). Supersedes #1227. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Convert bin/lib/onboard-session.js (440 lines) to src/lib/onboard-session.ts with typed interfaces for Session, StepState, SessionFailure, LockResult, SessionUpdates, and related types. Full migration — all functions including fs I/O (session persistence, file-based locking) and pure helpers (redaction, validation, normalization). Co-locates tests. Cache-clearing pattern updated for dist/ path. 615 CLI tests pass. Coverage ratchet passes. Relates to #924 (shell consolidation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughReorganized session management code from a JavaScript implementation in Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard-session.test.ts (1)
19-35:⚠️ Potential issue | 🟡 MinorThe shared temp HOME makes these lock tests order-dependent.
These hooks reuse one
tmpDirfor the whole file, butreleaseOnboardLock()intentionally leaves malformed lock files behind. After the malformed-lock cases, later tests inherit that state because the hooks only clear the module cache and the session file. RecreatetmpDirper test, or wipe it in the hooks.One simple cleanup option
beforeEach(() => { + fs.rmSync(tmpDir, { recursive: true, force: true }); + fs.mkdirSync(tmpDir, { recursive: true, mode: 0o700 }); process.env.HOME = tmpDir; delete require.cache[shimPath]; delete require.cache[distPath]; session = require("../../dist/lib/onboard-session"); session.clearSession(); session.releaseOnboardLock(); }); afterEach(() => { delete require.cache[shimPath]; delete require.cache[distPath]; + fs.rmSync(tmpDir, { recursive: true, force: true }); if (originalHome === undefined) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-session.test.ts` around lines 19 - 35, The tests reuse a single tmpDir across the file which causes order-dependent failures because session.releaseOnboardLock() leaves malformed lock files; update the beforeEach/afterEach hooks to create a fresh temporary directory for each test (or delete all files inside tmpDir) and set process.env.HOME to that fresh directory, ensuring you still delete require.cache for shimPath and distPath and call session.clearSession() and session.releaseOnboardLock() as before; target the beforeEach/afterEach functions and symbols tmpDir, process.env.HOME, session.clearSession(), and session.releaseOnboardLock() when applying the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard-session.js`:
- Line 7: Replace the shim export in bin/lib/onboard-session.js (currently
"module.exports = require('../../dist/lib/onboard-session')") with the actual
CommonJS implementation so the CLI can run without a build; locate the logic in
dist/lib/onboard-session (or the source TS/JS equivalent), convert/inline that
implementation into bin/lib/onboard-session.js as a plain CommonJS module
(exports or module.exports) and remove any dependency on dist/, ensuring
bin/nemoclaw.js can require('./lib/onboard-session') in a fresh checkout/npm
install.
In `@src/lib/onboard-session.ts`:
- Around line 204-251: normalizeSession is too complex and currently bypasses
the complexity rule; split its responsibilities into small helpers. Create
helper functions such as coerceSessionFields(data: Record<string, unknown>) ->
Partial<Session> to handle top-level field coercion and redaction (use
redactUrl, sanitizeFailure), normalizeMetadata(metadata: unknown) ->
SessionMetadata|undefined, and normalizeSteps(steps: unknown, currentSteps:
Session['steps']) -> Session['steps'] to validate and populate individual step
entries (use validateStep and redactSensitiveText). Replace the inline logic in
normalizeSession to call these helpers, remove the eslint-disable-next-line
complexity waiver, and keep behavior identical (set normalized.resumable and
normalized.status as before). Ensure the new helpers are exported or file-scoped
so tests/other code can be updated if needed.
- Around line 336-365: The current stale-lock cleanup unlinks LOCK_FILE directly
which allows a race where a concurrent process recreates the lock and the slower
process then deletes it; change to an atomic stale-claim approach: when you
detect a stale lock (parseLockFile(...) returned existing and
isProcessAlive(...) is false), write a temporary file (e.g.,
`${LOCK_FILE}.${process.pid}.tmp`) with your PID/startedAt/command and then
attempt to atomically create the real lock using fs.linkSync(tempPath,
LOCK_FILE) (this call will succeed only if LOCK_FILE does not exist); if
linkSync succeeds you have acquired the lock (remove tempPath and return
acquired=true), if it fails because LOCK_FILE now exists remove tempPath and
continue retrying; do not call fs.unlinkSync(LOCK_FILE) unconditionally—only
remove stale files after successfully claiming the lock or when you can safely
determine nobody else replaced it. Use parseLockFile, LOCK_FILE, isProcessAlive,
and fs.linkSync/fs.writeFileSync/fs.unlinkSync to implement this.
- Around line 227-229: The metadata assignment currently copies gatewayName
regardless of type, allowing non-string values to persist; update the
normalization in the metadata block (used by loadSession/createSession flows) to
only set metadata.gatewayName when isObject(d.metadata) is true and (d.metadata
as Record<string, unknown>).gatewayName is a string (typeof === 'string'),
otherwise leave metadata undefined or omit gatewayName; ensure you reference the
existing isObject check and the SessionMetadata type so only string gatewayName
values are propagated.
---
Outside diff comments:
In `@src/lib/onboard-session.test.ts`:
- Around line 19-35: The tests reuse a single tmpDir across the file which
causes order-dependent failures because session.releaseOnboardLock() leaves
malformed lock files; update the beforeEach/afterEach hooks to create a fresh
temporary directory for each test (or delete all files inside tmpDir) and set
process.env.HOME to that fresh directory, ensuring you still delete
require.cache for shimPath and distPath and call session.clearSession() and
session.releaseOnboardLock() as before; target the beforeEach/afterEach
functions and symbols tmpDir, process.env.HOME, session.clearSession(), and
session.releaseOnboardLock() when applying the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 356d71e7-e42a-411b-862f-93c6eab61708
📒 Files selected for processing (3)
bin/lib/onboard-session.jssrc/lib/onboard-session.test.tssrc/lib/onboard-session.ts
| summarizeForDebug, | ||
| updateSession, | ||
| }; | ||
| module.exports = require("../../dist/lib/onboard-session"); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Tracked onboard-session entrypoints:"
git ls-files -- bin/lib/onboard-session.js dist/lib/onboard-session.js src/lib/onboard-session.ts
echo
echo "Packaging/build hooks that mention bin/dist:"
rg -n -C2 '"(bin|files|prepare|prepack|postinstall|build:cli)"|dist/lib/onboard-session|bin/lib/onboard-session' \
package.json README.md .npmignore .gitignore 2>/dev/null || trueRepository: NVIDIA/NemoClaw
Length of output: 1400
Remove the shim and keep the implementation in bin/lib/onboard-session.js.
The shim at line 7 delegates to dist/lib/onboard-session.js, which is not tracked in git and not included in the npm package. This breaks the bin/ contract that requires working without a build step—bin/nemoclaw.js will fail to load this module in a clean checkout or fresh npm install.
Either restore the implementation directly in bin/lib/onboard-session.js (as CommonJS) or restructure to avoid the dist/ dependency in the CLI path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/lib/onboard-session.js` at line 7, Replace the shim export in
bin/lib/onboard-session.js (currently "module.exports =
require('../../dist/lib/onboard-session')") with the actual CommonJS
implementation so the CLI can run without a build; locate the logic in
dist/lib/onboard-session (or the source TS/JS equivalent), convert/inline that
implementation into bin/lib/onboard-session.js as a plain CommonJS module
(exports or module.exports) and remove any dependency on dist/, ensuring
bin/nemoclaw.js can require('./lib/onboard-session') in a fresh checkout/npm
install.
| // eslint-disable-next-line complexity | ||
| export function normalizeSession(data: unknown): Session | null { | ||
| if (!isObject(data) || (data as Record<string, unknown>).version !== SESSION_VERSION) return null; | ||
| const d = data as Record<string, unknown>; | ||
| const normalized = createSession({ | ||
| sessionId: typeof d.sessionId === "string" ? d.sessionId : undefined, | ||
| mode: typeof d.mode === "string" ? d.mode : undefined, | ||
| startedAt: typeof d.startedAt === "string" ? d.startedAt : undefined, | ||
| updatedAt: typeof d.updatedAt === "string" ? d.updatedAt : undefined, | ||
| sandboxName: typeof d.sandboxName === "string" ? d.sandboxName : null, | ||
| provider: typeof d.provider === "string" ? d.provider : null, | ||
| model: typeof d.model === "string" ? d.model : null, | ||
| endpointUrl: typeof d.endpointUrl === "string" ? redactUrl(d.endpointUrl) : null, | ||
| credentialEnv: typeof d.credentialEnv === "string" ? d.credentialEnv : null, | ||
| preferredInferenceApi: | ||
| typeof d.preferredInferenceApi === "string" ? d.preferredInferenceApi : null, | ||
| nimContainer: typeof d.nimContainer === "string" ? d.nimContainer : null, | ||
| policyPresets: Array.isArray(d.policyPresets) | ||
| ? (d.policyPresets as unknown[]).filter((value) => typeof value === "string") as string[] | ||
| : null, | ||
| lastStepStarted: typeof d.lastStepStarted === "string" ? d.lastStepStarted : null, | ||
| lastCompletedStep: typeof d.lastCompletedStep === "string" ? d.lastCompletedStep : null, | ||
| failure: sanitizeFailure(d.failure as Record<string, unknown> | null), | ||
| metadata: isObject(d.metadata) | ||
| ? ({ gatewayName: (d.metadata as Record<string, unknown>).gatewayName } as SessionMetadata) | ||
| : undefined, | ||
| } as Partial<Session>); | ||
| normalized.resumable = d.resumable !== false; | ||
| normalized.status = typeof d.status === "string" ? d.status : normalized.status; | ||
|
|
||
| if (isObject(d.steps)) { | ||
| for (const [name, step] of Object.entries(d.steps as Record<string, unknown>)) { | ||
| if ( | ||
| Object.prototype.hasOwnProperty.call(normalized.steps, name) && | ||
| validateStep(step) | ||
| ) { | ||
| const s = step as Record<string, unknown>; | ||
| normalized.steps[name] = { | ||
| status: s.status as string, | ||
| startedAt: typeof s.startedAt === "string" ? s.startedAt : null, | ||
| completedAt: typeof s.completedAt === "string" ? s.completedAt : null, | ||
| error: redactSensitiveText(s.error), | ||
| }; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return normalized; |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Split normalizeSession instead of waiving the complexity check.
This deserializer is already doing schema coercion, redaction, defaults, and step normalization in one place. Please extract smaller helpers rather than landing the new complexity waiver here.
As per coding guidelines, **/*.{js,ts,tsx}: Maintain cyclomatic complexity limit of 20 (target: ratchet down to 15) for all functions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/onboard-session.ts` around lines 204 - 251, normalizeSession is too
complex and currently bypasses the complexity rule; split its responsibilities
into small helpers. Create helper functions such as coerceSessionFields(data:
Record<string, unknown>) -> Partial<Session> to handle top-level field coercion
and redaction (use redactUrl, sanitizeFailure), normalizeMetadata(metadata:
unknown) -> SessionMetadata|undefined, and normalizeSteps(steps: unknown,
currentSteps: Session['steps']) -> Session['steps'] to validate and populate
individual step entries (use validateStep and redactSensitiveText). Replace the
inline logic in normalizeSession to call these helpers, remove the
eslint-disable-next-line complexity waiver, and keep behavior identical (set
normalized.resumable and normalized.status as before). Ensure the new helpers
are exported or file-scoped so tests/other code can be updated if needed.
| metadata: isObject(d.metadata) | ||
| ? ({ gatewayName: (d.metadata as Record<string, unknown>).gatewayName } as SessionMetadata) | ||
| : undefined, |
There was a problem hiding this comment.
Validate metadata.gatewayName before copying it.
A persisted payload like { "metadata": { "gatewayName": 123 } } currently survives normalization. Because createSession() copies truthy gatewayName values through, loadSession() can return a Session whose metadata.gatewayName is not a string.
Suggested fix
- metadata: isObject(d.metadata)
- ? ({ gatewayName: (d.metadata as Record<string, unknown>).gatewayName } as SessionMetadata)
- : undefined,
+ metadata:
+ isObject(d.metadata) &&
+ typeof (d.metadata as Record<string, unknown>).gatewayName === "string"
+ ? { gatewayName: (d.metadata as Record<string, unknown>).gatewayName as string }
+ : undefined,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/onboard-session.ts` around lines 227 - 229, The metadata assignment
currently copies gatewayName regardless of type, allowing non-string values to
persist; update the normalization in the metadata block (used by
loadSession/createSession flows) to only set metadata.gatewayName when
isObject(d.metadata) is true and (d.metadata as Record<string,
unknown>).gatewayName is a string (typeof === 'string'), otherwise leave
metadata undefined or omit gatewayName; ensure you reference the existing
isObject check and the SessionMetadata type so only string gatewayName values
are propagated.
| let existing: LockInfo | null; | ||
| try { | ||
| existing = parseLockFile(fs.readFileSync(LOCK_FILE, "utf8")); | ||
| } catch (readError: unknown) { | ||
| if ((readError as NodeJS.ErrnoException)?.code === "ENOENT") { | ||
| continue; | ||
| } | ||
| throw readError; | ||
| } | ||
| if (!existing) { | ||
| continue; | ||
| } | ||
| if (existing && isProcessAlive(existing.pid)) { | ||
| return { | ||
| acquired: false, | ||
| lockFile: LOCK_FILE, | ||
| stale: false, | ||
| holderPid: existing.pid, | ||
| holderStartedAt: existing.startedAt, | ||
| holderCommand: existing.command, | ||
| }; | ||
| } | ||
|
|
||
| try { | ||
| fs.unlinkSync(LOCK_FILE); | ||
| } catch (unlinkError: unknown) { | ||
| if ((unlinkError as NodeJS.ErrnoException)?.code !== "ENOENT") { | ||
| throw unlinkError; | ||
| } | ||
| } |
There was a problem hiding this comment.
The stale-lock cleanup can let two processes acquire the lock.
If two contenders read the same stale file, the slower one can reach Line 360 after the faster one has already recreated LOCK_FILE, unlink that fresh lock, and then win on its next retry. That breaks the mutual-exclusion guarantee. Please use an atomic stale-lock claim strategy here, or switch this path to a proven lockfile implementation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/onboard-session.ts` around lines 336 - 365, The current stale-lock
cleanup unlinks LOCK_FILE directly which allows a race where a concurrent
process recreates the lock and the slower process then deletes it; change to an
atomic stale-claim approach: when you detect a stale lock (parseLockFile(...)
returned existing and isProcessAlive(...) is false), write a temporary file
(e.g., `${LOCK_FILE}.${process.pid}.tmp`) with your PID/startedAt/command and
then attempt to atomically create the real lock using fs.linkSync(tempPath,
LOCK_FILE) (this call will succeed only if LOCK_FILE does not exist); if
linkSync succeeds you have acquired the lock (remove tempPath and return
acquired=true), if it fails because LOCK_FILE now exists remove tempPath and
continue retrying; do not call fs.unlinkSync(LOCK_FILE) unconditionally—only
remove stale files after successfully claiming the lock or when you can safely
determine nobody else replaced it. Use parseLockFile, LOCK_FILE, isProcessAlive,
and fs.linkSync/fs.writeFileSync/fs.unlinkSync to implement this.
ericksoa
left a comment
There was a problem hiding this comment.
Faithful migration of the largest module in the series. Pre-existing issues tracked separately:
- #1281 — stale-lock race condition in acquireOnboardLock
- #1282 — normalizeSession cyclomatic complexity
- #1283 — metadata.gatewayName not type-validated
- #1284 — lock tests share tmpDir causing order dependence
5 behind main — needs a merge.
## Summary - Convert `bin/lib/onboard-session.js` (440 lines) to `src/lib/onboard-session.ts` - Typed interfaces: `Session`, `StepState`, `SessionFailure`, `LockResult`, `SessionUpdates`, `LockInfo`, `SessionMetadata` - Full migration including fs I/O (session persistence, file locking) and pure helpers (redaction, validation, normalization) - Co-locate tests (14 tests) with updated cache-clearing for dist/ path Stacked on #1240. 615 CLI tests pass. Coverage ratchet passes. Relates to #924 (shell consolidation). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Reorganized onboarding session management module structure and improved distribution of the codebase. Session tracking, step-level state management, lock handling, and sensitive data redaction functionality have been consolidated and optimized for better maintainability. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary - Convert `bin/lib/onboard-session.js` (440 lines) to `src/lib/onboard-session.ts` - Typed interfaces: `Session`, `StepState`, `SessionFailure`, `LockResult`, `SessionUpdates`, `LockInfo`, `SessionMetadata` - Full migration including fs I/O (session persistence, file locking) and pure helpers (redaction, validation, normalization) - Co-locate tests (14 tests) with updated cache-clearing for dist/ path Stacked on NVIDIA#1240. 615 CLI tests pass. Coverage ratchet passes. Relates to NVIDIA#924 (shell consolidation). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Reorganized onboarding session management module structure and improved distribution of the codebase. Session tracking, step-level state management, lock handling, and sensitive data redaction functionality have been consolidated and optimized for better maintainability. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The acquireOnboardLock stale-cleanup path read a stale lock, decided the holder was dead, and unconditionally unlinked LOCK_FILE before retrying writeFileSync(wx). Two concurrent processes that both observe the same stale lock will both try to clean it up — and the slower of the two can unlink the *fresh* lock the faster process just claimed, breaking mutual exclusion: both processes end up holding 'their' lock and onboard runs in parallel against the same shared session state. Reported as NVIDIA#1281, originally surfaced by CodeRabbit on NVIDIA#1272. Fix: capture the stale file's inode via fs.statSync({ bigint: true }) at the same time we read its contents, then in a new unlinkIfInodeMatches() helper, re-stat right before fs.unlinkSync and bail if the inode has changed. The dual stat-then-unlink is the only portable POSIX primitive Node exposes for this — there is no atomic "unlink-if-inode" syscall — so a sufficiently unlucky race can still slip through. The window is orders of magnitude smaller than the unconditional unlink it replaces, and the outer retry loop will detect a wrong unlink on its next writeFileSync(wx) attempt because either we re-create the file or we observe a new lock with a different inode. Also bumps MAX_ATTEMPTS from 2 to 5 because the inode-verified cleanup can take a few more spins under contention before one cleaner wins. Adds a regression test that simulates the race deterministically by wrapping fs.statSync so the first stat succeeds against the original stale inode, then atomically swaps the lock file (unlink + recreate) to give it a new inode before unlinkIfInodeMatches re-stats it. The test asserts the fresh claim survives the race and is the file on disk after acquireOnboardLock returns. Verified by stashing the source fix and re-running: the new test fails on the unguarded code as expected, and passes with the inode guard in place. Closes NVIDIA#1281
## Summary - Convert `bin/lib/onboard-session.js` (440 lines) to `src/lib/onboard-session.ts` - Typed interfaces: `Session`, `StepState`, `SessionFailure`, `LockResult`, `SessionUpdates`, `LockInfo`, `SessionMetadata` - Full migration including fs I/O (session persistence, file locking) and pure helpers (redaction, validation, normalization) - Co-locate tests (14 tests) with updated cache-clearing for dist/ path Stacked on NVIDIA#1240. 615 CLI tests pass. Coverage ratchet passes. Relates to NVIDIA#924 (shell consolidation). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Reorganized onboarding session management module structure and improved distribution of the codebase. Session tracking, step-level state management, lock handling, and sensitive data redaction functionality have been consolidated and optimized for better maintainability. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The acquireOnboardLock stale-cleanup path read a stale lock, decided the holder was dead, and unconditionally unlinked LOCK_FILE before retrying writeFileSync(wx). Two concurrent processes that both observe the same stale lock will both try to clean it up — and the slower of the two can unlink the *fresh* lock the faster process just claimed, breaking mutual exclusion: both processes end up holding 'their' lock and onboard runs in parallel against the same shared session state. Reported as NVIDIA#1281, originally surfaced by CodeRabbit on NVIDIA#1272. Fix: capture the stale file's inode via fs.statSync({ bigint: true }) at the same time we read its contents, then in a new unlinkIfInodeMatches() helper, re-stat right before fs.unlinkSync and bail if the inode has changed. The dual stat-then-unlink is the only portable POSIX primitive Node exposes for this — there is no atomic "unlink-if-inode" syscall — so a sufficiently unlucky race can still slip through. The window is orders of magnitude smaller than the unconditional unlink it replaces, and the outer retry loop will detect a wrong unlink on its next writeFileSync(wx) attempt because either we re-create the file or we observe a new lock with a different inode. Also bumps MAX_ATTEMPTS from 2 to 5 because the inode-verified cleanup can take a few more spins under contention before one cleaner wins. Adds a regression test that simulates the race deterministically by wrapping fs.statSync so the first stat succeeds against the original stale inode, then atomically swaps the lock file (unlink + recreate) to give it a new inode before unlinkIfInodeMatches re-stats it. The test asserts the fresh claim survives the race and is the file on disk after acquireOnboardLock returns. Verified by stashing the source fix and re-running: the new test fails on the unguarded code as expected, and passes with the inode guard in place. Closes NVIDIA#1281
The acquireOnboardLock stale-cleanup path read a stale lock, decided the holder was dead, and unconditionally unlinked LOCK_FILE before retrying writeFileSync(wx). Two concurrent processes that both observe the same stale lock will both try to clean it up — and the slower of the two can unlink the *fresh* lock the faster process just claimed, breaking mutual exclusion: both processes end up holding 'their' lock and onboard runs in parallel against the same shared session state. Reported as NVIDIA#1281, originally surfaced by CodeRabbit on NVIDIA#1272. Fix: capture the stale file's inode via fs.statSync({ bigint: true }) at the same time we read its contents, then in a new unlinkIfInodeMatches() helper, re-stat right before fs.unlinkSync and bail if the inode has changed. The dual stat-then-unlink is the only portable POSIX primitive Node exposes for this — there is no atomic "unlink-if-inode" syscall — so a sufficiently unlucky race can still slip through. The window is orders of magnitude smaller than the unconditional unlink it replaces, and the outer retry loop will detect a wrong unlink on its next writeFileSync(wx) attempt because either we re-create the file or we observe a new lock with a different inode. Also bumps MAX_ATTEMPTS from 2 to 5 because the inode-verified cleanup can take a few more spins under contention before one cleaner wins. Adds a regression test that simulates the race deterministically by wrapping fs.statSync so the first stat succeeds against the original stale inode, then atomically swaps the lock file (unlink + recreate) to give it a new inode before unlinkIfInodeMatches re-stats it. The test asserts the fresh claim survives the race and is the file on disk after acquireOnboardLock returns. Verified by stashing the source fix and re-running: the new test fails on the unguarded code as expected, and passes with the inode guard in place. Closes NVIDIA#1281
The acquireOnboardLock stale-cleanup path read a stale lock, decided the holder was dead, and unconditionally unlinked LOCK_FILE before retrying writeFileSync(wx). Two concurrent processes that both observe the same stale lock will both try to clean it up — and the slower of the two can unlink the *fresh* lock the faster process just claimed, breaking mutual exclusion: both processes end up holding 'their' lock and onboard runs in parallel against the same shared session state. Reported as NVIDIA#1281, originally surfaced by CodeRabbit on NVIDIA#1272. Fix: capture the stale file's inode via fs.statSync({ bigint: true }) at the same time we read its contents, then in a new unlinkIfInodeMatches() helper, re-stat right before fs.unlinkSync and bail if the inode has changed. The dual stat-then-unlink is the only portable POSIX primitive Node exposes for this — there is no atomic "unlink-if-inode" syscall — so a sufficiently unlucky race can still slip through. The window is orders of magnitude smaller than the unconditional unlink it replaces, and the outer retry loop will detect a wrong unlink on its next writeFileSync(wx) attempt because either we re-create the file or we observe a new lock with a different inode. Also bumps MAX_ATTEMPTS from 2 to 5 because the inode-verified cleanup can take a few more spins under contention before one cleaner wins. Adds a regression test that simulates the race deterministically by wrapping fs.statSync so the first stat succeeds against the original stale inode, then atomically swaps the lock file (unlink + recreate) to give it a new inode before unlinkIfInodeMatches re-stats it. The test asserts the fresh claim survives the race and is the file on disk after acquireOnboardLock returns. Verified by stashing the source fix and re-running: the new test fails on the unguarded code as expected, and passes with the inode guard in place. Closes NVIDIA#1281
…#1656) ## Summary Closes #1281. `acquireOnboardLock` in `src/lib/onboard-session.ts` had a stale-cleanup path that read a stale lock, decided the holder was dead, and **unconditionally** called `fs.unlinkSync(LOCK_FILE)` before retrying `writeFileSync(wx)`. Two concurrent processes that both observe the same stale lock both try to clean it up — and the slower of the two can unlink the *fresh* lock the faster process just claimed. Both processes then end up holding 'their' lock and onboard runs in parallel against the same shared session state. Originally surfaced by CodeRabbit on #1272. ## Race walkthrough ``` A: writeFile(wx) → EEXIST (stale lock present) B: writeFile(wx) → EEXIST (same stale lock) A: read stale, isProcessAlive(stale.pid) → false A: unlinkSync(LOCK_FILE) ← deletes stale A: loop, writeFile(wx) → SUCCESS, A holds the lock B: read stale (still has its own copy from before) B: isProcessAlive(stale.pid) → false (B is reasoning about the OLD pid, not the fresh lock A just wrote) B: unlinkSync(LOCK_FILE) ← DELETES A's FRESH LOCK B: loop, writeFile(wx) → SUCCESS, B also "holds" the lock ``` After this, both A and B believe they own the lock and proceed with onboard. ## Fix Capture the stale file's **inode** via `fs.statSync({ bigint: true })` at the same time we read its contents. Then in a new `unlinkIfInodeMatches()` helper, re-stat right before `fs.unlinkSync` and bail if the inode has changed. The dual stat-then-unlink is the only portable POSIX primitive Node exposes for this — there is no atomic "unlink-if-inode" syscall — so a sufficiently unlucky race can still slip through. **However** the window is orders of magnitude smaller than the unconditional unlink it replaces, and the outer retry loop will detect a wrong unlink on its next `writeFileSync(wx)` attempt because either we re-create the file or we observe a new lock with a different inode and retry. Also bumps `MAX_ATTEMPTS` from 2 to 5 because the inode-verified cleanup can take a few more spins under contention before one cleaner wins. ## Behavior preserved - Malformed lock files are still left on disk (the existing "treats unreadable or transient lock contents as a retry" test still passes — I explicitly do NOT call `unlinkIfInodeMatches` on malformed locks, only on parseable-but-stale ones). - A holder PID that is still alive is still reported correctly to the caller. - `releaseOnboardLock()` semantics unchanged. ## Test plan - [x] Added regression test `regression #1281: stale-cleanup race does not unlink a fresh lock claimed by another process` that simulates the race deterministically by wrapping `fs.statSync` so the first stat (inside acquireOnboardLock) succeeds against the original stale inode, then atomically swaps the lock file (unlink + recreate) to give it a new inode before `unlinkIfInodeMatches` re-stats it. The test asserts the fresh claim survives the race and is the file on disk after `acquireOnboardLock` returns. - [x] **Negative case verified**: stashed the source fix, re-ran the test against the unguarded code. The new regression test correctly **fails** because the unconditional `unlinkSync` deletes the fresh claim and writes a new one with a different command string. The test's `expect(onDisk.command).toContain("fresh claim from concurrent process")` then fails with the actual content showing the wrong winner. - [x] **Positive case verified**: re-applied the fix, re-ran. All 14 lock-related tests pass. - [x] No regression on existing lock tests (`acquires and releases the onboard lock`, `replaces a stale onboard lock`, `treats unreadable or transient lock contents as a retry, not a stale lock`, `ignores malformed lock files when releasing the onboard lock`). - [x] One pre-existing Windows-only POSIX-permissions test (`creates and persists a session with restrictive permissions` checking `mode & 0o777 === 0o600`) fails on this branch with the same failure as `main` — this is the umask/POSIX-mode-on-Windows issue and is unrelated to the lock fix. ## Why an inode check rather than the issue's suggested `linkSync` approach The issue body suggests using `fs.linkSync(temp, LOCK_FILE)` which fails atomically with EEXIST if the target exists. That's a valid alternative, but: 1. The current `writeFileSync(LOCK_FILE, payload, { flag: "wx" })` already provides the same atomic-create-or-fail semantics — there's no need to introduce a tempfile + link dance for the *initial* claim. The race is specifically in the *cleanup* of stale locks, not the initial claim. 2. Inode-comparison preserves the existing happy-path code (single-syscall `writeFileSync(wx)`) and only adds the check on the rare stale-cleanup branch. 3. Tempfile + linkSync would add a second possible failure mode (tempfile leak under crash) without solving the cleanup race any more cleanly. Both approaches are correct; this one is the smaller surgical change. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added a regression test that reproduces a stale-lock race and verifies correct behavior without cross-test side effects. * **Bug Fixes** * Increased retry attempts for lock acquisition to reduce transient failures. * More robust, atomic lock creation to avoid partial writes. * Safer stale-lock cleanup using on-disk verification before removal. * More reliable lock release that verifies ownership before removing lock files. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: ColinM-sys <cmcdonough@50words.com> --------- Signed-off-by: ColinM-sys <cmcdonough@50words.com> Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-authored-by: Prekshi Vyas <34834085+prekshivyas@users.noreply.github.com> Co-authored-by: Prekshi Vyas <prekshiv@nvidia.com>
Summary
bin/lib/onboard-session.js(440 lines) tosrc/lib/onboard-session.tsSession,StepState,SessionFailure,LockResult,SessionUpdates,LockInfo,SessionMetadataStacked on #1240. 615 CLI tests pass. Coverage ratchet passes.
Relates to #924 (shell consolidation).
🤖 Generated with Claude Code
Summary by CodeRabbit