Skip to content

refactor(cli): migrate onboard-session.js to TypeScript#1272

Merged
cv merged 12 commits into
mainfrom
cv/migrate-onboard-session-ts
Apr 2, 2026
Merged

refactor(cli): migrate onboard-session.js to TypeScript#1272
cv merged 12 commits into
mainfrom
cv/migrate-onboard-session-ts

Conversation

@cv

@cv cv commented Apr 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Convert bin/lib/onboard-session.js (440 lines) to src/lib/onboard-session.ts
  • Typed interfaces: Session, StepState, SessionFailure, LockResult, SessionUpdates, LockInfo, SessionMetadata
  • Full migration including fs I/O (session persistence, file locking) and pure helpers (redaction, validation, normalization)
  • Co-locate tests (14 tests) with updated cache-clearing for dist/ path

Stacked on #1240. 615 CLI tests pass. Coverage ratchet passes.

Relates to #924 (shell consolidation).

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Refactor
    • Reorganized onboarding session management module structure and improved distribution of the codebase. Session tracking, step-level state management, lock handling, and sensitive data redaction functionality have been consolidated and optimized for better maintainability.

cv and others added 8 commits April 1, 2026 01:05
…ript modules

Move ~210 lines of pure, side-effect-free functions from the 3,800-line
onboard.js into five typed TypeScript modules under src/lib/:

- gateway-state.ts: gateway/sandbox state classification
- validation.ts: failure classification, API key validation, model ID checks
- url-utils.ts: URL normalization, text compaction, env formatting
- build-context.ts: Docker build context filtering, recovery hints
- dashboard.ts: dashboard URL resolution and construction

Infrastructure:
- tsconfig.src.json compiles src/ → dist/ as CJS (dist/ already gitignored)
- tsconfig.cli.json updated to type-check src/
- npm run build:cli added to package.json
- Pre-commit test-cli hook builds TS and includes dist/lib/ in coverage

onboard.js imports from compiled dist/lib/ output. All 542 CLI tests pass.
No user-facing behavior changes.

Closes #1237. Relates to #924 (shell consolidation).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
56 new tests across three co-located test files:

- src/lib/validation.test.ts: classifyValidationFailure, classifyApplyFailure,
  classifySandboxCreateFailure, validateNvidiaApiKeyValue, isSafeModelId
- src/lib/dashboard.test.ts: resolveDashboardForwardTarget, buildControlUiUrls
- src/lib/url-utils.test.ts: compactText, stripEndpointSuffix,
  normalizeProviderBaseUrl, isLoopbackHostname, formatEnvAssignment,
  parsePolicyPresetEnv

vitest.config.ts updated to include src/**/*.test.ts in the CLI project.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The existing callsite in printDashboard calls buildControlUiUrls()
with no arguments when no token is available.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The tsc-js pre-push hook (jsconfig.json) type-checks bin/lib/onboard.js
which requires dist/lib/ to exist. CI was not running npm run build:cli
before the prek checks, causing TS2307 "Cannot find module" errors.

- Add npm run build:cli step to .github/actions/basic-checks
- Update tsc-js hook to run build:cli before tsc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary

- Convert `bin/lib/preflight.js` (357 lines) to `src/lib/preflight.ts`
with full type definitions
- Typed interfaces for all opts objects and return types:
`PortProbeResult`, `MemoryInfo`, `SwapResult`, `CheckPortOpts`,
`GetMemoryInfoOpts`, `EnsureSwapOpts`
- Extract `parseLsofLines` helper to reduce duplication in
`checkPortAvailable`
- Incorporate #1227 fix: `sudo` → `sudo -n` (non-interactive) for lsof
retry
- `bin/lib/preflight.js` becomes a thin re-export shim — existing
consumers unaffected
- Co-locate tests: `test/preflight.test.js` →
`src/lib/preflight.test.ts`
- Add real net probe tests (EADDRINUSE detection on occupied ports)
- Fix all co-located test imports to use `dist/` paths for coverage
attribution
- Add targeted dashboard/validation branch tests to maintain ratchet

Stacked on #1240. Not touched by any #924 blocker PR.

## Test plan

- [x] 612 CLI tests pass (601 existing + 11 new)
- [x] `tsc -p tsconfig.src.json` compiles cleanly
- [x] `tsc -p tsconfig.cli.json` type-checks cleanly
- [x] `tsc -p jsconfig.json` type-checks cleanly (the pre-push check
that caught the union issue)
- [x] Coverage ratchet passes

Relates to #924 (shell consolidation). Supersedes #1227.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Convert bin/lib/onboard-session.js (440 lines) to src/lib/onboard-session.ts
with typed interfaces for Session, StepState, SessionFailure, LockResult,
SessionUpdates, and related types.

Full migration — all functions including fs I/O (session persistence,
file-based locking) and pure helpers (redaction, validation, normalization).

Co-locates tests. Cache-clearing pattern updated for dist/ path.

615 CLI tests pass. Coverage ratchet passes.

Relates to #924 (shell consolidation).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Apr 1, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Reorganized session management code from a JavaScript implementation in bin/lib/ to a new TypeScript implementation in src/lib/, with the original bin/ file converted to a thin re-export shim of the compiled dist/ module. Updated tests to handle module caching for both entrypoints.

Changes

Cohort / File(s) Summary
Shim Re-export
bin/lib/onboard-session.js
Replaced entire implementation with single re-export from compiled dist module; all previously defined constants, functions, and helpers now sourced from build output.
Test Module Loading
src/lib/onboard-session.test.ts
Updated module cache resolution to handle both shim and dist paths; switched test to require compiled module from dist while maintaining cache cleanup for both entrypoints.
Session Management Implementation
src/lib/onboard-session.ts
New TypeScript module providing file-backed session persistence at ~/.nemoclaw/onboard-session.json, including session lifecycle (create/load/save/clear), file-based locking with process liveness detection, step-level state tracking, atomic updates, and sensitive data redaction utilities.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 Hoppy refactoring times,
From JavaScript to TypeScript rhymes,
Sessions saved with lock and key,
Sensitive data we shall be,
Built from source to dist they go—
Onboarding flows now steal the show! 🔐✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'refactor(cli): migrate onboard-session.js to TypeScript' clearly and specifically describes the primary change—converting a JavaScript module to TypeScript.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cv/migrate-onboard-session-ts

Comment @coderabbitai help to get the list of available commands and usage tips.

@cv cv changed the base branch from cv/extract-onboard-pure-fns to main April 1, 2026 20:33
@wscurran wscurran added NemoClaw CLI refactor PR restructures code without intended behavior change and removed status: triage labels Apr 1, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/onboard-session.test.ts (1)

19-35: ⚠️ Potential issue | 🟡 Minor

The shared temp HOME makes these lock tests order-dependent.

These hooks reuse one tmpDir for the whole file, but releaseOnboardLock() intentionally leaves malformed lock files behind. After the malformed-lock cases, later tests inherit that state because the hooks only clear the module cache and the session file. Recreate tmpDir per test, or wipe it in the hooks.

One simple cleanup option
 beforeEach(() => {
+  fs.rmSync(tmpDir, { recursive: true, force: true });
+  fs.mkdirSync(tmpDir, { recursive: true, mode: 0o700 });
   process.env.HOME = tmpDir;
   delete require.cache[shimPath];
   delete require.cache[distPath];
   session = require("../../dist/lib/onboard-session");
   session.clearSession();
   session.releaseOnboardLock();
 });
 
 afterEach(() => {
   delete require.cache[shimPath];
   delete require.cache[distPath];
+  fs.rmSync(tmpDir, { recursive: true, force: true });
   if (originalHome === undefined) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-session.test.ts` around lines 19 - 35, The tests reuse a
single tmpDir across the file which causes order-dependent failures because
session.releaseOnboardLock() leaves malformed lock files; update the
beforeEach/afterEach hooks to create a fresh temporary directory for each test
(or delete all files inside tmpDir) and set process.env.HOME to that fresh
directory, ensuring you still delete require.cache for shimPath and distPath and
call session.clearSession() and session.releaseOnboardLock() as before; target
the beforeEach/afterEach functions and symbols tmpDir, process.env.HOME,
session.clearSession(), and session.releaseOnboardLock() when applying the
change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard-session.js`:
- Line 7: Replace the shim export in bin/lib/onboard-session.js (currently
"module.exports = require('../../dist/lib/onboard-session')") with the actual
CommonJS implementation so the CLI can run without a build; locate the logic in
dist/lib/onboard-session (or the source TS/JS equivalent), convert/inline that
implementation into bin/lib/onboard-session.js as a plain CommonJS module
(exports or module.exports) and remove any dependency on dist/, ensuring
bin/nemoclaw.js can require('./lib/onboard-session') in a fresh checkout/npm
install.

In `@src/lib/onboard-session.ts`:
- Around line 204-251: normalizeSession is too complex and currently bypasses
the complexity rule; split its responsibilities into small helpers. Create
helper functions such as coerceSessionFields(data: Record<string, unknown>) ->
Partial<Session> to handle top-level field coercion and redaction (use
redactUrl, sanitizeFailure), normalizeMetadata(metadata: unknown) ->
SessionMetadata|undefined, and normalizeSteps(steps: unknown, currentSteps:
Session['steps']) -> Session['steps'] to validate and populate individual step
entries (use validateStep and redactSensitiveText). Replace the inline logic in
normalizeSession to call these helpers, remove the eslint-disable-next-line
complexity waiver, and keep behavior identical (set normalized.resumable and
normalized.status as before). Ensure the new helpers are exported or file-scoped
so tests/other code can be updated if needed.
- Around line 336-365: The current stale-lock cleanup unlinks LOCK_FILE directly
which allows a race where a concurrent process recreates the lock and the slower
process then deletes it; change to an atomic stale-claim approach: when you
detect a stale lock (parseLockFile(...) returned existing and
isProcessAlive(...) is false), write a temporary file (e.g.,
`${LOCK_FILE}.${process.pid}.tmp`) with your PID/startedAt/command and then
attempt to atomically create the real lock using fs.linkSync(tempPath,
LOCK_FILE) (this call will succeed only if LOCK_FILE does not exist); if
linkSync succeeds you have acquired the lock (remove tempPath and return
acquired=true), if it fails because LOCK_FILE now exists remove tempPath and
continue retrying; do not call fs.unlinkSync(LOCK_FILE) unconditionally—only
remove stale files after successfully claiming the lock or when you can safely
determine nobody else replaced it. Use parseLockFile, LOCK_FILE, isProcessAlive,
and fs.linkSync/fs.writeFileSync/fs.unlinkSync to implement this.
- Around line 227-229: The metadata assignment currently copies gatewayName
regardless of type, allowing non-string values to persist; update the
normalization in the metadata block (used by loadSession/createSession flows) to
only set metadata.gatewayName when isObject(d.metadata) is true and (d.metadata
as Record<string, unknown>).gatewayName is a string (typeof === 'string'),
otherwise leave metadata undefined or omit gatewayName; ensure you reference the
existing isObject check and the SessionMetadata type so only string gatewayName
values are propagated.

---

Outside diff comments:
In `@src/lib/onboard-session.test.ts`:
- Around line 19-35: The tests reuse a single tmpDir across the file which
causes order-dependent failures because session.releaseOnboardLock() leaves
malformed lock files; update the beforeEach/afterEach hooks to create a fresh
temporary directory for each test (or delete all files inside tmpDir) and set
process.env.HOME to that fresh directory, ensuring you still delete
require.cache for shimPath and distPath and call session.clearSession() and
session.releaseOnboardLock() as before; target the beforeEach/afterEach
functions and symbols tmpDir, process.env.HOME, session.clearSession(), and
session.releaseOnboardLock() when applying the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 356d71e7-e42a-411b-862f-93c6eab61708

📥 Commits

Reviewing files that changed from the base of the PR and between 2804b74 and 3e67453.

📒 Files selected for processing (3)
  • bin/lib/onboard-session.js
  • src/lib/onboard-session.test.ts
  • src/lib/onboard-session.ts

summarizeForDebug,
updateSession,
};
module.exports = require("../../dist/lib/onboard-session");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Tracked onboard-session entrypoints:"
git ls-files -- bin/lib/onboard-session.js dist/lib/onboard-session.js src/lib/onboard-session.ts

echo
echo "Packaging/build hooks that mention bin/dist:"
rg -n -C2 '"(bin|files|prepare|prepack|postinstall|build:cli)"|dist/lib/onboard-session|bin/lib/onboard-session' \
  package.json README.md .npmignore .gitignore 2>/dev/null || true

Repository: NVIDIA/NemoClaw

Length of output: 1400


Remove the shim and keep the implementation in bin/lib/onboard-session.js.

The shim at line 7 delegates to dist/lib/onboard-session.js, which is not tracked in git and not included in the npm package. This breaks the bin/ contract that requires working without a build step—bin/nemoclaw.js will fail to load this module in a clean checkout or fresh npm install.

Either restore the implementation directly in bin/lib/onboard-session.js (as CommonJS) or restructure to avoid the dist/ dependency in the CLI path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard-session.js` at line 7, Replace the shim export in
bin/lib/onboard-session.js (currently "module.exports =
require('../../dist/lib/onboard-session')") with the actual CommonJS
implementation so the CLI can run without a build; locate the logic in
dist/lib/onboard-session (or the source TS/JS equivalent), convert/inline that
implementation into bin/lib/onboard-session.js as a plain CommonJS module
(exports or module.exports) and remove any dependency on dist/, ensuring
bin/nemoclaw.js can require('./lib/onboard-session') in a fresh checkout/npm
install.

Comment on lines +204 to +251
// eslint-disable-next-line complexity
export function normalizeSession(data: unknown): Session | null {
if (!isObject(data) || (data as Record<string, unknown>).version !== SESSION_VERSION) return null;
const d = data as Record<string, unknown>;
const normalized = createSession({
sessionId: typeof d.sessionId === "string" ? d.sessionId : undefined,
mode: typeof d.mode === "string" ? d.mode : undefined,
startedAt: typeof d.startedAt === "string" ? d.startedAt : undefined,
updatedAt: typeof d.updatedAt === "string" ? d.updatedAt : undefined,
sandboxName: typeof d.sandboxName === "string" ? d.sandboxName : null,
provider: typeof d.provider === "string" ? d.provider : null,
model: typeof d.model === "string" ? d.model : null,
endpointUrl: typeof d.endpointUrl === "string" ? redactUrl(d.endpointUrl) : null,
credentialEnv: typeof d.credentialEnv === "string" ? d.credentialEnv : null,
preferredInferenceApi:
typeof d.preferredInferenceApi === "string" ? d.preferredInferenceApi : null,
nimContainer: typeof d.nimContainer === "string" ? d.nimContainer : null,
policyPresets: Array.isArray(d.policyPresets)
? (d.policyPresets as unknown[]).filter((value) => typeof value === "string") as string[]
: null,
lastStepStarted: typeof d.lastStepStarted === "string" ? d.lastStepStarted : null,
lastCompletedStep: typeof d.lastCompletedStep === "string" ? d.lastCompletedStep : null,
failure: sanitizeFailure(d.failure as Record<string, unknown> | null),
metadata: isObject(d.metadata)
? ({ gatewayName: (d.metadata as Record<string, unknown>).gatewayName } as SessionMetadata)
: undefined,
} as Partial<Session>);
normalized.resumable = d.resumable !== false;
normalized.status = typeof d.status === "string" ? d.status : normalized.status;

if (isObject(d.steps)) {
for (const [name, step] of Object.entries(d.steps as Record<string, unknown>)) {
if (
Object.prototype.hasOwnProperty.call(normalized.steps, name) &&
validateStep(step)
) {
const s = step as Record<string, unknown>;
normalized.steps[name] = {
status: s.status as string,
startedAt: typeof s.startedAt === "string" ? s.startedAt : null,
completedAt: typeof s.completedAt === "string" ? s.completedAt : null,
error: redactSensitiveText(s.error),
};
}
}
}

return normalized;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Split normalizeSession instead of waiving the complexity check.

This deserializer is already doing schema coercion, redaction, defaults, and step normalization in one place. Please extract smaller helpers rather than landing the new complexity waiver here.

As per coding guidelines, **/*.{js,ts,tsx}: Maintain cyclomatic complexity limit of 20 (target: ratchet down to 15) for all functions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-session.ts` around lines 204 - 251, normalizeSession is too
complex and currently bypasses the complexity rule; split its responsibilities
into small helpers. Create helper functions such as coerceSessionFields(data:
Record<string, unknown>) -> Partial<Session> to handle top-level field coercion
and redaction (use redactUrl, sanitizeFailure), normalizeMetadata(metadata:
unknown) -> SessionMetadata|undefined, and normalizeSteps(steps: unknown,
currentSteps: Session['steps']) -> Session['steps'] to validate and populate
individual step entries (use validateStep and redactSensitiveText). Replace the
inline logic in normalizeSession to call these helpers, remove the
eslint-disable-next-line complexity waiver, and keep behavior identical (set
normalized.resumable and normalized.status as before). Ensure the new helpers
are exported or file-scoped so tests/other code can be updated if needed.

Comment on lines +227 to +229
metadata: isObject(d.metadata)
? ({ gatewayName: (d.metadata as Record<string, unknown>).gatewayName } as SessionMetadata)
: undefined,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate metadata.gatewayName before copying it.

A persisted payload like { "metadata": { "gatewayName": 123 } } currently survives normalization. Because createSession() copies truthy gatewayName values through, loadSession() can return a Session whose metadata.gatewayName is not a string.

Suggested fix
-    metadata: isObject(d.metadata)
-      ? ({ gatewayName: (d.metadata as Record<string, unknown>).gatewayName } as SessionMetadata)
-      : undefined,
+    metadata:
+      isObject(d.metadata) &&
+      typeof (d.metadata as Record<string, unknown>).gatewayName === "string"
+        ? { gatewayName: (d.metadata as Record<string, unknown>).gatewayName as string }
+        : undefined,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-session.ts` around lines 227 - 229, The metadata assignment
currently copies gatewayName regardless of type, allowing non-string values to
persist; update the normalization in the metadata block (used by
loadSession/createSession flows) to only set metadata.gatewayName when
isObject(d.metadata) is true and (d.metadata as Record<string,
unknown>).gatewayName is a string (typeof === 'string'), otherwise leave
metadata undefined or omit gatewayName; ensure you reference the existing
isObject check and the SessionMetadata type so only string gatewayName values
are propagated.

Comment on lines +336 to +365
let existing: LockInfo | null;
try {
existing = parseLockFile(fs.readFileSync(LOCK_FILE, "utf8"));
} catch (readError: unknown) {
if ((readError as NodeJS.ErrnoException)?.code === "ENOENT") {
continue;
}
throw readError;
}
if (!existing) {
continue;
}
if (existing && isProcessAlive(existing.pid)) {
return {
acquired: false,
lockFile: LOCK_FILE,
stale: false,
holderPid: existing.pid,
holderStartedAt: existing.startedAt,
holderCommand: existing.command,
};
}

try {
fs.unlinkSync(LOCK_FILE);
} catch (unlinkError: unknown) {
if ((unlinkError as NodeJS.ErrnoException)?.code !== "ENOENT") {
throw unlinkError;
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

The stale-lock cleanup can let two processes acquire the lock.

If two contenders read the same stale file, the slower one can reach Line 360 after the faster one has already recreated LOCK_FILE, unlink that fresh lock, and then win on its next retry. That breaks the mutual-exclusion guarantee. Please use an atomic stale-lock claim strategy here, or switch this path to a proven lockfile implementation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-session.ts` around lines 336 - 365, The current stale-lock
cleanup unlinks LOCK_FILE directly which allows a race where a concurrent
process recreates the lock and the slower process then deletes it; change to an
atomic stale-claim approach: when you detect a stale lock (parseLockFile(...)
returned existing and isProcessAlive(...) is false), write a temporary file
(e.g., `${LOCK_FILE}.${process.pid}.tmp`) with your PID/startedAt/command and
then attempt to atomically create the real lock using fs.linkSync(tempPath,
LOCK_FILE) (this call will succeed only if LOCK_FILE does not exist); if
linkSync succeeds you have acquired the lock (remove tempPath and return
acquired=true), if it fails because LOCK_FILE now exists remove tempPath and
continue retrying; do not call fs.unlinkSync(LOCK_FILE) unconditionally—only
remove stale files after successfully claiming the lock or when you can safely
determine nobody else replaced it. Use parseLockFile, LOCK_FILE, isProcessAlive,
and fs.linkSync/fs.writeFileSync/fs.unlinkSync to implement this.

@cv cv enabled auto-merge (squash) April 1, 2026 22:20

@ericksoa ericksoa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Faithful migration of the largest module in the series. Pre-existing issues tracked separately:

  • #1281 — stale-lock race condition in acquireOnboardLock
  • #1282 — normalizeSession cyclomatic complexity
  • #1283 — metadata.gatewayName not type-validated
  • #1284 — lock tests share tmpDir causing order dependence

5 behind main — needs a merge.

@cv cv merged commit 4f9e9ce into main Apr 2, 2026
4 checks passed
@cv cv deleted the cv/migrate-onboard-session-ts branch April 2, 2026 00:53
laitingsheng pushed a commit that referenced this pull request Apr 2, 2026
## Summary
- Convert `bin/lib/onboard-session.js` (440 lines) to
`src/lib/onboard-session.ts`
- Typed interfaces: `Session`, `StepState`, `SessionFailure`,
`LockResult`, `SessionUpdates`, `LockInfo`, `SessionMetadata`
- Full migration including fs I/O (session persistence, file locking)
and pure helpers (redaction, validation, normalization)
- Co-locate tests (14 tests) with updated cache-clearing for dist/ path

Stacked on #1240. 615 CLI tests pass. Coverage ratchet passes.

Relates to #924 (shell consolidation).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Reorganized onboarding session management module structure and
improved distribution of the codebase. Session tracking, step-level
state management, lock handling, and sensitive data redaction
functionality have been consolidated and optimized for better
maintainability.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lakamsani pushed a commit to lakamsani/NemoClaw that referenced this pull request Apr 4, 2026
## Summary
- Convert `bin/lib/onboard-session.js` (440 lines) to
`src/lib/onboard-session.ts`
- Typed interfaces: `Session`, `StepState`, `SessionFailure`,
`LockResult`, `SessionUpdates`, `LockInfo`, `SessionMetadata`
- Full migration including fs I/O (session persistence, file locking)
and pure helpers (redaction, validation, normalization)
- Co-locate tests (14 tests) with updated cache-clearing for dist/ path

Stacked on NVIDIA#1240. 615 CLI tests pass. Coverage ratchet passes.

Relates to NVIDIA#924 (shell consolidation).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Reorganized onboarding session management module structure and
improved distribution of the codebase. Session tracking, step-level
state management, lock handling, and sensitive data redaction
functionality have been consolidated and optimized for better
maintainability.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ColinM-sys added a commit to ColinM-sys/NemoClaw that referenced this pull request Apr 9, 2026
The acquireOnboardLock stale-cleanup path read a stale lock, decided
the holder was dead, and unconditionally unlinked LOCK_FILE before
retrying writeFileSync(wx). Two concurrent processes that both observe
the same stale lock will both try to clean it up — and the slower of
the two can unlink the *fresh* lock the faster process just claimed,
breaking mutual exclusion: both processes end up holding 'their' lock
and onboard runs in parallel against the same shared session state.

Reported as NVIDIA#1281, originally surfaced by CodeRabbit on NVIDIA#1272.

Fix: capture the stale file's inode via fs.statSync({ bigint: true })
at the same time we read its contents, then in a new
unlinkIfInodeMatches() helper, re-stat right before fs.unlinkSync and
bail if the inode has changed. The dual stat-then-unlink is the only
portable POSIX primitive Node exposes for this — there is no atomic
"unlink-if-inode" syscall — so a sufficiently unlucky race can still
slip through. The window is orders of magnitude smaller than the
unconditional unlink it replaces, and the outer retry loop will detect
a wrong unlink on its next writeFileSync(wx) attempt because either
we re-create the file or we observe a new lock with a different inode.

Also bumps MAX_ATTEMPTS from 2 to 5 because the inode-verified cleanup
can take a few more spins under contention before one cleaner wins.

Adds a regression test that simulates the race deterministically by
wrapping fs.statSync so the first stat succeeds against the original
stale inode, then atomically swaps the lock file (unlink + recreate)
to give it a new inode before unlinkIfInodeMatches re-stats it. The
test asserts the fresh claim survives the race and is the file on
disk after acquireOnboardLock returns.

Verified by stashing the source fix and re-running: the new test
fails on the unguarded code as expected, and passes with the inode
guard in place.

Closes NVIDIA#1281
gemini2026 pushed a commit to gemini2026/NemoClaw that referenced this pull request Apr 14, 2026
## Summary
- Convert `bin/lib/onboard-session.js` (440 lines) to
`src/lib/onboard-session.ts`
- Typed interfaces: `Session`, `StepState`, `SessionFailure`,
`LockResult`, `SessionUpdates`, `LockInfo`, `SessionMetadata`
- Full migration including fs I/O (session persistence, file locking)
and pure helpers (redaction, validation, normalization)
- Co-locate tests (14 tests) with updated cache-clearing for dist/ path

Stacked on NVIDIA#1240. 615 CLI tests pass. Coverage ratchet passes.

Relates to NVIDIA#924 (shell consolidation).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Reorganized onboarding session management module structure and
improved distribution of the codebase. Session tracking, step-level
state management, lock handling, and sensitive data redaction
functionality have been consolidated and optimized for better
maintainability.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ColinM-sys added a commit to ColinM-sys/NemoClaw that referenced this pull request Apr 15, 2026
The acquireOnboardLock stale-cleanup path read a stale lock, decided
the holder was dead, and unconditionally unlinked LOCK_FILE before
retrying writeFileSync(wx). Two concurrent processes that both observe
the same stale lock will both try to clean it up — and the slower of
the two can unlink the *fresh* lock the faster process just claimed,
breaking mutual exclusion: both processes end up holding 'their' lock
and onboard runs in parallel against the same shared session state.

Reported as NVIDIA#1281, originally surfaced by CodeRabbit on NVIDIA#1272.

Fix: capture the stale file's inode via fs.statSync({ bigint: true })
at the same time we read its contents, then in a new
unlinkIfInodeMatches() helper, re-stat right before fs.unlinkSync and
bail if the inode has changed. The dual stat-then-unlink is the only
portable POSIX primitive Node exposes for this — there is no atomic
"unlink-if-inode" syscall — so a sufficiently unlucky race can still
slip through. The window is orders of magnitude smaller than the
unconditional unlink it replaces, and the outer retry loop will detect
a wrong unlink on its next writeFileSync(wx) attempt because either
we re-create the file or we observe a new lock with a different inode.

Also bumps MAX_ATTEMPTS from 2 to 5 because the inode-verified cleanup
can take a few more spins under contention before one cleaner wins.

Adds a regression test that simulates the race deterministically by
wrapping fs.statSync so the first stat succeeds against the original
stale inode, then atomically swaps the lock file (unlink + recreate)
to give it a new inode before unlinkIfInodeMatches re-stats it. The
test asserts the fresh claim survives the race and is the file on
disk after acquireOnboardLock returns.

Verified by stashing the source fix and re-running: the new test
fails on the unguarded code as expected, and passes with the inode
guard in place.

Closes NVIDIA#1281
ColinM-sys added a commit to ColinM-sys/NemoClaw that referenced this pull request Apr 15, 2026
The acquireOnboardLock stale-cleanup path read a stale lock, decided
the holder was dead, and unconditionally unlinked LOCK_FILE before
retrying writeFileSync(wx). Two concurrent processes that both observe
the same stale lock will both try to clean it up — and the slower of
the two can unlink the *fresh* lock the faster process just claimed,
breaking mutual exclusion: both processes end up holding 'their' lock
and onboard runs in parallel against the same shared session state.

Reported as NVIDIA#1281, originally surfaced by CodeRabbit on NVIDIA#1272.

Fix: capture the stale file's inode via fs.statSync({ bigint: true })
at the same time we read its contents, then in a new
unlinkIfInodeMatches() helper, re-stat right before fs.unlinkSync and
bail if the inode has changed. The dual stat-then-unlink is the only
portable POSIX primitive Node exposes for this — there is no atomic
"unlink-if-inode" syscall — so a sufficiently unlucky race can still
slip through. The window is orders of magnitude smaller than the
unconditional unlink it replaces, and the outer retry loop will detect
a wrong unlink on its next writeFileSync(wx) attempt because either
we re-create the file or we observe a new lock with a different inode.

Also bumps MAX_ATTEMPTS from 2 to 5 because the inode-verified cleanup
can take a few more spins under contention before one cleaner wins.

Adds a regression test that simulates the race deterministically by
wrapping fs.statSync so the first stat succeeds against the original
stale inode, then atomically swaps the lock file (unlink + recreate)
to give it a new inode before unlinkIfInodeMatches re-stats it. The
test asserts the fresh claim survives the race and is the file on
disk after acquireOnboardLock returns.

Verified by stashing the source fix and re-running: the new test
fails on the unguarded code as expected, and passes with the inode
guard in place.

Closes NVIDIA#1281
ColinM-sys added a commit to ColinM-sys/NemoClaw that referenced this pull request Apr 16, 2026
The acquireOnboardLock stale-cleanup path read a stale lock, decided
the holder was dead, and unconditionally unlinked LOCK_FILE before
retrying writeFileSync(wx). Two concurrent processes that both observe
the same stale lock will both try to clean it up — and the slower of
the two can unlink the *fresh* lock the faster process just claimed,
breaking mutual exclusion: both processes end up holding 'their' lock
and onboard runs in parallel against the same shared session state.

Reported as NVIDIA#1281, originally surfaced by CodeRabbit on NVIDIA#1272.

Fix: capture the stale file's inode via fs.statSync({ bigint: true })
at the same time we read its contents, then in a new
unlinkIfInodeMatches() helper, re-stat right before fs.unlinkSync and
bail if the inode has changed. The dual stat-then-unlink is the only
portable POSIX primitive Node exposes for this — there is no atomic
"unlink-if-inode" syscall — so a sufficiently unlucky race can still
slip through. The window is orders of magnitude smaller than the
unconditional unlink it replaces, and the outer retry loop will detect
a wrong unlink on its next writeFileSync(wx) attempt because either
we re-create the file or we observe a new lock with a different inode.

Also bumps MAX_ATTEMPTS from 2 to 5 because the inode-verified cleanup
can take a few more spins under contention before one cleaner wins.

Adds a regression test that simulates the race deterministically by
wrapping fs.statSync so the first stat succeeds against the original
stale inode, then atomically swaps the lock file (unlink + recreate)
to give it a new inode before unlinkIfInodeMatches re-stats it. The
test asserts the fresh claim survives the race and is the file on
disk after acquireOnboardLock returns.

Verified by stashing the source fix and re-running: the new test
fails on the unguarded code as expected, and passes with the inode
guard in place.

Closes NVIDIA#1281
ericksoa pushed a commit that referenced this pull request Apr 16, 2026
…#1656)

## Summary

Closes #1281.

`acquireOnboardLock` in `src/lib/onboard-session.ts` had a stale-cleanup
path that read a stale lock, decided the holder was dead, and
**unconditionally** called `fs.unlinkSync(LOCK_FILE)` before retrying
`writeFileSync(wx)`.

Two concurrent processes that both observe the same stale lock both try
to clean it up — and the slower of the two can unlink the *fresh* lock
the faster process just claimed. Both processes then end up holding
'their' lock and onboard runs in parallel against the same shared
session state.

Originally surfaced by CodeRabbit on #1272.

## Race walkthrough

```
A: writeFile(wx) → EEXIST (stale lock present)
B: writeFile(wx) → EEXIST (same stale lock)
A: read stale, isProcessAlive(stale.pid) → false
A: unlinkSync(LOCK_FILE)                                     ← deletes stale
A: loop, writeFile(wx) → SUCCESS, A holds the lock
B: read stale (still has its own copy from before)
B: isProcessAlive(stale.pid) → false (B is reasoning about
   the OLD pid, not the fresh lock A just wrote)
B: unlinkSync(LOCK_FILE)                                     ← DELETES A's FRESH LOCK
B: loop, writeFile(wx) → SUCCESS, B also "holds" the lock
```

After this, both A and B believe they own the lock and proceed with
onboard.

## Fix

Capture the stale file's **inode** via `fs.statSync({ bigint: true })`
at the same time we read its contents. Then in a new
`unlinkIfInodeMatches()` helper, re-stat right before `fs.unlinkSync`
and bail if the inode has changed.

The dual stat-then-unlink is the only portable POSIX primitive Node
exposes for this — there is no atomic "unlink-if-inode" syscall — so a
sufficiently unlucky race can still slip through. **However** the window
is orders of magnitude smaller than the unconditional unlink it
replaces, and the outer retry loop will detect a wrong unlink on its
next `writeFileSync(wx)` attempt because either we re-create the file or
we observe a new lock with a different inode and retry.

Also bumps `MAX_ATTEMPTS` from 2 to 5 because the inode-verified cleanup
can take a few more spins under contention before one cleaner wins.

## Behavior preserved

- Malformed lock files are still left on disk (the existing "treats
unreadable or transient lock contents as a retry" test still passes — I
explicitly do NOT call `unlinkIfInodeMatches` on malformed locks, only
on parseable-but-stale ones).
- A holder PID that is still alive is still reported correctly to the
caller.
- `releaseOnboardLock()` semantics unchanged.

## Test plan

- [x] Added regression test `regression #1281: stale-cleanup race does
not unlink a fresh lock claimed by another process` that simulates the
race deterministically by wrapping `fs.statSync` so the first stat
(inside acquireOnboardLock) succeeds against the original stale inode,
then atomically swaps the lock file (unlink + recreate) to give it a new
inode before `unlinkIfInodeMatches` re-stats it. The test asserts the
fresh claim survives the race and is the file on disk after
`acquireOnboardLock` returns.
- [x] **Negative case verified**: stashed the source fix, re-ran the
test against the unguarded code. The new regression test correctly
**fails** because the unconditional `unlinkSync` deletes the fresh claim
and writes a new one with a different command string. The test's
`expect(onDisk.command).toContain("fresh claim from concurrent
process")` then fails with the actual content showing the wrong winner.
- [x] **Positive case verified**: re-applied the fix, re-ran. All 14
lock-related tests pass.
- [x] No regression on existing lock tests (`acquires and releases the
onboard lock`, `replaces a stale onboard lock`, `treats unreadable or
transient lock contents as a retry, not a stale lock`, `ignores
malformed lock files when releasing the onboard lock`).
- [x] One pre-existing Windows-only POSIX-permissions test (`creates and
persists a session with restrictive permissions` checking `mode & 0o777
=== 0o600`) fails on this branch with the same failure as `main` — this
is the umask/POSIX-mode-on-Windows issue and is unrelated to the lock
fix.

## Why an inode check rather than the issue's suggested `linkSync`
approach

The issue body suggests using `fs.linkSync(temp, LOCK_FILE)` which fails
atomically with EEXIST if the target exists. That's a valid alternative,
but:

1. The current `writeFileSync(LOCK_FILE, payload, { flag: "wx" })`
already provides the same atomic-create-or-fail semantics — there's no
need to introduce a tempfile + link dance for the *initial* claim. The
race is specifically in the *cleanup* of stale locks, not the initial
claim.
2. Inode-comparison preserves the existing happy-path code
(single-syscall `writeFileSync(wx)`) and only adds the check on the rare
stale-cleanup branch.
3. Tempfile + linkSync would add a second possible failure mode
(tempfile leak under crash) without solving the cleanup race any more
cleanly.

Both approaches are correct; this one is the smaller surgical change.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Added a regression test that reproduces a stale-lock race and verifies
correct behavior without cross-test side effects.

* **Bug Fixes**
* Increased retry attempts for lock acquisition to reduce transient
failures.
  * More robust, atomic lock creation to avoid partial writes.
  * Safer stale-lock cleanup using on-disk verification before removal.
* More reliable lock release that verifies ownership before removing
lock files.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: ColinM-sys <cmcdonough@50words.com>

---------

Signed-off-by: ColinM-sys <cmcdonough@50words.com>
Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Co-authored-by: Prekshi Vyas <34834085+prekshivyas@users.noreply.github.com>
Co-authored-by: Prekshi Vyas <prekshiv@nvidia.com>
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output feature PR adds or expands user-visible functionality and removed NemoClaw CLI feature PR adds or expands user-visible functionality labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output refactor PR restructures code without intended behavior change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants