fix(onboard): diagnose Ollama auth-proxy port conflicts and recover startup by yimoj · Pull Request #5040 · NVIDIA/NemoClaw

yimoj · 2026-06-09T10:46:26Z

Summary

Local Ollama onboarding aborted at provider setup with a generic Ollama auth proxy failed to start on :11435 — no conflict owner, no retry, no remediation. This hardens the host-side auth-proxy startup with precise port-conflict diagnostics, retry/backoff, and ownership verification, so the reporter's workflow either completes or fails with actionable detail.

Related Issue

Fixes #4820

Changes

Port-conflict diagnostics: before/around proxy spawn, detect who owns :11435 and print the exact PID + (redacted) command line with actionable remediation — stop the process, or export NEMOCLAW_OLLAMA_PROXY_PORT so every NemoClaw command uses the same port.
IPv4-scoped pre-check: the proxy binds IPv4 0.0.0.0, so the pre-start abort only fires on an IPv4 conflict — an IPv6-only listener that can't actually conflict no longer triggers a false abort. A dual-stack :::PORT blocker (which does block IPv4) is still caught and named via the post-spawn EADDRINUSE path (broad lsof scope).
Retry/backoff: replaced the single 2s readiness probe with a bounded poll loop so slow hosts and the post-systemd-restart window can settle.
Ownership verification: startup is successful only when the listener behaves like our proxy (rejects unauthenticated → 401, accepts the freshly-minted token → non-401). Prevents persisting a token for a foreign listener when lsof is unavailable or loses the EADDRINUSE race.
Clean EADDRINUSE handling in scripts/ollama-auth-proxy.js (exit 1 instead of an uncaught crash).
127.0.0.1 health probe (not localhost) so an IPv6-only squatter can't misdirect the check; this matches the file's other proxy probes.
Tests (test/ollama-proxy-startup.test.ts): port conflict, stale-proxy reclaim, slow start, spawn failure.

No src/lib/onboard.ts changes (growth-neutral). No documented behavior changes — the remediation aligns with the existing NEMOCLAW_OLLAMA_PROXY_PORT docs.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npm run typecheck:cli passes
Targeted npm test passes: ollama-proxy-startup, ollama-proxy-recovery, onboard-ollama-autostart, ollama-tools-capability, sandbox-connect-inference, onboard (pre-existing unrelated failures on this host are environment timing flakes, confirmed present on the clean base)
Tests added for new behavior
No secrets, API keys, or credentials committed
Reproduced end-to-end through the compiled CLI and verified the fix.

E2E evidence (host-side `startOllamaAuthProxy()` via the compiled `dist`)

Built the CLI (npm run build:cli) and drove the real host-side proxy startup against live listeners on a Linux host with Ollama on 127.0.0.1:11434.

Before the fix — squatting a non-NemoClaw process on :11435, then running the Ollama provider path produced the reporter's generic error:

$ node -e 'http.createServer().listen(11435,"0.0.0.0")' &   # squatter
$ node -e 'require("./dist/lib/inference/ollama/proxy.js").startOllamaAuthProxy()'
  Error: Ollama auth proxy failed to start on :11435
  Containers will not be able to reach Ollama without the proxy.
  Check if port 11435 is already in use: lsof -ti :11435
# returned: false  (no owner, no retry, no remediation)

After the fix — same harness, three scenarios:

# 1) Foreign IPv4 listener on :11435 → named conflict + remediation, aborts
  Error: Ollama auth proxy cannot start — port 11435 is already in use by another process.
    PID 2514860: node /tmp/v4.js
  Resolve the conflict, then re-run onboarding:
    • Stop the process above (e.g. kill 2514860), or
    • Choose a free proxy port and export it so every NemoClaw command
      uses the same value (add it to your shell profile to persist):
        export NEMOCLAW_OLLAMA_PROXY_PORT=<port>
  Containers will not be able to reach Ollama without the proxy.
# returned: false

# 2) IPv6-only listener on ::1:11435 (does NOT block our IPv4 0.0.0.0 bind)
#    → proxy starts, coexists; lsof shows our IPv4 *:11435 + the ::1 squatter
# returned: true

# 3) Happy path (port free) → proxy starts and binds 0.0.0.0:11435
# returned: true

Dual-stack :::11435 blockers (which lsof reports as IPv6 but do block our IPv4 bind) are caught post-spawn via the EADDRINUSE path and named with the broad lsof scope.

CI on this head (aa7065844) is green: 39 success / 0 failure (see the PR Checks tab — build-typecheck, cli-test-shards, static-checks, codebase-growth-guardrails all pass).

Signed-off-by: Yimo Jiang yimoj@nvidia.com

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Prevented proxy startup crashes on listen failures; improved diagnostics, clearer conflict messages, and safe exit/reclaim flows. Hardened readiness checks to avoid false positives by requiring token-specific verification and IPv4-scoped preflight.
Tests
- Added comprehensive startup scenarios (IPv4/IPv6 listeners, foreign-owner conflicts, delayed binds, spawn failures, reclaim) and standardized simulated proxy readiness across onboarding tests.
Chores
- Adjusted test file-size budget.

coderabbitai · 2026-06-09T10:46:40Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8134ecec-ed4d-49a4-848a-7b69480ecb0d

📥 Commits

Reviewing files that changed from the base of the PR and between 3df45b1 and 355e627.

📒 Files selected for processing (5)

ci/test-file-size-budget.json
scripts/ollama-auth-proxy.js
src/lib/inference/ollama/proxy.ts
test/ollama-proxy-startup.test.ts
test/onboard-selection.test.ts

💤 Files with no reviewable changes (1)

test/onboard-selection.test.ts

🚧 Files skipped from review as they are similar to previous changes (3)

scripts/ollama-auth-proxy.js
src/lib/inference/ollama/proxy.ts
test/ollama-proxy-startup.test.ts

📝 Walkthrough

Walkthrough

Hardens Ollama auth-proxy startup: adds an HTTP server error handler, IPv4-scoped port-owner inspection with redacted cmdlines, staged spawn with token-bound readiness checks against 127.0.0.1, comprehensive startup tests, and standardized fake curl shims for onboarding tests.

Changes

Ollama Auth-Proxy Startup Reliability

Layer / File(s)	Summary
Proxy server error handling `scripts/ollama-auth-proxy.js`	Adds HTTP server error handler that logs `EADDRINUSE` port-conflict messages distinctly from other startup failures and exits with a non-zero code.
Imports and redact enablement `src/lib/inference/ollama/proxy.ts`	Adds `redact` import to support sanitizing foreign process command-line output in diagnostics.
Port inspection and conflict reporting `src/lib/inference/ollama/proxy.ts`	Inspects TCP listeners (IPv4-scoped preflight then broader), excludes NemoClaw’s own auth-proxy PID, redacts foreign owners’ `ps` argv, and emits structured conflict/remediation messages.
Staged startup flow and token-verified readiness `src/lib/inference/ollama/proxy.ts`	Reworks `startOllamaAuthProxy()` to clear stale proxies, run IPv4 preflight with early conflict abort, generate a fresh token, spawn a detached proxy, poll with backoff against `http://127.0.0.1:<port>/v1/models`, and require token acceptance plus unauthenticated rejection before signaling readiness.
Comprehensive startup scenario tests `test/ollama-proxy-startup.test.ts`	Vitest suite runs the compiled proxy in a child process with injected mocks and asserts scenarios: foreign-port conflict diagnostics, successful bind, IPv6-only ignored by IPv4 preflight, retry-on-late-bind, spawn-failure, and reclaiming a stale NemoClaw proxy PID; extends child→parent `StartupResult` with probe counts and remediation commands.
Test curl shims updated `test/onboard-selection.test.ts`, inline shims	Removes `writeOllamaToolCallingCurl`; updates `writeAlwaysOkCurl` to parse `--config`, capture request URL, and return `401` for unauthenticated `:11435/*` requests; multiple onboarding tests switched to the standardized shim.
CI test file-size budget `ci/test-file-size-budget.json`	Updates `legacyMaxLines` for `test/onboard-selection.test.ts` from `6922` to `6891`.

Sequence Diagram

sequenceDiagram
  participant StartOllamaAuthProxy
  participant LsofPs as lsof/ps
  participant SpawnedAuthProxy
  participant HTTPProbe

  StartOllamaAuthProxy->>LsofPs: Inspect who owns proxy TCP port (IPv4)
  alt Port bound by foreign PID
    LsofPs-->>StartOllamaAuthProxy: Return owning PID + cmdline
    StartOllamaAuthProxy->>StartOllamaAuthProxy: Redact command, print conflict & remediation
    StartOllamaAuthProxy-->>StartOllamaAuthProxy: Exit failure
  else Port free
    StartOllamaAuthProxy->>SpawnedAuthProxy: Spawn proxy with generated token
    StartOllamaAuthProxy->>StartOllamaAuthProxy: Backoff polling loop
    StartOllamaAuthProxy->>HTTPProbe: Probe http://127.0.0.1:<port>/v1/models with token and without
    alt Token accepted && unauthenticated rejected
      HTTPProbe-->>StartOllamaAuthProxy: Confirmed auth-proxy owns port
      StartOllamaAuthProxy-->>StartOllamaAuthProxy: Startup success
    else Token checks fail
      HTTPProbe-->>StartOllamaAuthProxy: Ownership not confirmed
      StartOllamaAuthProxy-->>StartOllamaAuthProxy: Retry until timeout/failure
    end
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I sniffed the socket, scrubbed the trace,

redacted the command that hid its face.
I spun a secret token, nudged loopback’s door,
probed, retried softly, counted once, then four.
Hop — the proxy woke and answered once more.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: diagnosing Ollama auth-proxy port conflicts and recovering from startup failures.
Linked Issues check	✅ Passed	All coding objectives from issue `#4820` are met: port conflict diagnostics with PID/command output, IPv4-scoped pre-check avoiding false IPv6 positives, retry/backoff for slow hosts, behavioral ownership verification, and clean EADDRINUSE handling.
Out of Scope Changes check	✅ Passed	All changes directly address the linked issue: proxy error handling, port conflict diagnostics, startup resilience, ownership verification, and corresponding test/harness updates are all in scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/onboard-selection.test.ts (1)
973-997: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Consolidate duplicated Ollama curl shims to unblock CI size budget.

The file is currently failing CI size guardrails; repeated inline curl shims (same --config + :11435 unauthenticated 401 behavior) are the main growth driver. Reuse writeOllamaToolCallingCurl(fakeBin) (or a small parameterized variant) in these blocks instead of inlining each script body.
Suggested direction
-    fs.writeFileSync(
-      path.join(fakeBin, "curl"),
-      `#!/usr/bin/env bash
-... repeated auth-proxy shim ...
-`,
-      { mode: 0o755 },
-    );
+    writeOllamaToolCallingCurl(fakeBin);
If a specific test needs custom behavior, keep a tiny per-test override helper layered on top of writeOllamaToolCallingCurl rather than copying the full script again.
Also applies to: 1212-1240, 2054-2078, 2162-2186, 2282-2306, 2410-2434, 2535-2559, 3645-3669, 3742-3766, 5217-5244, 5566-5594, 5757-5784, 5952-5979, 6140-6167, 6616-6643, 6783-6810, 6939-6966
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/onboard-selection.test.ts` around lines 973 - 997, Replace the many
inline curl shim script bodies written with fs.writeFileSync in the tests by
calling the shared helper writeOllamaToolCallingCurl(fakeBin) (or create a small
parameterized variant writeOllamaToolCallingCurl(fakeBin, opts) that accepts
overrides such as body or unauthenticated-port) instead of copying the full
script; locate the duplicated blocks that reference fakeBin, the inline script
using OLLAMA_CHAT_COMPLETIONS_TOOL_CALL_RESPONSE and the :11435 unauthenticated
401 logic, and refactor each to call the helper, adding a tiny per-test override
wrapper only when a test needs custom behavior.
Source: Pipeline failures

🧹 Nitpick comments (1)

src/lib/inference/ollama/proxy.ts (1)
295-301: 💤 Low value

Minor: timeout error message understates actual wait time.

Each polling iteration may wait up to ~2s (1s waitForPort + 1s sleep), so 12 attempts can take ~24s worst-case, not 12s as the message implies. Consider either adjusting the message or using a cumulative elapsed-time check instead of attempt count.

This is cosmetic—the longer actual timeout is beneficial for slow hosts per the PR objective.
💡 Suggested clarification
-  console.error(
-    `  Error: Ollama auth proxy did not become ready on :${OLLAMA_PROXY_PORT} within ${PROXY_START_ATTEMPTS}s.`,
-  );
+  console.error(
+    `  Error: Ollama auth proxy did not become ready on :${OLLAMA_PROXY_PORT} after ${PROXY_START_ATTEMPTS} attempts.`,
+  );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/inference/ollama/proxy.ts` around lines 295 - 301, The timeout
message under the Ollama proxy startup (around the block that logs error and
returns false) understates real wait time because each loop uses waitForPort +
sleep; update the message in the function that checks the proxy (referencing
OLLAMA_PROXY_PORT and PROXY_START_ATTEMPTS) to either report the cumulative
elapsed time (capture start time before the loop and compute elapsedMs to
include in the error) or multiply attempts by ~2s to reflect worst-case
(~${PROXY_START_ATTEMPTS * 2}s); adjust the process of computing the timeout
text accordingly in the same function so logs accurately reflect how long we
actually waited.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@test/onboard-selection.test.ts`:
- Around line 973-997: Replace the many inline curl shim script bodies written
with fs.writeFileSync in the tests by calling the shared helper
writeOllamaToolCallingCurl(fakeBin) (or create a small parameterized variant
writeOllamaToolCallingCurl(fakeBin, opts) that accepts overrides such as body or
unauthenticated-port) instead of copying the full script; locate the duplicated
blocks that reference fakeBin, the inline script using
OLLAMA_CHAT_COMPLETIONS_TOOL_CALL_RESPONSE and the :11435 unauthenticated 401
logic, and refactor each to call the helper, adding a tiny per-test override
wrapper only when a test needs custom behavior.

---

Nitpick comments:
In `@src/lib/inference/ollama/proxy.ts`:
- Around line 295-301: The timeout message under the Ollama proxy startup
(around the block that logs error and returns false) understates real wait time
because each loop uses waitForPort + sleep; update the message in the function
that checks the proxy (referencing OLLAMA_PROXY_PORT and PROXY_START_ATTEMPTS)
to either report the cumulative elapsed time (capture start time before the loop
and compute elapsedMs to include in the error) or multiply attempts by ~2s to
reflect worst-case (~${PROXY_START_ATTEMPTS * 2}s); adjust the process of
computing the timeout text accordingly in the same function so logs accurately
reflect how long we actually waited.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c3c7089d-9f1a-423e-bc6b-17d8cec5954b

📥 Commits

Reviewing files that changed from the base of the PR and between cb9719e and b890909.

📒 Files selected for processing (4)

scripts/ollama-auth-proxy.js
src/lib/inference/ollama/proxy.ts
test/ollama-proxy-startup.test.ts
test/onboard-selection.test.ts

🚧 Files skipped from review as they are similar to previous changes (2)

scripts/ollama-auth-proxy.js
test/ollama-proxy-startup.test.ts

yimoj · 2026-06-09T11:42:45Z

@coderabbitai resume

The duplicated Ollama curl shims flagged in the outside-diff comment have been consolidated into the shared writeOllamaToolCallingCurl(fakeBin) helper (now models the real auth proxy's unauthenticated-401 behavior), and test/onboard-selection.test.ts is back under its size budget (6929 lines, budget ratcheted). Please re-review the final state.

coderabbitai · 2026-06-09T11:43:01Z

@yimoj Thanks for the update! Resuming reviews and kicking off a fresh pass over the final state of the PR.

✅ Action performed

Reviews resumed.

Full review finished.

yimoj · 2026-06-09T11:48:13Z

@coderabbitai review

Note: review #4458158009 was on the superseded commit b890909cf. Its outside-diff suggestion to consolidate the duplicated Ollama curl shims is already implemented on the current head aa7065844 — the inline shims now call the shared writeOllamaToolCallingCurl(fakeBin) helper and test/onboard-selection.test.ts is back under its size budget (6929 lines). Please post a fresh review of the current head.

coderabbitai · 2026-06-09T11:48:27Z

@yimoj Understood — I'll kick off a fresh full review against the current head (aa7065844). The consolidation of the Ollama curl shims into writeOllamaToolCallingCurl(fakeBin) and the restored size budget for test/onboard-selection.test.ts (6929 lines) are noted so those points won't be re-raised.

✅ Action performed

Full review finished.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

test/onboard-selection.test.ts (1)
25-46: ⚡ Quick win

Consider consolidating remaining inline curl implementations.

Four inline fake curl scripts remain (hunks at lines 25-46, 2005-2024, 3500-3519, 3597-3616) with logic that appears functionally identical to the shared writeOllamaToolCallingCurl(fakeBin) helper used elsewhere in this file.

Per the PR objectives ("Consolidated duplicated Ollama curl shims into a shared writeOllamaToolCallingCurl(fakeBin) helper"), consider refactoring these remaining inline implementations to also call the shared helper. This would reduce duplication, ensure consistency, and simplify future maintenance.

If these inline versions must remain separate for test-specific reasons, please document why they differ from the shared helper.

Also applies to: 2005-2024, 3500-3519, 3597-3616
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/onboard-selection.test.ts` around lines 25 - 46, In
test/onboard-selection.test.ts there are four remaining inline fake curl script
implementations that duplicate the logic in the shared helper
writeOllamaToolCallingCurl(fakeBin); replace each inline script with a call to
writeOllamaToolCallingCurl(fakeBin) (passing the same fakeBin used by the
original inline fakes) to remove duplication and ensure consistent behavior, and
if any inline version must keep different behavior, add a short comment above it
explaining why it diverges from writeOllamaToolCallingCurl and what specific
differences are required for that test.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/ollama-proxy-startup.test.ts`:
- Around line 213-253: The test currently sets reclaimed true on any mocked
sleep and never asserts killCommands, so change the mocked reclaim logic to set
reclaimed only when the actual expected kill command targeting PID 4242 is
pushed to killCommands (e.g., when the string/command equals the expected kill
for PID 4242), and add an assertion after the run that killCommands contains
that expected command (so the test fails if the reclaim/kill action was
skipped). Ensure references to the test variables killCommands and reclaimed and
the PID 4242 are used so the check is tied directly to the emitted kill command
rather than any sleep timing.
- Around line 120-180: The tests currently vary the mocked curl responses but do
not assert that startOllamaAuthProxy() actually issued both readiness probes;
update the test to verify the curl mock was called twice and that one call
matched the unauthenticated probe (expected 401) and the other matched the
authenticated probe (expected 200). Concretely, after invoking
startOllamaAuthProxy() assert the curl mock (the mocked exec/spawn used in the
test) call count is 2 and inspect the call arguments (or calledWith entries) to
confirm one request targeted the unauthenticated endpoint/headers and one
targeted the authenticated endpoint/headers so regressions that skip the 401
check will fail.

In `@test/onboard-selection.test.ts`:
- Around line 2018-2023: The fake curl handlers in
test/onboard-selection.test.ts write to $outfile unconditionally which crashes
when -o is not passed; update each fake curl implementation (the occurrences
that reference the outfile variable at or near lines 2022, 3517, 3614 and the
ranges 3513-3518, 3610-3615) to guard the redirection by checking that outfile
is non-empty before writing the body (use the same safe pattern used in the
earlier hunk around lines 44-45), i.e. only perform the `> "$outfile"` write
when outfile is set; alternatively, if you prefer, assert in the tests that
callers always pass -o and document that precondition.

---

Nitpick comments:
In `@test/onboard-selection.test.ts`:
- Around line 25-46: In test/onboard-selection.test.ts there are four remaining
inline fake curl script implementations that duplicate the logic in the shared
helper writeOllamaToolCallingCurl(fakeBin); replace each inline script with a
call to writeOllamaToolCallingCurl(fakeBin) (passing the same fakeBin used by
the original inline fakes) to remove duplication and ensure consistent behavior,
and if any inline version must keep different behavior, add a short comment
above it explaining why it diverges from writeOllamaToolCallingCurl and what
specific differences are required for that test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7c2179a4-466e-446b-98a8-3823aed15d9a

📥 Commits

Reviewing files that changed from the base of the PR and between 784f573 and aa70658.

📒 Files selected for processing (5)

ci/test-file-size-budget.json
scripts/ollama-auth-proxy.js
src/lib/inference/ollama/proxy.ts
test/ollama-proxy-startup.test.ts
test/onboard-selection.test.ts

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/ollama-proxy-startup.test.ts`:
- Around line 95-99: The runner.runCapture stub is too generic for lsof checks
and will incorrectly accept IPv6-only listeners; update the stub used in the
tests (runner.runCapture) to distinguish IPv4-scoped lsof queries from
generic/IPv6 ones by matching the IPv4-specific query string (e.g., the "-i4" or
"127.0.0.1:11435" style token) and only return the "2222" owner for that IPv4
query, add a separate case that returns an IPv6-only owner string (e.g., showing
an "[::1]:11435" or other IPv6 output) to simulate an IPv6 listener, and keep
the existing ps -p 2222 response ("/usr/bin/python3 -m http.server 11435") so
startup still succeeds when appropriate; apply the same change pattern to the
other runCapture stubs referenced in the comment.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1f2527fb-42e7-44ee-927b-9de69c7d63e9

📥 Commits

Reviewing files that changed from the base of the PR and between 166d6ae and 3eb100a.

📒 Files selected for processing (5)

ci/test-file-size-budget.json
scripts/ollama-auth-proxy.js
src/lib/inference/ollama/proxy.ts
test/ollama-proxy-startup.test.ts
test/onboard-selection.test.ts

✅ Files skipped from review due to trivial changes (1)

ci/test-file-size-budget.json

🚧 Files skipped from review as they are similar to previous changes (3)

scripts/ollama-auth-proxy.js
test/onboard-selection.test.ts
src/lib/inference/ollama/proxy.ts

wscurran · 2026-06-09T15:28:36Z

✨
Related open issues:

#4820 [Ubuntu 24.04][Onboard] local Ollama onboard fails auth proxy failed to start on :11435

yimoj · 2026-06-10T02:45:09Z

@coderabbitai resume

coderabbitai · 2026-06-10T02:45:15Z

✅ Action performed

Reviews resumed.

…tartup Local Ollama onboarding aborted at provider setup with a generic "Ollama auth proxy failed to start on :11435" — no conflict owner, no retry, and no remediation — when the port was occupied or the proxy was slow to bind. - Pre-check the proxy port (IPv4-scoped, since the proxy binds 0.0.0.0) and, on a real conflict, name the owning PID + command and tell the user to stop it or export NEMOCLAW_OLLAMA_PROXY_PORT for all NemoClaw commands - Poll for readiness with backoff (slow hosts / post-systemd-restart window) instead of a single 2s probe, and verify the listener is our proxy holding the token before reporting success (rejects unauth 401, accepts the token) so we never persist a token for a foreign listener - Diagnose a failed spawn distinctly: name a dual-stack IPv6 blocker via a broad lsof scope, else report "exited during startup" - Handle EADDRINUSE in the proxy script so it exits cleanly instead of crashing with an uncaught exception - Probe the proxy via 127.0.0.1 (not localhost) so an IPv6-only squatter cannot misdirect the health check; redact owner command lines before printing them - Add startup tests: port conflict, stale-proxy reclaim, slow start, spawn failure Fixes NVIDIA#4820 Signed-off-by: Yimo Jiang <yimoj@nvidia.com>

prekshivyas

Reviewed (code + security checklist). Hardens host-side startOllamaAuthProxy() / ollama-auth-proxy.js with named port-conflict diagnostics, IPv4-scoped preflight, retry/backoff readiness polling, and token-based ownership verification so the Ollama onboarding path either succeeds or fails with actionable detail (#4820).

✅ Approve. Security-conscious throughout: token is crypto.randomBytes(24), persisted only after provider confirmation; lsof/ps/curl invoked via spawnSync array args (no shell); curl pinned to 127.0.0.1 with --max-time; PIDs parseInt+positive-guarded before any kill, and kill is gated on isOllamaProxyProcess (never kills foreign PIDs). The readiness proof requires both unauth→401 and token→non-401 — a genuine TOCTOU hardening so a token is never persisted for a foreign listener that won an EADDRINUSE race. Security: all categories pass.

Non-blocking nits:

proxy.ts:195 — timeout message says "within ${PROXY_START_ATTEMPTS}s" (≈12s), but each unbound iteration is ~1s probe + 1s sleep, so real worst case is ~24s. Reword to "after N attempts" or compute the true budget.
proxy.ts:96 — comment implies foreign argv is fully redacted; redact() is best-effort (pattern/URL based), so an opaque --token=<x> could still print. Soften wording.

Tests strong: ollama-proxy-startup.test.ts covers all terminal paths and asserts both halves of the readiness proof run.

## Summary - Add the v0.0.63 release-note section using the published development note as source context. - Update source docs for sandbox recovery, OpenClaw config restore safety, managed vLLM selection, Slack Socket Mode conflict handling, and host diagnostics. - Refresh generated `nemoclaw-user-*` skills from the updated Fern MDX docs. - Update the release-doc refresh skill so post-release docs for version `n` look up the matching announcement discussion and use the `n+1` patch release label. - Fix CLI/docs parity by avoiding a `--from <Dockerfile>` flag mention inside the `upgrade-sandboxes` command section. ## Source summary - #5034 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document safer stale-sandbox recovery through `rebuild --yes` before recreating from scratch. - #5091 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document Docker-driver post-reboot recovery from OpenShell container labels. - #5101, #5174, #5177 -> `docs/manage-sandboxes/backup-restore.mdx`, `docs/about/release-notes.mdx`: Document OpenClaw `openclaw.json` preservation, merge behavior, and fail-safe restore handling. - #5102 -> `docs/reference/commands.mdx`, `docs/reference/commands-nemohermes.mdx`, `docs/manage-sandboxes/lifecycle.mdx`, `docs/about/release-notes.mdx`: Document `upgrade-sandboxes` image-fingerprint drift detection. - #4201 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document the installer diagnostic for unexpected Docker daemon access outside the `docker` group. - #5038 -> `docs/inference/inference-options.mdx`, `docs/inference/use-local-inference.mdx`, `docs/about/release-notes.mdx`: Document the interactive managed-vLLM model picker and non-interactive override behavior. - #5040, #5041 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document Ollama auth-proxy recovery and host DNS preflight diagnostics. - #4986, #5039 -> `docs/manage-sandboxes/messaging-channels.mdx`, `docs/about/release-notes.mdx`: Document Slack validation and duplicate Slack Socket Mode sandbox handling. - #4981, #5168 -> `docs/about/release-notes.mdx`: Capture Hermes gateway secret-guard and wrapped-argv startup hardening in the release surface. - Follow-up -> `.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`: Record the post-release docs workflow, discussion-announcement lookup, and next-patch release label rule. - Follow-up -> `docs/reference/commands.mdx`, `docs/reference/commands-nemohermes.mdx`: Reword custom Dockerfile sandbox text so CLI parity does not treat `--from` as an `upgrade-sandboxes` flag. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` - `npm run build:cli` - `bash test/e2e/e2e-cloud-experimental/check-docs.sh --only-cli` - Skip-term scan for `docs/.docs-skip` blocked terms across generated user skills  ## Summary by CodeRabbit * **Documentation** * Enhanced local inference setup with interactive model selection prompts and environment variable overrides * Improved sandbox upgrade detection using build fingerprints and version checks * Clarified configuration restore behavior preserving user settings during rebuild/restore * Added gateway authentication as fifth security layer * Expanded Slack messaging validation with live credential checking * Enhanced troubleshooting guidance for Docker access, DNS issues, and sandbox recovery * Updated release notes for v0.0.63 featuring sandbox recovery and inference improvements

yimoj force-pushed the fix/4820-ollama-proxy-startup branch 4 times, most recently from b890909 to aa70658 Compare June 9, 2026 11:36

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread test/ollama-proxy-startup.test.ts

Comment thread test/ollama-proxy-startup.test.ts

Comment thread test/onboard-selection.test.ts Outdated

yimoj force-pushed the fix/4820-ollama-proxy-startup branch 2 times, most recently from 166d6ae to 3eb100a Compare June 9, 2026 12:00

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread test/ollama-proxy-startup.test.ts

yimoj force-pushed the fix/4820-ollama-proxy-startup branch from 3eb100a to a23086d Compare June 9, 2026 12:12

wscurran added area: local-models Local model providers, downloads, launch, or connectivity area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow bug-fix PR fixes a bug or regression provider: ollama Ollama local model provider behavior labels Jun 9, 2026

yimoj force-pushed the fix/4820-ollama-proxy-startup branch 2 times, most recently from e19dd45 to bd1226f Compare June 10, 2026 02:33

yimoj force-pushed the fix/4820-ollama-proxy-startup branch from bd1226f to 6c23d4b Compare June 10, 2026 02:45

yimoj added the v0.0.63 Release target label Jun 10, 2026

yimoj force-pushed the fix/4820-ollama-proxy-startup branch 2 times, most recently from d76897b to 3df45b1 Compare June 10, 2026 06:40

yimoj force-pushed the fix/4820-ollama-proxy-startup branch from 3df45b1 to 355e627 Compare June 10, 2026 08:17

prekshivyas self-assigned this Jun 10, 2026

prekshivyas approved these changes Jun 10, 2026

View reviewed changes

cv merged commit 2cde63f into NVIDIA:main Jun 10, 2026
34 checks passed

miyoungc mentioned this pull request Jun 11, 2026

docs: refresh v0.0.63 release docs #5244

Merged

Conversation

yimoj commented Jun 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

E2E evidence (host-side startOllamaAuthProxy() via the compiled dist)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

yimoj commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yimoj commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wscurran commented Jun 9, 2026

Uh oh!

yimoj commented Jun 10, 2026

Uh oh!

coderabbitai Bot commented Jun 10, 2026

Uh oh!

prekshivyas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yimoj commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

E2E evidence (host-side `startOllamaAuthProxy()` via the compiled `dist`)

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading