Skip to content

fix(onboard): diagnose Ollama auth-proxy port conflicts and recover startup#5040

Merged
cv merged 1 commit into
NVIDIA:mainfrom
yimoj:fix/4820-ollama-proxy-startup
Jun 10, 2026
Merged

fix(onboard): diagnose Ollama auth-proxy port conflicts and recover startup#5040
cv merged 1 commit into
NVIDIA:mainfrom
yimoj:fix/4820-ollama-proxy-startup

Conversation

@yimoj

@yimoj yimoj commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Local Ollama onboarding aborted at provider setup with a generic Ollama auth proxy failed to start on :11435 — no conflict owner, no retry, no remediation. This hardens the host-side auth-proxy startup with precise port-conflict diagnostics, retry/backoff, and ownership verification, so the reporter's workflow either completes or fails with actionable detail.

Related Issue

Fixes #4820

Changes

  • Port-conflict diagnostics: before/around proxy spawn, detect who owns :11435 and print the exact PID + (redacted) command line with actionable remediation — stop the process, or export NEMOCLAW_OLLAMA_PROXY_PORT so every NemoClaw command uses the same port.
  • IPv4-scoped pre-check: the proxy binds IPv4 0.0.0.0, so the pre-start abort only fires on an IPv4 conflict — an IPv6-only listener that can't actually conflict no longer triggers a false abort. A dual-stack :::PORT blocker (which does block IPv4) is still caught and named via the post-spawn EADDRINUSE path (broad lsof scope).
  • Retry/backoff: replaced the single 2s readiness probe with a bounded poll loop so slow hosts and the post-systemd-restart window can settle.
  • Ownership verification: startup is successful only when the listener behaves like our proxy (rejects unauthenticated → 401, accepts the freshly-minted token → non-401). Prevents persisting a token for a foreign listener when lsof is unavailable or loses the EADDRINUSE race.
  • Clean EADDRINUSE handling in scripts/ollama-auth-proxy.js (exit 1 instead of an uncaught crash).
  • 127.0.0.1 health probe (not localhost) so an IPv6-only squatter can't misdirect the check; this matches the file's other proxy probes.
  • Tests (test/ollama-proxy-startup.test.ts): port conflict, stale-proxy reclaim, slow start, spawn failure.

No src/lib/onboard.ts changes (growth-neutral). No documented behavior changes — the remediation aligns with the existing NEMOCLAW_OLLAMA_PROXY_PORT docs.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npm run typecheck:cli passes
  • Targeted npm test passes: ollama-proxy-startup, ollama-proxy-recovery, onboard-ollama-autostart, ollama-tools-capability, sandbox-connect-inference, onboard (pre-existing unrelated failures on this host are environment timing flakes, confirmed present on the clean base)
  • Tests added for new behavior
  • No secrets, API keys, or credentials committed
  • Reproduced end-to-end through the compiled CLI and verified the fix.

E2E evidence (host-side startOllamaAuthProxy() via the compiled dist)

Built the CLI (npm run build:cli) and drove the real host-side proxy startup against live listeners on a Linux host with Ollama on 127.0.0.1:11434.

Before the fix — squatting a non-NemoClaw process on :11435, then running the Ollama provider path produced the reporter's generic error:

$ node -e 'http.createServer().listen(11435,"0.0.0.0")' &   # squatter
$ node -e 'require("./dist/lib/inference/ollama/proxy.js").startOllamaAuthProxy()'
  Error: Ollama auth proxy failed to start on :11435
  Containers will not be able to reach Ollama without the proxy.
  Check if port 11435 is already in use: lsof -ti :11435
# returned: false  (no owner, no retry, no remediation)

After the fix — same harness, three scenarios:

# 1) Foreign IPv4 listener on :11435 → named conflict + remediation, aborts
  Error: Ollama auth proxy cannot start — port 11435 is already in use by another process.
    PID 2514860: node /tmp/v4.js
  Resolve the conflict, then re-run onboarding:
    • Stop the process above (e.g. kill 2514860), or
    • Choose a free proxy port and export it so every NemoClaw command
      uses the same value (add it to your shell profile to persist):
        export NEMOCLAW_OLLAMA_PROXY_PORT=<port>
  Containers will not be able to reach Ollama without the proxy.
# returned: false

# 2) IPv6-only listener on ::1:11435 (does NOT block our IPv4 0.0.0.0 bind)
#    → proxy starts, coexists; lsof shows our IPv4 *:11435 + the ::1 squatter
# returned: true

# 3) Happy path (port free) → proxy starts and binds 0.0.0.0:11435
# returned: true

Dual-stack :::11435 blockers (which lsof reports as IPv6 but do block our IPv4 bind) are caught post-spawn via the EADDRINUSE path and named with the broad lsof scope.

CI on this head (aa7065844) is green: 39 success / 0 failure (see the PR Checks tab — build-typecheck, cli-test-shards, static-checks, codebase-growth-guardrails all pass).


Signed-off-by: Yimo Jiang yimoj@nvidia.com

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Prevented proxy startup crashes on listen failures; improved diagnostics, clearer conflict messages, and safe exit/reclaim flows. Hardened readiness checks to avoid false positives by requiring token-specific verification and IPv4-scoped preflight.
  • Tests

    • Added comprehensive startup scenarios (IPv4/IPv6 listeners, foreign-owner conflicts, delayed binds, spawn failures, reclaim) and standardized simulated proxy readiness across onboarding tests.
  • Chores

    • Adjusted test file-size budget.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8134ecec-ed4d-49a4-848a-7b69480ecb0d

📥 Commits

Reviewing files that changed from the base of the PR and between 3df45b1 and 355e627.

📒 Files selected for processing (5)
  • ci/test-file-size-budget.json
  • scripts/ollama-auth-proxy.js
  • src/lib/inference/ollama/proxy.ts
  • test/ollama-proxy-startup.test.ts
  • test/onboard-selection.test.ts
💤 Files with no reviewable changes (1)
  • test/onboard-selection.test.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • scripts/ollama-auth-proxy.js
  • src/lib/inference/ollama/proxy.ts
  • test/ollama-proxy-startup.test.ts

📝 Walkthrough

Walkthrough

Hardens Ollama auth-proxy startup: adds an HTTP server error handler, IPv4-scoped port-owner inspection with redacted cmdlines, staged spawn with token-bound readiness checks against 127.0.0.1, comprehensive startup tests, and standardized fake curl shims for onboarding tests.

Changes

Ollama Auth-Proxy Startup Reliability

Layer / File(s) Summary
Proxy server error handling
scripts/ollama-auth-proxy.js
Adds HTTP server error handler that logs EADDRINUSE port-conflict messages distinctly from other startup failures and exits with a non-zero code.
Imports and redact enablement
src/lib/inference/ollama/proxy.ts
Adds redact import to support sanitizing foreign process command-line output in diagnostics.
Port inspection and conflict reporting
src/lib/inference/ollama/proxy.ts
Inspects TCP listeners (IPv4-scoped preflight then broader), excludes NemoClaw’s own auth-proxy PID, redacts foreign owners’ ps argv, and emits structured conflict/remediation messages.
Staged startup flow and token-verified readiness
src/lib/inference/ollama/proxy.ts
Reworks startOllamaAuthProxy() to clear stale proxies, run IPv4 preflight with early conflict abort, generate a fresh token, spawn a detached proxy, poll with backoff against http://127.0.0.1:<port>/v1/models, and require token acceptance plus unauthenticated rejection before signaling readiness.
Comprehensive startup scenario tests
test/ollama-proxy-startup.test.ts
Vitest suite runs the compiled proxy in a child process with injected mocks and asserts scenarios: foreign-port conflict diagnostics, successful bind, IPv6-only ignored by IPv4 preflight, retry-on-late-bind, spawn-failure, and reclaiming a stale NemoClaw proxy PID; extends child→parent StartupResult with probe counts and remediation commands.
Test curl shims updated
test/onboard-selection.test.ts, inline shims
Removes writeOllamaToolCallingCurl; updates writeAlwaysOkCurl to parse --config, capture request URL, and return 401 for unauthenticated :11435/* requests; multiple onboarding tests switched to the standardized shim.
CI test file-size budget
ci/test-file-size-budget.json
Updates legacyMaxLines for test/onboard-selection.test.ts from 6922 to 6891.

Sequence Diagram

sequenceDiagram
  participant StartOllamaAuthProxy
  participant LsofPs as lsof/ps
  participant SpawnedAuthProxy
  participant HTTPProbe

  StartOllamaAuthProxy->>LsofPs: Inspect who owns proxy TCP port (IPv4)
  alt Port bound by foreign PID
    LsofPs-->>StartOllamaAuthProxy: Return owning PID + cmdline
    StartOllamaAuthProxy->>StartOllamaAuthProxy: Redact command, print conflict & remediation
    StartOllamaAuthProxy-->>StartOllamaAuthProxy: Exit failure
  else Port free
    StartOllamaAuthProxy->>SpawnedAuthProxy: Spawn proxy with generated token
    StartOllamaAuthProxy->>StartOllamaAuthProxy: Backoff polling loop
    StartOllamaAuthProxy->>HTTPProbe: Probe http://127.0.0.1:<port>/v1/models with token and without
    alt Token accepted && unauthenticated rejected
      HTTPProbe-->>StartOllamaAuthProxy: Confirmed auth-proxy owns port
      StartOllamaAuthProxy-->>StartOllamaAuthProxy: Startup success
    else Token checks fail
      HTTPProbe-->>StartOllamaAuthProxy: Ownership not confirmed
      StartOllamaAuthProxy-->>StartOllamaAuthProxy: Retry until timeout/failure
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I sniffed the socket, scrubbed the trace,

redacted the command that hid its face.
I spun a secret token, nudged loopback’s door,
probed, retried softly, counted once, then four.
Hop — the proxy woke and answered once more.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: diagnosing Ollama auth-proxy port conflicts and recovering from startup failures.
Linked Issues check ✅ Passed All coding objectives from issue #4820 are met: port conflict diagnostics with PID/command output, IPv4-scoped pre-check avoiding false IPv6 positives, retry/backoff for slow hosts, behavioral ownership verification, and clean EADDRINUSE handling.
Out of Scope Changes check ✅ Passed All changes directly address the linked issue: proxy error handling, port conflict diagnostics, startup resilience, ownership verification, and corresponding test/harness updates are all in scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yimoj yimoj force-pushed the fix/4820-ollama-proxy-startup branch 4 times, most recently from b890909 to aa70658 Compare June 9, 2026 11:36

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/onboard-selection.test.ts (1)

973-997: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Consolidate duplicated Ollama curl shims to unblock CI size budget.

The file is currently failing CI size guardrails; repeated inline curl shims (same --config + :11435 unauthenticated 401 behavior) are the main growth driver. Reuse writeOllamaToolCallingCurl(fakeBin) (or a small parameterized variant) in these blocks instead of inlining each script body.

Suggested direction
-    fs.writeFileSync(
-      path.join(fakeBin, "curl"),
-      `#!/usr/bin/env bash
-... repeated auth-proxy shim ...
-`,
-      { mode: 0o755 },
-    );
+    writeOllamaToolCallingCurl(fakeBin);

If a specific test needs custom behavior, keep a tiny per-test override helper layered on top of writeOllamaToolCallingCurl rather than copying the full script again.

Also applies to: 1212-1240, 2054-2078, 2162-2186, 2282-2306, 2410-2434, 2535-2559, 3645-3669, 3742-3766, 5217-5244, 5566-5594, 5757-5784, 5952-5979, 6140-6167, 6616-6643, 6783-6810, 6939-6966

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/onboard-selection.test.ts` around lines 973 - 997, Replace the many
inline curl shim script bodies written with fs.writeFileSync in the tests by
calling the shared helper writeOllamaToolCallingCurl(fakeBin) (or create a small
parameterized variant writeOllamaToolCallingCurl(fakeBin, opts) that accepts
overrides such as body or unauthenticated-port) instead of copying the full
script; locate the duplicated blocks that reference fakeBin, the inline script
using OLLAMA_CHAT_COMPLETIONS_TOOL_CALL_RESPONSE and the :11435 unauthenticated
401 logic, and refactor each to call the helper, adding a tiny per-test override
wrapper only when a test needs custom behavior.

Source: Pipeline failures

🧹 Nitpick comments (1)
src/lib/inference/ollama/proxy.ts (1)

295-301: 💤 Low value

Minor: timeout error message understates actual wait time.

Each polling iteration may wait up to ~2s (1s waitForPort + 1s sleep), so 12 attempts can take ~24s worst-case, not 12s as the message implies. Consider either adjusting the message or using a cumulative elapsed-time check instead of attempt count.

This is cosmetic—the longer actual timeout is beneficial for slow hosts per the PR objective.

💡 Suggested clarification
-  console.error(
-    `  Error: Ollama auth proxy did not become ready on :${OLLAMA_PROXY_PORT} within ${PROXY_START_ATTEMPTS}s.`,
-  );
+  console.error(
+    `  Error: Ollama auth proxy did not become ready on :${OLLAMA_PROXY_PORT} after ${PROXY_START_ATTEMPTS} attempts.`,
+  );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/inference/ollama/proxy.ts` around lines 295 - 301, The timeout
message under the Ollama proxy startup (around the block that logs error and
returns false) understates real wait time because each loop uses waitForPort +
sleep; update the message in the function that checks the proxy (referencing
OLLAMA_PROXY_PORT and PROXY_START_ATTEMPTS) to either report the cumulative
elapsed time (capture start time before the loop and compute elapsedMs to
include in the error) or multiply attempts by ~2s to reflect worst-case
(~${PROXY_START_ATTEMPTS * 2}s); adjust the process of computing the timeout
text accordingly in the same function so logs accurately reflect how long we
actually waited.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@test/onboard-selection.test.ts`:
- Around line 973-997: Replace the many inline curl shim script bodies written
with fs.writeFileSync in the tests by calling the shared helper
writeOllamaToolCallingCurl(fakeBin) (or create a small parameterized variant
writeOllamaToolCallingCurl(fakeBin, opts) that accepts overrides such as body or
unauthenticated-port) instead of copying the full script; locate the duplicated
blocks that reference fakeBin, the inline script using
OLLAMA_CHAT_COMPLETIONS_TOOL_CALL_RESPONSE and the :11435 unauthenticated 401
logic, and refactor each to call the helper, adding a tiny per-test override
wrapper only when a test needs custom behavior.

---

Nitpick comments:
In `@src/lib/inference/ollama/proxy.ts`:
- Around line 295-301: The timeout message under the Ollama proxy startup
(around the block that logs error and returns false) understates real wait time
because each loop uses waitForPort + sleep; update the message in the function
that checks the proxy (referencing OLLAMA_PROXY_PORT and PROXY_START_ATTEMPTS)
to either report the cumulative elapsed time (capture start time before the loop
and compute elapsedMs to include in the error) or multiply attempts by ~2s to
reflect worst-case (~${PROXY_START_ATTEMPTS * 2}s); adjust the process of
computing the timeout text accordingly in the same function so logs accurately
reflect how long we actually waited.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c3c7089d-9f1a-423e-bc6b-17d8cec5954b

📥 Commits

Reviewing files that changed from the base of the PR and between cb9719e and b890909.

📒 Files selected for processing (4)
  • scripts/ollama-auth-proxy.js
  • src/lib/inference/ollama/proxy.ts
  • test/ollama-proxy-startup.test.ts
  • test/onboard-selection.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • scripts/ollama-auth-proxy.js
  • test/ollama-proxy-startup.test.ts

@yimoj

yimoj commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai resume

The duplicated Ollama curl shims flagged in the outside-diff comment have been consolidated into the shared writeOllamaToolCallingCurl(fakeBin) helper (now models the real auth proxy's unauthenticated-401 behavior), and test/onboard-selection.test.ts is back under its size budget (6929 lines, budget ratcheted). Please re-review the final state.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@yimoj Thanks for the update! Resuming reviews and kicking off a fresh pass over the final state of the PR.

✅ Action performed

Reviews resumed.


Full review finished.

@yimoj

yimoj commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

Note: review #4458158009 was on the superseded commit b890909cf. Its outside-diff suggestion to consolidate the duplicated Ollama curl shims is already implemented on the current head aa7065844 — the inline shims now call the shared writeOllamaToolCallingCurl(fakeBin) helper and test/onboard-selection.test.ts is back under its size budget (6929 lines). Please post a fresh review of the current head.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@yimoj Understood — I'll kick off a fresh full review against the current head (aa7065844). The consolidation of the Ollama curl shims into writeOllamaToolCallingCurl(fakeBin) and the restored size budget for test/onboard-selection.test.ts (6929 lines) are noted so those points won't be re-raised.

✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
test/onboard-selection.test.ts (1)

25-46: ⚡ Quick win

Consider consolidating remaining inline curl implementations.

Four inline fake curl scripts remain (hunks at lines 25-46, 2005-2024, 3500-3519, 3597-3616) with logic that appears functionally identical to the shared writeOllamaToolCallingCurl(fakeBin) helper used elsewhere in this file.

Per the PR objectives ("Consolidated duplicated Ollama curl shims into a shared writeOllamaToolCallingCurl(fakeBin) helper"), consider refactoring these remaining inline implementations to also call the shared helper. This would reduce duplication, ensure consistency, and simplify future maintenance.

If these inline versions must remain separate for test-specific reasons, please document why they differ from the shared helper.

Also applies to: 2005-2024, 3500-3519, 3597-3616

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/onboard-selection.test.ts` around lines 25 - 46, In
test/onboard-selection.test.ts there are four remaining inline fake curl script
implementations that duplicate the logic in the shared helper
writeOllamaToolCallingCurl(fakeBin); replace each inline script with a call to
writeOllamaToolCallingCurl(fakeBin) (passing the same fakeBin used by the
original inline fakes) to remove duplication and ensure consistent behavior, and
if any inline version must keep different behavior, add a short comment above it
explaining why it diverges from writeOllamaToolCallingCurl and what specific
differences are required for that test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/ollama-proxy-startup.test.ts`:
- Around line 213-253: The test currently sets reclaimed true on any mocked
sleep and never asserts killCommands, so change the mocked reclaim logic to set
reclaimed only when the actual expected kill command targeting PID 4242 is
pushed to killCommands (e.g., when the string/command equals the expected kill
for PID 4242), and add an assertion after the run that killCommands contains
that expected command (so the test fails if the reclaim/kill action was
skipped). Ensure references to the test variables killCommands and reclaimed and
the PID 4242 are used so the check is tied directly to the emitted kill command
rather than any sleep timing.
- Around line 120-180: The tests currently vary the mocked curl responses but do
not assert that startOllamaAuthProxy() actually issued both readiness probes;
update the test to verify the curl mock was called twice and that one call
matched the unauthenticated probe (expected 401) and the other matched the
authenticated probe (expected 200). Concretely, after invoking
startOllamaAuthProxy() assert the curl mock (the mocked exec/spawn used in the
test) call count is 2 and inspect the call arguments (or calledWith entries) to
confirm one request targeted the unauthenticated endpoint/headers and one
targeted the authenticated endpoint/headers so regressions that skip the 401
check will fail.

In `@test/onboard-selection.test.ts`:
- Around line 2018-2023: The fake curl handlers in
test/onboard-selection.test.ts write to $outfile unconditionally which crashes
when -o is not passed; update each fake curl implementation (the occurrences
that reference the outfile variable at or near lines 2022, 3517, 3614 and the
ranges 3513-3518, 3610-3615) to guard the redirection by checking that outfile
is non-empty before writing the body (use the same safe pattern used in the
earlier hunk around lines 44-45), i.e. only perform the `> "$outfile"` write
when outfile is set; alternatively, if you prefer, assert in the tests that
callers always pass -o and document that precondition.

---

Nitpick comments:
In `@test/onboard-selection.test.ts`:
- Around line 25-46: In test/onboard-selection.test.ts there are four remaining
inline fake curl script implementations that duplicate the logic in the shared
helper writeOllamaToolCallingCurl(fakeBin); replace each inline script with a
call to writeOllamaToolCallingCurl(fakeBin) (passing the same fakeBin used by
the original inline fakes) to remove duplication and ensure consistent behavior,
and if any inline version must keep different behavior, add a short comment
above it explaining why it diverges from writeOllamaToolCallingCurl and what
specific differences are required for that test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7c2179a4-466e-446b-98a8-3823aed15d9a

📥 Commits

Reviewing files that changed from the base of the PR and between 784f573 and aa70658.

📒 Files selected for processing (5)
  • ci/test-file-size-budget.json
  • scripts/ollama-auth-proxy.js
  • src/lib/inference/ollama/proxy.ts
  • test/ollama-proxy-startup.test.ts
  • test/onboard-selection.test.ts

Comment thread test/ollama-proxy-startup.test.ts
Comment thread test/ollama-proxy-startup.test.ts
Comment thread test/onboard-selection.test.ts Outdated
@yimoj yimoj force-pushed the fix/4820-ollama-proxy-startup branch 2 times, most recently from 166d6ae to 3eb100a Compare June 9, 2026 12:00

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/ollama-proxy-startup.test.ts`:
- Around line 95-99: The runner.runCapture stub is too generic for lsof checks
and will incorrectly accept IPv6-only listeners; update the stub used in the
tests (runner.runCapture) to distinguish IPv4-scoped lsof queries from
generic/IPv6 ones by matching the IPv4-specific query string (e.g., the "-i4" or
"127.0.0.1:11435" style token) and only return the "2222" owner for that IPv4
query, add a separate case that returns an IPv6-only owner string (e.g., showing
an "[::1]:11435" or other IPv6 output) to simulate an IPv6 listener, and keep
the existing ps -p 2222 response ("/usr/bin/python3 -m http.server 11435") so
startup still succeeds when appropriate; apply the same change pattern to the
other runCapture stubs referenced in the comment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1f2527fb-42e7-44ee-927b-9de69c7d63e9

📥 Commits

Reviewing files that changed from the base of the PR and between 166d6ae and 3eb100a.

📒 Files selected for processing (5)
  • ci/test-file-size-budget.json
  • scripts/ollama-auth-proxy.js
  • src/lib/inference/ollama/proxy.ts
  • test/ollama-proxy-startup.test.ts
  • test/onboard-selection.test.ts
✅ Files skipped from review due to trivial changes (1)
  • ci/test-file-size-budget.json
🚧 Files skipped from review as they are similar to previous changes (3)
  • scripts/ollama-auth-proxy.js
  • test/onboard-selection.test.ts
  • src/lib/inference/ollama/proxy.ts

Comment thread test/ollama-proxy-startup.test.ts
@yimoj yimoj force-pushed the fix/4820-ollama-proxy-startup branch from 3eb100a to a23086d Compare June 9, 2026 12:12
@wscurran wscurran added area: local-models Local model providers, downloads, launch, or connectivity area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow bug-fix PR fixes a bug or regression provider: ollama Ollama local model provider behavior labels Jun 9, 2026
@wscurran

wscurran commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@yimoj yimoj force-pushed the fix/4820-ollama-proxy-startup branch 2 times, most recently from e19dd45 to bd1226f Compare June 10, 2026 02:33
@yimoj

yimoj commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai resume

@yimoj yimoj force-pushed the fix/4820-ollama-proxy-startup branch from bd1226f to 6c23d4b Compare June 10, 2026 02:45
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Reviews resumed.

@yimoj yimoj added the v0.0.63 Release target label Jun 10, 2026
@yimoj yimoj force-pushed the fix/4820-ollama-proxy-startup branch 2 times, most recently from d76897b to 3df45b1 Compare June 10, 2026 06:40
…tartup

Local Ollama onboarding aborted at provider setup with a generic
"Ollama auth proxy failed to start on :11435" — no conflict owner, no
retry, and no remediation — when the port was occupied or the proxy was
slow to bind.

- Pre-check the proxy port (IPv4-scoped, since the proxy binds 0.0.0.0)
  and, on a real conflict, name the owning PID + command and tell the user
  to stop it or export NEMOCLAW_OLLAMA_PROXY_PORT for all NemoClaw commands
- Poll for readiness with backoff (slow hosts / post-systemd-restart
  window) instead of a single 2s probe, and verify the listener is our
  proxy holding the token before reporting success (rejects unauth 401,
  accepts the token) so we never persist a token for a foreign listener
- Diagnose a failed spawn distinctly: name a dual-stack IPv6 blocker via a
  broad lsof scope, else report "exited during startup"
- Handle EADDRINUSE in the proxy script so it exits cleanly instead of
  crashing with an uncaught exception
- Probe the proxy via 127.0.0.1 (not localhost) so an IPv6-only squatter
  cannot misdirect the health check; redact owner command lines before
  printing them
- Add startup tests: port conflict, stale-proxy reclaim, slow start,
  spawn failure

Fixes NVIDIA#4820

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@yimoj yimoj force-pushed the fix/4820-ollama-proxy-startup branch from 3df45b1 to 355e627 Compare June 10, 2026 08:17
@prekshivyas prekshivyas self-assigned this Jun 10, 2026

@prekshivyas prekshivyas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed (code + security checklist). Hardens host-side startOllamaAuthProxy() / ollama-auth-proxy.js with named port-conflict diagnostics, IPv4-scoped preflight, retry/backoff readiness polling, and token-based ownership verification so the Ollama onboarding path either succeeds or fails with actionable detail (#4820).

Approve. Security-conscious throughout: token is crypto.randomBytes(24), persisted only after provider confirmation; lsof/ps/curl invoked via spawnSync array args (no shell); curl pinned to 127.0.0.1 with --max-time; PIDs parseInt+positive-guarded before any kill, and kill is gated on isOllamaProxyProcess (never kills foreign PIDs). The readiness proof requires both unauth→401 and token→non-401 — a genuine TOCTOU hardening so a token is never persisted for a foreign listener that won an EADDRINUSE race. Security: all categories pass.

Non-blocking nits:

  • proxy.ts:195 — timeout message says "within ${PROXY_START_ATTEMPTS}s" (≈12s), but each unbound iteration is ~1s probe + 1s sleep, so real worst case is ~24s. Reword to "after N attempts" or compute the true budget.
  • proxy.ts:96 — comment implies foreign argv is fully redacted; redact() is best-effort (pattern/URL based), so an opaque --token=<x> could still print. Soften wording.

Tests strong: ollama-proxy-startup.test.ts covers all terminal paths and asserts both halves of the readiness proof run.

@cv cv merged commit 2cde63f into NVIDIA:main Jun 10, 2026
34 checks passed
miyoungc added a commit that referenced this pull request Jun 11, 2026
## Summary
- Add the v0.0.63 release-note section using the published development
note as source context.
- Update source docs for sandbox recovery, OpenClaw config restore
safety, managed vLLM selection, Slack Socket Mode conflict handling, and
host diagnostics.
- Refresh generated `nemoclaw-user-*` skills from the updated Fern MDX
docs.
- Update the release-doc refresh skill so post-release docs for version
`n` look up the matching announcement discussion and use the `n+1` patch
release label.
- Fix CLI/docs parity by avoiding a `--from <Dockerfile>` flag mention
inside the `upgrade-sandboxes` command section.

## Source summary
- #5034 -> `docs/reference/troubleshooting.mdx`,
`docs/about/release-notes.mdx`: Document safer stale-sandbox recovery
through `rebuild --yes` before recreating from scratch.
- #5091 -> `docs/reference/troubleshooting.mdx`,
`docs/about/release-notes.mdx`: Document Docker-driver post-reboot
recovery from OpenShell container labels.
- #5101, #5174, #5177 -> `docs/manage-sandboxes/backup-restore.mdx`,
`docs/about/release-notes.mdx`: Document OpenClaw `openclaw.json`
preservation, merge behavior, and fail-safe restore handling.
- #5102 -> `docs/reference/commands.mdx`,
`docs/reference/commands-nemohermes.mdx`,
`docs/manage-sandboxes/lifecycle.mdx`, `docs/about/release-notes.mdx`:
Document `upgrade-sandboxes` image-fingerprint drift detection.
- #4201 -> `docs/reference/troubleshooting.mdx`,
`docs/about/release-notes.mdx`: Document the installer diagnostic for
unexpected Docker daemon access outside the `docker` group.
- #5038 -> `docs/inference/inference-options.mdx`,
`docs/inference/use-local-inference.mdx`,
`docs/about/release-notes.mdx`: Document the interactive managed-vLLM
model picker and non-interactive override behavior.
- #5040, #5041 -> `docs/reference/troubleshooting.mdx`,
`docs/about/release-notes.mdx`: Document Ollama auth-proxy recovery and
host DNS preflight diagnostics.
- #4986, #5039 -> `docs/manage-sandboxes/messaging-channels.mdx`,
`docs/about/release-notes.mdx`: Document Slack validation and duplicate
Slack Socket Mode sandbox handling.
- #4981, #5168 -> `docs/about/release-notes.mdx`: Capture Hermes gateway
secret-guard and wrapped-argv startup hardening in the release surface.
- Follow-up ->
`.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`: Record the
post-release docs workflow, discussion-announcement lookup, and
next-patch release label rule.
- Follow-up -> `docs/reference/commands.mdx`,
`docs/reference/commands-nemohermes.mdx`: Reword custom Dockerfile
sandbox text so CLI parity does not treat `--from` as an
`upgrade-sandboxes` flag.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `npm run docs`
- `npm run build:cli`
- `bash test/e2e/e2e-cloud-experimental/check-docs.sh --only-cli`
- Skip-term scan for `docs/.docs-skip` blocked terms across generated
user skills

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Enhanced local inference setup with interactive model selection
prompts and environment variable overrides
* Improved sandbox upgrade detection using build fingerprints and
version checks
* Clarified configuration restore behavior preserving user settings
during rebuild/restore
  * Added gateway authentication as fifth security layer
  * Expanded Slack messaging validation with live credential checking
* Enhanced troubleshooting guidance for Docker access, DNS issues, and
sandbox recovery
* Updated release notes for v0.0.63 featuring sandbox recovery and
inference improvements

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: local-models Local model providers, downloads, launch, or connectivity area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow bug-fix PR fixes a bug or regression provider: ollama Ollama local model provider behavior v0.0.63 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ubuntu 24.04][Onboard] local Ollama onboard fails auth proxy failed to start on :11435

4 participants