fix(onboard): diagnose Ollama auth-proxy port conflicts and recover startup#5040
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (5)
💤 Files with no reviewable changes (1)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughHardens Ollama auth-proxy startup: adds an HTTP server error handler, IPv4-scoped port-owner inspection with redacted cmdlines, staged spawn with token-bound readiness checks against 127.0.0.1, comprehensive startup tests, and standardized fake curl shims for onboarding tests. ChangesOllama Auth-Proxy Startup Reliability
Sequence DiagramsequenceDiagram
participant StartOllamaAuthProxy
participant LsofPs as lsof/ps
participant SpawnedAuthProxy
participant HTTPProbe
StartOllamaAuthProxy->>LsofPs: Inspect who owns proxy TCP port (IPv4)
alt Port bound by foreign PID
LsofPs-->>StartOllamaAuthProxy: Return owning PID + cmdline
StartOllamaAuthProxy->>StartOllamaAuthProxy: Redact command, print conflict & remediation
StartOllamaAuthProxy-->>StartOllamaAuthProxy: Exit failure
else Port free
StartOllamaAuthProxy->>SpawnedAuthProxy: Spawn proxy with generated token
StartOllamaAuthProxy->>StartOllamaAuthProxy: Backoff polling loop
StartOllamaAuthProxy->>HTTPProbe: Probe http://127.0.0.1:<port>/v1/models with token and without
alt Token accepted && unauthenticated rejected
HTTPProbe-->>StartOllamaAuthProxy: Confirmed auth-proxy owns port
StartOllamaAuthProxy-->>StartOllamaAuthProxy: Startup success
else Token checks fail
HTTPProbe-->>StartOllamaAuthProxy: Ownership not confirmed
StartOllamaAuthProxy-->>StartOllamaAuthProxy: Retry until timeout/failure
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
b890909 to
aa70658
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/onboard-selection.test.ts (1)
973-997:⚠️ Potential issue | 🟠 Major | ⚡ Quick winConsolidate duplicated Ollama curl shims to unblock CI size budget.
The file is currently failing CI size guardrails; repeated inline curl shims (same
--config+:11435unauthenticated401behavior) are the main growth driver. ReusewriteOllamaToolCallingCurl(fakeBin)(or a small parameterized variant) in these blocks instead of inlining each script body.Suggested direction
- fs.writeFileSync( - path.join(fakeBin, "curl"), - `#!/usr/bin/env bash -... repeated auth-proxy shim ... -`, - { mode: 0o755 }, - ); + writeOllamaToolCallingCurl(fakeBin);If a specific test needs custom behavior, keep a tiny per-test override helper layered on top of
writeOllamaToolCallingCurlrather than copying the full script again.Also applies to: 1212-1240, 2054-2078, 2162-2186, 2282-2306, 2410-2434, 2535-2559, 3645-3669, 3742-3766, 5217-5244, 5566-5594, 5757-5784, 5952-5979, 6140-6167, 6616-6643, 6783-6810, 6939-6966
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/onboard-selection.test.ts` around lines 973 - 997, Replace the many inline curl shim script bodies written with fs.writeFileSync in the tests by calling the shared helper writeOllamaToolCallingCurl(fakeBin) (or create a small parameterized variant writeOllamaToolCallingCurl(fakeBin, opts) that accepts overrides such as body or unauthenticated-port) instead of copying the full script; locate the duplicated blocks that reference fakeBin, the inline script using OLLAMA_CHAT_COMPLETIONS_TOOL_CALL_RESPONSE and the :11435 unauthenticated 401 logic, and refactor each to call the helper, adding a tiny per-test override wrapper only when a test needs custom behavior.Source: Pipeline failures
🧹 Nitpick comments (1)
src/lib/inference/ollama/proxy.ts (1)
295-301: 💤 Low valueMinor: timeout error message understates actual wait time.
Each polling iteration may wait up to ~2s (1s
waitForPort+ 1ssleep), so 12 attempts can take ~24s worst-case, not 12s as the message implies. Consider either adjusting the message or using a cumulative elapsed-time check instead of attempt count.This is cosmetic—the longer actual timeout is beneficial for slow hosts per the PR objective.
💡 Suggested clarification
- console.error( - ` Error: Ollama auth proxy did not become ready on :${OLLAMA_PROXY_PORT} within ${PROXY_START_ATTEMPTS}s.`, - ); + console.error( + ` Error: Ollama auth proxy did not become ready on :${OLLAMA_PROXY_PORT} after ${PROXY_START_ATTEMPTS} attempts.`, + );🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/inference/ollama/proxy.ts` around lines 295 - 301, The timeout message under the Ollama proxy startup (around the block that logs error and returns false) understates real wait time because each loop uses waitForPort + sleep; update the message in the function that checks the proxy (referencing OLLAMA_PROXY_PORT and PROXY_START_ATTEMPTS) to either report the cumulative elapsed time (capture start time before the loop and compute elapsedMs to include in the error) or multiply attempts by ~2s to reflect worst-case (~${PROXY_START_ATTEMPTS * 2}s); adjust the process of computing the timeout text accordingly in the same function so logs accurately reflect how long we actually waited.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@test/onboard-selection.test.ts`:
- Around line 973-997: Replace the many inline curl shim script bodies written
with fs.writeFileSync in the tests by calling the shared helper
writeOllamaToolCallingCurl(fakeBin) (or create a small parameterized variant
writeOllamaToolCallingCurl(fakeBin, opts) that accepts overrides such as body or
unauthenticated-port) instead of copying the full script; locate the duplicated
blocks that reference fakeBin, the inline script using
OLLAMA_CHAT_COMPLETIONS_TOOL_CALL_RESPONSE and the :11435 unauthenticated 401
logic, and refactor each to call the helper, adding a tiny per-test override
wrapper only when a test needs custom behavior.
---
Nitpick comments:
In `@src/lib/inference/ollama/proxy.ts`:
- Around line 295-301: The timeout message under the Ollama proxy startup
(around the block that logs error and returns false) understates real wait time
because each loop uses waitForPort + sleep; update the message in the function
that checks the proxy (referencing OLLAMA_PROXY_PORT and PROXY_START_ATTEMPTS)
to either report the cumulative elapsed time (capture start time before the loop
and compute elapsedMs to include in the error) or multiply attempts by ~2s to
reflect worst-case (~${PROXY_START_ATTEMPTS * 2}s); adjust the process of
computing the timeout text accordingly in the same function so logs accurately
reflect how long we actually waited.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c3c7089d-9f1a-423e-bc6b-17d8cec5954b
📒 Files selected for processing (4)
scripts/ollama-auth-proxy.jssrc/lib/inference/ollama/proxy.tstest/ollama-proxy-startup.test.tstest/onboard-selection.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- scripts/ollama-auth-proxy.js
- test/ollama-proxy-startup.test.ts
|
@coderabbitai resume The duplicated Ollama curl shims flagged in the outside-diff comment have been consolidated into the shared |
|
✅ Action performedReviews resumed. Full review finished. |
|
@coderabbitai review Note: review #4458158009 was on the superseded commit |
|
✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
test/onboard-selection.test.ts (1)
25-46: ⚡ Quick winConsider consolidating remaining inline curl implementations.
Four inline fake curl scripts remain (hunks at lines 25-46, 2005-2024, 3500-3519, 3597-3616) with logic that appears functionally identical to the shared
writeOllamaToolCallingCurl(fakeBin)helper used elsewhere in this file.Per the PR objectives ("Consolidated duplicated Ollama curl shims into a shared writeOllamaToolCallingCurl(fakeBin) helper"), consider refactoring these remaining inline implementations to also call the shared helper. This would reduce duplication, ensure consistency, and simplify future maintenance.
If these inline versions must remain separate for test-specific reasons, please document why they differ from the shared helper.
Also applies to: 2005-2024, 3500-3519, 3597-3616
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/onboard-selection.test.ts` around lines 25 - 46, In test/onboard-selection.test.ts there are four remaining inline fake curl script implementations that duplicate the logic in the shared helper writeOllamaToolCallingCurl(fakeBin); replace each inline script with a call to writeOllamaToolCallingCurl(fakeBin) (passing the same fakeBin used by the original inline fakes) to remove duplication and ensure consistent behavior, and if any inline version must keep different behavior, add a short comment above it explaining why it diverges from writeOllamaToolCallingCurl and what specific differences are required for that test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/ollama-proxy-startup.test.ts`:
- Around line 213-253: The test currently sets reclaimed true on any mocked
sleep and never asserts killCommands, so change the mocked reclaim logic to set
reclaimed only when the actual expected kill command targeting PID 4242 is
pushed to killCommands (e.g., when the string/command equals the expected kill
for PID 4242), and add an assertion after the run that killCommands contains
that expected command (so the test fails if the reclaim/kill action was
skipped). Ensure references to the test variables killCommands and reclaimed and
the PID 4242 are used so the check is tied directly to the emitted kill command
rather than any sleep timing.
- Around line 120-180: The tests currently vary the mocked curl responses but do
not assert that startOllamaAuthProxy() actually issued both readiness probes;
update the test to verify the curl mock was called twice and that one call
matched the unauthenticated probe (expected 401) and the other matched the
authenticated probe (expected 200). Concretely, after invoking
startOllamaAuthProxy() assert the curl mock (the mocked exec/spawn used in the
test) call count is 2 and inspect the call arguments (or calledWith entries) to
confirm one request targeted the unauthenticated endpoint/headers and one
targeted the authenticated endpoint/headers so regressions that skip the 401
check will fail.
In `@test/onboard-selection.test.ts`:
- Around line 2018-2023: The fake curl handlers in
test/onboard-selection.test.ts write to $outfile unconditionally which crashes
when -o is not passed; update each fake curl implementation (the occurrences
that reference the outfile variable at or near lines 2022, 3517, 3614 and the
ranges 3513-3518, 3610-3615) to guard the redirection by checking that outfile
is non-empty before writing the body (use the same safe pattern used in the
earlier hunk around lines 44-45), i.e. only perform the `> "$outfile"` write
when outfile is set; alternatively, if you prefer, assert in the tests that
callers always pass -o and document that precondition.
---
Nitpick comments:
In `@test/onboard-selection.test.ts`:
- Around line 25-46: In test/onboard-selection.test.ts there are four remaining
inline fake curl script implementations that duplicate the logic in the shared
helper writeOllamaToolCallingCurl(fakeBin); replace each inline script with a
call to writeOllamaToolCallingCurl(fakeBin) (passing the same fakeBin used by
the original inline fakes) to remove duplication and ensure consistent behavior,
and if any inline version must keep different behavior, add a short comment
above it explaining why it diverges from writeOllamaToolCallingCurl and what
specific differences are required for that test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7c2179a4-466e-446b-98a8-3823aed15d9a
📒 Files selected for processing (5)
ci/test-file-size-budget.jsonscripts/ollama-auth-proxy.jssrc/lib/inference/ollama/proxy.tstest/ollama-proxy-startup.test.tstest/onboard-selection.test.ts
166d6ae to
3eb100a
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/ollama-proxy-startup.test.ts`:
- Around line 95-99: The runner.runCapture stub is too generic for lsof checks
and will incorrectly accept IPv6-only listeners; update the stub used in the
tests (runner.runCapture) to distinguish IPv4-scoped lsof queries from
generic/IPv6 ones by matching the IPv4-specific query string (e.g., the "-i4" or
"127.0.0.1:11435" style token) and only return the "2222" owner for that IPv4
query, add a separate case that returns an IPv6-only owner string (e.g., showing
an "[::1]:11435" or other IPv6 output) to simulate an IPv6 listener, and keep
the existing ps -p 2222 response ("/usr/bin/python3 -m http.server 11435") so
startup still succeeds when appropriate; apply the same change pattern to the
other runCapture stubs referenced in the comment.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1f2527fb-42e7-44ee-927b-9de69c7d63e9
📒 Files selected for processing (5)
ci/test-file-size-budget.jsonscripts/ollama-auth-proxy.jssrc/lib/inference/ollama/proxy.tstest/ollama-proxy-startup.test.tstest/onboard-selection.test.ts
✅ Files skipped from review due to trivial changes (1)
- ci/test-file-size-budget.json
🚧 Files skipped from review as they are similar to previous changes (3)
- scripts/ollama-auth-proxy.js
- test/onboard-selection.test.ts
- src/lib/inference/ollama/proxy.ts
3eb100a to
a23086d
Compare
e19dd45 to
bd1226f
Compare
|
@coderabbitai resume |
bd1226f to
6c23d4b
Compare
✅ Action performedReviews resumed. |
d76897b to
3df45b1
Compare
…tartup Local Ollama onboarding aborted at provider setup with a generic "Ollama auth proxy failed to start on :11435" — no conflict owner, no retry, and no remediation — when the port was occupied or the proxy was slow to bind. - Pre-check the proxy port (IPv4-scoped, since the proxy binds 0.0.0.0) and, on a real conflict, name the owning PID + command and tell the user to stop it or export NEMOCLAW_OLLAMA_PROXY_PORT for all NemoClaw commands - Poll for readiness with backoff (slow hosts / post-systemd-restart window) instead of a single 2s probe, and verify the listener is our proxy holding the token before reporting success (rejects unauth 401, accepts the token) so we never persist a token for a foreign listener - Diagnose a failed spawn distinctly: name a dual-stack IPv6 blocker via a broad lsof scope, else report "exited during startup" - Handle EADDRINUSE in the proxy script so it exits cleanly instead of crashing with an uncaught exception - Probe the proxy via 127.0.0.1 (not localhost) so an IPv6-only squatter cannot misdirect the health check; redact owner command lines before printing them - Add startup tests: port conflict, stale-proxy reclaim, slow start, spawn failure Fixes NVIDIA#4820 Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
3df45b1 to
355e627
Compare
prekshivyas
left a comment
There was a problem hiding this comment.
Reviewed (code + security checklist). Hardens host-side startOllamaAuthProxy() / ollama-auth-proxy.js with named port-conflict diagnostics, IPv4-scoped preflight, retry/backoff readiness polling, and token-based ownership verification so the Ollama onboarding path either succeeds or fails with actionable detail (#4820).
✅ Approve. Security-conscious throughout: token is crypto.randomBytes(24), persisted only after provider confirmation; lsof/ps/curl invoked via spawnSync array args (no shell); curl pinned to 127.0.0.1 with --max-time; PIDs parseInt+positive-guarded before any kill, and kill is gated on isOllamaProxyProcess (never kills foreign PIDs). The readiness proof requires both unauth→401 and token→non-401 — a genuine TOCTOU hardening so a token is never persisted for a foreign listener that won an EADDRINUSE race. Security: all categories pass.
Non-blocking nits:
proxy.ts:195— timeout message says "within${PROXY_START_ATTEMPTS}s" (≈12s), but each unbound iteration is ~1s probe + 1s sleep, so real worst case is ~24s. Reword to "after N attempts" or compute the true budget.proxy.ts:96— comment implies foreign argv is fully redacted;redact()is best-effort (pattern/URL based), so an opaque--token=<x>could still print. Soften wording.
Tests strong: ollama-proxy-startup.test.ts covers all terminal paths and asserts both halves of the readiness proof run.
## Summary - Add the v0.0.63 release-note section using the published development note as source context. - Update source docs for sandbox recovery, OpenClaw config restore safety, managed vLLM selection, Slack Socket Mode conflict handling, and host diagnostics. - Refresh generated `nemoclaw-user-*` skills from the updated Fern MDX docs. - Update the release-doc refresh skill so post-release docs for version `n` look up the matching announcement discussion and use the `n+1` patch release label. - Fix CLI/docs parity by avoiding a `--from <Dockerfile>` flag mention inside the `upgrade-sandboxes` command section. ## Source summary - #5034 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document safer stale-sandbox recovery through `rebuild --yes` before recreating from scratch. - #5091 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document Docker-driver post-reboot recovery from OpenShell container labels. - #5101, #5174, #5177 -> `docs/manage-sandboxes/backup-restore.mdx`, `docs/about/release-notes.mdx`: Document OpenClaw `openclaw.json` preservation, merge behavior, and fail-safe restore handling. - #5102 -> `docs/reference/commands.mdx`, `docs/reference/commands-nemohermes.mdx`, `docs/manage-sandboxes/lifecycle.mdx`, `docs/about/release-notes.mdx`: Document `upgrade-sandboxes` image-fingerprint drift detection. - #4201 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document the installer diagnostic for unexpected Docker daemon access outside the `docker` group. - #5038 -> `docs/inference/inference-options.mdx`, `docs/inference/use-local-inference.mdx`, `docs/about/release-notes.mdx`: Document the interactive managed-vLLM model picker and non-interactive override behavior. - #5040, #5041 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document Ollama auth-proxy recovery and host DNS preflight diagnostics. - #4986, #5039 -> `docs/manage-sandboxes/messaging-channels.mdx`, `docs/about/release-notes.mdx`: Document Slack validation and duplicate Slack Socket Mode sandbox handling. - #4981, #5168 -> `docs/about/release-notes.mdx`: Capture Hermes gateway secret-guard and wrapped-argv startup hardening in the release surface. - Follow-up -> `.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`: Record the post-release docs workflow, discussion-announcement lookup, and next-patch release label rule. - Follow-up -> `docs/reference/commands.mdx`, `docs/reference/commands-nemohermes.mdx`: Reword custom Dockerfile sandbox text so CLI parity does not treat `--from` as an `upgrade-sandboxes` flag. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` - `npm run build:cli` - `bash test/e2e/e2e-cloud-experimental/check-docs.sh --only-cli` - Skip-term scan for `docs/.docs-skip` blocked terms across generated user skills <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Enhanced local inference setup with interactive model selection prompts and environment variable overrides * Improved sandbox upgrade detection using build fingerprints and version checks * Clarified configuration restore behavior preserving user settings during rebuild/restore * Added gateway authentication as fifth security layer * Expanded Slack messaging validation with live credential checking * Enhanced troubleshooting guidance for Docker access, DNS issues, and sandbox recovery * Updated release notes for v0.0.63 featuring sandbox recovery and inference improvements <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Local Ollama onboarding aborted at provider setup with a generic
Ollama auth proxy failed to start on :11435— no conflict owner, no retry, no remediation. This hardens the host-side auth-proxy startup with precise port-conflict diagnostics, retry/backoff, and ownership verification, so the reporter's workflow either completes or fails with actionable detail.Related Issue
Fixes #4820
Changes
:11435and print the exact PID + (redacted) command line with actionable remediation — stop the process, or exportNEMOCLAW_OLLAMA_PROXY_PORTso every NemoClaw command uses the same port.0.0.0.0, so the pre-start abort only fires on an IPv4 conflict — an IPv6-only listener that can't actually conflict no longer triggers a false abort. A dual-stack:::PORTblocker (which does block IPv4) is still caught and named via the post-spawnEADDRINUSEpath (broad lsof scope).lsofis unavailable or loses theEADDRINUSErace.EADDRINUSEhandling inscripts/ollama-auth-proxy.js(exit 1 instead of an uncaught crash).127.0.0.1health probe (notlocalhost) so an IPv6-only squatter can't misdirect the check; this matches the file's other proxy probes.test/ollama-proxy-startup.test.ts): port conflict, stale-proxy reclaim, slow start, spawn failure.No
src/lib/onboard.tschanges (growth-neutral). No documented behavior changes — the remediation aligns with the existingNEMOCLAW_OLLAMA_PROXY_PORTdocs.Type of Change
Verification
npm run typecheck:clipassesnpm testpasses:ollama-proxy-startup,ollama-proxy-recovery,onboard-ollama-autostart,ollama-tools-capability,sandbox-connect-inference,onboard(pre-existing unrelated failures on this host are environment timing flakes, confirmed present on the clean base)E2E evidence (host-side
startOllamaAuthProxy()via the compileddist)Built the CLI (
npm run build:cli) and drove the real host-side proxy startup against live listeners on a Linux host with Ollama on127.0.0.1:11434.Before the fix — squatting a non-NemoClaw process on
:11435, then running the Ollama provider path produced the reporter's generic error:After the fix — same harness, three scenarios:
Dual-stack
:::11435blockers (which lsof reports as IPv6 but do block our IPv4 bind) are caught post-spawn via the EADDRINUSE path and named with the broad lsof scope.CI on this head (
aa7065844) is green: 39 success / 0 failure (see the PR Checks tab —build-typecheck,cli-test-shards,static-checks,codebase-growth-guardrailsall pass).Signed-off-by: Yimo Jiang yimoj@nvidia.com
🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests
Chores