fix(onboard): retry and fallback for container reachability check#2453
Conversation
) Add timeout, retry with backoff, host-side fallback, and improved diagnostics to the container reachability check during onboarding. On Brev cloud GPU instances, Docker's --add-host host-gateway does not route correctly, causing a false-negative failure that blocks onboarding even though the Ollama auth proxy is healthy on the host. The sandbox uses a different networking stack (k3s CoreDNS + NodeHosts) and would succeed. Changes: - Add --connect-timeout 5 --max-time 10 to container curl command - Retry the container check up to 3 times with 2s backoff - Collect HTTP status and host-gateway resolution diagnostics on failure - Fall back to host-side health check before exiting fatally - Replace misleading error messages with actionable diagnostics - Add diagnostic field to ValidationResult interface Fixes #2425
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis pull request enhances local inference provider validation with retry logic, explicit timeouts, and diagnostic gathering capabilities. Container reachability checks now retry up to 3 times with 2-second delays. The onboarding flow is updated to tolerate certain validation failures by checking host-side reachability and auth proxy health, with a new HTTP-based health check for the Ollama proxy. Ollama proxy lifecycle functions are reorganized into a dedicated module. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as NemoClaw CLI
participant Onboard as setupInference
participant Validator as validateLocalProvider
participant HostCheck as Host Health Check
participant ProxyCheck as Proxy Health Check
Client->>Onboard: setupInference(provider='ollama-local')
Onboard->>Validator: validateLocalProvider(provider, runCaptureImpl, sleepFn)
loop Retry up to 3 times
Validator->>Validator: Docker run with timeouts<br/>(--connect-timeout 5, --max-time 10)
alt Container reachable
Validator-->>Onboard: ValidationResult { success: true }
else Container unreachable
Validator->>Validator: sleep 2 seconds
end
end
alt All retries failed
Validator->>Validator: Gather diagnostics<br/>(HTTP status, /etc/hosts)
Validator-->>Onboard: ValidationResult { success: false, diagnostic }
Onboard->>ProxyCheck: isProxyHealthy()
ProxyCheck->>ProxyCheck: HTTP probe /api/tags
alt Proxy healthy
ProxyCheck-->>Onboard: true
Onboard->>Client: Downgrade to warning<br/>Continue setup
else Proxy unhealthy
ProxyCheck-->>Onboard: false
Onboard->>Client: Log error<br/>Exit setup
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 5517-5545: The current bypass logic relies on a ps check
(isOllamaProxyProcess(loadPersistedProxyPid())) and an Ollama /api/tags health
check (getLocalProviderHealthCheck + runCapture), which can miss a wedged/stale
auth proxy; instead, before deciding to continue or exit, call
ensureOllamaAuthProxy() (to start/restart the proxy if needed) and then perform
an actual request against the proxy health endpoint (not Ollama's /api/tags)
using the same runCapture helper; replace the proxyAlive/hostResponding logic
with a probe that first invokes ensureOllamaAuthProxy(), then hits the proxy
health URL and only treats the proxy as healthy if that probe returns success
(fall back to existing diagnostic logging and process.exit(1) if the proxy
endpoint remains unhealthy).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 76683991-0e05-4c63-abbb-78954b4ed90d
📒 Files selected for processing (3)
src/lib/local-inference.test.tssrc/lib/local-inference.tssrc/lib/onboard.ts
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/lib/onboard.ts (1)
5521-5533:⚠️ Potential issue | 🟠 MajorProbe the auth proxy endpoint, not just the Ollama process.
This still keys the bypass off “PID exists + Ollama
/api/tagsresponds”. A wedgedollama-auth-proxy.json portOLLAMA_PROXY_PORTwill pass this gate, and the laterensureOllamaAuthProxy()call also only checks the PID, so onboarding can continue with a dead proxy. Probe the proxy endpoint you actually hand to OpenShell afterensureOllamaAuthProxy()before downgrading this to a warning.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 5521 - 5533, The current bypass only checks PID + host provider health (getLocalProviderHealthCheck/runCapture) but not the actual auth-proxy HTTP endpoint, so replace or augment hostResponding to perform an HTTP probe against the real auth proxy URL/port used by OpenShell (the same endpoint that ensureOllamaAuthProxy registers, e.g. http://localhost:OLLAMA_PROXY_PORT/<health-or-api-path>) and treat success/failure from that probe when deciding whether to downgrade to a warning; keep the existing isOllamaProxyProcess(loadPersistedProxyPid()) check but require the proxy HTTP probe to succeed (using the same path used by ensureOllamaAuthProxy) before allowing the bypass.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 5475-5476: The health check currently interprets the string result
of runCapture() as a boolean (hostResponding = !!runCapture(...)), which can be
truthy even when curl failed; instead call runCapture and inspect its exit
status (e.g., the returned object's exitCode or status) and set hostResponding =
true only when that exit code === 0 (or use the appropriate runCapture variant
that returns an exit/status code). Update the hostCheck/hostResponding
assignment using getLocalProviderHealthCheck and runCapture so you check the
command's exit code, and apply the same change to the other occurrence at the
region around the other hostResponding usage (the block referenced at lines
~5522-5523).
---
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 5521-5533: The current bypass only checks PID + host provider
health (getLocalProviderHealthCheck/runCapture) but not the actual auth-proxy
HTTP endpoint, so replace or augment hostResponding to perform an HTTP probe
against the real auth proxy URL/port used by OpenShell (the same endpoint that
ensureOllamaAuthProxy registers, e.g.
http://localhost:OLLAMA_PROXY_PORT/<health-or-api-path>) and treat
success/failure from that probe when deciding whether to downgrade to a warning;
keep the existing isOllamaProxyProcess(loadPersistedProxyPid()) check but
require the proxy HTTP probe to succeed (using the same path used by
ensureOllamaAuthProxy) before allowing the bypass.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c8c5e1a9-256d-4d6b-b46f-d800642ac37e
📒 Files selected for processing (1)
src/lib/onboard.ts
…roxyHealthy (#767) Extract the entire Ollama auth proxy subsystem (token persistence, PID tracking, process spawn/kill, health probing) from onboard.ts into a dedicated ollama-proxy.ts module. Add isProxyHealthy() which performs an actual HTTP probe against the proxy endpoint (not just a PID check), addressing CodeRabbit's review feedback about wedged/stale proxy processes. Simplify the setupInference fallback to use isProxyHealthy() instead of inline PID + host check logic. - Reduces onboard.ts by 161 lines (7910 → 7749) - New ollama-proxy.ts: 223 lines - Advances #767 (split onboarding provider logic)
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
src/lib/onboard.ts (2)
5360-5384:⚠️ Potential issue | 🟠 MajorRepair or reprobe the Ollama auth proxy before aborting.
This branch exits on
isProxyHealthy()beforeensureOllamaAuthProxy()runs at Line 5389. A stale or missing proxy that is recoverable here still becomes a hard false-negative.Suggested fix
+ let proxyReady = false; if (!validation.ok) { // The container reachability check uses Docker's --add-host host-gateway, // which may not work on all Docker configurations (e.g., Brev, rootless). // The real sandbox uses k3s CoreDNS + NodeHosts — a different path. // Probe the actual auth proxy endpoint before giving up. - if (!isWsl() && isProxyHealthy()) { + proxyReady = !isWsl() && ensureOllamaAuthProxy() && isProxyHealthy(); + if (proxyReady) { console.warn(` ⚠ ${validation.message}`); if (validation.diagnostic) { console.warn(` Diagnostic: ${validation.diagnostic}`); @@ - if (!isWsl()) { - ensureOllamaAuthProxy(); + if (!isWsl()) { + if (!proxyReady) ensureOllamaAuthProxy(); const proxyToken = getOllamaProxyToken();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 5360 - 5384, The code exits immediately when isProxyHealthy() is false even though a recoverable proxy could be created; modify the branch handling around isProxyHealthy() so that when the proxy appears unhealthy you call ensureOllamaAuthProxy() (and recheck with isProxyHealthy()) before logging the fatal error and calling process.exit(1); in practice, invoke ensureOllamaAuthProxy() when !isProxyHealthy(), handle any thrown error or returned failure, re-evaluate validation (and validation.diagnostic) and only log the error and call process.exit(1) if the reprobe/recovery still reports an unhealthy proxy.
5318-5319:⚠️ Potential issue | 🟠 MajorUse curl exit status for the vLLM host-health bypass.
Line 5319 still converts
runCapture()output into a boolean. A failedcurl -sfcan still leave captured text behind, which would incorrectly downgrade a real host failure into the warning-and-continue path.Suggested fix
- const hostResponding = hostCheck ? !!runCapture(hostCheck, { ignoreError: true }) : false; + const hostResponding = hostCheck + ? run(hostCheck, { ignoreError: true, suppressOutput: true }).status === 0 + : false;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 5318 - 5319, The host-health check currently sets hostResponding by coercing runCapture(...) output to boolean, which can be non-empty even when curl failed; update the logic in onboard.ts where getLocalProviderHealthCheck(provider) is assigned to hostCheck so that you inspect the command exit status instead of the captured stdout. Call the runner in a way that returns the exit code (e.g., use the run variant or runCapture result that includes an exitCode/exitStatus) and set hostResponding = hostCheck ? (result.exitCode === 0) : false, keeping the ignoreError behavior but basing success on the process exit code; reference getLocalProviderHealthCheck, runCapture, hostCheck, and hostResponding to locate and change the code.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
5316-5384: Extract the local-provider fallback path into a helper.These new vLLM/Ollama branches duplicate the same diagnostic-and-bypass flow inside an already complexity-suppressed function. Pulling that decision into a small helper would keep
setupInference()easier to maintain.As per coding guidelines,
{src,nemoclaw/src,scripts}/**/*.{js,ts,tsx}: TypeScript files must maintain cyclomatic complexity limit of 20, ratcheting down to 15. Enforced by ESLint.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 5316 - 5384, The duplicated diagnostic-and-bypass logic in the vllm-local and ollama-local branches should be refactored into a small helper to reduce cyclomatic complexity: create a helper (e.g., ensureLocalProviderOrExit or handleLocalProviderFallback) that takes the provider name and the validation result and encapsulates the existing checks (for vllm-local: getLocalProviderHealthCheck + runCapture; for ollama-local: isWsl + isProxyHealthy) and the three logging paths (warn + optional diagnostic + bypass message, or error + optional diagnostic + platform-specific hint + process.exit). Replace the duplicated blocks around validateLocalProvider in setupInference (or where validateLocalProvider is called) to call this helper before continuing to upsertProvider / runOpenshell.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/ollama-proxy.ts`:
- Around line 145-165: The code sets ollamaProxyToken before verifying the proxy
is alive; update startOllamaAuthProxy so it generates a local proxyToken, calls
spawnOllamaAuthProxy(pid), waits and checks isOllamaProxyProcess(pid), and only
if the liveness check passes assign ollamaProxyToken = proxyToken (and persist
if needed); if the check fails return false without mutating ollamaProxyToken so
getOllamaProxyToken/ensureOllamaAuthProxy won’t return a stale token; apply the
same change pattern where ollamaProxyToken is currently set early (references:
startOllamaAuthProxy, spawnOllamaAuthProxy, isOllamaProxyProcess,
getOllamaProxyToken, ensureOllamaAuthProxy).
- Around line 207-222: In isProxyHealthy(), don't early-return false when
isOllamaProxyProcess(pid) is falsy; instead store that boolean (e.g.,
hasValidPid = isOllamaProxyProcess(pid)), always perform the HTTP probe built
with loadPersistedProxyToken() and runCapture(..., { ignoreError: true }), and
then return true if the probe produced output, otherwise return hasValidPid (so
a successful HTTP probe wins even if the PID file is missing/stale). Update
references in the function to use the new hasValidPid local variable and remove
the early return.
---
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 5360-5384: The code exits immediately when isProxyHealthy() is
false even though a recoverable proxy could be created; modify the branch
handling around isProxyHealthy() so that when the proxy appears unhealthy you
call ensureOllamaAuthProxy() (and recheck with isProxyHealthy()) before logging
the fatal error and calling process.exit(1); in practice, invoke
ensureOllamaAuthProxy() when !isProxyHealthy(), handle any thrown error or
returned failure, re-evaluate validation (and validation.diagnostic) and only
log the error and call process.exit(1) if the reprobe/recovery still reports an
unhealthy proxy.
- Around line 5318-5319: The host-health check currently sets hostResponding by
coercing runCapture(...) output to boolean, which can be non-empty even when
curl failed; update the logic in onboard.ts where
getLocalProviderHealthCheck(provider) is assigned to hostCheck so that you
inspect the command exit status instead of the captured stdout. Call the runner
in a way that returns the exit code (e.g., use the run variant or runCapture
result that includes an exitCode/exitStatus) and set hostResponding = hostCheck
? (result.exitCode === 0) : false, keeping the ignoreError behavior but basing
success on the process exit code; reference getLocalProviderHealthCheck,
runCapture, hostCheck, and hostResponding to locate and change the code.
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 5316-5384: The duplicated diagnostic-and-bypass logic in the
vllm-local and ollama-local branches should be refactored into a small helper to
reduce cyclomatic complexity: create a helper (e.g., ensureLocalProviderOrExit
or handleLocalProviderFallback) that takes the provider name and the validation
result and encapsulates the existing checks (for vllm-local:
getLocalProviderHealthCheck + runCapture; for ollama-local: isWsl +
isProxyHealthy) and the three logging paths (warn + optional diagnostic + bypass
message, or error + optional diagnostic + platform-specific hint +
process.exit). Replace the duplicated blocks around validateLocalProvider in
setupInference (or where validateLocalProvider is called) to call this helper
before continuing to upsertProvider / runOpenshell.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cd2c8e18-d861-4413-b058-63a9501e1fe0
📒 Files selected for processing (3)
src/lib/ollama-proxy.tssrc/lib/onboard.tssrc/nemoclaw.ts
✅ Files skipped from review due to trivial changes (1)
- src/nemoclaw.ts
- startOllamaAuthProxy: set in-memory token only after confirming proxy is alive, preventing stale tokens on failed starts - isProxyHealthy: don't early-return false on PID check failure; always perform HTTP probe as the authoritative signal, falling back to PID only when the probe transiently fails - ollama-local branch: call ensureOllamaAuthProxy() before isProxyHealthy() so a stale/missing proxy is recovered before we decide to abort onboarding - vllm-local branch: use run() exit status instead of coercing runCapture() output to boolean for host health bypass Signed-off-by: J. Yaunches <jyaunches@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/lib/ollama-proxy.ts (1)
145-168:⚠️ Potential issue | 🟠 MajorFailed restarts can still publish a dead proxy token.
Line 188 assigns
ollamaProxyTokenbefore the restarted child is proven alive, and the failed-start path still leaves the newly persisted PID behind. If the proxy exits immediately,getOllamaProxyToken()can keep returning a token for a proxy that is no longer serving.Suggested fix
export function startOllamaAuthProxy(): boolean { // eslint-disable-next-line `@typescript-eslint/no-require-imports` const crypto = require("crypto"); killStaleProxy(); + ollamaProxyToken = null; const proxyToken = crypto.randomBytes(24).toString("hex"); // Don't persist yet — wait until provider is confirmed in setupInference. // If the user backs out to a different provider, the token stays in memory // only and is discarded. const pid = spawnOllamaAuthProxy(proxyToken); sleepSeconds(1); if (!isOllamaProxyProcess(pid)) { + clearPersistedProxyPid(); console.error(` Error: Ollama auth proxy failed to start on :${OLLAMA_PROXY_PORT}`); console.error(` Containers will not be able to reach Ollama without the proxy.`); console.error( ` Check if port ${OLLAMA_PROXY_PORT} is already in use: lsof -ti :${OLLAMA_PROXY_PORT}`, ); @@ export function ensureOllamaAuthProxy(): void { // Try to load persisted token first — if none, this isn't an Ollama setup. const token = loadPersistedProxyToken(); if (!token) return; @@ // Proxy not running — restart it with the persisted token. killStaleProxy(); - ollamaProxyToken = token; - spawnOllamaAuthProxy(token); + const restartedPid = spawnOllamaAuthProxy(token); sleepSeconds(1); + if (isOllamaProxyProcess(restartedPid)) { + ollamaProxyToken = token; + } else { + clearPersistedProxyPid(); + ollamaProxyToken = null; + } }Also applies to: 175-191
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/ollama-proxy.ts` around lines 145 - 168, In startOllamaAuthProxy(), avoid persisting the new proxy PID or setting ollamaProxyToken until the spawned process is verified alive: generate proxyToken, call spawnOllamaAuthProxy(proxyToken) but do NOT persist the PID inside spawn/killStaleProxy yet; after sleepSeconds and a successful isOllamaProxyProcess(pid) check, then persist the PID (wherever you currently store it) and assign ollamaProxyToken; if the check fails, ensure you clear any persisted PID left by the restart path and do NOT set ollamaProxyToken so getOllamaProxyToken() cannot return a token for a dead proxy (update spawnOllamaAuthProxy/killStaleProxy behavior if they currently persist immediately).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 5318-5323: The host-side vLLM fallback probe can hang because
getLocalProviderHealthCheck("vllm-local") returns a plain "curl -sf ..."
command; update the probe to enforce a timeout by either (A) modifying
getLocalProviderHealthCheck to include curl timing flags (e.g. --max-time and/or
--connect-timeout) for the "vllm-local" case so the returned hostCheck command
always contains a bounded timeout, or (B) if you prefer to keep
getLocalProviderHealthCheck unchanged, detect the vllm-local provider before
calling run() and wrap the hostCheck invocation with a timeout tool (e.g. prefix
with timeout or add equivalent flags) so run(hostCheck, { ignoreError: true,
suppressOutput: true }).status is decided within a bounded time; ensure the
unique symbols involved are hostCheck, hostResponding,
getLocalProviderHealthCheck, and run.
---
Duplicate comments:
In `@src/lib/ollama-proxy.ts`:
- Around line 145-168: In startOllamaAuthProxy(), avoid persisting the new proxy
PID or setting ollamaProxyToken until the spawned process is verified alive:
generate proxyToken, call spawnOllamaAuthProxy(proxyToken) but do NOT persist
the PID inside spawn/killStaleProxy yet; after sleepSeconds and a successful
isOllamaProxyProcess(pid) check, then persist the PID (wherever you currently
store it) and assign ollamaProxyToken; if the check fails, ensure you clear any
persisted PID left by the restart path and do NOT set ollamaProxyToken so
getOllamaProxyToken() cannot return a token for a dead proxy (update
spawnOllamaAuthProxy/killStaleProxy behavior if they currently persist
immediately).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8259b168-d6ca-45c0-bb97-4fcbe5766478
📒 Files selected for processing (2)
src/lib/ollama-proxy.tssrc/lib/onboard.ts
| const hostCheck = getLocalProviderHealthCheck(provider); | ||
| // Use run() and check exit status rather than coercing runCapture() output | ||
| // to boolean — curl -sf can leave output even on failure in edge cases. | ||
| const hostResponding = hostCheck | ||
| ? run(hostCheck, { ignoreError: true, suppressOutput: true }).status === 0 | ||
| : false; |
There was a problem hiding this comment.
Bound the host-side vLLM fallback probe.
getLocalProviderHealthCheck("vllm-local") is still a plain curl -sf ... command, so this new fallback can hang indefinitely if localhost accepts the connection but never responds. That reintroduces the step-4 stall on the warning/continue path. Please add a timeout here (or inject curl timing flags) before deciding whether to bypass validation.
Suggested fix
const hostResponding = hostCheck
- ? run(hostCheck, { ignoreError: true, suppressOutput: true }).status === 0
+ ? run(hostCheck, {
+ ignoreError: true,
+ suppressOutput: true,
+ timeout: 10_000,
+ }).status === 0
: false;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/onboard.ts` around lines 5318 - 5323, The host-side vLLM fallback
probe can hang because getLocalProviderHealthCheck("vllm-local") returns a plain
"curl -sf ..." command; update the probe to enforce a timeout by either (A)
modifying getLocalProviderHealthCheck to include curl timing flags (e.g.
--max-time and/or --connect-timeout) for the "vllm-local" case so the returned
hostCheck command always contains a bounded timeout, or (B) if you prefer to
keep getLocalProviderHealthCheck unchanged, detect the vllm-local provider
before calling run() and wrap the hostCheck invocation with a timeout tool (e.g.
prefix with timeout or add equivalent flags) so run(hostCheck, { ignoreError:
true, suppressOutput: true }).status is decided within a bounded time; ensure
the unique symbols involved are hostCheck, hostResponding,
getLocalProviderHealthCheck, and run.
## Summary `scripts/brev-launchable-ci-cpu.sh` is the community install path for Brev users — it bootstraps a VM with Docker, Node.js, OpenShell, and NemoClaw. **That script already exists in the repo but has zero CI coverage.** This PR adds a nightly E2E smoke test that validates the script works end-to-end. This is the long-living safety net for the community install flow. If any regression breaks the launchable script (e.g., the Apr 20–25 Brev outage from #2472/#2482, or the container reachability fallback from #2425), this test catches it before community users are affected. ## Related Issue Closes #2599 Related: #2425 (the `isProxyHealthy()` fallback in PR #2453 — if that regresses, onboard will abort on Brev and this smoke test catches it) ## Changes ### New: `test/e2e/test-launchable-smoke.sh` | Phase | What it validates | |-------|-------------------| | 0 | Pre-cleanup + pre-seed clone directory from checkout | | 1 | Prerequisites (Docker, NVIDIA_API_KEY, network, env vars) | | 2 | Run `brev-launchable-ci-cpu.sh` — the existing community bootstrap script | | 3 | Verify artifacts (nemoclaw, openshell, Node.js, Docker, sentinel file, built outputs) | | 4 | `nemoclaw onboard --non-interactive` with cloud provider | | 5 | Sandbox health (list, status, inference config, gateway) | | 6 | Live inference (direct API, routing via inference.local, openclaw agent 6×7=42) | | 7 | Destroy + cleanup | Key design decisions: - **No BREV_API_TOKEN needed** — the launchable script is a generic Ubuntu bootstrap with zero Brev dependencies, so it runs on standard GitHub-hosted `ubuntu-latest` runners - **Tests current code, not main** — pre-seeds the clone directory from the CI checkout so regressions are caught before reaching community users - **Follows existing E2E conventions** — pass/fail/section helpers, e2e-timeout.sh self-wrap, sandbox-teardown.sh EXIT trap, parse_chat_content() for reasoning models ### Modified: `.github/workflows/nightly-e2e.yaml` - Added `launchable-smoke-e2e` job: `ubuntu-latest`, 30min timeout, `NVIDIA_API_KEY` secret - Uploads install/onboard/test logs as artifacts on failure - Added to `notify-on-failure` needs list ## Validation Triggered via fork dispatch (`jyaunches/NemoClaw` → `sparky-dispatch` → `launchable-smoke`): - **Run:** https://github.com/jyaunches/NemoClaw/actions/runs/25075715342 - **Result:** ✅ 24 passed, 0 failed, 1 skipped (Node.js version — GH runner pre-installs Node 20) - **Runtime:** ~12 minutes ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Checklist - [x] Follows project coding conventions - [x] Tests pass locally or in CI - [x] No secrets/credentials committed <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added an end-to-end smoke test and CI job that validates the community launchable CPU install path (install, onboarding, runtime readiness, and a simple inference check). CI now uploads install/onboard/test logs on failures. * **Chores** * Renamed the branch-validation workflow and corresponding test-suite identifiers for clarity. * Updated E2E test documentation and project configuration names to match the new labeling. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
- Adopt main's onboard-ollama-proxy.ts (CJS) as canonical module - Add isProxyHealthy() from PR's ollama-proxy.ts into onboard-ollama-proxy.ts - Remove PR's ollama-proxy.ts (superseded by onboard-ollama-proxy.ts) - Update imports in onboard.ts and nemoclaw.ts to use onboard-ollama-proxy - Take main's getRequestedSandboxNameHint(opts) signature - Include main's hydrateCredentialEnv, gateway-token-command additions
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/lib/onboard.ts (2)
5146-5151:⚠️ Potential issue | 🟠 MajorBound the host-side vLLM fallback probe.
getLocalProviderHealthCheck("vllm-local")still resolves to a plaincurl -sf ..., so this warning/continue path can hang indefinitely when localhost accepts the connection but never responds. That reintroduces the Step 4 stall this PR is trying to eliminate.Suggested fix
const hostResponding = hostCheck - ? run(hostCheck, { ignoreError: true, suppressOutput: true }).status === 0 + ? run(hostCheck, { + ignoreError: true, + suppressOutput: true, + timeout: 10_000, + }).status === 0 : false;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 5146 - 5151, The host health probe for getLocalProviderHealthCheck("vllm-local") can hang because the returned curl command lacks a request timeout; update the probe so hostCheck is bounded (e.g., include curl's --max-time N or wrap the command with the system timeout utility) so run(hostCheck, { ignoreError: true, suppressOutput: true }) will return promptly instead of hanging; modify getLocalProviderHealthCheck (or the value assigned to hostCheck) to inject a short timeout (e.g., 1–3s) for vllm-local while keeping the existing run(...) usage and error handling.
5209-5240:⚠️ Potential issue | 🟠 MajorRequire a real HTTP success before bypassing Ollama validation.
proxyReadycomes fromisProxyHealthy(), but that helper still falls back to PID liveness when the HTTP probe fails. This branch can therefore warn-and-continue, and then skip the laterensureOllamaAuthProxy()call, even though the proxy endpoint is not actually serving yet. Keep PID state diagnostic-only and gate the bypass on an actual proxy response.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 5209 - 5240, The current bypass uses proxyReady = isProxyHealthy(), but isProxyHealthy() may return true based on PID liveness rather than an actual HTTP probe, allowing the code path that warns-and-continues to run even when the proxy endpoint isn't serving; change the logic so the bypass only happens when an actual HTTP success is observed: either update isProxyHealthy() to distinguish PID-only liveness from an HTTP-serving result (e.g., return a boolean/enum that indicates HTTP success) or add an explicit HTTP probe call (e.g., probeOllamaProxyHttp()) after isProxyHealthy(), and only set/use proxyReady = true when the HTTP probe succeeds; keep PID-based info as diagnostic text but do not gate ensureOllamaAuthProxy() or skip shutdown unless the HTTP probe confirms the proxy is serving. Ensure references include the proxyReady variable, isProxyHealthy(), ensureOllamaAuthProxy(), and getOllamaProxyToken() so reviewers can find the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard-ollama-proxy.ts`:
- Around line 256-268: The health probe uses loadPersistedProxyToken() which can
be null during onboarding even though a valid in-memory token exists; update the
probe logic to prefer the in-memory proxy token (e.g., the runtime variable that
holds the current token) before falling back to loadPersistedProxyToken(), then
build probeCmd with that chosen token (still falling back to no-Auth when none
exists) so the Authorization header uses the in-memory token when available;
keep references to probeCmd, proxyUrl, runCapture, and hasValidPid so the
authenticated HTTP probe is attempted first and only then fallback to PID.
---
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 5146-5151: The host health probe for
getLocalProviderHealthCheck("vllm-local") can hang because the returned curl
command lacks a request timeout; update the probe so hostCheck is bounded (e.g.,
include curl's --max-time N or wrap the command with the system timeout utility)
so run(hostCheck, { ignoreError: true, suppressOutput: true }) will return
promptly instead of hanging; modify getLocalProviderHealthCheck (or the value
assigned to hostCheck) to inject a short timeout (e.g., 1–3s) for vllm-local
while keeping the existing run(...) usage and error handling.
- Around line 5209-5240: The current bypass uses proxyReady = isProxyHealthy(),
but isProxyHealthy() may return true based on PID liveness rather than an actual
HTTP probe, allowing the code path that warns-and-continues to run even when the
proxy endpoint isn't serving; change the logic so the bypass only happens when
an actual HTTP success is observed: either update isProxyHealthy() to
distinguish PID-only liveness from an HTTP-serving result (e.g., return a
boolean/enum that indicates HTTP success) or add an explicit HTTP probe call
(e.g., probeOllamaProxyHttp()) after isProxyHealthy(), and only set/use
proxyReady = true when the HTTP probe succeeds; keep PID-based info as
diagnostic text but do not gate ensureOllamaAuthProxy() or skip shutdown unless
the HTTP probe confirms the proxy is serving. Ensure references include the
proxyReady variable, isProxyHealthy(), ensureOllamaAuthProxy(), and
getOllamaProxyToken() so reviewers can find the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7c78189c-5ea8-4bdd-bb35-77656a56fc5b
📒 Files selected for processing (3)
src/lib/onboard-ollama-proxy.tssrc/lib/onboard.tssrc/nemoclaw.ts
✅ Files skipped from review due to trivial changes (1)
- src/nemoclaw.ts
| const token = loadPersistedProxyToken(); | ||
| const probeCmd = token | ||
| ? ["curl", "-sf", "--connect-timeout", "3", "--max-time", "5", | ||
| "-H", `Authorization: Bearer ${token}`, proxyUrl] | ||
| : ["curl", "-sf", "--connect-timeout", "3", "--max-time", "5", proxyUrl]; | ||
|
|
||
| const output = runCapture(probeCmd, { ignoreError: true }); | ||
| if (output) return true; | ||
|
|
||
| // HTTP probe failed — fall back to PID as a weaker signal. | ||
| // This covers edge cases where the probe transiently fails but the | ||
| // process is confirmed alive. | ||
| return hasValidPid; |
There was a problem hiding this comment.
Use in-memory token for health probe to avoid misclassifying proxy health
Line 256 uses loadPersistedProxyToken(), which can be null during onboarding even when a valid in-memory token exists. That makes the probe unauthenticated and can incorrectly fall back to PID-only health (or fail when PID is stale).
💡 Proposed fix
- const token = loadPersistedProxyToken();
+ const token = getOllamaProxyToken();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/onboard-ollama-proxy.ts` around lines 256 - 268, The health probe
uses loadPersistedProxyToken() which can be null during onboarding even though a
valid in-memory token exists; update the probe logic to prefer the in-memory
proxy token (e.g., the runtime variable that holds the current token) before
falling back to loadPersistedProxyToken(), then build probeCmd with that chosen
token (still falling back to no-Auth when none exists) so the Authorization
header uses the in-memory token when available; keep references to probeCmd,
proxyUrl, runCapture, and hasValidPid so the authenticated HTTP probe is
attempted first and only then fallback to PID.
Selective E2E Results — ✅ All requested jobs passedRun: 25132934106
|
## Summary Refreshes the daily docs from NemoClaw commits merged in the past 24 hours and advances the docs metadata from 0.0.29 to 0.0.31, the next version after tag v0.0.30. The updates cover documented behavior gaps found in the merged PRs listed below. ## Related Issue None. ## Changes - `docs/versions1.json` and `docs/project.json`: bump the preferred docs version to `0.0.31` for daily release preparation after latest tag `v0.0.30`. - `docs/reference/commands.md`: document non-interactive Brave Search validation fallback from #2511 / 9bfe30b, missing `--from <Dockerfile>` path validation from #2597 / 7186834, and `logs` reading OpenShell audit events from #2590 / e225dfb. - `docs/inference/use-local-inference.md`: document local inference reachability retry and host-side fallback from #2453 / 9dbe855, plus compatible-endpoint timeout coverage from #2583 / b4ef3db. - `docs/reference/troubleshooting.md`: document source-install shim fallback from #2520 / 01a177c, TLS gateway trust recovery from #1936 / 24725d2, compatible-endpoint timeout coverage from #2583 / b4ef3db, local reachability diagnostics from #2453 / 9dbe855, and host proxy `NO_PROXY` injection from #2662 / b4df07e. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Additional verification: - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --dry-run` passed. - `git diff --check` passed. - Pre-push hooks passed through markdownlint, docs-to-skills, JSON checks, gitleaks, and version sync before `Test (skills YAML)` failed because this fresh worktree lacked `vitest/config`. - `npx prek run --all-files` could not run from the fresh worktree because `npx prek` resolved to a missing `prek@*` package; downloading `@j178/prek` was not approved. - `npm test` could not complete from the fresh worktree because dependencies and compiled `dist/lib/*` artifacts were absent. ## AI Disclosure - [x] AI-assisted — tool: OpenAI Codex --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Version updated to 0.0.31 * Local inference onboarding now includes retry logic for container reachability checks * Web search setup failure handling clarified with fallback guidance * Dockerfile path validation timing documented * Logging behavior clarified for concurrent stream reading * New TLS/certificate troubleshooting section added * Install path and proxy configuration troubleshooting updated <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
## Summary `scripts/brev-launchable-ci-cpu.sh` is the community install path for Brev users — it bootstraps a VM with Docker, Node.js, OpenShell, and NemoClaw. **That script already exists in the repo but has zero CI coverage.** This PR adds a nightly E2E smoke test that validates the script works end-to-end. This is the long-living safety net for the community install flow. If any regression breaks the launchable script (e.g., the Apr 20–25 Brev outage from NVIDIA#2472/NVIDIA#2482, or the container reachability fallback from NVIDIA#2425), this test catches it before community users are affected. ## Related Issue Closes NVIDIA#2599 Related: NVIDIA#2425 (the `isProxyHealthy()` fallback in PR NVIDIA#2453 — if that regresses, onboard will abort on Brev and this smoke test catches it) ## Changes ### New: `test/e2e/test-launchable-smoke.sh` | Phase | What it validates | |-------|-------------------| | 0 | Pre-cleanup + pre-seed clone directory from checkout | | 1 | Prerequisites (Docker, NVIDIA_API_KEY, network, env vars) | | 2 | Run `brev-launchable-ci-cpu.sh` — the existing community bootstrap script | | 3 | Verify artifacts (nemoclaw, openshell, Node.js, Docker, sentinel file, built outputs) | | 4 | `nemoclaw onboard --non-interactive` with cloud provider | | 5 | Sandbox health (list, status, inference config, gateway) | | 6 | Live inference (direct API, routing via inference.local, openclaw agent 6×7=42) | | 7 | Destroy + cleanup | Key design decisions: - **No BREV_API_TOKEN needed** — the launchable script is a generic Ubuntu bootstrap with zero Brev dependencies, so it runs on standard GitHub-hosted `ubuntu-latest` runners - **Tests current code, not main** — pre-seeds the clone directory from the CI checkout so regressions are caught before reaching community users - **Follows existing E2E conventions** — pass/fail/section helpers, e2e-timeout.sh self-wrap, sandbox-teardown.sh EXIT trap, parse_chat_content() for reasoning models ### Modified: `.github/workflows/nightly-e2e.yaml` - Added `launchable-smoke-e2e` job: `ubuntu-latest`, 30min timeout, `NVIDIA_API_KEY` secret - Uploads install/onboard/test logs as artifacts on failure - Added to `notify-on-failure` needs list ## Validation Triggered via fork dispatch (`jyaunches/NemoClaw` → `sparky-dispatch` → `launchable-smoke`): - **Run:** https://github.com/jyaunches/NemoClaw/actions/runs/25075715342 - **Result:** ✅ 24 passed, 0 failed, 1 skipped (Node.js version — GH runner pre-installs Node 20) - **Runtime:** ~12 minutes ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Checklist - [x] Follows project coding conventions - [x] Tests pass locally or in CI - [x] No secrets/credentials committed <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added an end-to-end smoke test and CI job that validates the community launchable CPU install path (install, onboarding, runtime readiness, and a simple inference check). CI now uploads install/onboard/test logs on failures. * **Chores** * Renamed the branch-validation workflow and corresponding test-suite identifiers for clarity. * Updated E2E test documentation and project configuration names to match the new labeling. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…IDIA#2453) ## Summary On Brev cloud GPU instances, `nemoclaw onboard` with Local Ollama fails at step 4/8 because Docker's `--add-host host-gateway` does not route correctly, causing a false-negative container reachability check. This PR adds timeout, retry with backoff, host-side fallback, and improved diagnostics so onboarding succeeds when the proxy is demonstrably healthy on the host. Additionally, this PR extracts the entire Ollama auth proxy subsystem from the `onboard.ts` monolith into a dedicated `ollama-proxy.ts` module, reducing `onboard.ts` by 161 lines and advancing issue NVIDIA#767. ## Related Issue Fixes NVIDIA#2425 Ref NVIDIA#767 ## Changes - Add `--connect-timeout 5 --max-time 10` to the container reachability curl command in `src/lib/local-inference.ts` to prevent indefinite hangs - Add retry loop (3 attempts, 2s backoff) around the container check in `validateLocalProvider()` - Collect HTTP status code and host-gateway resolution diagnostics when retries are exhausted - Add optional `diagnostic` field to `ValidationResult` interface - Extract Ollama auth proxy subsystem from `src/lib/onboard.ts` into new `src/lib/ollama-proxy.ts` — token persistence, PID tracking, process spawn/kill, health probing (161 lines removed from onboard.ts) - Add `isProxyHealthy()` in `ollama-proxy.ts` that performs an actual HTTP probe against the proxy endpoint, not just a PID check — addresses CodeRabbit feedback about wedged/stale proxy processes - Simplify `setupInference()` fallback in `onboard.ts` to use `isProxyHealthy()` instead of inline PID + host check logic - Add host-side fallback for `vllm-local` using the existing health check endpoint - Replace misleading "Ensure the Ollama auth proxy is running" error message with actionable "Docker container reachability check failed" message - Add 5 new unit tests in `src/lib/local-inference.test.ts` covering retry success, retry exhaustion with diagnostics, diagnostic fallback, sleep verification, and no-retry on host failure ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure - [x] AI-assisted — tool: Claude Code (pi agent) --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Container connectivity checks now retry up to 3 times (with 2-second intervals) instead of single attempt. * Explicit timeout settings for Docker connectivity checks. * Enhanced diagnostic reporting when validation fails, including HTTP status codes and network information. * **Bug Fixes** * Improved handling of local inference provider validation failures with automatic host-side health checks. * Graceful fallback behavior allowing setup to continue when host services are healthy despite validation issues. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: J. Yaunches <jyaunches@nvidia.com>
## Summary Refreshes the daily docs from NemoClaw commits merged in the past 24 hours and advances the docs metadata from 0.0.29 to 0.0.31, the next version after tag v0.0.30. The updates cover documented behavior gaps found in the merged PRs listed below. ## Related Issue None. ## Changes - `docs/versions1.json` and `docs/project.json`: bump the preferred docs version to `0.0.31` for daily release preparation after latest tag `v0.0.30`. - `docs/reference/commands.md`: document non-interactive Brave Search validation fallback from NVIDIA#2511 / 9bfe30b, missing `--from <Dockerfile>` path validation from NVIDIA#2597 / 7186834, and `logs` reading OpenShell audit events from NVIDIA#2590 / e225dfb. - `docs/inference/use-local-inference.md`: document local inference reachability retry and host-side fallback from NVIDIA#2453 / 9dbe855, plus compatible-endpoint timeout coverage from NVIDIA#2583 / b4ef3db. - `docs/reference/troubleshooting.md`: document source-install shim fallback from NVIDIA#2520 / 01a177c, TLS gateway trust recovery from NVIDIA#1936 / 24725d2, compatible-endpoint timeout coverage from NVIDIA#2583 / b4ef3db, local reachability diagnostics from NVIDIA#2453 / 9dbe855, and host proxy `NO_PROXY` injection from NVIDIA#2662 / b4df07e. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Additional verification: - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --dry-run` passed. - `git diff --check` passed. - Pre-push hooks passed through markdownlint, docs-to-skills, JSON checks, gitleaks, and version sync before `Test (skills YAML)` failed because this fresh worktree lacked `vitest/config`. - `npx prek run --all-files` could not run from the fresh worktree because `npx prek` resolved to a missing `prek@*` package; downloading `@j178/prek` was not approved. - `npm test` could not complete from the fresh worktree because dependencies and compiled `dist/lib/*` artifacts were absent. ## AI Disclosure - [x] AI-assisted — tool: OpenAI Codex --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Version updated to 0.0.31 * Local inference onboarding now includes retry logic for container reachability checks * Web search setup failure handling clarified with fallback guidance * Dockerfile path validation timing documented * Logging behavior clarified for concurrent stream reading * New TLS/certificate troubleshooting section added * Install path and proxy configuration troubleshooting updated <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
Summary
On Brev cloud GPU instances,
nemoclaw onboardwith Local Ollama fails at step 4/8 because Docker's--add-host host-gatewaydoes not route correctly, causing a false-negative container reachability check. This PR adds timeout, retry with backoff, host-side fallback, and improved diagnostics so onboarding succeeds when the proxy is demonstrably healthy on the host.Additionally, this PR extracts the entire Ollama auth proxy subsystem from the
onboard.tsmonolith into a dedicatedollama-proxy.tsmodule, reducingonboard.tsby 161 lines and advancing issue #767.Related Issue
Fixes #2425
Ref #767
Changes
--connect-timeout 5 --max-time 10to the container reachability curl command insrc/lib/local-inference.tsto prevent indefinite hangsvalidateLocalProvider()diagnosticfield toValidationResultinterfacesrc/lib/onboard.tsinto newsrc/lib/ollama-proxy.ts— token persistence, PID tracking, process spawn/kill, health probing (161 lines removed from onboard.ts)isProxyHealthy()inollama-proxy.tsthat performs an actual HTTP probe against the proxy endpoint, not just a PID check — addresses CodeRabbit feedback about wedged/stale proxy processessetupInference()fallback inonboard.tsto useisProxyHealthy()instead of inline PID + host check logicvllm-localusing the existing health check endpointsrc/lib/local-inference.test.tscovering retry success, retry exhaustion with diagnostics, diagnostic fallback, sleep verification, and no-retry on host failureType of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)AI Disclosure
Signed-off-by: Julie Yaunches jyaunches@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes