fix(onboard): retry and fallback for container reachability check by jyaunches · Pull Request #2453 · NVIDIA/NemoClaw

jyaunches · 2026-04-24T20:56:09Z

Summary

On Brev cloud GPU instances, nemoclaw onboard with Local Ollama fails at step 4/8 because Docker's --add-host host-gateway does not route correctly, causing a false-negative container reachability check. This PR adds timeout, retry with backoff, host-side fallback, and improved diagnostics so onboarding succeeds when the proxy is demonstrably healthy on the host.

Additionally, this PR extracts the entire Ollama auth proxy subsystem from the onboard.ts monolith into a dedicated ollama-proxy.ts module, reducing onboard.ts by 161 lines and advancing issue #767.

Related Issue

Fixes #2425
Ref #767

Changes

Add --connect-timeout 5 --max-time 10 to the container reachability curl command in src/lib/local-inference.ts to prevent indefinite hangs
Add retry loop (3 attempts, 2s backoff) around the container check in validateLocalProvider()
Collect HTTP status code and host-gateway resolution diagnostics when retries are exhausted
Add optional diagnostic field to ValidationResult interface
Extract Ollama auth proxy subsystem from src/lib/onboard.ts into new src/lib/ollama-proxy.ts — token persistence, PID tracking, process spawn/kill, health probing (161 lines removed from onboard.ts)
Add isProxyHealthy() in ollama-proxy.ts that performs an actual HTTP probe against the proxy endpoint, not just a PID check — addresses CodeRabbit feedback about wedged/stale proxy processes
Simplify setupInference() fallback in onboard.ts to use isProxyHealthy() instead of inline PID + host check logic
Add host-side fallback for vllm-local using the existing health check endpoint
Replace misleading "Ensure the Ollama auth proxy is running" error message with actionable "Docker container reachability check failed" message
Add 5 new unit tests in src/lib/local-inference.test.ts covering retry success, retry exhaustion with diagnostics, diagnostic fallback, sleep verification, and no-retry on host failure

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
make docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

AI Disclosure

AI-assisted — tool: Claude Code (pi agent)

Signed-off-by: Julie Yaunches jyaunches@nvidia.com

Summary by CodeRabbit

New Features
- Container connectivity checks now retry up to 3 times (with 2-second intervals) instead of single attempt.
- Explicit timeout settings for Docker connectivity checks.
- Enhanced diagnostic reporting when validation fails, including HTTP status codes and network information.
Bug Fixes
- Improved handling of local inference provider validation failures with automatic host-side health checks.
- Graceful fallback behavior allowing setup to continue when host services are healthy despite validation issues.

) Add timeout, retry with backoff, host-side fallback, and improved diagnostics to the container reachability check during onboarding. On Brev cloud GPU instances, Docker's --add-host host-gateway does not route correctly, causing a false-negative failure that blocks onboarding even though the Ollama auth proxy is healthy on the host. The sandbox uses a different networking stack (k3s CoreDNS + NodeHosts) and would succeed. Changes: - Add --connect-timeout 5 --max-time 10 to container curl command - Retry the container check up to 3 times with 2s backoff - Collect HTTP status and host-gateway resolution diagnostics on failure - Fall back to host-side health check before exiting fatally - Replace misleading error messages with actionable diagnostics - Add diagnostic field to ValidationResult interface Fixes #2425

coderabbitai · 2026-04-24T20:56:20Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This pull request enhances local inference provider validation with retry logic, explicit timeouts, and diagnostic gathering capabilities. Container reachability checks now retry up to 3 times with 2-second delays. The onboarding flow is updated to tolerate certain validation failures by checking host-side reachability and auth proxy health, with a new HTTP-based health check for the Ollama proxy. Ollama proxy lifecycle functions are reorganized into a dedicated module.

Changes

Cohort / File(s)	Summary
Container Validation Logic `src/lib/local-inference.ts`, `src/lib/local-inference.test.ts`	Container reachability checks now retry up to 3 times with 2-second delays between attempts. Docker commands include explicit timeouts (`--connect-timeout 5`, `--max-time 10`). `validateLocalProvider` signature updated to accept injectable `sleepFn` parameter. `ValidationResult` interface extended with optional `diagnostic` field. New diagnostic helpers gather HTTP status codes and host mappings; test coverage expanded for retry behavior, diagnostic reporting, and failure scenarios.
Ollama Proxy Health & Lifecycle `src/lib/onboard-ollama-proxy.ts`, `src/nemoclaw.ts`	New `isProxyHealthy()` function validates proxy availability via HTTP probe to `/api/tags`; degrades to process-based signal on HTTP failure. `killStaleProxy` function exported. Ollama proxy lifecycle imports relocated to dedicated `onboard-ollama-proxy` module in `nemoclaw.ts`.
Onboarding Flow Tolerances `src/lib/onboard.ts`	`setupInference` now tolerates validation failures for local providers by checking host-side reachability and auth proxy health instead of immediately aborting. For `vllm-local`: downgrades failed validation to warning if host health check succeeds. For `ollama-local`: proactively ensures auth proxy is running and healthy; if healthy, downgrades validation failure to warning and continues; otherwise logs as error and exits. Conditional fallback logic applies provider-specific recovery strategies.

Sequence Diagram(s)

sequenceDiagram
    participant Client as NemoClaw CLI
    participant Onboard as setupInference
    participant Validator as validateLocalProvider
    participant HostCheck as Host Health Check
    participant ProxyCheck as Proxy Health Check

    Client->>Onboard: setupInference(provider='ollama-local')
    Onboard->>Validator: validateLocalProvider(provider, runCaptureImpl, sleepFn)
    
    loop Retry up to 3 times
        Validator->>Validator: Docker run with timeouts<br/>(--connect-timeout 5, --max-time 10)
        alt Container reachable
            Validator-->>Onboard: ValidationResult { success: true }
        else Container unreachable
            Validator->>Validator: sleep 2 seconds
        end
    end
    
    alt All retries failed
        Validator->>Validator: Gather diagnostics<br/>(HTTP status, /etc/hosts)
        Validator-->>Onboard: ValidationResult { success: false, diagnostic }
        
        Onboard->>ProxyCheck: isProxyHealthy()
        ProxyCheck->>ProxyCheck: HTTP probe /api/tags
        alt Proxy healthy
            ProxyCheck-->>Onboard: true
            Onboard->>Client: Downgrade to warning<br/>Continue setup
        else Proxy unhealthy
            ProxyCheck-->>Onboard: false
            Onboard->>Client: Log error<br/>Exit setup
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 Hop, retry, and try once more!
With timeouts set and health checks sure,
The auth proxy stands at the door,
Container bridges now endure! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding retry logic and fallback mechanisms for container reachability checks to fix onboarding failures.
Linked Issues check	✅ Passed	The PR addresses all coding requirements from issue `#2425`: retry container reachability checks, capture diagnostics, add timeouts, implement host-side fallbacks, and improve error messaging.
Out of Scope Changes check	✅ Passed	The PR refactors Ollama auth proxy code into a new module and reorganizes imports in nemoclaw.ts, which is within scope of improving the fallback mechanism per the PR objectives.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch pr/issue-2425

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 5517-5545: The current bypass logic relies on a ps check
(isOllamaProxyProcess(loadPersistedProxyPid())) and an Ollama /api/tags health
check (getLocalProviderHealthCheck + runCapture), which can miss a wedged/stale
auth proxy; instead, before deciding to continue or exit, call
ensureOllamaAuthProxy() (to start/restart the proxy if needed) and then perform
an actual request against the proxy health endpoint (not Ollama's /api/tags)
using the same runCapture helper; replace the proxyAlive/hostResponding logic
with a probe that first invokes ensureOllamaAuthProxy(), then hits the proxy
health URL and only treats the proxy as healthy if that probe returns success
(fall back to existing diagnostic logging and process.exit(1) if the proxy
endpoint remains unhealthy).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 76683991-0e05-4c63-abbb-78954b4ed90d

📥 Commits

Reviewing files that changed from the base of the PR and between 260b237 and 7385d6e.

📒 Files selected for processing (3)

src/lib/local-inference.test.ts
src/lib/local-inference.ts
src/lib/onboard.ts

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/lib/onboard.ts (1)
5521-5533: ⚠️ Potential issue | 🟠 Major

Probe the auth proxy endpoint, not just the Ollama process.

This still keys the bypass off “PID exists + Ollama /api/tags responds”. A wedged ollama-auth-proxy.js on port OLLAMA_PROXY_PORT will pass this gate, and the later ensureOllamaAuthProxy() call also only checks the PID, so onboarding can continue with a dead proxy. Probe the proxy endpoint you actually hand to OpenShell after ensureOllamaAuthProxy() before downgrading this to a warning.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5521 - 5533, The current bypass only checks
PID + host provider health (getLocalProviderHealthCheck/runCapture) but not the
actual auth-proxy HTTP endpoint, so replace or augment hostResponding to perform
an HTTP probe against the real auth proxy URL/port used by OpenShell (the same
endpoint that ensureOllamaAuthProxy registers, e.g.
http://localhost:OLLAMA_PROXY_PORT/<health-or-api-path>) and treat
success/failure from that probe when deciding whether to downgrade to a warning;
keep the existing isOllamaProxyProcess(loadPersistedProxyPid()) check but
require the proxy HTTP probe to succeed (using the same path used by
ensureOllamaAuthProxy) before allowing the bypass.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 5475-5476: The health check currently interprets the string result
of runCapture() as a boolean (hostResponding = !!runCapture(...)), which can be
truthy even when curl failed; instead call runCapture and inspect its exit
status (e.g., the returned object's exitCode or status) and set hostResponding =
true only when that exit code === 0 (or use the appropriate runCapture variant
that returns an exit/status code). Update the hostCheck/hostResponding
assignment using getLocalProviderHealthCheck and runCapture so you check the
command's exit code, and apply the same change to the other occurrence at the
region around the other hostResponding usage (the block referenced at lines
~5522-5523).

---

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 5521-5533: The current bypass only checks PID + host provider
health (getLocalProviderHealthCheck/runCapture) but not the actual auth-proxy
HTTP endpoint, so replace or augment hostResponding to perform an HTTP probe
against the real auth proxy URL/port used by OpenShell (the same endpoint that
ensureOllamaAuthProxy registers, e.g.
http://localhost:OLLAMA_PROXY_PORT/<health-or-api-path>) and treat
success/failure from that probe when deciding whether to downgrade to a warning;
keep the existing isOllamaProxyProcess(loadPersistedProxyPid()) check but
require the proxy HTTP probe to succeed (using the same path used by
ensureOllamaAuthProxy) before allowing the bypass.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c8c5e1a9-256d-4d6b-b46f-d800642ac37e

📥 Commits

Reviewing files that changed from the base of the PR and between 7385d6e and 06f77ed.

📒 Files selected for processing (1)

src/lib/onboard.ts

…roxyHealthy (#767) Extract the entire Ollama auth proxy subsystem (token persistence, PID tracking, process spawn/kill, health probing) from onboard.ts into a dedicated ollama-proxy.ts module. Add isProxyHealthy() which performs an actual HTTP probe against the proxy endpoint (not just a PID check), addressing CodeRabbit's review feedback about wedged/stale proxy processes. Simplify the setupInference fallback to use isProxyHealthy() instead of inline PID + host check logic. - Reduces onboard.ts by 161 lines (7910 → 7749) - New ollama-proxy.ts: 223 lines - Advances #767 (split onboarding provider logic)

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

src/lib/onboard.ts (2)

5360-5384: ⚠️ Potential issue | 🟠 Major

Repair or reprobe the Ollama auth proxy before aborting.

This branch exits on isProxyHealthy() before ensureOllamaAuthProxy() runs at Line 5389. A stale or missing proxy that is recoverable here still becomes a hard false-negative.

Suggested fix

+    let proxyReady = false;
     if (!validation.ok) {
       // The container reachability check uses Docker's --add-host host-gateway,
       // which may not work on all Docker configurations (e.g., Brev, rootless).
       // The real sandbox uses k3s CoreDNS + NodeHosts — a different path.
       // Probe the actual auth proxy endpoint before giving up.
-      if (!isWsl() && isProxyHealthy()) {
+      proxyReady = !isWsl() && ensureOllamaAuthProxy() && isProxyHealthy();
+      if (proxyReady) {
         console.warn(`  ⚠ ${validation.message}`);
         if (validation.diagnostic) {
           console.warn(`  Diagnostic: ${validation.diagnostic}`);
@@
-    if (!isWsl()) {
-      ensureOllamaAuthProxy();
+    if (!isWsl()) {
+      if (!proxyReady) ensureOllamaAuthProxy();
       const proxyToken = getOllamaProxyToken();

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5360 - 5384, The code exits immediately when
isProxyHealthy() is false even though a recoverable proxy could be created;
modify the branch handling around isProxyHealthy() so that when the proxy
appears unhealthy you call ensureOllamaAuthProxy() (and recheck with
isProxyHealthy()) before logging the fatal error and calling process.exit(1); in
practice, invoke ensureOllamaAuthProxy() when !isProxyHealthy(), handle any
thrown error or returned failure, re-evaluate validation (and
validation.diagnostic) and only log the error and call process.exit(1) if the
reprobe/recovery still reports an unhealthy proxy.

5318-5319: ⚠️ Potential issue | 🟠 Major

Use curl exit status for the vLLM host-health bypass.

Line 5319 still converts runCapture() output into a boolean. A failed curl -sf can still leave captured text behind, which would incorrectly downgrade a real host failure into the warning-and-continue path.

Suggested fix

-      const hostResponding = hostCheck ? !!runCapture(hostCheck, { ignoreError: true }) : false;
+      const hostResponding = hostCheck
+        ? run(hostCheck, { ignoreError: true, suppressOutput: true }).status === 0
+        : false;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5318 - 5319, The host-health check currently
sets hostResponding by coercing runCapture(...) output to boolean, which can be
non-empty even when curl failed; update the logic in onboard.ts where
getLocalProviderHealthCheck(provider) is assigned to hostCheck so that you
inspect the command exit status instead of the captured stdout. Call the runner
in a way that returns the exit code (e.g., use the run variant or runCapture
result that includes an exitCode/exitStatus) and set hostResponding = hostCheck
? (result.exitCode === 0) : false, keeping the ignoreError behavior but basing
success on the process exit code; reference getLocalProviderHealthCheck,
runCapture, hostCheck, and hostResponding to locate and change the code.

🧹 Nitpick comments (1)

src/lib/onboard.ts (1)
5316-5384: Extract the local-provider fallback path into a helper.

These new vLLM/Ollama branches duplicate the same diagnostic-and-bypass flow inside an already complexity-suppressed function. Pulling that decision into a small helper would keep setupInference() easier to maintain.

As per coding guidelines, {src,nemoclaw/src,scripts}/**/*.{js,ts,tsx}: TypeScript files must maintain cyclomatic complexity limit of 20, ratcheting down to 15. Enforced by ESLint.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5316 - 5384, The duplicated
diagnostic-and-bypass logic in the vllm-local and ollama-local branches should
be refactored into a small helper to reduce cyclomatic complexity: create a
helper (e.g., ensureLocalProviderOrExit or handleLocalProviderFallback) that
takes the provider name and the validation result and encapsulates the existing
checks (for vllm-local: getLocalProviderHealthCheck + runCapture; for
ollama-local: isWsl + isProxyHealthy) and the three logging paths (warn +
optional diagnostic + bypass message, or error + optional diagnostic +
platform-specific hint + process.exit). Replace the duplicated blocks around
validateLocalProvider in setupInference (or where validateLocalProvider is
called) to call this helper before continuing to upsertProvider / runOpenshell.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/ollama-proxy.ts`:
- Around line 145-165: The code sets ollamaProxyToken before verifying the proxy
is alive; update startOllamaAuthProxy so it generates a local proxyToken, calls
spawnOllamaAuthProxy(pid), waits and checks isOllamaProxyProcess(pid), and only
if the liveness check passes assign ollamaProxyToken = proxyToken (and persist
if needed); if the check fails return false without mutating ollamaProxyToken so
getOllamaProxyToken/ensureOllamaAuthProxy won’t return a stale token; apply the
same change pattern where ollamaProxyToken is currently set early (references:
startOllamaAuthProxy, spawnOllamaAuthProxy, isOllamaProxyProcess,
getOllamaProxyToken, ensureOllamaAuthProxy).
- Around line 207-222: In isProxyHealthy(), don't early-return false when
isOllamaProxyProcess(pid) is falsy; instead store that boolean (e.g.,
hasValidPid = isOllamaProxyProcess(pid)), always perform the HTTP probe built
with loadPersistedProxyToken() and runCapture(..., { ignoreError: true }), and
then return true if the probe produced output, otherwise return hasValidPid (so
a successful HTTP probe wins even if the PID file is missing/stale). Update
references in the function to use the new hasValidPid local variable and remove
the early return.

---

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 5360-5384: The code exits immediately when isProxyHealthy() is
false even though a recoverable proxy could be created; modify the branch
handling around isProxyHealthy() so that when the proxy appears unhealthy you
call ensureOllamaAuthProxy() (and recheck with isProxyHealthy()) before logging
the fatal error and calling process.exit(1); in practice, invoke
ensureOllamaAuthProxy() when !isProxyHealthy(), handle any thrown error or
returned failure, re-evaluate validation (and validation.diagnostic) and only
log the error and call process.exit(1) if the reprobe/recovery still reports an
unhealthy proxy.
- Around line 5318-5319: The host-health check currently sets hostResponding by
coercing runCapture(...) output to boolean, which can be non-empty even when
curl failed; update the logic in onboard.ts where
getLocalProviderHealthCheck(provider) is assigned to hostCheck so that you
inspect the command exit status instead of the captured stdout. Call the runner
in a way that returns the exit code (e.g., use the run variant or runCapture
result that includes an exitCode/exitStatus) and set hostResponding = hostCheck
? (result.exitCode === 0) : false, keeping the ignoreError behavior but basing
success on the process exit code; reference getLocalProviderHealthCheck,
runCapture, hostCheck, and hostResponding to locate and change the code.

---

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 5316-5384: The duplicated diagnostic-and-bypass logic in the
vllm-local and ollama-local branches should be refactored into a small helper to
reduce cyclomatic complexity: create a helper (e.g., ensureLocalProviderOrExit
or handleLocalProviderFallback) that takes the provider name and the validation
result and encapsulates the existing checks (for vllm-local:
getLocalProviderHealthCheck + runCapture; for ollama-local: isWsl +
isProxyHealthy) and the three logging paths (warn + optional diagnostic + bypass
message, or error + optional diagnostic + platform-specific hint +
process.exit). Replace the duplicated blocks around validateLocalProvider in
setupInference (or where validateLocalProvider is called) to call this helper
before continuing to upsertProvider / runOpenshell.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cd2c8e18-d861-4413-b058-63a9501e1fe0

📥 Commits

Reviewing files that changed from the base of the PR and between 06f77ed and a154279.

📒 Files selected for processing (3)

src/lib/ollama-proxy.ts
src/lib/onboard.ts
src/nemoclaw.ts

✅ Files skipped from review due to trivial changes (1)

src/nemoclaw.ts

- startOllamaAuthProxy: set in-memory token only after confirming proxy is alive, preventing stale tokens on failed starts - isProxyHealthy: don't early-return false on PID check failure; always perform HTTP probe as the authoritative signal, falling back to PID only when the probe transiently fails - ollama-local branch: call ensureOllamaAuthProxy() before isProxyHealthy() so a stale/missing proxy is recovered before we decide to abort onboarding - vllm-local branch: use run() exit status instead of coercing runCapture() output to boolean for host health bypass Signed-off-by: J. Yaunches <jyaunches@nvidia.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/lib/ollama-proxy.ts (1)

145-168: ⚠️ Potential issue | 🟠 Major

Failed restarts can still publish a dead proxy token.

Line 188 assigns ollamaProxyToken before the restarted child is proven alive, and the failed-start path still leaves the newly persisted PID behind. If the proxy exits immediately, getOllamaProxyToken() can keep returning a token for a proxy that is no longer serving.

Suggested fix

 export function startOllamaAuthProxy(): boolean {
   // eslint-disable-next-line `@typescript-eslint/no-require-imports`
   const crypto = require("crypto");
   killStaleProxy();
+  ollamaProxyToken = null;

   const proxyToken = crypto.randomBytes(24).toString("hex");
   // Don't persist yet — wait until provider is confirmed in setupInference.
   // If the user backs out to a different provider, the token stays in memory
   // only and is discarded.
   const pid = spawnOllamaAuthProxy(proxyToken);
   sleepSeconds(1);
   if (!isOllamaProxyProcess(pid)) {
+    clearPersistedProxyPid();
     console.error(`  Error: Ollama auth proxy failed to start on :${OLLAMA_PROXY_PORT}`);
     console.error(`  Containers will not be able to reach Ollama without the proxy.`);
     console.error(
       `  Check if port ${OLLAMA_PROXY_PORT} is already in use: lsof -ti :${OLLAMA_PROXY_PORT}`,
     );
@@
 export function ensureOllamaAuthProxy(): void {
   // Try to load persisted token first — if none, this isn't an Ollama setup.
   const token = loadPersistedProxyToken();
   if (!token) return;
@@
   // Proxy not running — restart it with the persisted token.
   killStaleProxy();
-  ollamaProxyToken = token;
-  spawnOllamaAuthProxy(token);
+  const restartedPid = spawnOllamaAuthProxy(token);
   sleepSeconds(1);
+  if (isOllamaProxyProcess(restartedPid)) {
+    ollamaProxyToken = token;
+  } else {
+    clearPersistedProxyPid();
+    ollamaProxyToken = null;
+  }
 }

Also applies to: 175-191

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/ollama-proxy.ts` around lines 145 - 168, In startOllamaAuthProxy(),
avoid persisting the new proxy PID or setting ollamaProxyToken until the spawned
process is verified alive: generate proxyToken, call
spawnOllamaAuthProxy(proxyToken) but do NOT persist the PID inside
spawn/killStaleProxy yet; after sleepSeconds and a successful
isOllamaProxyProcess(pid) check, then persist the PID (wherever you currently
store it) and assign ollamaProxyToken; if the check fails, ensure you clear any
persisted PID left by the restart path and do NOT set ollamaProxyToken so
getOllamaProxyToken() cannot return a token for a dead proxy (update
spawnOllamaAuthProxy/killStaleProxy behavior if they currently persist
immediately).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 5318-5323: The host-side vLLM fallback probe can hang because
getLocalProviderHealthCheck("vllm-local") returns a plain "curl -sf ..."
command; update the probe to enforce a timeout by either (A) modifying
getLocalProviderHealthCheck to include curl timing flags (e.g. --max-time and/or
--connect-timeout) for the "vllm-local" case so the returned hostCheck command
always contains a bounded timeout, or (B) if you prefer to keep
getLocalProviderHealthCheck unchanged, detect the vllm-local provider before
calling run() and wrap the hostCheck invocation with a timeout tool (e.g. prefix
with timeout or add equivalent flags) so run(hostCheck, { ignoreError: true,
suppressOutput: true }).status is decided within a bounded time; ensure the
unique symbols involved are hostCheck, hostResponding,
getLocalProviderHealthCheck, and run.

---

Duplicate comments:
In `@src/lib/ollama-proxy.ts`:
- Around line 145-168: In startOllamaAuthProxy(), avoid persisting the new proxy
PID or setting ollamaProxyToken until the spawned process is verified alive:
generate proxyToken, call spawnOllamaAuthProxy(proxyToken) but do NOT persist
the PID inside spawn/killStaleProxy yet; after sleepSeconds and a successful
isOllamaProxyProcess(pid) check, then persist the PID (wherever you currently
store it) and assign ollamaProxyToken; if the check fails, ensure you clear any
persisted PID left by the restart path and do NOT set ollamaProxyToken so
getOllamaProxyToken() cannot return a token for a dead proxy (update
spawnOllamaAuthProxy/killStaleProxy behavior if they currently persist
immediately).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8259b168-d6ca-45c0-bb97-4fcbe5766478

📥 Commits

Reviewing files that changed from the base of the PR and between a154279 and d89b961.

📒 Files selected for processing (2)

src/lib/ollama-proxy.ts
src/lib/onboard.ts

coderabbitai · 2026-04-28T03:07:47Z

+      const hostCheck = getLocalProviderHealthCheck(provider);
+      // Use run() and check exit status rather than coercing runCapture() output
+      // to boolean — curl -sf can leave output even on failure in edge cases.
+      const hostResponding = hostCheck
+        ? run(hostCheck, { ignoreError: true, suppressOutput: true }).status === 0
+        : false;


⚠️ Potential issue | 🟠 Major

Bound the host-side vLLM fallback probe.

getLocalProviderHealthCheck("vllm-local") is still a plain curl -sf ... command, so this new fallback can hang indefinitely if localhost accepts the connection but never responds. That reintroduces the step-4 stall on the warning/continue path. Please add a timeout here (or inject curl timing flags) before deciding whether to bypass validation.

Suggested fix

const hostResponding = hostCheck - ? run(hostCheck, { ignoreError: true, suppressOutput: true }).status === 0 + ? run(hostCheck, { + ignoreError: true, + suppressOutput: true, + timeout: 10_000, + }).status === 0 : false;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 5318 - 5323, The host-side vLLM fallback probe can hang because getLocalProviderHealthCheck("vllm-local") returns a plain "curl -sf ..." command; update the probe to enforce a timeout by either (A) modifying getLocalProviderHealthCheck to include curl timing flags (e.g. --max-time and/or --connect-timeout) for the "vllm-local" case so the returned hostCheck command always contains a bounded timeout, or (B) if you prefer to keep getLocalProviderHealthCheck unchanged, detect the vllm-local provider before calling run() and wrap the hostCheck invocation with a timeout tool (e.g. prefix with timeout or add equivalent flags) so run(hostCheck, { ignoreError: true, suppressOutput: true }).status is decided within a bounded time; ensure the unique symbols involved are hostCheck, hostResponding, getLocalProviderHealthCheck, and run.

## Summary `scripts/brev-launchable-ci-cpu.sh` is the community install path for Brev users — it bootstraps a VM with Docker, Node.js, OpenShell, and NemoClaw. **That script already exists in the repo but has zero CI coverage.** This PR adds a nightly E2E smoke test that validates the script works end-to-end. This is the long-living safety net for the community install flow. If any regression breaks the launchable script (e.g., the Apr 20–25 Brev outage from #2472/#2482, or the container reachability fallback from #2425), this test catches it before community users are affected. ## Related Issue Closes #2599 Related: #2425 (the `isProxyHealthy()` fallback in PR #2453 — if that regresses, onboard will abort on Brev and this smoke test catches it) ## Changes ### New: `test/e2e/test-launchable-smoke.sh` | Phase | What it validates | |-------|-------------------| | 0 | Pre-cleanup + pre-seed clone directory from checkout | | 1 | Prerequisites (Docker, NVIDIA_API_KEY, network, env vars) | | 2 | Run `brev-launchable-ci-cpu.sh` — the existing community bootstrap script | | 3 | Verify artifacts (nemoclaw, openshell, Node.js, Docker, sentinel file, built outputs) | | 4 | `nemoclaw onboard --non-interactive` with cloud provider | | 5 | Sandbox health (list, status, inference config, gateway) | | 6 | Live inference (direct API, routing via inference.local, openclaw agent 6×7=42) | | 7 | Destroy + cleanup | Key design decisions: - **No BREV_API_TOKEN needed** — the launchable script is a generic Ubuntu bootstrap with zero Brev dependencies, so it runs on standard GitHub-hosted `ubuntu-latest` runners - **Tests current code, not main** — pre-seeds the clone directory from the CI checkout so regressions are caught before reaching community users - **Follows existing E2E conventions** — pass/fail/section helpers, e2e-timeout.sh self-wrap, sandbox-teardown.sh EXIT trap, parse_chat_content() for reasoning models ### Modified: `.github/workflows/nightly-e2e.yaml` - Added `launchable-smoke-e2e` job: `ubuntu-latest`, 30min timeout, `NVIDIA_API_KEY` secret - Uploads install/onboard/test logs as artifacts on failure - Added to `notify-on-failure` needs list ## Validation Triggered via fork dispatch (`jyaunches/NemoClaw` → `sparky-dispatch` → `launchable-smoke`): - **Run:** https://github.com/jyaunches/NemoClaw/actions/runs/25075715342 - **Result:** ✅ 24 passed, 0 failed, 1 skipped (Node.js version — GH runner pre-installs Node 20) - **Runtime:** ~12 minutes ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Checklist - [x] Follows project coding conventions - [x] Tests pass locally or in CI - [x] No secrets/credentials committed  ## Summary by CodeRabbit * **New Features** * Added an end-to-end smoke test and CI job that validates the community launchable CPU install path (install, onboarding, runtime readiness, and a simple inference check). CI now uploads install/onboard/test logs on failures. * **Chores** * Renamed the branch-validation workflow and corresponding test-suite identifiers for clarity. * Updated E2E test documentation and project configuration names to match the new labeling.

- Adopt main's onboard-ollama-proxy.ts (CJS) as canonical module - Add isProxyHealthy() from PR's ollama-proxy.ts into onboard-ollama-proxy.ts - Remove PR's ollama-proxy.ts (superseded by onboard-ollama-proxy.ts) - Update imports in onboard.ts and nemoclaw.ts to use onboard-ollama-proxy - Take main's getRequestedSandboxNameHint(opts) signature - Include main's hydrateCredentialEnv, gateway-token-command additions

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

src/lib/onboard.ts (2)

5146-5151: ⚠️ Potential issue | 🟠 Major

Bound the host-side vLLM fallback probe.

getLocalProviderHealthCheck("vllm-local") still resolves to a plain curl -sf ..., so this warning/continue path can hang indefinitely when localhost accepts the connection but never responds. That reintroduces the Step 4 stall this PR is trying to eliminate.

Suggested fix

       const hostResponding = hostCheck
-        ? run(hostCheck, { ignoreError: true, suppressOutput: true }).status === 0
+        ? run(hostCheck, {
+            ignoreError: true,
+            suppressOutput: true,
+            timeout: 10_000,
+          }).status === 0
         : false;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5146 - 5151, The host health probe for
getLocalProviderHealthCheck("vllm-local") can hang because the returned curl
command lacks a request timeout; update the probe so hostCheck is bounded (e.g.,
include curl's --max-time N or wrap the command with the system timeout utility)
so run(hostCheck, { ignoreError: true, suppressOutput: true }) will return
promptly instead of hanging; modify getLocalProviderHealthCheck (or the value
assigned to hostCheck) to inject a short timeout (e.g., 1–3s) for vllm-local
while keeping the existing run(...) usage and error handling.

5209-5240: ⚠️ Potential issue | 🟠 Major

Require a real HTTP success before bypassing Ollama validation.

proxyReady comes from isProxyHealthy(), but that helper still falls back to PID liveness when the HTTP probe fails. This branch can therefore warn-and-continue, and then skip the later ensureOllamaAuthProxy() call, even though the proxy endpoint is not actually serving yet. Keep PID state diagnostic-only and gate the bypass on an actual proxy response.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5209 - 5240, The current bypass uses
proxyReady = isProxyHealthy(), but isProxyHealthy() may return true based on PID
liveness rather than an actual HTTP probe, allowing the code path that
warns-and-continues to run even when the proxy endpoint isn't serving; change
the logic so the bypass only happens when an actual HTTP success is observed:
either update isProxyHealthy() to distinguish PID-only liveness from an
HTTP-serving result (e.g., return a boolean/enum that indicates HTTP success) or
add an explicit HTTP probe call (e.g., probeOllamaProxyHttp()) after
isProxyHealthy(), and only set/use proxyReady = true when the HTTP probe
succeeds; keep PID-based info as diagnostic text but do not gate
ensureOllamaAuthProxy() or skip shutdown unless the HTTP probe confirms the
proxy is serving. Ensure references include the proxyReady variable,
isProxyHealthy(), ensureOllamaAuthProxy(), and getOllamaProxyToken() so
reviewers can find the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard-ollama-proxy.ts`:
- Around line 256-268: The health probe uses loadPersistedProxyToken() which can
be null during onboarding even though a valid in-memory token exists; update the
probe logic to prefer the in-memory proxy token (e.g., the runtime variable that
holds the current token) before falling back to loadPersistedProxyToken(), then
build probeCmd with that chosen token (still falling back to no-Auth when none
exists) so the Authorization header uses the in-memory token when available;
keep references to probeCmd, proxyUrl, runCapture, and hasValidPid so the
authenticated HTTP probe is attempted first and only then fallback to PID.

---

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 5146-5151: The host health probe for
getLocalProviderHealthCheck("vllm-local") can hang because the returned curl
command lacks a request timeout; update the probe so hostCheck is bounded (e.g.,
include curl's --max-time N or wrap the command with the system timeout utility)
so run(hostCheck, { ignoreError: true, suppressOutput: true }) will return
promptly instead of hanging; modify getLocalProviderHealthCheck (or the value
assigned to hostCheck) to inject a short timeout (e.g., 1–3s) for vllm-local
while keeping the existing run(...) usage and error handling.
- Around line 5209-5240: The current bypass uses proxyReady = isProxyHealthy(),
but isProxyHealthy() may return true based on PID liveness rather than an actual
HTTP probe, allowing the code path that warns-and-continues to run even when the
proxy endpoint isn't serving; change the logic so the bypass only happens when
an actual HTTP success is observed: either update isProxyHealthy() to
distinguish PID-only liveness from an HTTP-serving result (e.g., return a
boolean/enum that indicates HTTP success) or add an explicit HTTP probe call
(e.g., probeOllamaProxyHttp()) after isProxyHealthy(), and only set/use
proxyReady = true when the HTTP probe succeeds; keep PID-based info as
diagnostic text but do not gate ensureOllamaAuthProxy() or skip shutdown unless
the HTTP probe confirms the proxy is serving. Ensure references include the
proxyReady variable, isProxyHealthy(), ensureOllamaAuthProxy(), and
getOllamaProxyToken() so reviewers can find the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7c78189c-5ea8-4bdd-bb35-77656a56fc5b

📥 Commits

Reviewing files that changed from the base of the PR and between d89b961 and edb6a44.

📒 Files selected for processing (3)

src/lib/onboard-ollama-proxy.ts
src/lib/onboard.ts
src/nemoclaw.ts

✅ Files skipped from review due to trivial changes (1)

src/nemoclaw.ts

coderabbitai · 2026-04-29T20:37:35Z

+  const token = loadPersistedProxyToken();
+  const probeCmd = token
+    ? ["curl", "-sf", "--connect-timeout", "3", "--max-time", "5",
+       "-H", `Authorization: Bearer ${token}`, proxyUrl]
+    : ["curl", "-sf", "--connect-timeout", "3", "--max-time", "5", proxyUrl];
+
+  const output = runCapture(probeCmd, { ignoreError: true });
+  if (output) return true;
+
+  // HTTP probe failed — fall back to PID as a weaker signal.
+  // This covers edge cases where the probe transiently fails but the
+  // process is confirmed alive.
+  return hasValidPid;


⚠️ Potential issue | 🟠 Major

Use in-memory token for health probe to avoid misclassifying proxy health

Line 256 uses loadPersistedProxyToken(), which can be null during onboarding even when a valid in-memory token exists. That makes the probe unauthenticated and can incorrectly fall back to PID-only health (or fail when PID is stale).

💡 Proposed fix

- const token = loadPersistedProxyToken(); + const token = getOllamaProxyToken();

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-ollama-proxy.ts` around lines 256 - 268, The health probe uses loadPersistedProxyToken() which can be null during onboarding even though a valid in-memory token exists; update the probe logic to prefer the in-memory proxy token (e.g., the runtime variable that holds the current token) before falling back to loadPersistedProxyToken(), then build probeCmd with that chosen token (still falling back to no-Auth when none exists) so the Authorization header uses the in-memory token when available; keep references to probeCmd, proxyUrl, runCapture, and hasValidPid so the authenticated HTTP probe is attempted first and only then fallback to PID.

github-actions · 2026-04-29T21:09:36Z

Selective E2E Results — ✅ All requested jobs passed

Run: 25132934106
Branch: pr/issue-2425
Requested jobs: cloud-e2e,cloud-onboard-e2e,sandbox-operations-e2e
Summary: 3 passed, 0 failed, 19 skipped

Job	Result
cloud-e2e	✅ success
cloud-inference-e2e	⏭️ skipped
cloud-onboard-e2e	✅ success
deployment-services-e2e	⏭️ skipped
diagnostics-e2e	⏭️ skipped
docs-validation-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped
hermes-e2e	⏭️ skipped
inference-routing-e2e	⏭️ skipped
messaging-providers-e2e	⏭️ skipped
network-policy-e2e	⏭️ skipped
overlayfs-autofix-e2e	⏭️ skipped
rebuild-hermes-e2e	⏭️ skipped
rebuild-openclaw-e2e	⏭️ skipped
sandbox-operations-e2e	✅ success
sandbox-survival-e2e	⏭️ skipped
shields-config-e2e	⏭️ skipped
skill-agent-e2e	⏭️ skipped
skip-permissions-e2e	⏭️ skipped
snapshot-commands-e2e	⏭️ skipped
token-rotation-e2e	⏭️ skipped
upgrade-stale-sandbox-e2e	⏭️ skipped

## Summary Refreshes the daily docs from NemoClaw commits merged in the past 24 hours and advances the docs metadata from 0.0.29 to 0.0.31, the next version after tag v0.0.30. The updates cover documented behavior gaps found in the merged PRs listed below. ## Related Issue None. ## Changes - `docs/versions1.json` and `docs/project.json`: bump the preferred docs version to `0.0.31` for daily release preparation after latest tag `v0.0.30`. - `docs/reference/commands.md`: document non-interactive Brave Search validation fallback from #2511 / 9bfe30b, missing `--from <Dockerfile>` path validation from #2597 / 7186834, and `logs` reading OpenShell audit events from #2590 / e225dfb. - `docs/inference/use-local-inference.md`: document local inference reachability retry and host-side fallback from #2453 / 9dbe855, plus compatible-endpoint timeout coverage from #2583 / b4ef3db. - `docs/reference/troubleshooting.md`: document source-install shim fallback from #2520 / 01a177c, TLS gateway trust recovery from #1936 / 24725d2, compatible-endpoint timeout coverage from #2583 / b4ef3db, local reachability diagnostics from #2453 / 9dbe855, and host proxy `NO_PROXY` injection from #2662 / b4df07e. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Additional verification: - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --dry-run` passed. - `git diff --check` passed. - Pre-push hooks passed through markdownlint, docs-to-skills, JSON checks, gitleaks, and version sync before `Test (skills YAML)` failed because this fresh worktree lacked `vitest/config`. - `npx prek run --all-files` could not run from the fresh worktree because `npx prek` resolved to a missing `prek@*` package; downloading `@j178/prek` was not approved. - `npm test` could not complete from the fresh worktree because dependencies and compiled `dist/lib/*` artifacts were absent. ## AI Disclosure - [x] AI-assisted — tool: OpenAI Codex --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>  ## Summary by CodeRabbit * **Documentation** * Version updated to 0.0.31 * Local inference onboarding now includes retry logic for container reachability checks * Web search setup failure handling clarified with fallback guidance * Dockerfile path validation timing documented * Logging behavior clarified for concurrent stream reading * New TLS/certificate troubleshooting section added * Install path and proxy configuration troubleshooting updated  Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

## Summary `scripts/brev-launchable-ci-cpu.sh` is the community install path for Brev users — it bootstraps a VM with Docker, Node.js, OpenShell, and NemoClaw. **That script already exists in the repo but has zero CI coverage.** This PR adds a nightly E2E smoke test that validates the script works end-to-end. This is the long-living safety net for the community install flow. If any regression breaks the launchable script (e.g., the Apr 20–25 Brev outage from NVIDIA#2472/NVIDIA#2482, or the container reachability fallback from NVIDIA#2425), this test catches it before community users are affected. ## Related Issue Closes NVIDIA#2599 Related: NVIDIA#2425 (the `isProxyHealthy()` fallback in PR NVIDIA#2453 — if that regresses, onboard will abort on Brev and this smoke test catches it) ## Changes ### New: `test/e2e/test-launchable-smoke.sh` | Phase | What it validates | |-------|-------------------| | 0 | Pre-cleanup + pre-seed clone directory from checkout | | 1 | Prerequisites (Docker, NVIDIA_API_KEY, network, env vars) | | 2 | Run `brev-launchable-ci-cpu.sh` — the existing community bootstrap script | | 3 | Verify artifacts (nemoclaw, openshell, Node.js, Docker, sentinel file, built outputs) | | 4 | `nemoclaw onboard --non-interactive` with cloud provider | | 5 | Sandbox health (list, status, inference config, gateway) | | 6 | Live inference (direct API, routing via inference.local, openclaw agent 6×7=42) | | 7 | Destroy + cleanup | Key design decisions: - **No BREV_API_TOKEN needed** — the launchable script is a generic Ubuntu bootstrap with zero Brev dependencies, so it runs on standard GitHub-hosted `ubuntu-latest` runners - **Tests current code, not main** — pre-seeds the clone directory from the CI checkout so regressions are caught before reaching community users - **Follows existing E2E conventions** — pass/fail/section helpers, e2e-timeout.sh self-wrap, sandbox-teardown.sh EXIT trap, parse_chat_content() for reasoning models ### Modified: `.github/workflows/nightly-e2e.yaml` - Added `launchable-smoke-e2e` job: `ubuntu-latest`, 30min timeout, `NVIDIA_API_KEY` secret - Uploads install/onboard/test logs as artifacts on failure - Added to `notify-on-failure` needs list ## Validation Triggered via fork dispatch (`jyaunches/NemoClaw` → `sparky-dispatch` → `launchable-smoke`): - **Run:** https://github.com/jyaunches/NemoClaw/actions/runs/25075715342 - **Result:** ✅ 24 passed, 0 failed, 1 skipped (Node.js version — GH runner pre-installs Node 20) - **Runtime:** ~12 minutes ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Checklist - [x] Follows project coding conventions - [x] Tests pass locally or in CI - [x] No secrets/credentials committed  ## Summary by CodeRabbit * **New Features** * Added an end-to-end smoke test and CI job that validates the community launchable CPU install path (install, onboarding, runtime readiness, and a simple inference check). CI now uploads install/onboard/test logs on failures. * **Chores** * Renamed the branch-validation workflow and corresponding test-suite identifiers for clarity. * Updated E2E test documentation and project configuration names to match the new labeling.

…IDIA#2453) ## Summary On Brev cloud GPU instances, `nemoclaw onboard` with Local Ollama fails at step 4/8 because Docker's `--add-host host-gateway` does not route correctly, causing a false-negative container reachability check. This PR adds timeout, retry with backoff, host-side fallback, and improved diagnostics so onboarding succeeds when the proxy is demonstrably healthy on the host. Additionally, this PR extracts the entire Ollama auth proxy subsystem from the `onboard.ts` monolith into a dedicated `ollama-proxy.ts` module, reducing `onboard.ts` by 161 lines and advancing issue NVIDIA#767. ## Related Issue Fixes NVIDIA#2425 Ref NVIDIA#767 ## Changes - Add `--connect-timeout 5 --max-time 10` to the container reachability curl command in `src/lib/local-inference.ts` to prevent indefinite hangs - Add retry loop (3 attempts, 2s backoff) around the container check in `validateLocalProvider()` - Collect HTTP status code and host-gateway resolution diagnostics when retries are exhausted - Add optional `diagnostic` field to `ValidationResult` interface - Extract Ollama auth proxy subsystem from `src/lib/onboard.ts` into new `src/lib/ollama-proxy.ts` — token persistence, PID tracking, process spawn/kill, health probing (161 lines removed from onboard.ts) - Add `isProxyHealthy()` in `ollama-proxy.ts` that performs an actual HTTP probe against the proxy endpoint, not just a PID check — addresses CodeRabbit feedback about wedged/stale proxy processes - Simplify `setupInference()` fallback in `onboard.ts` to use `isProxyHealthy()` instead of inline PID + host check logic - Add host-side fallback for `vllm-local` using the existing health check endpoint - Replace misleading "Ensure the Ollama auth proxy is running" error message with actionable "Docker container reachability check failed" message - Add 5 new unit tests in `src/lib/local-inference.test.ts` covering retry success, retry exhaustion with diagnostics, diagnostic fallback, sleep verification, and no-retry on host failure ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure - [x] AI-assisted — tool: Claude Code (pi agent) --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>  ## Summary by CodeRabbit * **New Features** * Container connectivity checks now retry up to 3 times (with 2-second intervals) instead of single attempt. * Explicit timeout settings for Docker connectivity checks. * Enhanced diagnostic reporting when validation fails, including HTTP status codes and network information. * **Bug Fixes** * Improved handling of local inference provider validation failures with automatic host-side health checks. * Graceful fallback behavior allowing setup to continue when host services are healthy despite validation issues.  --------- Signed-off-by: J. Yaunches <jyaunches@nvidia.com>

## Summary Refreshes the daily docs from NemoClaw commits merged in the past 24 hours and advances the docs metadata from 0.0.29 to 0.0.31, the next version after tag v0.0.30. The updates cover documented behavior gaps found in the merged PRs listed below. ## Related Issue None. ## Changes - `docs/versions1.json` and `docs/project.json`: bump the preferred docs version to `0.0.31` for daily release preparation after latest tag `v0.0.30`. - `docs/reference/commands.md`: document non-interactive Brave Search validation fallback from NVIDIA#2511 / 9bfe30b, missing `--from <Dockerfile>` path validation from NVIDIA#2597 / 7186834, and `logs` reading OpenShell audit events from NVIDIA#2590 / e225dfb. - `docs/inference/use-local-inference.md`: document local inference reachability retry and host-side fallback from NVIDIA#2453 / 9dbe855, plus compatible-endpoint timeout coverage from NVIDIA#2583 / b4ef3db. - `docs/reference/troubleshooting.md`: document source-install shim fallback from NVIDIA#2520 / 01a177c, TLS gateway trust recovery from NVIDIA#1936 / 24725d2, compatible-endpoint timeout coverage from NVIDIA#2583 / b4ef3db, local reachability diagnostics from NVIDIA#2453 / 9dbe855, and host proxy `NO_PROXY` injection from NVIDIA#2662 / b4df07e. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Additional verification: - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --dry-run` passed. - `git diff --check` passed. - Pre-push hooks passed through markdownlint, docs-to-skills, JSON checks, gitleaks, and version sync before `Test (skills YAML)` failed because this fresh worktree lacked `vitest/config`. - `npx prek run --all-files` could not run from the fresh worktree because `npx prek` resolved to a missing `prek@*` package; downloading `@j178/prek` was not approved. - `npm test` could not complete from the fresh worktree because dependencies and compiled `dist/lib/*` artifacts were absent. ## AI Disclosure - [x] AI-assisted — tool: OpenAI Codex --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>  ## Summary by CodeRabbit * **Documentation** * Version updated to 0.0.31 * Local inference onboarding now includes retry logic for container reachability checks * Web search setup failure handling clarified with fallback guidance * Dockerfile path validation timing documented * Logging behavior clarified for concurrent stream reading * New TLS/certificate troubleshooting section added * Install path and proxy configuration troubleshooting updated  Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

jyaunches self-assigned this Apr 24, 2026

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread src/lib/onboard.ts Outdated

Merge remote-tracking branch 'origin/main' into pr/issue-2425

06f77ed

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread src/lib/onboard.ts Outdated

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread src/lib/ollama-proxy.ts Outdated

Comment thread src/lib/ollama-proxy.ts Outdated

jyaunches added the v0.0.27 label Apr 27, 2026

cv added v0.0.28 and removed v0.0.27 labels Apr 27, 2026

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

cv added v0.0.29 and removed v0.0.28 labels Apr 28, 2026

jyaunches mentioned this pull request Apr 29, 2026

test(e2e): add Brev launchable install-flow smoke test #2677

Merged

4 tasks

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

cv approved these changes Apr 29, 2026

View reviewed changes

cv merged commit 9dbe855 into main Apr 29, 2026
51 checks passed

miyoungc mentioned this pull request Apr 30, 2026

docs: refresh daily docs for 0.0.31 #2724

Merged

13 tasks

wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026

jyaunches deleted the pr/issue-2425 branch June 12, 2026 13:53

Conversation

jyaunches commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

AI Disclosure

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 29, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jyaunches commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading