Skip to content

fix(onboard): retry and fallback for container reachability check#2453

Merged
cv merged 5 commits into
mainfrom
pr/issue-2425
Apr 29, 2026
Merged

fix(onboard): retry and fallback for container reachability check#2453
cv merged 5 commits into
mainfrom
pr/issue-2425

Conversation

@jyaunches

@jyaunches jyaunches commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Summary

On Brev cloud GPU instances, nemoclaw onboard with Local Ollama fails at step 4/8 because Docker's --add-host host-gateway does not route correctly, causing a false-negative container reachability check. This PR adds timeout, retry with backoff, host-side fallback, and improved diagnostics so onboarding succeeds when the proxy is demonstrably healthy on the host.

Additionally, this PR extracts the entire Ollama auth proxy subsystem from the onboard.ts monolith into a dedicated ollama-proxy.ts module, reducing onboard.ts by 161 lines and advancing issue #767.

Related Issue

Fixes #2425
Ref #767

Changes

  • Add --connect-timeout 5 --max-time 10 to the container reachability curl command in src/lib/local-inference.ts to prevent indefinite hangs
  • Add retry loop (3 attempts, 2s backoff) around the container check in validateLocalProvider()
  • Collect HTTP status code and host-gateway resolution diagnostics when retries are exhausted
  • Add optional diagnostic field to ValidationResult interface
  • Extract Ollama auth proxy subsystem from src/lib/onboard.ts into new src/lib/ollama-proxy.ts — token persistence, PID tracking, process spawn/kill, health probing (161 lines removed from onboard.ts)
  • Add isProxyHealthy() in ollama-proxy.ts that performs an actual HTTP probe against the proxy endpoint, not just a PID check — addresses CodeRabbit feedback about wedged/stale proxy processes
  • Simplify setupInference() fallback in onboard.ts to use isProxyHealthy() instead of inline PID + host check logic
  • Add host-side fallback for vllm-local using the existing health check endpoint
  • Replace misleading "Ensure the Ollama auth proxy is running" error message with actionable "Docker container reachability check failed" message
  • Add 5 new unit tests in src/lib/local-inference.test.ts covering retry success, retry exhaustion with diagnostics, diagnostic fallback, sleep verification, and no-retry on host failure

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

AI Disclosure

  • AI-assisted — tool: Claude Code (pi agent)

Signed-off-by: Julie Yaunches jyaunches@nvidia.com

Summary by CodeRabbit

  • New Features

    • Container connectivity checks now retry up to 3 times (with 2-second intervals) instead of single attempt.
    • Explicit timeout settings for Docker connectivity checks.
    • Enhanced diagnostic reporting when validation fails, including HTTP status codes and network information.
  • Bug Fixes

    • Improved handling of local inference provider validation failures with automatic host-side health checks.
    • Graceful fallback behavior allowing setup to continue when host services are healthy despite validation issues.

)

Add timeout, retry with backoff, host-side fallback, and improved
diagnostics to the container reachability check during onboarding.

On Brev cloud GPU instances, Docker's --add-host host-gateway does not
route correctly, causing a false-negative failure that blocks onboarding
even though the Ollama auth proxy is healthy on the host. The sandbox
uses a different networking stack (k3s CoreDNS + NodeHosts) and would
succeed.

Changes:
- Add --connect-timeout 5 --max-time 10 to container curl command
- Retry the container check up to 3 times with 2s backoff
- Collect HTTP status and host-gateway resolution diagnostics on failure
- Fall back to host-side health check before exiting fatally
- Replace misleading error messages with actionable diagnostics
- Add diagnostic field to ValidationResult interface

Fixes #2425
@jyaunches jyaunches self-assigned this Apr 24, 2026
@coderabbitai

coderabbitai Bot commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request enhances local inference provider validation with retry logic, explicit timeouts, and diagnostic gathering capabilities. Container reachability checks now retry up to 3 times with 2-second delays. The onboarding flow is updated to tolerate certain validation failures by checking host-side reachability and auth proxy health, with a new HTTP-based health check for the Ollama proxy. Ollama proxy lifecycle functions are reorganized into a dedicated module.

Changes

Cohort / File(s) Summary
Container Validation Logic
src/lib/local-inference.ts, src/lib/local-inference.test.ts
Container reachability checks now retry up to 3 times with 2-second delays between attempts. Docker commands include explicit timeouts (--connect-timeout 5, --max-time 10). validateLocalProvider signature updated to accept injectable sleepFn parameter. ValidationResult interface extended with optional diagnostic field. New diagnostic helpers gather HTTP status codes and host mappings; test coverage expanded for retry behavior, diagnostic reporting, and failure scenarios.
Ollama Proxy Health & Lifecycle
src/lib/onboard-ollama-proxy.ts, src/nemoclaw.ts
New isProxyHealthy() function validates proxy availability via HTTP probe to /api/tags; degrades to process-based signal on HTTP failure. killStaleProxy function exported. Ollama proxy lifecycle imports relocated to dedicated onboard-ollama-proxy module in nemoclaw.ts.
Onboarding Flow Tolerances
src/lib/onboard.ts
setupInference now tolerates validation failures for local providers by checking host-side reachability and auth proxy health instead of immediately aborting. For vllm-local: downgrades failed validation to warning if host health check succeeds. For ollama-local: proactively ensures auth proxy is running and healthy; if healthy, downgrades validation failure to warning and continues; otherwise logs as error and exits. Conditional fallback logic applies provider-specific recovery strategies.

Sequence Diagram(s)

sequenceDiagram
    participant Client as NemoClaw CLI
    participant Onboard as setupInference
    participant Validator as validateLocalProvider
    participant HostCheck as Host Health Check
    participant ProxyCheck as Proxy Health Check

    Client->>Onboard: setupInference(provider='ollama-local')
    Onboard->>Validator: validateLocalProvider(provider, runCaptureImpl, sleepFn)
    
    loop Retry up to 3 times
        Validator->>Validator: Docker run with timeouts<br/>(--connect-timeout 5, --max-time 10)
        alt Container reachable
            Validator-->>Onboard: ValidationResult { success: true }
        else Container unreachable
            Validator->>Validator: sleep 2 seconds
        end
    end
    
    alt All retries failed
        Validator->>Validator: Gather diagnostics<br/>(HTTP status, /etc/hosts)
        Validator-->>Onboard: ValidationResult { success: false, diagnostic }
        
        Onboard->>ProxyCheck: isProxyHealthy()
        ProxyCheck->>ProxyCheck: HTTP probe /api/tags
        alt Proxy healthy
            ProxyCheck-->>Onboard: true
            Onboard->>Client: Downgrade to warning<br/>Continue setup
        else Proxy unhealthy
            ProxyCheck-->>Onboard: false
            Onboard->>Client: Log error<br/>Exit setup
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 Hop, retry, and try once more!
With timeouts set and health checks sure,
The auth proxy stands at the door,
Container bridges now endure! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding retry logic and fallback mechanisms for container reachability checks to fix onboarding failures.
Linked Issues check ✅ Passed The PR addresses all coding requirements from issue #2425: retry container reachability checks, capture diagnostics, add timeouts, implement host-side fallbacks, and improve error messaging.
Out of Scope Changes check ✅ Passed The PR refactors Ollama auth proxy code into a new module and reorganizes imports in nemoclaw.ts, which is within scope of improving the fallback mechanism per the PR objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr/issue-2425

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 5517-5545: The current bypass logic relies on a ps check
(isOllamaProxyProcess(loadPersistedProxyPid())) and an Ollama /api/tags health
check (getLocalProviderHealthCheck + runCapture), which can miss a wedged/stale
auth proxy; instead, before deciding to continue or exit, call
ensureOllamaAuthProxy() (to start/restart the proxy if needed) and then perform
an actual request against the proxy health endpoint (not Ollama's /api/tags)
using the same runCapture helper; replace the proxyAlive/hostResponding logic
with a probe that first invokes ensureOllamaAuthProxy(), then hits the proxy
health URL and only treats the proxy as healthy if that probe returns success
(fall back to existing diagnostic logging and process.exit(1) if the proxy
endpoint remains unhealthy).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 76683991-0e05-4c63-abbb-78954b4ed90d

📥 Commits

Reviewing files that changed from the base of the PR and between 260b237 and 7385d6e.

📒 Files selected for processing (3)
  • src/lib/local-inference.test.ts
  • src/lib/local-inference.ts
  • src/lib/onboard.ts

Comment thread src/lib/onboard.ts Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/lib/onboard.ts (1)

5521-5533: ⚠️ Potential issue | 🟠 Major

Probe the auth proxy endpoint, not just the Ollama process.

This still keys the bypass off “PID exists + Ollama /api/tags responds”. A wedged ollama-auth-proxy.js on port OLLAMA_PROXY_PORT will pass this gate, and the later ensureOllamaAuthProxy() call also only checks the PID, so onboarding can continue with a dead proxy. Probe the proxy endpoint you actually hand to OpenShell after ensureOllamaAuthProxy() before downgrading this to a warning.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5521 - 5533, The current bypass only checks
PID + host provider health (getLocalProviderHealthCheck/runCapture) but not the
actual auth-proxy HTTP endpoint, so replace or augment hostResponding to perform
an HTTP probe against the real auth proxy URL/port used by OpenShell (the same
endpoint that ensureOllamaAuthProxy registers, e.g.
http://localhost:OLLAMA_PROXY_PORT/<health-or-api-path>) and treat
success/failure from that probe when deciding whether to downgrade to a warning;
keep the existing isOllamaProxyProcess(loadPersistedProxyPid()) check but
require the proxy HTTP probe to succeed (using the same path used by
ensureOllamaAuthProxy) before allowing the bypass.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 5475-5476: The health check currently interprets the string result
of runCapture() as a boolean (hostResponding = !!runCapture(...)), which can be
truthy even when curl failed; instead call runCapture and inspect its exit
status (e.g., the returned object's exitCode or status) and set hostResponding =
true only when that exit code === 0 (or use the appropriate runCapture variant
that returns an exit/status code). Update the hostCheck/hostResponding
assignment using getLocalProviderHealthCheck and runCapture so you check the
command's exit code, and apply the same change to the other occurrence at the
region around the other hostResponding usage (the block referenced at lines
~5522-5523).

---

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 5521-5533: The current bypass only checks PID + host provider
health (getLocalProviderHealthCheck/runCapture) but not the actual auth-proxy
HTTP endpoint, so replace or augment hostResponding to perform an HTTP probe
against the real auth proxy URL/port used by OpenShell (the same endpoint that
ensureOllamaAuthProxy registers, e.g.
http://localhost:OLLAMA_PROXY_PORT/<health-or-api-path>) and treat
success/failure from that probe when deciding whether to downgrade to a warning;
keep the existing isOllamaProxyProcess(loadPersistedProxyPid()) check but
require the proxy HTTP probe to succeed (using the same path used by
ensureOllamaAuthProxy) before allowing the bypass.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c8c5e1a9-256d-4d6b-b46f-d800642ac37e

📥 Commits

Reviewing files that changed from the base of the PR and between 7385d6e and 06f77ed.

📒 Files selected for processing (1)
  • src/lib/onboard.ts

Comment thread src/lib/onboard.ts Outdated
…roxyHealthy (#767)

Extract the entire Ollama auth proxy subsystem (token persistence, PID
tracking, process spawn/kill, health probing) from onboard.ts into a
dedicated ollama-proxy.ts module.

Add isProxyHealthy() which performs an actual HTTP probe against the
proxy endpoint (not just a PID check), addressing CodeRabbit's review
feedback about wedged/stale proxy processes.

Simplify the setupInference fallback to use isProxyHealthy() instead of
inline PID + host check logic.

- Reduces onboard.ts by 161 lines (7910 → 7749)
- New ollama-proxy.ts: 223 lines
- Advances #767 (split onboarding provider logic)

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
src/lib/onboard.ts (2)

5360-5384: ⚠️ Potential issue | 🟠 Major

Repair or reprobe the Ollama auth proxy before aborting.

This branch exits on isProxyHealthy() before ensureOllamaAuthProxy() runs at Line 5389. A stale or missing proxy that is recoverable here still becomes a hard false-negative.

Suggested fix
+    let proxyReady = false;
     if (!validation.ok) {
       // The container reachability check uses Docker's --add-host host-gateway,
       // which may not work on all Docker configurations (e.g., Brev, rootless).
       // The real sandbox uses k3s CoreDNS + NodeHosts — a different path.
       // Probe the actual auth proxy endpoint before giving up.
-      if (!isWsl() && isProxyHealthy()) {
+      proxyReady = !isWsl() && ensureOllamaAuthProxy() && isProxyHealthy();
+      if (proxyReady) {
         console.warn(`  ⚠ ${validation.message}`);
         if (validation.diagnostic) {
           console.warn(`  Diagnostic: ${validation.diagnostic}`);
@@
-    if (!isWsl()) {
-      ensureOllamaAuthProxy();
+    if (!isWsl()) {
+      if (!proxyReady) ensureOllamaAuthProxy();
       const proxyToken = getOllamaProxyToken();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5360 - 5384, The code exits immediately when
isProxyHealthy() is false even though a recoverable proxy could be created;
modify the branch handling around isProxyHealthy() so that when the proxy
appears unhealthy you call ensureOllamaAuthProxy() (and recheck with
isProxyHealthy()) before logging the fatal error and calling process.exit(1); in
practice, invoke ensureOllamaAuthProxy() when !isProxyHealthy(), handle any
thrown error or returned failure, re-evaluate validation (and
validation.diagnostic) and only log the error and call process.exit(1) if the
reprobe/recovery still reports an unhealthy proxy.

5318-5319: ⚠️ Potential issue | 🟠 Major

Use curl exit status for the vLLM host-health bypass.

Line 5319 still converts runCapture() output into a boolean. A failed curl -sf can still leave captured text behind, which would incorrectly downgrade a real host failure into the warning-and-continue path.

Suggested fix
-      const hostResponding = hostCheck ? !!runCapture(hostCheck, { ignoreError: true }) : false;
+      const hostResponding = hostCheck
+        ? run(hostCheck, { ignoreError: true, suppressOutput: true }).status === 0
+        : false;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5318 - 5319, The host-health check currently
sets hostResponding by coercing runCapture(...) output to boolean, which can be
non-empty even when curl failed; update the logic in onboard.ts where
getLocalProviderHealthCheck(provider) is assigned to hostCheck so that you
inspect the command exit status instead of the captured stdout. Call the runner
in a way that returns the exit code (e.g., use the run variant or runCapture
result that includes an exitCode/exitStatus) and set hostResponding = hostCheck
? (result.exitCode === 0) : false, keeping the ignoreError behavior but basing
success on the process exit code; reference getLocalProviderHealthCheck,
runCapture, hostCheck, and hostResponding to locate and change the code.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

5316-5384: Extract the local-provider fallback path into a helper.

These new vLLM/Ollama branches duplicate the same diagnostic-and-bypass flow inside an already complexity-suppressed function. Pulling that decision into a small helper would keep setupInference() easier to maintain.

As per coding guidelines, {src,nemoclaw/src,scripts}/**/*.{js,ts,tsx}: TypeScript files must maintain cyclomatic complexity limit of 20, ratcheting down to 15. Enforced by ESLint.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5316 - 5384, The duplicated
diagnostic-and-bypass logic in the vllm-local and ollama-local branches should
be refactored into a small helper to reduce cyclomatic complexity: create a
helper (e.g., ensureLocalProviderOrExit or handleLocalProviderFallback) that
takes the provider name and the validation result and encapsulates the existing
checks (for vllm-local: getLocalProviderHealthCheck + runCapture; for
ollama-local: isWsl + isProxyHealthy) and the three logging paths (warn +
optional diagnostic + bypass message, or error + optional diagnostic +
platform-specific hint + process.exit). Replace the duplicated blocks around
validateLocalProvider in setupInference (or where validateLocalProvider is
called) to call this helper before continuing to upsertProvider / runOpenshell.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/ollama-proxy.ts`:
- Around line 145-165: The code sets ollamaProxyToken before verifying the proxy
is alive; update startOllamaAuthProxy so it generates a local proxyToken, calls
spawnOllamaAuthProxy(pid), waits and checks isOllamaProxyProcess(pid), and only
if the liveness check passes assign ollamaProxyToken = proxyToken (and persist
if needed); if the check fails return false without mutating ollamaProxyToken so
getOllamaProxyToken/ensureOllamaAuthProxy won’t return a stale token; apply the
same change pattern where ollamaProxyToken is currently set early (references:
startOllamaAuthProxy, spawnOllamaAuthProxy, isOllamaProxyProcess,
getOllamaProxyToken, ensureOllamaAuthProxy).
- Around line 207-222: In isProxyHealthy(), don't early-return false when
isOllamaProxyProcess(pid) is falsy; instead store that boolean (e.g.,
hasValidPid = isOllamaProxyProcess(pid)), always perform the HTTP probe built
with loadPersistedProxyToken() and runCapture(..., { ignoreError: true }), and
then return true if the probe produced output, otherwise return hasValidPid (so
a successful HTTP probe wins even if the PID file is missing/stale). Update
references in the function to use the new hasValidPid local variable and remove
the early return.

---

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 5360-5384: The code exits immediately when isProxyHealthy() is
false even though a recoverable proxy could be created; modify the branch
handling around isProxyHealthy() so that when the proxy appears unhealthy you
call ensureOllamaAuthProxy() (and recheck with isProxyHealthy()) before logging
the fatal error and calling process.exit(1); in practice, invoke
ensureOllamaAuthProxy() when !isProxyHealthy(), handle any thrown error or
returned failure, re-evaluate validation (and validation.diagnostic) and only
log the error and call process.exit(1) if the reprobe/recovery still reports an
unhealthy proxy.
- Around line 5318-5319: The host-health check currently sets hostResponding by
coercing runCapture(...) output to boolean, which can be non-empty even when
curl failed; update the logic in onboard.ts where
getLocalProviderHealthCheck(provider) is assigned to hostCheck so that you
inspect the command exit status instead of the captured stdout. Call the runner
in a way that returns the exit code (e.g., use the run variant or runCapture
result that includes an exitCode/exitStatus) and set hostResponding = hostCheck
? (result.exitCode === 0) : false, keeping the ignoreError behavior but basing
success on the process exit code; reference getLocalProviderHealthCheck,
runCapture, hostCheck, and hostResponding to locate and change the code.

---

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 5316-5384: The duplicated diagnostic-and-bypass logic in the
vllm-local and ollama-local branches should be refactored into a small helper to
reduce cyclomatic complexity: create a helper (e.g., ensureLocalProviderOrExit
or handleLocalProviderFallback) that takes the provider name and the validation
result and encapsulates the existing checks (for vllm-local:
getLocalProviderHealthCheck + runCapture; for ollama-local: isWsl +
isProxyHealthy) and the three logging paths (warn + optional diagnostic + bypass
message, or error + optional diagnostic + platform-specific hint +
process.exit). Replace the duplicated blocks around validateLocalProvider in
setupInference (or where validateLocalProvider is called) to call this helper
before continuing to upsertProvider / runOpenshell.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cd2c8e18-d861-4413-b058-63a9501e1fe0

📥 Commits

Reviewing files that changed from the base of the PR and between 06f77ed and a154279.

📒 Files selected for processing (3)
  • src/lib/ollama-proxy.ts
  • src/lib/onboard.ts
  • src/nemoclaw.ts
✅ Files skipped from review due to trivial changes (1)
  • src/nemoclaw.ts

Comment thread src/lib/ollama-proxy.ts Outdated
Comment thread src/lib/ollama-proxy.ts Outdated
@cv cv added v0.0.28 and removed v0.0.27 labels Apr 27, 2026
- startOllamaAuthProxy: set in-memory token only after confirming
  proxy is alive, preventing stale tokens on failed starts
- isProxyHealthy: don't early-return false on PID check failure;
  always perform HTTP probe as the authoritative signal, falling
  back to PID only when the probe transiently fails
- ollama-local branch: call ensureOllamaAuthProxy() before
  isProxyHealthy() so a stale/missing proxy is recovered before
  we decide to abort onboarding
- vllm-local branch: use run() exit status instead of coercing
  runCapture() output to boolean for host health bypass

Signed-off-by: J. Yaunches <jyaunches@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/lib/ollama-proxy.ts (1)

145-168: ⚠️ Potential issue | 🟠 Major

Failed restarts can still publish a dead proxy token.

Line 188 assigns ollamaProxyToken before the restarted child is proven alive, and the failed-start path still leaves the newly persisted PID behind. If the proxy exits immediately, getOllamaProxyToken() can keep returning a token for a proxy that is no longer serving.

Suggested fix
 export function startOllamaAuthProxy(): boolean {
   // eslint-disable-next-line `@typescript-eslint/no-require-imports`
   const crypto = require("crypto");
   killStaleProxy();
+  ollamaProxyToken = null;

   const proxyToken = crypto.randomBytes(24).toString("hex");
   // Don't persist yet — wait until provider is confirmed in setupInference.
   // If the user backs out to a different provider, the token stays in memory
   // only and is discarded.
   const pid = spawnOllamaAuthProxy(proxyToken);
   sleepSeconds(1);
   if (!isOllamaProxyProcess(pid)) {
+    clearPersistedProxyPid();
     console.error(`  Error: Ollama auth proxy failed to start on :${OLLAMA_PROXY_PORT}`);
     console.error(`  Containers will not be able to reach Ollama without the proxy.`);
     console.error(
       `  Check if port ${OLLAMA_PROXY_PORT} is already in use: lsof -ti :${OLLAMA_PROXY_PORT}`,
     );
@@
 export function ensureOllamaAuthProxy(): void {
   // Try to load persisted token first — if none, this isn't an Ollama setup.
   const token = loadPersistedProxyToken();
   if (!token) return;
@@
   // Proxy not running — restart it with the persisted token.
   killStaleProxy();
-  ollamaProxyToken = token;
-  spawnOllamaAuthProxy(token);
+  const restartedPid = spawnOllamaAuthProxy(token);
   sleepSeconds(1);
+  if (isOllamaProxyProcess(restartedPid)) {
+    ollamaProxyToken = token;
+  } else {
+    clearPersistedProxyPid();
+    ollamaProxyToken = null;
+  }
 }

Also applies to: 175-191

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/ollama-proxy.ts` around lines 145 - 168, In startOllamaAuthProxy(),
avoid persisting the new proxy PID or setting ollamaProxyToken until the spawned
process is verified alive: generate proxyToken, call
spawnOllamaAuthProxy(proxyToken) but do NOT persist the PID inside
spawn/killStaleProxy yet; after sleepSeconds and a successful
isOllamaProxyProcess(pid) check, then persist the PID (wherever you currently
store it) and assign ollamaProxyToken; if the check fails, ensure you clear any
persisted PID left by the restart path and do NOT set ollamaProxyToken so
getOllamaProxyToken() cannot return a token for a dead proxy (update
spawnOllamaAuthProxy/killStaleProxy behavior if they currently persist
immediately).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 5318-5323: The host-side vLLM fallback probe can hang because
getLocalProviderHealthCheck("vllm-local") returns a plain "curl -sf ..."
command; update the probe to enforce a timeout by either (A) modifying
getLocalProviderHealthCheck to include curl timing flags (e.g. --max-time and/or
--connect-timeout) for the "vllm-local" case so the returned hostCheck command
always contains a bounded timeout, or (B) if you prefer to keep
getLocalProviderHealthCheck unchanged, detect the vllm-local provider before
calling run() and wrap the hostCheck invocation with a timeout tool (e.g. prefix
with timeout or add equivalent flags) so run(hostCheck, { ignoreError: true,
suppressOutput: true }).status is decided within a bounded time; ensure the
unique symbols involved are hostCheck, hostResponding,
getLocalProviderHealthCheck, and run.

---

Duplicate comments:
In `@src/lib/ollama-proxy.ts`:
- Around line 145-168: In startOllamaAuthProxy(), avoid persisting the new proxy
PID or setting ollamaProxyToken until the spawned process is verified alive:
generate proxyToken, call spawnOllamaAuthProxy(proxyToken) but do NOT persist
the PID inside spawn/killStaleProxy yet; after sleepSeconds and a successful
isOllamaProxyProcess(pid) check, then persist the PID (wherever you currently
store it) and assign ollamaProxyToken; if the check fails, ensure you clear any
persisted PID left by the restart path and do NOT set ollamaProxyToken so
getOllamaProxyToken() cannot return a token for a dead proxy (update
spawnOllamaAuthProxy/killStaleProxy behavior if they currently persist
immediately).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8259b168-d6ca-45c0-bb97-4fcbe5766478

📥 Commits

Reviewing files that changed from the base of the PR and between a154279 and d89b961.

📒 Files selected for processing (2)
  • src/lib/ollama-proxy.ts
  • src/lib/onboard.ts

Comment thread src/lib/onboard.ts
Comment on lines +5318 to +5323
const hostCheck = getLocalProviderHealthCheck(provider);
// Use run() and check exit status rather than coercing runCapture() output
// to boolean — curl -sf can leave output even on failure in edge cases.
const hostResponding = hostCheck
? run(hostCheck, { ignoreError: true, suppressOutput: true }).status === 0
: false;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bound the host-side vLLM fallback probe.

getLocalProviderHealthCheck("vllm-local") is still a plain curl -sf ... command, so this new fallback can hang indefinitely if localhost accepts the connection but never responds. That reintroduces the step-4 stall on the warning/continue path. Please add a timeout here (or inject curl timing flags) before deciding whether to bypass validation.

Suggested fix
       const hostResponding = hostCheck
-        ? run(hostCheck, { ignoreError: true, suppressOutput: true }).status === 0
+        ? run(hostCheck, {
+            ignoreError: true,
+            suppressOutput: true,
+            timeout: 10_000,
+          }).status === 0
         : false;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5318 - 5323, The host-side vLLM fallback
probe can hang because getLocalProviderHealthCheck("vllm-local") returns a plain
"curl -sf ..." command; update the probe to enforce a timeout by either (A)
modifying getLocalProviderHealthCheck to include curl timing flags (e.g.
--max-time and/or --connect-timeout) for the "vllm-local" case so the returned
hostCheck command always contains a bounded timeout, or (B) if you prefer to
keep getLocalProviderHealthCheck unchanged, detect the vllm-local provider
before calling run() and wrap the hostCheck invocation with a timeout tool (e.g.
prefix with timeout or add equivalent flags) so run(hostCheck, { ignoreError:
true, suppressOutput: true }).status is decided within a bounded time; ensure
the unique symbols involved are hostCheck, hostResponding,
getLocalProviderHealthCheck, and run.

@cv cv added v0.0.29 and removed v0.0.28 labels Apr 28, 2026
jyaunches added a commit that referenced this pull request Apr 29, 2026
## Summary

`scripts/brev-launchable-ci-cpu.sh` is the community install path for
Brev users — it bootstraps a VM with Docker, Node.js, OpenShell, and
NemoClaw. **That script already exists in the repo but has zero CI
coverage.** This PR adds a nightly E2E smoke test that validates the
script works end-to-end.

This is the long-living safety net for the community install flow. If
any regression breaks the launchable script (e.g., the Apr 20–25 Brev
outage from #2472/#2482, or the container reachability fallback from
#2425), this test catches it before community users are affected.

## Related Issue

Closes #2599
Related: #2425 (the `isProxyHealthy()` fallback in PR #2453 — if that
regresses, onboard will abort on Brev and this smoke test catches it)

## Changes

### New: `test/e2e/test-launchable-smoke.sh`

| Phase | What it validates |
|-------|-------------------|
| 0 | Pre-cleanup + pre-seed clone directory from checkout |
| 1 | Prerequisites (Docker, NVIDIA_API_KEY, network, env vars) |
| 2 | Run `brev-launchable-ci-cpu.sh` — the existing community bootstrap
script |
| 3 | Verify artifacts (nemoclaw, openshell, Node.js, Docker, sentinel
file, built outputs) |
| 4 | `nemoclaw onboard --non-interactive` with cloud provider |
| 5 | Sandbox health (list, status, inference config, gateway) |
| 6 | Live inference (direct API, routing via inference.local, openclaw
agent 6×7=42) |
| 7 | Destroy + cleanup |

Key design decisions:
- **No BREV_API_TOKEN needed** — the launchable script is a generic
Ubuntu bootstrap with zero Brev dependencies, so it runs on standard
GitHub-hosted `ubuntu-latest` runners
- **Tests current code, not main** — pre-seeds the clone directory from
the CI checkout so regressions are caught before reaching community
users
- **Follows existing E2E conventions** — pass/fail/section helpers,
e2e-timeout.sh self-wrap, sandbox-teardown.sh EXIT trap,
parse_chat_content() for reasoning models

### Modified: `.github/workflows/nightly-e2e.yaml`

- Added `launchable-smoke-e2e` job: `ubuntu-latest`, 30min timeout,
`NVIDIA_API_KEY` secret
- Uploads install/onboard/test logs as artifacts on failure
- Added to `notify-on-failure` needs list

## Validation

Triggered via fork dispatch (`jyaunches/NemoClaw` → `sparky-dispatch` →
`launchable-smoke`):
- **Run:**
https://github.com/jyaunches/NemoClaw/actions/runs/25075715342
- **Result:** ✅ 24 passed, 0 failed, 1 skipped (Node.js version — GH
runner pre-installs Node 20)
- **Runtime:** ~12 minutes

## Type of Change
- [x] Code change (feature, bug fix, or refactor)

## Checklist
- [x] Follows project coding conventions
- [x] Tests pass locally or in CI
- [x] No secrets/credentials committed


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added an end-to-end smoke test and CI job that validates the community
launchable CPU install path (install, onboarding, runtime readiness, and
a simple inference check). CI now uploads install/onboard/test logs on
failures.

* **Chores**
* Renamed the branch-validation workflow and corresponding test-suite
identifiers for clarity.
* Updated E2E test documentation and project configuration names to
match the new labeling.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
- Adopt main's onboard-ollama-proxy.ts (CJS) as canonical module
- Add isProxyHealthy() from PR's ollama-proxy.ts into onboard-ollama-proxy.ts
- Remove PR's ollama-proxy.ts (superseded by onboard-ollama-proxy.ts)
- Update imports in onboard.ts and nemoclaw.ts to use onboard-ollama-proxy
- Take main's getRequestedSandboxNameHint(opts) signature
- Include main's hydrateCredentialEnv, gateway-token-command additions

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/lib/onboard.ts (2)

5146-5151: ⚠️ Potential issue | 🟠 Major

Bound the host-side vLLM fallback probe.

getLocalProviderHealthCheck("vllm-local") still resolves to a plain curl -sf ..., so this warning/continue path can hang indefinitely when localhost accepts the connection but never responds. That reintroduces the Step 4 stall this PR is trying to eliminate.

Suggested fix
       const hostResponding = hostCheck
-        ? run(hostCheck, { ignoreError: true, suppressOutput: true }).status === 0
+        ? run(hostCheck, {
+            ignoreError: true,
+            suppressOutput: true,
+            timeout: 10_000,
+          }).status === 0
         : false;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5146 - 5151, The host health probe for
getLocalProviderHealthCheck("vllm-local") can hang because the returned curl
command lacks a request timeout; update the probe so hostCheck is bounded (e.g.,
include curl's --max-time N or wrap the command with the system timeout utility)
so run(hostCheck, { ignoreError: true, suppressOutput: true }) will return
promptly instead of hanging; modify getLocalProviderHealthCheck (or the value
assigned to hostCheck) to inject a short timeout (e.g., 1–3s) for vllm-local
while keeping the existing run(...) usage and error handling.

5209-5240: ⚠️ Potential issue | 🟠 Major

Require a real HTTP success before bypassing Ollama validation.

proxyReady comes from isProxyHealthy(), but that helper still falls back to PID liveness when the HTTP probe fails. This branch can therefore warn-and-continue, and then skip the later ensureOllamaAuthProxy() call, even though the proxy endpoint is not actually serving yet. Keep PID state diagnostic-only and gate the bypass on an actual proxy response.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5209 - 5240, The current bypass uses
proxyReady = isProxyHealthy(), but isProxyHealthy() may return true based on PID
liveness rather than an actual HTTP probe, allowing the code path that
warns-and-continues to run even when the proxy endpoint isn't serving; change
the logic so the bypass only happens when an actual HTTP success is observed:
either update isProxyHealthy() to distinguish PID-only liveness from an
HTTP-serving result (e.g., return a boolean/enum that indicates HTTP success) or
add an explicit HTTP probe call (e.g., probeOllamaProxyHttp()) after
isProxyHealthy(), and only set/use proxyReady = true when the HTTP probe
succeeds; keep PID-based info as diagnostic text but do not gate
ensureOllamaAuthProxy() or skip shutdown unless the HTTP probe confirms the
proxy is serving. Ensure references include the proxyReady variable,
isProxyHealthy(), ensureOllamaAuthProxy(), and getOllamaProxyToken() so
reviewers can find the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard-ollama-proxy.ts`:
- Around line 256-268: The health probe uses loadPersistedProxyToken() which can
be null during onboarding even though a valid in-memory token exists; update the
probe logic to prefer the in-memory proxy token (e.g., the runtime variable that
holds the current token) before falling back to loadPersistedProxyToken(), then
build probeCmd with that chosen token (still falling back to no-Auth when none
exists) so the Authorization header uses the in-memory token when available;
keep references to probeCmd, proxyUrl, runCapture, and hasValidPid so the
authenticated HTTP probe is attempted first and only then fallback to PID.

---

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 5146-5151: The host health probe for
getLocalProviderHealthCheck("vllm-local") can hang because the returned curl
command lacks a request timeout; update the probe so hostCheck is bounded (e.g.,
include curl's --max-time N or wrap the command with the system timeout utility)
so run(hostCheck, { ignoreError: true, suppressOutput: true }) will return
promptly instead of hanging; modify getLocalProviderHealthCheck (or the value
assigned to hostCheck) to inject a short timeout (e.g., 1–3s) for vllm-local
while keeping the existing run(...) usage and error handling.
- Around line 5209-5240: The current bypass uses proxyReady = isProxyHealthy(),
but isProxyHealthy() may return true based on PID liveness rather than an actual
HTTP probe, allowing the code path that warns-and-continues to run even when the
proxy endpoint isn't serving; change the logic so the bypass only happens when
an actual HTTP success is observed: either update isProxyHealthy() to
distinguish PID-only liveness from an HTTP-serving result (e.g., return a
boolean/enum that indicates HTTP success) or add an explicit HTTP probe call
(e.g., probeOllamaProxyHttp()) after isProxyHealthy(), and only set/use
proxyReady = true when the HTTP probe succeeds; keep PID-based info as
diagnostic text but do not gate ensureOllamaAuthProxy() or skip shutdown unless
the HTTP probe confirms the proxy is serving. Ensure references include the
proxyReady variable, isProxyHealthy(), ensureOllamaAuthProxy(), and
getOllamaProxyToken() so reviewers can find the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7c78189c-5ea8-4bdd-bb35-77656a56fc5b

📥 Commits

Reviewing files that changed from the base of the PR and between d89b961 and edb6a44.

📒 Files selected for processing (3)
  • src/lib/onboard-ollama-proxy.ts
  • src/lib/onboard.ts
  • src/nemoclaw.ts
✅ Files skipped from review due to trivial changes (1)
  • src/nemoclaw.ts

Comment on lines +256 to +268
const token = loadPersistedProxyToken();
const probeCmd = token
? ["curl", "-sf", "--connect-timeout", "3", "--max-time", "5",
"-H", `Authorization: Bearer ${token}`, proxyUrl]
: ["curl", "-sf", "--connect-timeout", "3", "--max-time", "5", proxyUrl];

const output = runCapture(probeCmd, { ignoreError: true });
if (output) return true;

// HTTP probe failed — fall back to PID as a weaker signal.
// This covers edge cases where the probe transiently fails but the
// process is confirmed alive.
return hasValidPid;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use in-memory token for health probe to avoid misclassifying proxy health

Line 256 uses loadPersistedProxyToken(), which can be null during onboarding even when a valid in-memory token exists. That makes the probe unauthenticated and can incorrectly fall back to PID-only health (or fail when PID is stale).

💡 Proposed fix
-  const token = loadPersistedProxyToken();
+  const token = getOllamaProxyToken();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-ollama-proxy.ts` around lines 256 - 268, The health probe
uses loadPersistedProxyToken() which can be null during onboarding even though a
valid in-memory token exists; update the probe logic to prefer the in-memory
proxy token (e.g., the runtime variable that holds the current token) before
falling back to loadPersistedProxyToken(), then build probeCmd with that chosen
token (still falling back to no-Auth when none exists) so the Authorization
header uses the in-memory token when available; keep references to probeCmd,
proxyUrl, runCapture, and hasValidPid so the authenticated HTTP probe is
attempted first and only then fallback to PID.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25132934106
Branch: pr/issue-2425
Requested jobs: cloud-e2e,cloud-onboard-e2e,sandbox-operations-e2e
Summary: 3 passed, 0 failed, 19 skipped

Job Result
cloud-e2e ✅ success
cloud-inference-e2e ⏭️ skipped
cloud-onboard-e2e ✅ success
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ⏭️ skipped
docs-validation-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ⏭️ skipped
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skill-agent-e2e ⏭️ skipped
skip-permissions-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

@cv cv merged commit 9dbe855 into main Apr 29, 2026
51 checks passed
@miyoungc miyoungc mentioned this pull request Apr 30, 2026
13 tasks
miyoungc added a commit that referenced this pull request Apr 30, 2026
## Summary
Refreshes the daily docs from NemoClaw commits merged in the past 24
hours and advances the docs metadata from 0.0.29 to 0.0.31, the next
version after tag v0.0.30.
The updates cover documented behavior gaps found in the merged PRs
listed below.

## Related Issue
None.

## Changes
- `docs/versions1.json` and `docs/project.json`: bump the preferred docs
version to `0.0.31` for daily release preparation after latest tag
`v0.0.30`.
- `docs/reference/commands.md`: document non-interactive Brave Search
validation fallback from #2511 / 9bfe30b, missing `--from <Dockerfile>`
path validation from #2597 / 7186834, and `logs` reading OpenShell
audit events from #2590 / e225dfb.
- `docs/inference/use-local-inference.md`: document local inference
reachability retry and host-side fallback from #2453 / 9dbe855, plus
compatible-endpoint timeout coverage from #2583 / b4ef3db.
- `docs/reference/troubleshooting.md`: document source-install shim
fallback from #2520 / 01a177c, TLS gateway trust recovery from #1936 /
24725d2, compatible-endpoint timeout coverage from #2583 / b4ef3db,
local reachability diagnostics from #2453 / 9dbe855, and host proxy
`NO_PROXY` injection from #2662 / b4df07e.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [ ] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

Additional verification:
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --dry-run` passed.
- `git diff --check` passed.
- Pre-push hooks passed through markdownlint, docs-to-skills, JSON
checks, gitleaks, and version sync before `Test (skills YAML)` failed
because this fresh worktree lacked `vitest/config`.
- `npx prek run --all-files` could not run from the fresh worktree
because `npx prek` resolved to a missing `prek@*` package; downloading
`@j178/prek` was not approved.
- `npm test` could not complete from the fresh worktree because
dependencies and compiled `dist/lib/*` artifacts were absent.

## AI Disclosure
- [x] AI-assisted — tool: OpenAI Codex

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
  * Version updated to 0.0.31
* Local inference onboarding now includes retry logic for container
reachability checks
  * Web search setup failure handling clarified with fallback guidance
  * Dockerfile path validation timing documented
  * Logging behavior clarified for concurrent stream reading
  * New TLS/certificate troubleshooting section added
  * Install path and proxy configuration troubleshooting updated

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
DemianHeyGen pushed a commit to DemianHeyGen/NemoClaw that referenced this pull request Apr 30, 2026
## Summary

`scripts/brev-launchable-ci-cpu.sh` is the community install path for
Brev users — it bootstraps a VM with Docker, Node.js, OpenShell, and
NemoClaw. **That script already exists in the repo but has zero CI
coverage.** This PR adds a nightly E2E smoke test that validates the
script works end-to-end.

This is the long-living safety net for the community install flow. If
any regression breaks the launchable script (e.g., the Apr 20–25 Brev
outage from NVIDIA#2472/NVIDIA#2482, or the container reachability fallback from
NVIDIA#2425), this test catches it before community users are affected.

## Related Issue

Closes NVIDIA#2599
Related: NVIDIA#2425 (the `isProxyHealthy()` fallback in PR NVIDIA#2453 — if that
regresses, onboard will abort on Brev and this smoke test catches it)

## Changes

### New: `test/e2e/test-launchable-smoke.sh`

| Phase | What it validates |
|-------|-------------------|
| 0 | Pre-cleanup + pre-seed clone directory from checkout |
| 1 | Prerequisites (Docker, NVIDIA_API_KEY, network, env vars) |
| 2 | Run `brev-launchable-ci-cpu.sh` — the existing community bootstrap
script |
| 3 | Verify artifacts (nemoclaw, openshell, Node.js, Docker, sentinel
file, built outputs) |
| 4 | `nemoclaw onboard --non-interactive` with cloud provider |
| 5 | Sandbox health (list, status, inference config, gateway) |
| 6 | Live inference (direct API, routing via inference.local, openclaw
agent 6×7=42) |
| 7 | Destroy + cleanup |

Key design decisions:
- **No BREV_API_TOKEN needed** — the launchable script is a generic
Ubuntu bootstrap with zero Brev dependencies, so it runs on standard
GitHub-hosted `ubuntu-latest` runners
- **Tests current code, not main** — pre-seeds the clone directory from
the CI checkout so regressions are caught before reaching community
users
- **Follows existing E2E conventions** — pass/fail/section helpers,
e2e-timeout.sh self-wrap, sandbox-teardown.sh EXIT trap,
parse_chat_content() for reasoning models

### Modified: `.github/workflows/nightly-e2e.yaml`

- Added `launchable-smoke-e2e` job: `ubuntu-latest`, 30min timeout,
`NVIDIA_API_KEY` secret
- Uploads install/onboard/test logs as artifacts on failure
- Added to `notify-on-failure` needs list

## Validation

Triggered via fork dispatch (`jyaunches/NemoClaw` → `sparky-dispatch` →
`launchable-smoke`):
- **Run:**
https://github.com/jyaunches/NemoClaw/actions/runs/25075715342
- **Result:** ✅ 24 passed, 0 failed, 1 skipped (Node.js version — GH
runner pre-installs Node 20)
- **Runtime:** ~12 minutes

## Type of Change
- [x] Code change (feature, bug fix, or refactor)

## Checklist
- [x] Follows project coding conventions
- [x] Tests pass locally or in CI
- [x] No secrets/credentials committed


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added an end-to-end smoke test and CI job that validates the community
launchable CPU install path (install, onboarding, runtime readiness, and
a simple inference check). CI now uploads install/onboard/test logs on
failures.

* **Chores**
* Renamed the branch-validation workflow and corresponding test-suite
identifiers for clarity.
* Updated E2E test documentation and project configuration names to
match the new labeling.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
DemianHeyGen pushed a commit to DemianHeyGen/NemoClaw that referenced this pull request Apr 30, 2026
…IDIA#2453)

## Summary

On Brev cloud GPU instances, `nemoclaw onboard` with Local Ollama fails
at step 4/8 because Docker's `--add-host host-gateway` does not route
correctly, causing a false-negative container reachability check. This
PR adds timeout, retry with backoff, host-side fallback, and improved
diagnostics so onboarding succeeds when the proxy is demonstrably
healthy on the host.

Additionally, this PR extracts the entire Ollama auth proxy subsystem
from the `onboard.ts` monolith into a dedicated `ollama-proxy.ts`
module, reducing `onboard.ts` by 161 lines and advancing issue NVIDIA#767.

## Related Issue

Fixes NVIDIA#2425
Ref NVIDIA#767

## Changes

- Add `--connect-timeout 5 --max-time 10` to the container reachability
curl command in `src/lib/local-inference.ts` to prevent indefinite hangs
- Add retry loop (3 attempts, 2s backoff) around the container check in
`validateLocalProvider()`
- Collect HTTP status code and host-gateway resolution diagnostics when
retries are exhausted
- Add optional `diagnostic` field to `ValidationResult` interface
- Extract Ollama auth proxy subsystem from `src/lib/onboard.ts` into new
`src/lib/ollama-proxy.ts` — token persistence, PID tracking, process
spawn/kill, health probing (161 lines removed from onboard.ts)
- Add `isProxyHealthy()` in `ollama-proxy.ts` that performs an actual
HTTP probe against the proxy endpoint, not just a PID check — addresses
CodeRabbit feedback about wedged/stale proxy processes
- Simplify `setupInference()` fallback in `onboard.ts` to use
`isProxyHealthy()` instead of inline PID + host check logic
- Add host-side fallback for `vllm-local` using the existing health
check endpoint
- Replace misleading "Ensure the Ollama auth proxy is running" error
message with actionable "Docker container reachability check failed"
message
- Add 5 new unit tests in `src/lib/local-inference.test.ts` covering
retry success, retry exhaustion with diagnostics, diagnostic fallback,
sleep verification, and no-retry on host failure

## Type of Change
- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

## AI Disclosure
- [x] AI-assisted — tool: Claude Code (pi agent)

---
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Container connectivity checks now retry up to 3 times (with 2-second
intervals) instead of single attempt.
  * Explicit timeout settings for Docker connectivity checks.
* Enhanced diagnostic reporting when validation fails, including HTTP
status codes and network information.

* **Bug Fixes**
* Improved handling of local inference provider validation failures with
automatic host-side health checks.
* Graceful fallback behavior allowing setup to continue when host
services are healthy despite validation issues.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: J. Yaunches <jyaunches@nvidia.com>
DemianHeyGen pushed a commit to DemianHeyGen/NemoClaw that referenced this pull request Apr 30, 2026
## Summary
Refreshes the daily docs from NemoClaw commits merged in the past 24
hours and advances the docs metadata from 0.0.29 to 0.0.31, the next
version after tag v0.0.30.
The updates cover documented behavior gaps found in the merged PRs
listed below.

## Related Issue
None.

## Changes
- `docs/versions1.json` and `docs/project.json`: bump the preferred docs
version to `0.0.31` for daily release preparation after latest tag
`v0.0.30`.
- `docs/reference/commands.md`: document non-interactive Brave Search
validation fallback from NVIDIA#2511 / 9bfe30b, missing `--from <Dockerfile>`
path validation from NVIDIA#2597 / 7186834, and `logs` reading OpenShell
audit events from NVIDIA#2590 / e225dfb.
- `docs/inference/use-local-inference.md`: document local inference
reachability retry and host-side fallback from NVIDIA#2453 / 9dbe855, plus
compatible-endpoint timeout coverage from NVIDIA#2583 / b4ef3db.
- `docs/reference/troubleshooting.md`: document source-install shim
fallback from NVIDIA#2520 / 01a177c, TLS gateway trust recovery from NVIDIA#1936 /
24725d2, compatible-endpoint timeout coverage from NVIDIA#2583 / b4ef3db,
local reachability diagnostics from NVIDIA#2453 / 9dbe855, and host proxy
`NO_PROXY` injection from NVIDIA#2662 / b4df07e.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [ ] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

Additional verification:
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --dry-run` passed.
- `git diff --check` passed.
- Pre-push hooks passed through markdownlint, docs-to-skills, JSON
checks, gitleaks, and version sync before `Test (skills YAML)` failed
because this fresh worktree lacked `vitest/config`.
- `npx prek run --all-files` could not run from the fresh worktree
because `npx prek` resolved to a missing `prek@*` package; downloading
`@j178/prek` was not approved.
- `npm test` could not complete from the fresh worktree because
dependencies and compiled `dist/lib/*` artifacts were absent.

## AI Disclosure
- [x] AI-assisted — tool: OpenAI Codex

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
  * Version updated to 0.0.31
* Local inference onboarding now includes retry logic for container
reachability checks
  * Web search setup failure handling clarified with fallback guidance
  * Dockerfile path validation timing documented
  * Logging behavior clarified for concurrent stream reading
  * New TLS/certificate troubleshooting section added
  * Install path and proxy configuration troubleshooting updated

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
@jyaunches jyaunches deleted the pr/issue-2425 branch June 12, 2026 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Brev] fail to onboard local ollama

3 participants