Skip to content

fix(sandbox): recover stale inference route on connect#3444

Merged
cv merged 1 commit into
NVIDIA:mainfrom
yimoj:fix/3390-ollama-route-recovery
May 15, 2026
Merged

fix(sandbox): recover stale inference route on connect#3444
cv merged 1 commit into
NVIDIA:mainfrom
yimoj:fix/3390-ollama-route-recovery

Conversation

@yimoj

@yimoj yimoj commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Recover stale inference.local routes during sandbox connect by revalidating the in-sandbox route after DNS repair and resetting the managed OpenShell inference route when provider/model still match but the route remains broken. Local Ollama providers now get host and auth-proxy health diagnostics before NemoClaw opens an SSH session into a sandbox with known-broken inference.

Related Issue

Fixes #3390

Changes

  • Re-probes inference.local during connect and --probe-only, including BROKEN 000 responses from failed sandbox curl probes.
  • Re-runs openshell inference set --provider <provider> --model <model> --no-verify after DNS repair when the matching managed route is still unhealthy.
  • Validates local provider host health and the Ollama auth proxy before route reset, then exits with precise diagnostics if the route remains broken.
  • Adds focused connect regression coverage for route reset, final failure blocking, host Ollama diagnostics, WSL behavior, and BROKEN 000 handling.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Verification notes:

  • npm run build:cli passed.
  • npm run typecheck:cli passed.
  • npm run lint passed with an unrelated upstream warning in src/lib/onboard/child-exit-tracker.test.ts.
  • npx vitest run test/sandbox-connect-inference.test.ts passed.
  • npm test -- --run test/cli.test.ts passed during local review.
  • npm test was run locally but remains red due unrelated local installer/source-checkout and Dockerfile fetch-guard permission failures; latest main GitHub checks were green for those suites before this PR.
  • Live Ollama sandbox E2E was attempted through the worktree CLI after OpenShell setup; sandbox creation was blocked by host permission error writing under /opt/nemoclaw-blueprint/..., so the regression is covered by the focused connect tests.
  • codex review -c sandbox_mode="danger-full-access" --uncommitted found no correctness issues.

Signed-off-by: Yimo Jiang yimoj@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • More reliable sandbox connect: improved detection, recovery, and abort-on-failure when local inference routes remain broken.
  • Improvements

    • Structured health/repair statuses for inference routes and Ollama auth-proxy with clearer failure detail (including reachable non-2xx responses).
    • Additional verification of local inference dependencies and conditional handling when proxy state is absent; tighter timeout/operation handling.
  • Tests

    • Expanded coverage for probe/curl behaviors, env forwarding, proxy-token scenarios, WSL cases, and repair edge cases.

Review Change Stack

@yimoj yimoj self-assigned this May 13, 2026
@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Refactors sandbox connect inference-route verification/repair, adds token-aware Ollama auth-proxy probes, returns structured probe/repair results, adds an ensure-or-exit wrapper, moves recovery after sandbox Ready, and expands tests for recovery paths and env forwarding.

Changes

Inference Route Recovery Flow

Layer / File(s) Summary
Ollama auth proxy health probe
src/lib/inference/ollama/proxy.ts
New exported probeOllamaAuthProxyHealth() and internal curl-with-token helpers that probe /v1/models and /api/tags, returning structured { ok, endpoint, detail }.
Inference probe and repair helpers
src/lib/actions/sandbox/connect.ts
Adds typed probe results, bounded OpenShell-based probe, repair routine returning { healthy, repairAttempted, detail }, dependency validation (including optional Ollama proxy check), and a reset helper to set the managed route back to the sandbox's persisted provider/model.
Ensure function refactoring and exit wrapper
src/lib/actions/sandbox/connect.ts
ensureSandboxInferenceRoute now returns a structured ensure result; ensureSandboxInferenceRouteOrExit exits the process when routeHealthy === false.
Connect flow integration
src/lib/actions/sandbox/connect.ts
Replaces prior ensure call sites with ensureSandboxInferenceRouteOrExit, introduces mutable sb, and runs inference-route swap/recovery after sandbox Ready and before SSH.
Test fixture and harness expansion
test/sandbox-connect-inference.test.ts
Fixture supports configurable fake curl outputs and inference-probe sequences, optional Ollama proxy state writing, fake curl/ps executables, state.json telemetry, and runConnect(envOverride).
Proxy recovery test env assertions
test/ollama-proxy-recovery.test.ts
Captures curl invocation env in tests and asserts proxy env propagation (HTTP_PROXY, NO_PROXY) while omitting sensitive vars like NVIDIA_API_KEY; adds reachable non-2xx probe response test.
Test cases for inference recovery scenarios
test/sandbox-connect-inference.test.ts
Adds tests for DNS-proxy repair with 000 probe responses, route reset on failed repair, abort-before-connect when still broken, host Ollama diagnosis before reset, and WSL-specific auth-proxy token behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3472: Modifies src/lib/inference/ollama/proxy.ts for auth-proxy probing and exported probe helpers; closely related to the probe changes here.

Suggested labels

NemoClaw CLI

Suggested reviewers

  • ericksoa

Poem

🐰 I sniffed the curl and traced the net,

A token found where headers met.
I probed, I patched, I reset the route,
Then watched the sandbox hum and boot.
Hooray — inference answers now.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(sandbox): recover stale inference route on connect' directly describes the main change: automatic recovery of stale inference routes during sandbox connect, which is the primary objective of the PR.
Linked Issues check ✅ Passed The PR directly addresses all coding requirements from issue #3390: automatic inference route recovery during connect, DNS-proxy repair validation, route reset after repair, host Ollama diagnostics, and proper blocking on unrecoverable failures.
Out of Scope Changes check ✅ Passed All changes are scoped to inference route recovery: connect flow refactoring, Ollama proxy health probing, test coverage for recovery scenarios, and environment variable handling for diagnostics—all directly supporting issue #3390 requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread src/lib/inference/ollama/proxy.ts Fixed
@yimoj yimoj force-pushed the fix/3390-ollama-route-recovery branch from 99d5daa to d3c57b1 Compare May 13, 2026 07:19

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/inference/ollama/proxy.ts`:
- Around line 312-347: The current logic treats reachable non-2xx HTTP responses
(e.g., 403/404) as "not reachable"; update the branching in the result handling
(the block using result, status, endpoint and OLLAMA_PORT) so that if
Number.isFinite(status) and status is >=300 and <500 you return ok: false with a
detail explicitly stating the proxy is reachable on the endpoint but returned
HTTP {status} (e.g., "Ollama auth proxy is reachable on {endpoint} but returned
HTTP {status}; check auth/route/config"), while keeping the existing 2xx success
branch, the 401-specific message, the >=500 backend-error message, and the final
unreachable/failure fallback.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4767a2e6-fd5b-4695-b4b7-df4412b630a4

📥 Commits

Reviewing files that changed from the base of the PR and between 99d5daa and d3c57b1.

📒 Files selected for processing (3)
  • src/lib/actions/sandbox/connect.ts
  • src/lib/inference/ollama/proxy.ts
  • test/sandbox-connect-inference.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/sandbox-connect-inference.test.ts
  • src/lib/actions/sandbox/connect.ts

Comment thread src/lib/inference/ollama/proxy.ts
@yimoj yimoj force-pushed the fix/3390-ollama-route-recovery branch 2 times, most recently from fb1ec30 to 918189f Compare May 13, 2026 07:40
@yimoj yimoj added the v0.0.41 label May 13, 2026
@wscurran wscurran added bug Something fails against expected or documented behavior fix labels May 13, 2026
@wscurran

Copy link
Copy Markdown
Contributor

@cv cv added v0.0.42 and removed v0.0.41 labels May 14, 2026
@jyaunches jyaunches self-requested a review May 14, 2026 01:16

@jyaunches jyaunches left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Blockers

  1. src/lib/actions/sandbox/connect.ts:240-255 — Ollama proxy recovery checks health before starting/recovering the proxy

    • verifyLocalInferenceRouteDependencies() calls probeLocalProviderHealth(provider) before ensureOllamaAuthProxy().
    • For non-WSL ollama-local, the local provider health endpoint is the auth proxy. If the proxy PID is stale or the proxy is down — one of the core cases this PR claims to recover — the health probe fails and returns false before ensureOllamaAuthProxy() gets a chance to restart it.
    • Suggested fix: for ollama-local && !isWsl(), call ensureOllamaAuthProxy() before probing local provider health, then verify probeOllamaAuthProxyHealth().

    Example structure:

    if (isOllamaLocal) {
      findReachableOllamaHost();
      if (!isWsl()) {
        ensureOllamaAuthProxy();
      }
    }
    
    const localHealth = probeLocalProviderHealth(provider);
    // ...

🟡 Warnings

  1. src/lib/actions/sandbox/connect.ts:373-375 — recovery failures are swallowed broadly

    • The catch { /* non-fatal */ } now wraps more than best-effort route switching; it also covers the new fail-closed recovery path.
    • If a bug/throw occurs during reset or health probing, routeHealthy becomes null and connect proceeds.
    • Consider narrowing the catch or logging the detail when route recovery was supposed to be authoritative.
  2. CI pending

    • macos-e2e, wsl-e2e, checks, and CodeQL were still pending when reviewed.
    • This should not merge until those finish green for 26cc92a24048b08f567a56dbba202adbb95df25e or a newer reviewed HEAD.

🔵 Suggestions

  1. src/lib/actions/sandbox/connect.ts — extract route-recovery helpers later

    • connect.ts grew by ~162 lines. A focused inference-route-recovery.ts module would keep connect orchestration simpler.
  2. src/lib/inference/ollama/proxy.ts — keep curl wrapper endpoint-constrained

    • The current callers use fixed loopback URLs, which is good. If this helper becomes more general later, add endpoint validation rather than relying on comments.

@cv cv added v0.0.43 and removed v0.0.42 labels May 14, 2026
@yimoj yimoj force-pushed the fix/3390-ollama-route-recovery branch from 26cc92a to 491b14b Compare May 15, 2026 04:21
Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@yimoj yimoj force-pushed the fix/3390-ollama-route-recovery branch from 491b14b to 6d9954b Compare May 15, 2026 04:35
@yimoj yimoj requested a review from jyaunches May 15, 2026 05:09
@wscurran wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression and removed fix bug Something fails against expected or documented behavior labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ubuntu 24.04][Inference] inference.local unreachable in sandbox after Ollama host restart

5 participants