fix(sandbox): recover stale inference route on connect#3444
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRefactors sandbox connect inference-route verification/repair, adds token-aware Ollama auth-proxy probes, returns structured probe/repair results, adds an ensure-or-exit wrapper, moves recovery after sandbox Ready, and expands tests for recovery paths and env forwarding. ChangesInference Route Recovery Flow
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
99d5daa to
d3c57b1
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/inference/ollama/proxy.ts`:
- Around line 312-347: The current logic treats reachable non-2xx HTTP responses
(e.g., 403/404) as "not reachable"; update the branching in the result handling
(the block using result, status, endpoint and OLLAMA_PORT) so that if
Number.isFinite(status) and status is >=300 and <500 you return ok: false with a
detail explicitly stating the proxy is reachable on the endpoint but returned
HTTP {status} (e.g., "Ollama auth proxy is reachable on {endpoint} but returned
HTTP {status}; check auth/route/config"), while keeping the existing 2xx success
branch, the 401-specific message, the >=500 backend-error message, and the final
unreachable/failure fallback.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4767a2e6-fd5b-4695-b4b7-df4412b630a4
📒 Files selected for processing (3)
src/lib/actions/sandbox/connect.tssrc/lib/inference/ollama/proxy.tstest/sandbox-connect-inference.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- test/sandbox-connect-inference.test.ts
- src/lib/actions/sandbox/connect.ts
fb1ec30 to
918189f
Compare
jyaunches
left a comment
There was a problem hiding this comment.
🔴 Blockers
-
src/lib/actions/sandbox/connect.ts:240-255— Ollama proxy recovery checks health before starting/recovering the proxyverifyLocalInferenceRouteDependencies()callsprobeLocalProviderHealth(provider)beforeensureOllamaAuthProxy().- For non-WSL
ollama-local, the local provider health endpoint is the auth proxy. If the proxy PID is stale or the proxy is down — one of the core cases this PR claims to recover — the health probe fails and returnsfalsebeforeensureOllamaAuthProxy()gets a chance to restart it. - Suggested fix: for
ollama-local && !isWsl(), callensureOllamaAuthProxy()before probing local provider health, then verifyprobeOllamaAuthProxyHealth().
Example structure:
if (isOllamaLocal) { findReachableOllamaHost(); if (!isWsl()) { ensureOllamaAuthProxy(); } } const localHealth = probeLocalProviderHealth(provider); // ...
🟡 Warnings
-
src/lib/actions/sandbox/connect.ts:373-375— recovery failures are swallowed broadly- The
catch { /* non-fatal */ }now wraps more than best-effort route switching; it also covers the new fail-closed recovery path. - If a bug/throw occurs during reset or health probing,
routeHealthybecomesnulland connect proceeds. - Consider narrowing the catch or logging the detail when route recovery was supposed to be authoritative.
- The
-
CI pending
macos-e2e,wsl-e2e,checks, and CodeQL were still pending when reviewed.- This should not merge until those finish green for
26cc92a24048b08f567a56dbba202adbb95df25eor a newer reviewed HEAD.
🔵 Suggestions
-
src/lib/actions/sandbox/connect.ts— extract route-recovery helpers laterconnect.tsgrew by ~162 lines. A focusedinference-route-recovery.tsmodule would keep connect orchestration simpler.
-
src/lib/inference/ollama/proxy.ts— keep curl wrapper endpoint-constrained- The current callers use fixed loopback URLs, which is good. If this helper becomes more general later, add endpoint validation rather than relying on comments.
26cc92a to
491b14b
Compare
Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
491b14b to
6d9954b
Compare
Summary
Recover stale
inference.localroutes during sandbox connect by revalidating the in-sandbox route after DNS repair and resetting the managed OpenShell inference route when provider/model still match but the route remains broken. Local Ollama providers now get host and auth-proxy health diagnostics before NemoClaw opens an SSH session into a sandbox with known-broken inference.Related Issue
Fixes #3390
Changes
inference.localduring connect and--probe-only, includingBROKEN 000responses from failed sandbox curl probes.openshell inference set --provider <provider> --model <model> --no-verifyafter DNS repair when the matching managed route is still unhealthy.BROKEN 000handling.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Verification notes:
npm run build:clipassed.npm run typecheck:clipassed.npm run lintpassed with an unrelated upstream warning insrc/lib/onboard/child-exit-tracker.test.ts.npx vitest run test/sandbox-connect-inference.test.tspassed.npm test -- --run test/cli.test.tspassed during local review.npm testwas run locally but remains red due unrelated local installer/source-checkout and Dockerfile fetch-guard permission failures; latestmainGitHub checks were green for those suites before this PR./opt/nemoclaw-blueprint/..., so the regression is covered by the focused connect tests.codex review -c sandbox_mode="danger-full-access" --uncommittedfound no correctness issues.Signed-off-by: Yimo Jiang yimoj@nvidia.com
Summary by CodeRabbit
Bug Fixes
Improvements
Tests