Skip to content

fix(onboard): limit gateway GPU inspect to legacy gateways#3479

Merged
cv merged 8 commits into
mainfrom
fix/gateway-gpu-passthrough-check
May 15, 2026
Merged

fix(onboard): limit gateway GPU inspect to legacy gateways#3479
cv merged 8 commits into
mainfrom
fix/gateway-gpu-passthrough-check

Conversation

@zyang-dev

@zyang-dev zyang-dev commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Limit the reused gateway GPU passthrough check to legacy OpenShell gateway containers. This avoids falsely aborting Docker-driver/package-managed gateway reuse because openshell-cluster-* HostConfig.DeviceRequests is not a reliable GPU signal on that path.

Changes

  • Added a helper that scopes the legacy gateway GPU inspect to healthy legacy gateways with GPU passthrough enabled.
  • Skipped the legacy openshell-cluster-nemoclaw Docker DeviceRequests check for Docker-driver/package-managed gateways.
  • Added regression coverage for the gateway GPU reuse check decision.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: zyang-dev 267119621+zyang-dev@users.noreply.github.com

Summary by CodeRabbit

  • New Features

    • Refined legacy GPU passthrough checks so GPU inspection runs only for specific gateway states and capability combinations, reducing unnecessary checks.
    • Added OS/architecture-aware platform detection to better determine Docker-driver gateway eligibility.
  • Tests

    • Added tests validating the GPU passthrough inspection predicate across gateway states and configurations.

Review Change Stack

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@zyang-dev zyang-dev self-assigned this May 13, 2026
@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 80013aa9-c6c3-4393-9f0f-6a82c5c7a9d7

📥 Commits

Reviewing files that changed from the base of the PR and between 1b08af8 and 3171a55.

📒 Files selected for processing (2)
  • src/lib/onboard.ts
  • test/onboard.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/onboard.test.ts
  • src/lib/onboard.ts

📝 Walkthrough

Walkthrough

Adds two platform-detection helpers and a predicate that centralizes legacy gateway GPU-passthrough inspection logic; integrates these into src/lib/onboard.ts (removing the local helper) and adds a unit test for the predicate's truth table.

Changes

Legacy gateway GPU passthrough and platform detection

Layer / File(s) Summary
Platform detection helpers
src/lib/onboard/docker-driver-platform.ts
New module exporting isLinuxDockerDriverGatewayEnabled(platform?: NodeJS.Platform, arch?: NodeJS.Architecture) and isLinuxDockerDriverGatewayPlatform(platform?: NodeJS.Platform) with defaults to process.platform/arch.
GPU passthrough predicate
src/lib/onboard/gateway-gpu-passthrough.ts
New exported shouldInspectLegacyGatewayGpuPassthrough(gatewayReuseState: GatewayReuseState, gpuPassthrough: boolean, dockerDriverGatewayEnabled: boolean, gatewayLifecycleCommandsSupported: boolean): boolean that returns true only when reuse state is "healthy", gpuPassthrough is enabled, Docker-driver gateway is disabled, and lifecycle commands are supported.
onboard integration
src/lib/onboard.ts
Replaces local platform helper with the imported isLinuxDockerDriverGatewayEnabled, imports shouldInspectLegacyGatewayGpuPassthrough, and uses it to gate the Docker DeviceRequests inspection for legacy reusable gateways (previous inline condition removed).
Unit test
test/onboard.test.ts
Imports shouldInspectLegacyGatewayGpuPassthrough and GatewayReuseState; adds a Vitest case asserting expected boolean outcomes across reuse-state / flag permutations.

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3383: Modifies Docker-driver gateway platform-detection logic that can affect gateway flow selection and GPU-passthrough checks.

Suggested labels

OpenShell, NemoClaw CLI, Sandbox

Suggested reviewers

  • ericksoa
  • cv
  • jyaunches

Poem

🐰 I hopped through code with gentle paws,
A tiny gate that checks GPU laws,
Healthy state and flags aligned,
Docker-driver off — probe designed,
Tests blink green, the journey's cause.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: limiting the gateway GPU passthrough inspection to legacy gateways, which is the core purpose of this fix.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/gateway-gpu-passthrough-check

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@zyang-dev zyang-dev added v0.0.41 bug Something fails against expected or documented behavior labels May 13, 2026
@github-actions

github-actions Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: gpu-double-onboard-e2e, double-onboard-e2e
Optional E2E: gpu-e2e, macos-e2e, onboard-resume-e2e

Dispatch hint: gpu-double-onboard-e2e,double-onboard-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • gpu-double-onboard-e2e (high): Most targeted existing coverage: performs real local Ollama GPU onboarding followed by a second onboard on a fresh GPU runner, exercising reusable gateway behavior with GPU passthrough where the changed legacy-container inspection could incorrectly fail Docker-driver gateways.
  • double-onboard-e2e (high): Covers repeat onboard and shared gateway lifecycle/reuse on Linux Docker-driver gateway paths, validating that the extracted platform helper and gateway reuse checks still support real user lifecycle recovery.

Optional E2E

  • gpu-e2e (high): Useful adjacent confidence for first-run GPU local Ollama onboarding and sandbox inference. Less targeted than gpu-double-onboard-e2e because the changed passthrough inspection is primarily on reusable gateway paths.
  • macos-e2e (medium): Optional platform confidence because Docker-driver gateway enablement includes darwin/arm64. The changed code is mostly Linux/GPU reuse focused, but this can catch import/platform regressions on macOS.
  • onboard-resume-e2e (medium): Optional coverage for the resume branch of onboard gateway reuse. The changed helper is evaluated before resume/reuse decisions, but existing GPU double-onboard is the stronger targeted check.

New E2E recommendations

  • legacy gateway GPU passthrough (medium): Existing E2E coverage appears strongest for Docker-driver GPU re-onboard. Add a focused regression that creates or simulates a healthy legacy openshell-cluster gateway with GPU requested and verifies the DeviceRequests-based failure/recovery path still runs only for legacy gateways.
    • Suggested test: legacy-gateway-gpu-passthrough-reuse-e2e

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: gpu-double-onboard-e2e,double-onboard-e2e

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 1323-1337: The helper function
shouldInspectLegacyGatewayGpuPassthrough and its preceding comment are adding
net lines to onboard.ts; move the function (and the comment if still relevant)
into an existing onboarding submodule (e.g., the module that groups gateway
helpers), export it from there with the same name, and replace the
implementation in src/lib/onboard.ts with a thin import and re-export (or an
export { shouldInspectLegacyGatewayGpuPassthrough } from "./path/to/submodule")
so the public API is unchanged while reducing onboard.ts's line count.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0d280037-927e-4d54-a5e3-585b4e1e5f09

📥 Commits

Reviewing files that changed from the base of the PR and between 8c4bc1c and d99e2d7.

📒 Files selected for processing (2)
  • src/lib/onboard.ts
  • test/onboard.test.ts

Comment thread src/lib/onboard.ts Outdated
@zyang-dev zyang-dev marked this pull request as draft May 13, 2026 20:59
@copy-pr-bot

copy-pr-bot Bot commented May 13, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@zyang-dev zyang-dev marked this pull request as ready for review May 13, 2026 22:15
@zyang-dev zyang-dev added fix and removed bug Something fails against expected or documented behavior labels May 13, 2026
@cv cv added v0.0.42 and removed v0.0.41 labels May 14, 2026
cv added 2 commits May 14, 2026 19:56
…hrough-check

Signed-off-by: Carlos Villela <cvillela@nvidia.com>

# Conflicts:
#	src/lib/onboard.ts
#	test/onboard.test.ts
@cv cv enabled auto-merge (squash) May 15, 2026 02:57
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25897763054
Target ref: 117057e4f404bddf9618155b230157b8c8fa9e77
Workflow ref: main
Requested jobs: gpu-double-onboard-e2e,gpu-e2e
Summary: 0 passed, 0 failed, 2 skipped

Job Result
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped

cv added 2 commits May 14, 2026 20:03
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
…hrough-check

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@cv cv merged commit 76d24c7 into main May 15, 2026
24 checks passed
@coderabbitai coderabbitai Bot mentioned this pull request May 16, 2026
15 tasks
@zyang-dev zyang-dev deleted the fix/gateway-gpu-passthrough-check branch June 2, 2026 00:15
@wscurran wscurran added bug-fix PR fixes a bug or regression and removed fix labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants