fix(ci): switch Brev E2E from Nebius to GCP for reliability by cjagwani · Pull Request #1450 · NVIDIA/NemoClaw

cjagwani · 2026-04-03T18:19:25Z

Summary

Add --provider gcp to both brev search cpu calls (launchable and bare-instance paths) in brev-e2e.test.js
Add BREV_PROVIDER: gcp env var to the e2e-brev.yaml workflow
Expose BREV_PROVIDER as a configurable env var (default: gcp) so the provider can be overridden without code changes

Problem

The E2E workflow uses brev search cpu --min-vcpu 4 --min-ram 16 --sort price to select the cheapest available instance. This consistently lands on Nebius machines because they're the cheapest provider in Brev's marketplace. However, these instances have been unreliable:

Slow provisioning: ~17 min of the 29 min total run time is spent waiting for the instance to become reachable
Intermittent failures: Julie's agent flagged instability with Nebius machines, and multiple recent workflow runs failed
No explicit provider selection: the sort-by-price default silently chose the cheapest (and least reliable) option

From the most recent successful run (23950138412):

Phase	Duration
Instance provisioning + SSH wait	~17 min
Code sync + bootstrap	~1 min
Actual tests (sandbox creation + inference)	~11 min

Solution

Pin the provider to GCP via --provider gcp. GCP instances (n2d-standard-4) are:

$0.13/hr — comparable cost to Nebius
7 min max boot time (per brev search) vs the ~17 min observed on Nebius
More reliable per Alec Fong's (Brev team) recommendation

Test plan

Trigger e2e-brev workflow manually on this branch — verify it selects a GCP instance and passes
Compare run time against the Nebius baseline (~29 min)
Verify BREV_PROVIDER env var override works (e.g., set to aws to test fallback)

Fixes #1420

🤖 Generated with Claude Code

Summary by CodeRabbit

Tests
- Enhanced E2E tests to support selecting instances by cloud provider and minimum disk size via environment-driven configuration.
- Improved test logging to surface chosen provider and minimum disk requirements during instance selection.
Chores
- Updated CI workflow environment to set the provider for E2E runs, ensuring consistent provider-specific testing.

Nebius instances selected by `brev search cpu --sort price` are the cheapest but have been unreliable — causing flaky CI runs due to slow provisioning (~17 min wait) and intermittent machine failures. GCP instances (n2d-standard-4 at $0.13/hr) offer comparable specs with significantly better reliability and a max boot time of 7 minutes. Adds `--provider gcp` to both the launchable and bare-instance `brev search` commands, with a configurable `BREV_PROVIDER` env var (default: gcp) so the provider can be overridden without code changes. Closes #1420 Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-03T18:25:11Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 774fb6e2-0b46-4508-9ebd-3a3c2e2169d0

📥 Commits

Reviewing files that changed from the base of the PR and between 14f8859 and caba3c8.

📒 Files selected for processing (1)

test/e2e/brev-e2e.test.js

🚧 Files skipped from review as they are similar to previous changes (1)

test/e2e/brev-e2e.test.js

📝 Walkthrough

Walkthrough

Set GCP as the Brev provider for E2E by adding BREV_PROVIDER: gcp to the CI workflow and making the test script pass --provider ${BREV_PROVIDER} and --min-disk when selecting instances.

Changes

Cohort / File(s)	Summary
CI Workflow Configuration `\.github/workflows/e2e-brev.yaml`	Added `BREV_PROVIDER: gcp` env var to the "Run ephemeral Brev E2E" step so the workflow sets the provider to GCP.
E2E Test Implementation `test/e2e/brev-e2e.test.js`	Introduced `BREV_PROVIDER` and `BREV_MIN_DISK` env vars (defaults), logged provider and min-disk in creation paths, and added `--provider ${BREV_PROVIDER}` and `--min-disk ${BREV_MIN_DISK}` to `brev search cpu

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I sniff the logs and hop with glee,

From Nebius fields to GCP,
Flags set, disks sized, the tests take flight,
A rabbit's cheer for smoother CI night.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change—switching Brev E2E provider from Nebius to GCP—and is concise and clear without extraneous details.
Linked Issues check	✅ Passed	Changes fully implement issue `#1420` requirements: `--provider gcp` added to both brev search calls, `BREV_PROVIDER` env var configured in workflow, and `--min-disk` added to prevent Docker extraction failures.
Out of Scope Changes check	✅ Passed	All changes directly address the linked issue objectives—adding GCP provider selection and minimum disk size requirements—with no extraneous modifications beyond the scope.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/e2e-brev-gcp-provider

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

.github/workflows/e2e-brev.yaml (1)
161-161: Provider configuration aligns with the test file.

The hardcoded gcp value ensures CI stability, matching the PR objective to pin the provider for reliability.

Optional enhancement: Consider exposing BREV_PROVIDER as a workflow_dispatch input (defaulting to gcp) to allow manual overrides for testing other providers without code changes. The test file already supports this via the env var.
💡 Optional: Add workflow input for provider override

Add a new input in the workflow_dispatch section:
      brev_provider:
        description: "Cloud provider for Brev instance (gcp, aws, etc.)"
        required: false
        default: "gcp"
Then update line 161 to:
-          BREV_PROVIDER: gcp
+          BREV_PROVIDER: ${{ inputs.brev_provider || 'gcp' }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/e2e-brev.yaml at line 161, The workflow currently
hardcodes BREV_PROVIDER: gcp; to allow manual overrides without changing code,
add a workflow_dispatch input (e.g., brev_provider) with default "gcp" in the
workflow_dispatch section and then set the environment variable BREV_PROVIDER to
use that input (referencing the workflow_dispatch input name brev_provider and
the BREV_PROVIDER env var in the job's env block) so manual runs can override
the provider while keeping gcp as the default.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.github/workflows/e2e-brev.yaml:
- Line 161: The workflow currently hardcodes BREV_PROVIDER: gcp; to allow manual
overrides without changing code, add a workflow_dispatch input (e.g.,
brev_provider) with default "gcp" in the workflow_dispatch section and then set
the environment variable BREV_PROVIDER to use that input (referencing the
workflow_dispatch input name brev_provider and the BREV_PROVIDER env var in the
job's env block) so manual runs can override the provider while keeping gcp as
the default.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6fae374f-a3ba-48dd-a9c4-ea602e3b7c6f

📥 Commits

Reviewing files that changed from the base of the PR and between f4a01cf and 14f8859.

📒 Files selected for processing (2)

.github/workflows/e2e-brev.yaml
test/e2e/brev-e2e.test.js

github-actions · 2026-04-03T18:33:42Z

❌ Brev E2E (full): FAILED on branch fix/e2e-brev-gcp-provider — See logs

GCP instances default to 10GB disk which is insufficient — Docker image extraction fails with "no space left on device" during sandbox creation. Set --min-disk 50 to ensure enough space for Docker images and build artifacts. Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-03T18:58:51Z

✅ Brev E2E (full): PASSED on branch fix/e2e-brev-gcp-provider — See logs

github-actions · 2026-04-03T19:26:00Z

✅ Brev E2E (full): PASSED on branch fix/e2e-brev-gcp-provider — See logs

) ## Summary - Add `--provider gcp` to both `brev search cpu` calls (launchable and bare-instance paths) in `brev-e2e.test.js` - Add `BREV_PROVIDER: gcp` env var to the `e2e-brev.yaml` workflow - Expose `BREV_PROVIDER` as a configurable env var (default: `gcp`) so the provider can be overridden without code changes ## Problem The E2E workflow uses `brev search cpu --min-vcpu 4 --min-ram 16 --sort price` to select the cheapest available instance. This consistently lands on **Nebius** machines because they're the cheapest provider in Brev's marketplace. However, these instances have been unreliable: - **Slow provisioning**: ~17 min of the 29 min total run time is spent waiting for the instance to become reachable - **Intermittent failures**: Julie's agent flagged instability with Nebius machines, and multiple recent workflow runs failed - **No explicit provider selection**: the sort-by-price default silently chose the cheapest (and least reliable) option From the most recent successful run ([23950138412](https://github.com/NVIDIA/NemoClaw/actions/runs/23950138412)): | Phase | Duration | |-------|----------| | Instance provisioning + SSH wait | ~17 min | | Code sync + bootstrap | ~1 min | | Actual tests (sandbox creation + inference) | ~11 min | ## Solution Pin the provider to GCP via `--provider gcp`. GCP instances (`n2d-standard-4`) are: - **$0.13/hr** — comparable cost to Nebius - **7 min max boot time** (per `brev search`) vs the ~17 min observed on Nebius - **More reliable** per Alec Fong's (Brev team) recommendation ## Test plan - [ ] Trigger `e2e-brev` workflow manually on this branch — verify it selects a GCP instance and passes - [ ] Compare run time against the Nebius baseline (~29 min) - [ ] Verify `BREV_PROVIDER` env var override works (e.g., set to `aws` to test fallback) Fixes NVIDIA#1420 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Tests** * Enhanced E2E tests to support selecting instances by cloud provider and minimum disk size via environment-driven configuration. * Improved test logging to surface chosen provider and minimum disk requirements during instance selection. * **Chores** * Updated CI workflow environment to set the provider for E2E runs, ensuring consistent provider-specific testing.  --------- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@zyang-dev

…ty (#1470) ## Summary - unify installer and onboarding host detection around shared TypeScript preflight logic - move `deploy` behavior into TypeScript, thin the Brev compatibility wrapper, and harden Brev readiness handling - demote or remove legacy platform-specific setup paths (`setup-spark`, `brev-setup.sh`) in favor of the canonical installer + onboard flow - update docs, CLI help, and Brev E2E coverage to match the new behavior ## What Changed - added shared host assessment and remediation planning in `src/lib/preflight.ts` - wired installer and onboard flows to the same host preflight decisions - changed Podman handling from hard block to unsupported-runtime warning - migrated deploy logic into `src/lib/deploy.ts` - updated `nemoclaw deploy` to use the authenticated Brev CLI, current Brev create flags, explicit GCP provider default, stricter readiness checks, and standard installer/onboard flow - removed `scripts/setup-spark.sh` and reduced `scripts/brev-setup.sh` to a deprecated compatibility wrapper - updated README/docs/help text and hardened the Brev E2E cleanup path ## Validation - `npm run build:cli` - targeted Vitest coverage for `src/lib/preflight.test.ts`, `src/lib/deploy.test.ts`, `test/install-preflight.test.js`, `test/cli.test.js`, `test/runner.test.js` - live Brev validation with `TEST_SUITE=deploy-cli` on `cpu-e2.4vcpu-16gb` - confirmed successful end-to-end remote deploy after waiting for Brev `status=RUNNING`, `build_status=COMPLETED`, `shell_status=READY` ## Related Issues - Fixes #1377 - Addresses #1330 - Addresses #1390 - Related to #1404 ## Credit / Prior Work This branch builds on ideas and prior work from: - #1368 by @zyang-dev for simplifying Spark setup and removing the old cgroup workaround - #1395 and #1468 by @kjw3 for the thin installer/bootstrap direction and installer path reliability - #1450 by @cjagwani for switching Brev flows toward GCP for reliability - #1383 by @13ernkastel for the current Brev create flag compatibility work - #1364 by @WuKongAI-CMU for deploy sync-path fixes - #1362 and #1266 by @jyaunches for the Brev E2E/launchable infrastructure direction - issue ideas from #1377 and #1404 by @zNeill, #1330 by @Marcelo5444, and #1390 by @ericksoa  ## Summary by CodeRabbit * **New Features** * Improved host diagnostics with actionable remediation guidance surfaced during installer/onboard preflight. * **Improvements** * macOS (Intel) now recommends Docker Desktop; DGX Spark guidance now uses the standard installer + `nemoclaw onboard`. * Preflight output shows detected runtime and WSL notes; installer prints remediation actions and will skip onboarding on blocking issues. * **Deprecations** * `nemoclaw deploy`, `nemoclaw setup-spark`, and the legacy bootstrap wrapper are now deprecated compatibility paths. * **Documentation** * Quickstart, troubleshooting, and command reference updated to reflect installer+onboard flow and deprecation guidance. * **Tests** * Added/updated tests covering preflight, deploy compatibility, CLI aliases, and deploy e2e scenarios.

) ## Summary - Add `--provider gcp` to both `brev search cpu` calls (launchable and bare-instance paths) in `brev-e2e.test.js` - Add `BREV_PROVIDER: gcp` env var to the `e2e-brev.yaml` workflow - Expose `BREV_PROVIDER` as a configurable env var (default: `gcp`) so the provider can be overridden without code changes ## Problem The E2E workflow uses `brev search cpu --min-vcpu 4 --min-ram 16 --sort price` to select the cheapest available instance. This consistently lands on **Nebius** machines because they're the cheapest provider in Brev's marketplace. However, these instances have been unreliable: - **Slow provisioning**: ~17 min of the 29 min total run time is spent waiting for the instance to become reachable - **Intermittent failures**: Julie's agent flagged instability with Nebius machines, and multiple recent workflow runs failed - **No explicit provider selection**: the sort-by-price default silently chose the cheapest (and least reliable) option From the most recent successful run ([23950138412](https://github.com/NVIDIA/NemoClaw/actions/runs/23950138412)): | Phase | Duration | |-------|----------| | Instance provisioning + SSH wait | ~17 min | | Code sync + bootstrap | ~1 min | | Actual tests (sandbox creation + inference) | ~11 min | ## Solution Pin the provider to GCP via `--provider gcp`. GCP instances (`n2d-standard-4`) are: - **$0.13/hr** — comparable cost to Nebius - **7 min max boot time** (per `brev search`) vs the ~17 min observed on Nebius - **More reliable** per Alec Fong's (Brev team) recommendation ## Test plan - [ ] Trigger `e2e-brev` workflow manually on this branch — verify it selects a GCP instance and passes - [ ] Compare run time against the Nebius baseline (~29 min) - [ ] Verify `BREV_PROVIDER` env var override works (e.g., set to `aws` to test fallback) Fixes NVIDIA#1420 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Tests** * Enhanced E2E tests to support selecting instances by cloud provider and minimum disk size via environment-driven configuration. * Improved test logging to surface chosen provider and minimum disk requirements during instance selection. * **Chores** * Updated CI workflow environment to set the provider for E2E runs, ensuring consistent provider-specific testing.  --------- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@zyang-dev

…ty (NVIDIA#1470) ## Summary - unify installer and onboarding host detection around shared TypeScript preflight logic - move `deploy` behavior into TypeScript, thin the Brev compatibility wrapper, and harden Brev readiness handling - demote or remove legacy platform-specific setup paths (`setup-spark`, `brev-setup.sh`) in favor of the canonical installer + onboard flow - update docs, CLI help, and Brev E2E coverage to match the new behavior ## What Changed - added shared host assessment and remediation planning in `src/lib/preflight.ts` - wired installer and onboard flows to the same host preflight decisions - changed Podman handling from hard block to unsupported-runtime warning - migrated deploy logic into `src/lib/deploy.ts` - updated `nemoclaw deploy` to use the authenticated Brev CLI, current Brev create flags, explicit GCP provider default, stricter readiness checks, and standard installer/onboard flow - removed `scripts/setup-spark.sh` and reduced `scripts/brev-setup.sh` to a deprecated compatibility wrapper - updated README/docs/help text and hardened the Brev E2E cleanup path ## Validation - `npm run build:cli` - targeted Vitest coverage for `src/lib/preflight.test.ts`, `src/lib/deploy.test.ts`, `test/install-preflight.test.js`, `test/cli.test.js`, `test/runner.test.js` - live Brev validation with `TEST_SUITE=deploy-cli` on `cpu-e2.4vcpu-16gb` - confirmed successful end-to-end remote deploy after waiting for Brev `status=RUNNING`, `build_status=COMPLETED`, `shell_status=READY` ## Related Issues - Fixes NVIDIA#1377 - Addresses NVIDIA#1330 - Addresses NVIDIA#1390 - Related to NVIDIA#1404 ## Credit / Prior Work This branch builds on ideas and prior work from: - NVIDIA#1368 by @zyang-dev for simplifying Spark setup and removing the old cgroup workaround - NVIDIA#1395 and NVIDIA#1468 by @kjw3 for the thin installer/bootstrap direction and installer path reliability - NVIDIA#1450 by @cjagwani for switching Brev flows toward GCP for reliability - NVIDIA#1383 by @13ernkastel for the current Brev create flag compatibility work - NVIDIA#1364 by @WuKongAI-CMU for deploy sync-path fixes - NVIDIA#1362 and NVIDIA#1266 by @jyaunches for the Brev E2E/launchable infrastructure direction - issue ideas from NVIDIA#1377 and NVIDIA#1404 by @zNeill, NVIDIA#1330 by @Marcelo5444, and NVIDIA#1390 by @ericksoa  ## Summary by CodeRabbit * **New Features** * Improved host diagnostics with actionable remediation guidance surfaced during installer/onboard preflight. * **Improvements** * macOS (Intel) now recommends Docker Desktop; DGX Spark guidance now uses the standard installer + `nemoclaw onboard`. * Preflight output shows detected runtime and WSL notes; installer prints remediation actions and will skip onboarding on blocking issues. * **Deprecations** * `nemoclaw deploy`, `nemoclaw setup-spark`, and the legacy bootstrap wrapper are now deprecated compatibility paths. * **Documentation** * Quickstart, troubleshooting, and command reference updated to reflect installer+onboard flow and deprecation guidance. * **Tests** * Added/updated tests covering preflight, deploy compatibility, CLI aliases, and deploy e2e scenarios.

coderabbitai Bot reviewed Apr 3, 2026

View reviewed changes

cv approved these changes Apr 3, 2026

View reviewed changes

brandonpelfrey approved these changes Apr 3, 2026

View reviewed changes

brandonpelfrey merged commit 648ab5f into main Apr 3, 2026
11 checks passed

kjw3 mentioned this pull request Apr 4, 2026

refactor(installer): unify host preflight and thin deploy compatibility #1470

Merged

cjagwani self-assigned this Apr 6, 2026

cjagwani added CI/CD platform: brev Affects Brev hosted development environments security Potential vulnerability, unsafe behavior, or access risk labels Apr 6, 2026

wscurran added area: ci CI workflows, checks, release automation, or GitHub Actions chore Build, CI, dependency, or tooling maintenance and removed CI/CD labels Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): switch Brev E2E from Nebius to GCP for reliability#1450

fix(ci): switch Brev E2E from Nebius to GCP for reliability#1450
brandonpelfrey merged 2 commits into
mainfrom
fix/e2e-brev-gcp-provider

cjagwani commented Apr 3, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 3, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

github-actions Bot commented Apr 3, 2026

Uh oh!

github-actions Bot commented Apr 3, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

cjagwani commented Apr 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 3, 2026

Uh oh!

github-actions Bot commented Apr 3, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cjagwani commented Apr 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 3, 2026 •

edited

Loading