Skip to content

fix(ci): switch Brev E2E from Nebius to GCP for reliability#1450

Merged
brandonpelfrey merged 2 commits into
mainfrom
fix/e2e-brev-gcp-provider
Apr 3, 2026
Merged

fix(ci): switch Brev E2E from Nebius to GCP for reliability#1450
brandonpelfrey merged 2 commits into
mainfrom
fix/e2e-brev-gcp-provider

Conversation

@cjagwani

@cjagwani cjagwani commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add --provider gcp to both brev search cpu calls (launchable and bare-instance paths) in brev-e2e.test.js
  • Add BREV_PROVIDER: gcp env var to the e2e-brev.yaml workflow
  • Expose BREV_PROVIDER as a configurable env var (default: gcp) so the provider can be overridden without code changes

Problem

The E2E workflow uses brev search cpu --min-vcpu 4 --min-ram 16 --sort price to select the cheapest available instance. This consistently lands on Nebius machines because they're the cheapest provider in Brev's marketplace. However, these instances have been unreliable:

  • Slow provisioning: ~17 min of the 29 min total run time is spent waiting for the instance to become reachable
  • Intermittent failures: Julie's agent flagged instability with Nebius machines, and multiple recent workflow runs failed
  • No explicit provider selection: the sort-by-price default silently chose the cheapest (and least reliable) option

From the most recent successful run (23950138412):

Phase Duration
Instance provisioning + SSH wait ~17 min
Code sync + bootstrap ~1 min
Actual tests (sandbox creation + inference) ~11 min

Solution

Pin the provider to GCP via --provider gcp. GCP instances (n2d-standard-4) are:

  • $0.13/hr — comparable cost to Nebius
  • 7 min max boot time (per brev search) vs the ~17 min observed on Nebius
  • More reliable per Alec Fong's (Brev team) recommendation

Test plan

  • Trigger e2e-brev workflow manually on this branch — verify it selects a GCP instance and passes
  • Compare run time against the Nebius baseline (~29 min)
  • Verify BREV_PROVIDER env var override works (e.g., set to aws to test fallback)

Fixes #1420

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests

    • Enhanced E2E tests to support selecting instances by cloud provider and minimum disk size via environment-driven configuration.
    • Improved test logging to surface chosen provider and minimum disk requirements during instance selection.
  • Chores

    • Updated CI workflow environment to set the provider for E2E runs, ensuring consistent provider-specific testing.

Nebius instances selected by `brev search cpu --sort price` are the cheapest
but have been unreliable — causing flaky CI runs due to slow provisioning
(~17 min wait) and intermittent machine failures. GCP instances (n2d-standard-4
at $0.13/hr) offer comparable specs with significantly better reliability and
a max boot time of 7 minutes.

Adds `--provider gcp` to both the launchable and bare-instance `brev search`
commands, with a configurable `BREV_PROVIDER` env var (default: gcp) so the
provider can be overridden without code changes.

Closes #1420

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 774fb6e2-0b46-4508-9ebd-3a3c2e2169d0

📥 Commits

Reviewing files that changed from the base of the PR and between 14f8859 and caba3c8.

📒 Files selected for processing (1)
  • test/e2e/brev-e2e.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/brev-e2e.test.js

📝 Walkthrough

Walkthrough

Set GCP as the Brev provider for E2E by adding BREV_PROVIDER: gcp to the CI workflow and making the test script pass --provider ${BREV_PROVIDER} and --min-disk when selecting instances.

Changes

Cohort / File(s) Summary
CI Workflow Configuration
\.github/workflows/e2e-brev.yaml
Added BREV_PROVIDER: gcp env var to the "Run ephemeral Brev E2E" step so the workflow sets the provider to GCP.
E2E Test Implementation
test/e2e/brev-e2e.test.js
Introduced BREV_PROVIDER and BREV_MIN_DISK env vars (defaults), logged provider and min-disk in creation paths, and added --provider ${BREV_PROVIDER} and --min-disk ${BREV_MIN_DISK} to `brev search cpu

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I sniff the logs and hop with glee,

From Nebius fields to GCP,
Flags set, disks sized, the tests take flight,
A rabbit's cheer for smoother CI night.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change—switching Brev E2E provider from Nebius to GCP—and is concise and clear without extraneous details.
Linked Issues check ✅ Passed Changes fully implement issue #1420 requirements: --provider gcp added to both brev search calls, BREV_PROVIDER env var configured in workflow, and --min-disk added to prevent Docker extraction failures.
Out of Scope Changes check ✅ Passed All changes directly address the linked issue objectives—adding GCP provider selection and minimum disk size requirements—with no extraneous modifications beyond the scope.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/e2e-brev-gcp-provider

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/workflows/e2e-brev.yaml (1)

161-161: Provider configuration aligns with the test file.

The hardcoded gcp value ensures CI stability, matching the PR objective to pin the provider for reliability.

Optional enhancement: Consider exposing BREV_PROVIDER as a workflow_dispatch input (defaulting to gcp) to allow manual overrides for testing other providers without code changes. The test file already supports this via the env var.

💡 Optional: Add workflow input for provider override

Add a new input in the workflow_dispatch section:

      brev_provider:
        description: "Cloud provider for Brev instance (gcp, aws, etc.)"
        required: false
        default: "gcp"

Then update line 161 to:

-          BREV_PROVIDER: gcp
+          BREV_PROVIDER: ${{ inputs.brev_provider || 'gcp' }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/e2e-brev.yaml at line 161, The workflow currently
hardcodes BREV_PROVIDER: gcp; to allow manual overrides without changing code,
add a workflow_dispatch input (e.g., brev_provider) with default "gcp" in the
workflow_dispatch section and then set the environment variable BREV_PROVIDER to
use that input (referencing the workflow_dispatch input name brev_provider and
the BREV_PROVIDER env var in the job's env block) so manual runs can override
the provider while keeping gcp as the default.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.github/workflows/e2e-brev.yaml:
- Line 161: The workflow currently hardcodes BREV_PROVIDER: gcp; to allow manual
overrides without changing code, add a workflow_dispatch input (e.g.,
brev_provider) with default "gcp" in the workflow_dispatch section and then set
the environment variable BREV_PROVIDER to use that input (referencing the
workflow_dispatch input name brev_provider and the BREV_PROVIDER env var in the
job's env block) so manual runs can override the provider while keeping gcp as
the default.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6fae374f-a3ba-48dd-a9c4-ea602e3b7c6f

📥 Commits

Reviewing files that changed from the base of the PR and between f4a01cf and 14f8859.

📒 Files selected for processing (2)
  • .github/workflows/e2e-brev.yaml
  • test/e2e/brev-e2e.test.js

@github-actions

github-actions Bot commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

Brev E2E (full): FAILED on branch fix/e2e-brev-gcp-providerSee logs

GCP instances default to 10GB disk which is insufficient — Docker image
extraction fails with "no space left on device" during sandbox creation.
Set --min-disk 50 to ensure enough space for Docker images and build
artifacts.

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

Brev E2E (full): PASSED on branch fix/e2e-brev-gcp-providerSee logs

@brandonpelfrey brandonpelfrey merged commit 648ab5f into main Apr 3, 2026
11 checks passed
@github-actions

github-actions Bot commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

Brev E2E (full): PASSED on branch fix/e2e-brev-gcp-providerSee logs

lakamsani pushed a commit to lakamsani/NemoClaw that referenced this pull request Apr 4, 2026
)

## Summary

- Add `--provider gcp` to both `brev search cpu` calls (launchable and
bare-instance paths) in `brev-e2e.test.js`
- Add `BREV_PROVIDER: gcp` env var to the `e2e-brev.yaml` workflow
- Expose `BREV_PROVIDER` as a configurable env var (default: `gcp`) so
the provider can be overridden without code changes

## Problem

The E2E workflow uses `brev search cpu --min-vcpu 4 --min-ram 16 --sort
price` to select the cheapest available instance. This consistently
lands on **Nebius** machines because they're the cheapest provider in
Brev's marketplace. However, these instances have been unreliable:

- **Slow provisioning**: ~17 min of the 29 min total run time is spent
waiting for the instance to become reachable
- **Intermittent failures**: Julie's agent flagged instability with
Nebius machines, and multiple recent workflow runs failed
- **No explicit provider selection**: the sort-by-price default silently
chose the cheapest (and least reliable) option

From the most recent successful run
([23950138412](https://github.com/NVIDIA/NemoClaw/actions/runs/23950138412)):
| Phase | Duration |
|-------|----------|
| Instance provisioning + SSH wait | ~17 min |
| Code sync + bootstrap | ~1 min |
| Actual tests (sandbox creation + inference) | ~11 min |

## Solution

Pin the provider to GCP via `--provider gcp`. GCP instances
(`n2d-standard-4`) are:
- **$0.13/hr** — comparable cost to Nebius
- **7 min max boot time** (per `brev search`) vs the ~17 min observed on
Nebius
- **More reliable** per Alec Fong's (Brev team) recommendation

## Test plan

- [ ] Trigger `e2e-brev` workflow manually on this branch — verify it
selects a GCP instance and passes
- [ ] Compare run time against the Nebius baseline (~29 min)
- [ ] Verify `BREV_PROVIDER` env var override works (e.g., set to `aws`
to test fallback)

Fixes NVIDIA#1420

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Enhanced E2E tests to support selecting instances by cloud provider
and minimum disk size via environment-driven configuration.
* Improved test logging to surface chosen provider and minimum disk
requirements during instance selection.

* **Chores**
* Updated CI workflow environment to set the provider for E2E runs,
ensuring consistent provider-specific testing.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ericksoa pushed a commit that referenced this pull request Apr 4, 2026
…ty (#1470)

## Summary
- unify installer and onboarding host detection around shared TypeScript
preflight logic
- move `deploy` behavior into TypeScript, thin the Brev compatibility
wrapper, and harden Brev readiness handling
- demote or remove legacy platform-specific setup paths (`setup-spark`,
`brev-setup.sh`) in favor of the canonical installer + onboard flow
- update docs, CLI help, and Brev E2E coverage to match the new behavior

## What Changed
- added shared host assessment and remediation planning in
`src/lib/preflight.ts`
- wired installer and onboard flows to the same host preflight decisions
- changed Podman handling from hard block to unsupported-runtime warning
- migrated deploy logic into `src/lib/deploy.ts`
- updated `nemoclaw deploy` to use the authenticated Brev CLI, current
Brev create flags, explicit GCP provider default, stricter readiness
checks, and standard installer/onboard flow
- removed `scripts/setup-spark.sh` and reduced `scripts/brev-setup.sh`
to a deprecated compatibility wrapper
- updated README/docs/help text and hardened the Brev E2E cleanup path

## Validation
- `npm run build:cli`
- targeted Vitest coverage for `src/lib/preflight.test.ts`,
`src/lib/deploy.test.ts`, `test/install-preflight.test.js`,
`test/cli.test.js`, `test/runner.test.js`
- live Brev validation with `TEST_SUITE=deploy-cli` on
`cpu-e2.4vcpu-16gb`
- confirmed successful end-to-end remote deploy after waiting for Brev
`status=RUNNING`, `build_status=COMPLETED`, `shell_status=READY`

## Related Issues
- Fixes #1377
- Addresses #1330
- Addresses #1390
- Related to #1404

## Credit / Prior Work
This branch builds on ideas and prior work from:
- #1368 by @zyang-dev for simplifying Spark setup and removing the old
cgroup workaround
- #1395 and #1468 by @kjw3 for the thin installer/bootstrap direction
and installer path reliability
- #1450 by @cjagwani for switching Brev flows toward GCP for reliability
- #1383 by @13ernkastel for the current Brev create flag compatibility
work
- #1364 by @WuKongAI-CMU for deploy sync-path fixes
- #1362 and #1266 by @jyaunches for the Brev E2E/launchable
infrastructure direction
- issue ideas from #1377 and #1404 by @zNeill, #1330 by @Marcelo5444,
and #1390 by @ericksoa


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Improved host diagnostics with actionable remediation guidance
surfaced during installer/onboard preflight.

* **Improvements**
* macOS (Intel) now recommends Docker Desktop; DGX Spark guidance now
uses the standard installer + `nemoclaw onboard`.
* Preflight output shows detected runtime and WSL notes; installer
prints remediation actions and will skip onboarding on blocking issues.

* **Deprecations**
* `nemoclaw deploy`, `nemoclaw setup-spark`, and the legacy bootstrap
wrapper are now deprecated compatibility paths.

* **Documentation**
* Quickstart, troubleshooting, and command reference updated to reflect
installer+onboard flow and deprecation guidance.

* **Tests**
* Added/updated tests covering preflight, deploy compatibility, CLI
aliases, and deploy e2e scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@cjagwani cjagwani self-assigned this Apr 6, 2026
@cjagwani cjagwani added CI/CD platform: brev Affects Brev hosted development environments security Potential vulnerability, unsafe behavior, or access risk labels Apr 6, 2026
gemini2026 pushed a commit to gemini2026/NemoClaw that referenced this pull request Apr 14, 2026
)

## Summary

- Add `--provider gcp` to both `brev search cpu` calls (launchable and
bare-instance paths) in `brev-e2e.test.js`
- Add `BREV_PROVIDER: gcp` env var to the `e2e-brev.yaml` workflow
- Expose `BREV_PROVIDER` as a configurable env var (default: `gcp`) so
the provider can be overridden without code changes

## Problem

The E2E workflow uses `brev search cpu --min-vcpu 4 --min-ram 16 --sort
price` to select the cheapest available instance. This consistently
lands on **Nebius** machines because they're the cheapest provider in
Brev's marketplace. However, these instances have been unreliable:

- **Slow provisioning**: ~17 min of the 29 min total run time is spent
waiting for the instance to become reachable
- **Intermittent failures**: Julie's agent flagged instability with
Nebius machines, and multiple recent workflow runs failed
- **No explicit provider selection**: the sort-by-price default silently
chose the cheapest (and least reliable) option

From the most recent successful run
([23950138412](https://github.com/NVIDIA/NemoClaw/actions/runs/23950138412)):
| Phase | Duration |
|-------|----------|
| Instance provisioning + SSH wait | ~17 min |
| Code sync + bootstrap | ~1 min |
| Actual tests (sandbox creation + inference) | ~11 min |

## Solution

Pin the provider to GCP via `--provider gcp`. GCP instances
(`n2d-standard-4`) are:
- **$0.13/hr** — comparable cost to Nebius
- **7 min max boot time** (per `brev search`) vs the ~17 min observed on
Nebius
- **More reliable** per Alec Fong's (Brev team) recommendation

## Test plan

- [ ] Trigger `e2e-brev` workflow manually on this branch — verify it
selects a GCP instance and passes
- [ ] Compare run time against the Nebius baseline (~29 min)
- [ ] Verify `BREV_PROVIDER` env var override works (e.g., set to `aws`
to test fallback)

Fixes NVIDIA#1420

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Enhanced E2E tests to support selecting instances by cloud provider
and minimum disk size via environment-driven configuration.
* Improved test logging to surface chosen provider and minimum disk
requirements during instance selection.

* **Chores**
* Updated CI workflow environment to set the provider for E2E runs,
ensuring consistent provider-specific testing.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gemini2026 pushed a commit to gemini2026/NemoClaw that referenced this pull request Apr 14, 2026
…ty (NVIDIA#1470)

## Summary
- unify installer and onboarding host detection around shared TypeScript
preflight logic
- move `deploy` behavior into TypeScript, thin the Brev compatibility
wrapper, and harden Brev readiness handling
- demote or remove legacy platform-specific setup paths (`setup-spark`,
`brev-setup.sh`) in favor of the canonical installer + onboard flow
- update docs, CLI help, and Brev E2E coverage to match the new behavior

## What Changed
- added shared host assessment and remediation planning in
`src/lib/preflight.ts`
- wired installer and onboard flows to the same host preflight decisions
- changed Podman handling from hard block to unsupported-runtime warning
- migrated deploy logic into `src/lib/deploy.ts`
- updated `nemoclaw deploy` to use the authenticated Brev CLI, current
Brev create flags, explicit GCP provider default, stricter readiness
checks, and standard installer/onboard flow
- removed `scripts/setup-spark.sh` and reduced `scripts/brev-setup.sh`
to a deprecated compatibility wrapper
- updated README/docs/help text and hardened the Brev E2E cleanup path

## Validation
- `npm run build:cli`
- targeted Vitest coverage for `src/lib/preflight.test.ts`,
`src/lib/deploy.test.ts`, `test/install-preflight.test.js`,
`test/cli.test.js`, `test/runner.test.js`
- live Brev validation with `TEST_SUITE=deploy-cli` on
`cpu-e2.4vcpu-16gb`
- confirmed successful end-to-end remote deploy after waiting for Brev
`status=RUNNING`, `build_status=COMPLETED`, `shell_status=READY`

## Related Issues
- Fixes NVIDIA#1377
- Addresses NVIDIA#1330
- Addresses NVIDIA#1390
- Related to NVIDIA#1404

## Credit / Prior Work
This branch builds on ideas and prior work from:
- NVIDIA#1368 by @zyang-dev for simplifying Spark setup and removing the old
cgroup workaround
- NVIDIA#1395 and NVIDIA#1468 by @kjw3 for the thin installer/bootstrap direction
and installer path reliability
- NVIDIA#1450 by @cjagwani for switching Brev flows toward GCP for reliability
- NVIDIA#1383 by @13ernkastel for the current Brev create flag compatibility
work
- NVIDIA#1364 by @WuKongAI-CMU for deploy sync-path fixes
- NVIDIA#1362 and NVIDIA#1266 by @jyaunches for the Brev E2E/launchable
infrastructure direction
- issue ideas from NVIDIA#1377 and NVIDIA#1404 by @zNeill, NVIDIA#1330 by @Marcelo5444,
and NVIDIA#1390 by @ericksoa


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Improved host diagnostics with actionable remediation guidance
surfaced during installer/onboard preflight.

* **Improvements**
* macOS (Intel) now recommends Docker Desktop; DGX Spark guidance now
uses the standard installer + `nemoclaw onboard`.
* Preflight output shows detected runtime and WSL notes; installer
prints remediation actions and will skip onboarding on blocking issues.

* **Deprecations**
* `nemoclaw deploy`, `nemoclaw setup-spark`, and the legacy bootstrap
wrapper are now deprecated compatibility paths.

* **Documentation**
* Quickstart, troubleshooting, and command reference updated to reflect
installer+onboard flow and deprecation guidance.

* **Tests**
* Added/updated tests covering preflight, deploy compatibility, CLI
aliases, and deploy e2e scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added area: ci CI workflows, checks, release automation, or GitHub Actions chore Build, CI, dependency, or tooling maintenance and removed CI/CD labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: ci CI workflows, checks, release automation, or GitHub Actions chore Build, CI, dependency, or tooling maintenance platform: brev Affects Brev hosted development environments security Potential vulnerability, unsafe behavior, or access risk

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ci: switch Brev E2E instances from Nebius to GCP for reliability

4 participants