fix(ci): switch Brev E2E from Nebius to GCP for reliability#1450
Conversation
Nebius instances selected by `brev search cpu --sort price` are the cheapest but have been unreliable — causing flaky CI runs due to slow provisioning (~17 min wait) and intermittent machine failures. GCP instances (n2d-standard-4 at $0.13/hr) offer comparable specs with significantly better reliability and a max boot time of 7 minutes. Adds `--provider gcp` to both the launchable and bare-instance `brev search` commands, with a configurable `BREV_PROVIDER` env var (default: gcp) so the provider can be overridden without code changes. Closes #1420 Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughSet GCP as the Brev provider for E2E by adding Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.github/workflows/e2e-brev.yaml (1)
161-161: Provider configuration aligns with the test file.The hardcoded
gcpvalue ensures CI stability, matching the PR objective to pin the provider for reliability.Optional enhancement: Consider exposing
BREV_PROVIDERas aworkflow_dispatchinput (defaulting togcp) to allow manual overrides for testing other providers without code changes. The test file already supports this via the env var.💡 Optional: Add workflow input for provider override
Add a new input in the
workflow_dispatchsection:brev_provider: description: "Cloud provider for Brev instance (gcp, aws, etc.)" required: false default: "gcp"Then update line 161 to:
- BREV_PROVIDER: gcp + BREV_PROVIDER: ${{ inputs.brev_provider || 'gcp' }}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/e2e-brev.yaml at line 161, The workflow currently hardcodes BREV_PROVIDER: gcp; to allow manual overrides without changing code, add a workflow_dispatch input (e.g., brev_provider) with default "gcp" in the workflow_dispatch section and then set the environment variable BREV_PROVIDER to use that input (referencing the workflow_dispatch input name brev_provider and the BREV_PROVIDER env var in the job's env block) so manual runs can override the provider while keeping gcp as the default.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.github/workflows/e2e-brev.yaml:
- Line 161: The workflow currently hardcodes BREV_PROVIDER: gcp; to allow manual
overrides without changing code, add a workflow_dispatch input (e.g.,
brev_provider) with default "gcp" in the workflow_dispatch section and then set
the environment variable BREV_PROVIDER to use that input (referencing the
workflow_dispatch input name brev_provider and the BREV_PROVIDER env var in the
job's env block) so manual runs can override the provider while keeping gcp as
the default.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6fae374f-a3ba-48dd-a9c4-ea602e3b7c6f
📒 Files selected for processing (2)
.github/workflows/e2e-brev.yamltest/e2e/brev-e2e.test.js
|
❌ Brev E2E (full): FAILED on branch |
GCP instances default to 10GB disk which is insufficient — Docker image extraction fails with "no space left on device" during sandbox creation. Set --min-disk 50 to ensure enough space for Docker images and build artifacts. Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
✅ Brev E2E (full): PASSED on branch |
|
✅ Brev E2E (full): PASSED on branch |
) ## Summary - Add `--provider gcp` to both `brev search cpu` calls (launchable and bare-instance paths) in `brev-e2e.test.js` - Add `BREV_PROVIDER: gcp` env var to the `e2e-brev.yaml` workflow - Expose `BREV_PROVIDER` as a configurable env var (default: `gcp`) so the provider can be overridden without code changes ## Problem The E2E workflow uses `brev search cpu --min-vcpu 4 --min-ram 16 --sort price` to select the cheapest available instance. This consistently lands on **Nebius** machines because they're the cheapest provider in Brev's marketplace. However, these instances have been unreliable: - **Slow provisioning**: ~17 min of the 29 min total run time is spent waiting for the instance to become reachable - **Intermittent failures**: Julie's agent flagged instability with Nebius machines, and multiple recent workflow runs failed - **No explicit provider selection**: the sort-by-price default silently chose the cheapest (and least reliable) option From the most recent successful run ([23950138412](https://github.com/NVIDIA/NemoClaw/actions/runs/23950138412)): | Phase | Duration | |-------|----------| | Instance provisioning + SSH wait | ~17 min | | Code sync + bootstrap | ~1 min | | Actual tests (sandbox creation + inference) | ~11 min | ## Solution Pin the provider to GCP via `--provider gcp`. GCP instances (`n2d-standard-4`) are: - **$0.13/hr** — comparable cost to Nebius - **7 min max boot time** (per `brev search`) vs the ~17 min observed on Nebius - **More reliable** per Alec Fong's (Brev team) recommendation ## Test plan - [ ] Trigger `e2e-brev` workflow manually on this branch — verify it selects a GCP instance and passes - [ ] Compare run time against the Nebius baseline (~29 min) - [ ] Verify `BREV_PROVIDER` env var override works (e.g., set to `aws` to test fallback) Fixes NVIDIA#1420 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Enhanced E2E tests to support selecting instances by cloud provider and minimum disk size via environment-driven configuration. * Improved test logging to surface chosen provider and minimum disk requirements during instance selection. * **Chores** * Updated CI workflow environment to set the provider for E2E runs, ensuring consistent provider-specific testing. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ty (#1470) ## Summary - unify installer and onboarding host detection around shared TypeScript preflight logic - move `deploy` behavior into TypeScript, thin the Brev compatibility wrapper, and harden Brev readiness handling - demote or remove legacy platform-specific setup paths (`setup-spark`, `brev-setup.sh`) in favor of the canonical installer + onboard flow - update docs, CLI help, and Brev E2E coverage to match the new behavior ## What Changed - added shared host assessment and remediation planning in `src/lib/preflight.ts` - wired installer and onboard flows to the same host preflight decisions - changed Podman handling from hard block to unsupported-runtime warning - migrated deploy logic into `src/lib/deploy.ts` - updated `nemoclaw deploy` to use the authenticated Brev CLI, current Brev create flags, explicit GCP provider default, stricter readiness checks, and standard installer/onboard flow - removed `scripts/setup-spark.sh` and reduced `scripts/brev-setup.sh` to a deprecated compatibility wrapper - updated README/docs/help text and hardened the Brev E2E cleanup path ## Validation - `npm run build:cli` - targeted Vitest coverage for `src/lib/preflight.test.ts`, `src/lib/deploy.test.ts`, `test/install-preflight.test.js`, `test/cli.test.js`, `test/runner.test.js` - live Brev validation with `TEST_SUITE=deploy-cli` on `cpu-e2.4vcpu-16gb` - confirmed successful end-to-end remote deploy after waiting for Brev `status=RUNNING`, `build_status=COMPLETED`, `shell_status=READY` ## Related Issues - Fixes #1377 - Addresses #1330 - Addresses #1390 - Related to #1404 ## Credit / Prior Work This branch builds on ideas and prior work from: - #1368 by @zyang-dev for simplifying Spark setup and removing the old cgroup workaround - #1395 and #1468 by @kjw3 for the thin installer/bootstrap direction and installer path reliability - #1450 by @cjagwani for switching Brev flows toward GCP for reliability - #1383 by @13ernkastel for the current Brev create flag compatibility work - #1364 by @WuKongAI-CMU for deploy sync-path fixes - #1362 and #1266 by @jyaunches for the Brev E2E/launchable infrastructure direction - issue ideas from #1377 and #1404 by @zNeill, #1330 by @Marcelo5444, and #1390 by @ericksoa <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Improved host diagnostics with actionable remediation guidance surfaced during installer/onboard preflight. * **Improvements** * macOS (Intel) now recommends Docker Desktop; DGX Spark guidance now uses the standard installer + `nemoclaw onboard`. * Preflight output shows detected runtime and WSL notes; installer prints remediation actions and will skip onboarding on blocking issues. * **Deprecations** * `nemoclaw deploy`, `nemoclaw setup-spark`, and the legacy bootstrap wrapper are now deprecated compatibility paths. * **Documentation** * Quickstart, troubleshooting, and command reference updated to reflect installer+onboard flow and deprecation guidance. * **Tests** * Added/updated tests covering preflight, deploy compatibility, CLI aliases, and deploy e2e scenarios. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
) ## Summary - Add `--provider gcp` to both `brev search cpu` calls (launchable and bare-instance paths) in `brev-e2e.test.js` - Add `BREV_PROVIDER: gcp` env var to the `e2e-brev.yaml` workflow - Expose `BREV_PROVIDER` as a configurable env var (default: `gcp`) so the provider can be overridden without code changes ## Problem The E2E workflow uses `brev search cpu --min-vcpu 4 --min-ram 16 --sort price` to select the cheapest available instance. This consistently lands on **Nebius** machines because they're the cheapest provider in Brev's marketplace. However, these instances have been unreliable: - **Slow provisioning**: ~17 min of the 29 min total run time is spent waiting for the instance to become reachable - **Intermittent failures**: Julie's agent flagged instability with Nebius machines, and multiple recent workflow runs failed - **No explicit provider selection**: the sort-by-price default silently chose the cheapest (and least reliable) option From the most recent successful run ([23950138412](https://github.com/NVIDIA/NemoClaw/actions/runs/23950138412)): | Phase | Duration | |-------|----------| | Instance provisioning + SSH wait | ~17 min | | Code sync + bootstrap | ~1 min | | Actual tests (sandbox creation + inference) | ~11 min | ## Solution Pin the provider to GCP via `--provider gcp`. GCP instances (`n2d-standard-4`) are: - **$0.13/hr** — comparable cost to Nebius - **7 min max boot time** (per `brev search`) vs the ~17 min observed on Nebius - **More reliable** per Alec Fong's (Brev team) recommendation ## Test plan - [ ] Trigger `e2e-brev` workflow manually on this branch — verify it selects a GCP instance and passes - [ ] Compare run time against the Nebius baseline (~29 min) - [ ] Verify `BREV_PROVIDER` env var override works (e.g., set to `aws` to test fallback) Fixes NVIDIA#1420 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Enhanced E2E tests to support selecting instances by cloud provider and minimum disk size via environment-driven configuration. * Improved test logging to surface chosen provider and minimum disk requirements during instance selection. * **Chores** * Updated CI workflow environment to set the provider for E2E runs, ensuring consistent provider-specific testing. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ty (NVIDIA#1470) ## Summary - unify installer and onboarding host detection around shared TypeScript preflight logic - move `deploy` behavior into TypeScript, thin the Brev compatibility wrapper, and harden Brev readiness handling - demote or remove legacy platform-specific setup paths (`setup-spark`, `brev-setup.sh`) in favor of the canonical installer + onboard flow - update docs, CLI help, and Brev E2E coverage to match the new behavior ## What Changed - added shared host assessment and remediation planning in `src/lib/preflight.ts` - wired installer and onboard flows to the same host preflight decisions - changed Podman handling from hard block to unsupported-runtime warning - migrated deploy logic into `src/lib/deploy.ts` - updated `nemoclaw deploy` to use the authenticated Brev CLI, current Brev create flags, explicit GCP provider default, stricter readiness checks, and standard installer/onboard flow - removed `scripts/setup-spark.sh` and reduced `scripts/brev-setup.sh` to a deprecated compatibility wrapper - updated README/docs/help text and hardened the Brev E2E cleanup path ## Validation - `npm run build:cli` - targeted Vitest coverage for `src/lib/preflight.test.ts`, `src/lib/deploy.test.ts`, `test/install-preflight.test.js`, `test/cli.test.js`, `test/runner.test.js` - live Brev validation with `TEST_SUITE=deploy-cli` on `cpu-e2.4vcpu-16gb` - confirmed successful end-to-end remote deploy after waiting for Brev `status=RUNNING`, `build_status=COMPLETED`, `shell_status=READY` ## Related Issues - Fixes NVIDIA#1377 - Addresses NVIDIA#1330 - Addresses NVIDIA#1390 - Related to NVIDIA#1404 ## Credit / Prior Work This branch builds on ideas and prior work from: - NVIDIA#1368 by @zyang-dev for simplifying Spark setup and removing the old cgroup workaround - NVIDIA#1395 and NVIDIA#1468 by @kjw3 for the thin installer/bootstrap direction and installer path reliability - NVIDIA#1450 by @cjagwani for switching Brev flows toward GCP for reliability - NVIDIA#1383 by @13ernkastel for the current Brev create flag compatibility work - NVIDIA#1364 by @WuKongAI-CMU for deploy sync-path fixes - NVIDIA#1362 and NVIDIA#1266 by @jyaunches for the Brev E2E/launchable infrastructure direction - issue ideas from NVIDIA#1377 and NVIDIA#1404 by @zNeill, NVIDIA#1330 by @Marcelo5444, and NVIDIA#1390 by @ericksoa <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Improved host diagnostics with actionable remediation guidance surfaced during installer/onboard preflight. * **Improvements** * macOS (Intel) now recommends Docker Desktop; DGX Spark guidance now uses the standard installer + `nemoclaw onboard`. * Preflight output shows detected runtime and WSL notes; installer prints remediation actions and will skip onboarding on blocking issues. * **Deprecations** * `nemoclaw deploy`, `nemoclaw setup-spark`, and the legacy bootstrap wrapper are now deprecated compatibility paths. * **Documentation** * Quickstart, troubleshooting, and command reference updated to reflect installer+onboard flow and deprecation guidance. * **Tests** * Added/updated tests covering preflight, deploy compatibility, CLI aliases, and deploy e2e scenarios. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
--provider gcpto bothbrev search cpucalls (launchable and bare-instance paths) inbrev-e2e.test.jsBREV_PROVIDER: gcpenv var to thee2e-brev.yamlworkflowBREV_PROVIDERas a configurable env var (default:gcp) so the provider can be overridden without code changesProblem
The E2E workflow uses
brev search cpu --min-vcpu 4 --min-ram 16 --sort priceto select the cheapest available instance. This consistently lands on Nebius machines because they're the cheapest provider in Brev's marketplace. However, these instances have been unreliable:From the most recent successful run (23950138412):
Solution
Pin the provider to GCP via
--provider gcp. GCP instances (n2d-standard-4) are:brev search) vs the ~17 min observed on NebiusTest plan
e2e-brevworkflow manually on this branch — verify it selects a GCP instance and passesBREV_PROVIDERenv var override works (e.g., set toawsto test fallback)Fixes #1420
🤖 Generated with Claude Code
Summary by CodeRabbit
Tests
Chores