Skip to content

ci: safely split slow CLI coverage suites #4892

@cv

Description

@cv

Problem Statement

The cli-tests job in CI / Pull Request currently takes close to six minutes, with nearly all of that time spent in the Run CLI coverage step rather than checkout, dependency install, or build setup.

From PR #4887's passing run (CI / Pull Request, run 27055172682, job 79858083592):

  • Install dependencies: about 11 seconds
  • Build TypeScript plugin: effectively negligible
  • Run CLI coverage: about 5 minutes 36 seconds
  • Inside that shell step, clean + npm run build:cli + sourcemap validation took about 10 seconds, leaving roughly 5 minutes 25 seconds for npx vitest run --project cli --coverage ...

The cli Vitest project is broad: it currently covers about 505 test files and about 6,500 test cases. Local timing without coverage, even on a run that hit local environment-sensitive failures, still took about 4 minutes 28 seconds. The cost appears to be the existing subprocess-heavy and timeout/retry-heavy CLI test surface, plus V8 coverage overhead.

The biggest local timing hotspots were:

  • test/cli.test.ts
  • test/sandbox-connect-inference.test.ts
  • test/nemoclaw-start.test.ts
  • test/policies.test.ts
  • test/onboard-selection.test.ts
  • test/onboard.test.ts

This is not a correctness regression from PR #4887, but the workflow split made the long-running CLI coverage job more visible.

Proposed Design

Split or shard the slow CLI coverage work only after preserving the current behavior contract. The goal should be lower wall-clock time without changing which checks run, which tests are selected, or how coverage ratchets are enforced.

Suggested safe rollout:

  1. Add measurement before changing behavior.

    • Split the current Run CLI coverage shell block into separately timed CI steps, or emit explicit timing markers for build, sourcemap validation, Vitest, and coverage ratchet.
    • Optionally upload Vitest JSON timing output as an artifact for the cli project.
    • Keep this as a no-behavior-change PR so baseline timing is visible on GitHub Actions.
  2. Classify the expensive CLI test surface by ownership and failure mode.

    • Identify subprocess-heavy CLI command tests, retry/timeout simulation tests, and lower-level unit-style CLI tests.
    • Candidate heavy suites based on current timing: test/cli.test.ts, test/sandbox-connect-inference.test.ts, and test/nemoclaw-start.test.ts.
    • Avoid moving tests based only on filename convenience; preserve the current semantic coverage of CLI command behavior, sandbox lifecycle handling, inference preflights, and retry/timeouts.
  3. Introduce explicit Vitest projects or CI shards.

    • Example shape: keep fast root/unit-style CLI tests in one project/job and move subprocess/timeout-heavy integration-style CLI tests into another project/job.
    • Alternatively use Vitest sharding if test selection remains deterministic and easy to audit.
    • Ensure every test currently selected by --project cli is selected exactly once unless a duplicate is intentionally removed.
  4. Preserve aggregate coverage semantics.

    • Do not enforce separate per-shard coverage ratchets that could hide aggregate regressions.
    • Produce coverage reports from each shard and merge them before running the existing scripts/check-coverage-ratchet.ts checks, or otherwise prove the ratchet sees the same aggregate coverage data it sees today.
    • Keep both CLI coverage threshold files in force: ci/coverage-threshold-cli-summary.json and ci/coverage-threshold-cli-files.json.
  5. Prove equivalence before deleting the old path.

    • On the migration PR, run the old monolithic npx vitest run --project cli --coverage ... path and the proposed split path at least once, then compare:
      • test file selection
      • total test count
      • failed/skipped test behavior
      • coverage summary
      • coverage ratchet result
    • After equivalence is demonstrated, remove the duplicate monolithic run so CI does not stay permanently more expensive.
  6. Keep required-check behavior explicit.

    • Update the final PR aggregate checks job so all split CLI jobs are required when code changes are present.
    • Confirm docs-only PR routing is unchanged and does not start the CLI jobs.
    • Add or update workflow contract tests so the final aggregate cannot pass if one CLI shard is skipped, renamed, or omitted unexpectedly.

Alternatives Considered

  • Only optimize individual slow tests. This may still be worthwhile, especially for timeout/retry simulations, but it will take longer to pay down and does not reduce wall-clock time as reliably as parallelizing independent work.
  • Shard by test count alone. This is simpler, but can make ownership and coverage debugging harder if related CLI command tests move between shards unpredictably.
  • Split coverage thresholds per shard. This is risky because a shard-local threshold can pass while aggregate CLI coverage behavior changes.
  • Exclude slow suites from PR CI. This should not be done; the current behavior checks are valuable and should remain part of the code-change gate.

Acceptance Criteria

  • The split keeps all checks and tests that the current cli-tests job performs for code PRs.
  • Each existing cli project test is either selected exactly once or explicitly documented as a deliberate duplicate removal.
  • Aggregate CLI coverage ratchets still run against equivalent merged coverage data.
  • The final PR aggregate checks job requires every new CLI shard/job for code changes.
  • Docs-only PR behavior remains unchanged.
  • A migration PR includes before/after timing for the same commit or equivalent commit pair.
  • Wall-clock time for the CLI coverage portion is reduced without weakening the gate.

Category

Testing

Checklist

  • I searched existing issues and this is not a duplicate
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

Labels

area: ciCI workflows, checks, release automation, or GitHub Actionsarea: cliCommand line interface, flags, terminal UX, or outputarea: performanceLatency, throughput, resource use, benchmarks, or scaling
No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions