Skip to content

perf(onboard): parallelize safe onboard waits with cancellation handling #3775

@wscurran

Description

@wscurran

Problem

#2001 identifies sequential onboard orchestration as a contributor to latency. Some onboard work appears to run serially even when the phases do not strictly depend on each other.

Potential overlap opportunities include:

  • DNS pre-resolution while the user is answering provider prompts
  • Gateway startup overlapping with interactive provider or messaging prompts
  • Sandbox-ready and dashboard-ready checks running together
  • Other readiness checks that can safely run under the same deadline budget

This carries more risk than simple wait-loop cleanup because it changes control flow. If background work fails while the user is mid-prompt, NemoClaw needs clean cancellation and one clear error instead of a confusing half-onboard state.

Scope

Parallelize only dependency-safe onboard waits and setup work, with explicit cancellation and failure propagation.

Candidate areas:

  • Identify onboard phases that can safely overlap.
  • Add cancellation handling with AbortController or an equivalent mechanism.
  • Ensure background gateway/provider setup failures stop dependent foreground work.
  • Ensure user cancellation aborts background work cleanly.
  • Keep resume behavior stable.
  • Preserve existing output expectations where E2E tests parse step messages.
  • Add tests for success, background failure, user cancellation, and resume behavior.

Expected Behavior

NemoClaw should reduce avoidable onboard wall-clock time by overlapping independent work while preserving clear sequencing where real dependencies exist.

If a background task fails, the user should see one actionable error and dependent work should stop cleanly.

If the user cancels onboard, background tasks should be aborted and cleanup should be predictable.

Related Work

This issue should benefit from #3769 because trace artifacts can show which onboard phases are safe and valuable to overlap. It should also account for #3768 because deadline-based wait helpers may provide the cancellation/deadline structure used here.

Acceptance Criteria

  • Only dependency-safe onboard phases are parallelized.
  • Background failure while prompts are active produces one clear actionable error.
  • User cancellation stops background work cleanly.
  • Resume behavior remains compatible with existing onboard session state.
  • Existing E2E output expectations are preserved or deliberately updated.
  • Tests cover success, background failure, prompt cancellation, and resume cases.
  • perf: investigate and reduce networking latency during onboard and validation #2001 is updated with implementation notes or trace evidence.

Non-goals

  • Replacing fixed readiness polling loops.
  • Adding profiling trace output.
  • Changing provider validation timeout policy.
  • Reducing DNS/TCP/TLS overhead inside individual provider probes.
  • Changing provider support policy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: cliCommand line interface, flags, terminal UX, or outputarea: performanceLatency, throughput, resource use, benchmarks, or scaling
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions