Skip to content

perf(onboard): replace remaining fixed readiness polls with deadline-based waits #3768

@wscurran

Description

@wscurran

Problem

#2001 identifies a broader onboarding latency pattern: several readiness paths still rely on fixed poll counts, fixed intervals, or platform-specific timeout widening. PR #2492 addresses the gateway startup health loop, but the same waiting pattern still exists in other onboard/readiness flows.

These waits make fast systems pay unnecessary delay and make slow systems fail based on hardcoded loop assumptions rather than a clear deadline.

Scope

Extend the deadline-based wait pattern beyond gateway startup to the remaining readiness paths called out in #2001.

Candidate areas:

  • Sandbox readiness polling
  • Dashboard readiness polling
  • Gateway recovery polling
  • Agent gateway polling
  • Sandbox create stream readiness polling, if it uses the same fixed-interval pattern
  • Any error messages that hardcode stale timeout values like "within 60s"

Expected Behavior

Readiness checks should:

  • Start with fast polling so successful systems continue quickly
  • Back off up to a capped interval
  • Respect one clear deadline budget
  • Report the actual deadline used when timing out
  • Preserve or deprecate existing health-poll env vars safely if they are externally used

Related Work

This issue should build on #2492 if it lands. If #2492 is superseded, this issue should reuse the replacement wait helper instead.

Acceptance Criteria

  • Remaining onboard/readiness polling loops use deadline-based waits instead of fixed N x interval loops.
  • Fast systems exit as soon as the readiness condition is met.
  • Slow systems receive the full configured deadline budget.
  • Existing tests that use NEMOCLAW_HEALTH_POLL_COUNT or NEMOCLAW_HEALTH_POLL_INTERVAL are updated or remain compatible through deprecated aliases.
  • Timeout messages include the actual deadline used.
  • New or updated tests cover immediate success, retry success, timeout, and zero/short-deadline test behavior.

Non-goals

  • Adaptive provider-validation timeout calibration.
  • DNS/TCP/TLS probe optimization.
  • Onboard orchestration parallelization.
  • Profiling trace output.

Metadata

Metadata

Assignees

Labels

area: cliCommand line interface, flags, terminal UX, or outputarea: inferenceInference routing, serving, model selection, or outputsarea: performanceLatency, throughput, resource use, benchmarks, or scaling
No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions