Skip to content

perf(onboard): operationalize profiling traces for latency diagnosis #3769

@wscurran

Description

@wscurran

Problem

#2001 identifies several onboard and inference-validation latency risks, but further optimization work needs real timing data before changing timeout behavior, probe strategy, or onboarding orchestration.

PR #3070 adds opt-in profiling traces, but Sprint 5 needs a clear issue to track landing that work, confirming the trace coverage, and using the resulting artifacts to guide the next #2001 breakouts.

Scope

Operationalize onboard profiling traces for latency diagnosis.

Candidate areas:

  • Onboard phase timing
  • Gateway startup and recovery waits
  • Sandbox readiness waits
  • Dashboard readiness waits
  • Policy application
  • Inference validation probes
  • Curl probe execution
  • CI/E2E trace artifact collection
  • Maintainer workflow for collecting traces from slow local or remote runs

Expected Behavior

When profiling is enabled, maintainers should be able to identify which onboard or validation phase dominates a slow run without exposing secrets, provider credentials, or raw API keys.

Trace output should be useful both locally and in CI/E2E artifacts.

Related Work

This issue should track landing or refreshing #3070, then using the first trace artifacts to update #2001 with evidence-backed latency findings.

Acceptance Criteria

  • NEMOCLAW_TRACE_FILE and/or NEMOCLAW_TRACE_DIR produce usable trace artifacts.
  • Trace coverage includes onboard phases, gateway waits, sandbox/dashboard readiness waits, policy application, inference probes, and curl probe execution.
  • CI/E2E jobs that exercise onboard collect trace artifacts where useful.
  • Trace metadata is scrubbed of secrets, provider credentials, and raw API keys.
  • A maintainer can inspect a slow run and identify the dominant latency phase.
  • perf: investigate and reduce networking latency during onboard and validation #2001 is updated with trace-based findings before deeper timeout calibration or orchestration parallelization work proceeds.

Non-goals

  • Changing timeout values.
  • Replacing polling loops.
  • Adding adaptive provider-validation timeouts.
  • Reducing DNS/TCP/TLS overhead.
  • Parallelizing onboard orchestration.

Metadata

Metadata

Assignees

Labels

area: cliCommand line interface, flags, terminal UX, or outputarea: observabilityLogging, metrics, tracing, diagnostics, or debug outputarea: performanceLatency, throughput, resource use, benchmarks, or scaling
No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions