Skip to content

Use embedded etcd for scale test cluster#624

Merged
shayasoolin merged 1 commit into
ai-dynamo:mainfrom
shayasoolin:fix-scale-ci-embedded-etcd
May 19, 2026
Merged

Use embedded etcd for scale test cluster#624
shayasoolin merged 1 commit into
ai-dynamo:mainfrom
shayasoolin:fix-scale-ci-embedded-etcd

Conversation

@shayasoolin

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

This updates the k3d cluster created by the e2e infra manager to start the single k3s server with embedded etcd:

--k3s-arg --cluster-init@server:0

The scale test creates enough pod and status churn that the default single-server k3s datastore, sqlite through kine, can lag under CI load. In the failed scale CI runs, the API server showed symptoms such as stale resource versions, compacted revisions, handler timeouts, and list/watch calls stalling while the scale test waited for pods to become ready.

--cluster-init@server:0 keeps the same single-server k3d topology, but bootstraps the k3s server with embedded etcd as the datastore instead of sqlite/kine. This gives the scale test a datastore path that handles the high watch/list/update churn more reliably.

Which issue(s) this PR fixes:

Related to #550

Special notes for your reviewer:

This is intentionally scoped to cluster creation. It does not change the scale test workload, node count, controller settings, or workflow behavior.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

@copy-pr-bot

copy-pr-bot Bot commented May 19, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread operator/hack/infra_manager/cluster.py
@shayasoolin shayasoolin merged commit bebf846 into ai-dynamo:main May 19, 2026
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants