Use embedded etcd for scale test cluster by shayasoolin · Pull Request #624 · ai-dynamo/grove

shayasoolin · 2026-05-19T11:01:43Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

This updates the k3d cluster created by the e2e infra manager to start the single k3s server with embedded etcd:

--k3s-arg --cluster-init@server:0

The scale test creates enough pod and status churn that the default single-server k3s datastore, sqlite through kine, can lag under CI load. In the failed scale CI runs, the API server showed symptoms such as stale resource versions, compacted revisions, handler timeouts, and list/watch calls stalling while the scale test waited for pods to become ready.

--cluster-init@server:0 keeps the same single-server k3d topology, but bootstraps the k3s server with embedded etcd as the datastore instead of sqlite/kine. This gives the scale test a datastore path that handles the high watch/list/update churn more reliably.

Which issue(s) this PR fixes:

Related to #550

Special notes for your reviewer:

This is intentionally scoped to cluster creation. It does not change the scale test workload, node count, controller settings, or workflow behavior.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

copy-pr-bot · 2026-05-19T11:01:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Use embedded etcd for scale test cluster

475fdda

shayasoolin requested review from Ronkahn21, danbar2, gflarity, sanjaychatterjee and unmarshall as code owners May 19, 2026 11:01

Ronkahn21 approved these changes May 19, 2026

View reviewed changes

danbar2 reviewed May 19, 2026

View reviewed changes

Comment thread operator/hack/infra_manager/cluster.py

danbar2 approved these changes May 19, 2026

View reviewed changes

shayasoolin merged commit bebf846 into ai-dynamo:main May 19, 2026
15 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use embedded etcd for scale test cluster#624

Use embedded etcd for scale test cluster#624
shayasoolin merged 1 commit into
ai-dynamo:mainfrom
shayasoolin:fix-scale-ci-embedded-etcd

shayasoolin commented May 19, 2026

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shayasoolin commented May 19, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a API change?

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants