Skip to content

perf(cluster): parallelize node provisioning, join info, source/dest check#660

Merged
ArangoGutierrez merged 1 commit intoNVIDIA:mainfrom
ArangoGutierrez:perf/parallel-provisioning
Feb 13, 2026
Merged

perf(cluster): parallelize node provisioning, join info, source/dest check#660
ArangoGutierrez merged 1 commit intoNVIDIA:mainfrom
ArangoGutierrez:perf/parallel-provisioning

Conversation

@ArangoGutierrez
Copy link
Collaborator

Summary

  • Parallelize provisionBaseOnAllNodes with errgroup (N nodes concurrently instead of sequential)
  • Combine 3 SSH sessions into 1 script in extractJoinInfo
  • Parallelize disableSourceDestCheck API calls with errgroup

Audit Findings

Changes

  • pkg/provisioner/cluster.go: Parallel provisionBaseOnAllNodes + combined extractJoinInfo
  • pkg/provider/aws/cluster.go: Parallel disableSourceDestCheck

Test plan

  • gofmt — no formatting issues
  • go build — compiles
  • go test ./pkg/... — all tests pass

Copilot AI review requested due to automatic review settings February 13, 2026 07:46
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Holodeck multinode cluster provisioning performance by parallelizing node-level operations and reducing SSH round trips during kubeadm join credential extraction.

Changes:

  • Parallelize base dependency provisioning and Kubernetes prerequisite installation across nodes using errgroup.
  • Collapse join token / CA hash / (optional) cert upload into a single SSH session/script in extractJoinInfo.
  • Parallelize AWS ModifyNetworkInterfaceAttribute calls to disable Source/Destination Check across instances.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pkg/provisioner/cluster.go Parallelizes provisioning phases and consolidates join credential extraction into one SSH script/session.
pkg/provider/aws/cluster.go Runs source/dest check disabling concurrently for cluster instances.

` + fmt.Sprintf(`if [ "%t" = "true" ]; then
CERTKEY=$(sudo kubeadm init phase upload-certs --upload-certs 2>/dev/null | tail -1)
fi`, cp.isHAEnabled()) + `
printf '%%s\n%%s\n%%s' "$TOKEN" "$HASH" "$CERTKEY"
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the join-credentials SSH script, the printf format string uses %%s sequences. In shell printf, %% escapes to a literal %, so this will print literal %s text and will not output the token/hash/certkey values, causing joins to fail. Use %s (single %) for the format specifiers so the variables are actually printed.

Suggested change
printf '%%s\n%%s\n%%s' "$TOKEN" "$HASH" "$CERTKEY"
printf '%s\n%s\n%s' "$TOKEN" "$HASH" "$CERTKEY"

Copilot uses AI. Check for mistakes.
}
node := node // capture loop variable
g2.Go(func() error {
return cp.installK8sPrereqs(node)
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provisionBaseOnAllNodes Phase 2 now returns cp.installK8sPrereqs(node) errors directly from an errgroup goroutine, so failures like "failed to start script" / "failed to run K8s prereq script" can lose the node name/IP context that was previously added by the caller. Wrap the returned error with the node identity (or include the node identity inside installK8sPrereqs for all error paths) so parallel failures remain debuggable.

Suggested change
return cp.installK8sPrereqs(node)
if err := cp.installK8sPrereqs(node); err != nil {
return fmt.Errorf("install K8s prerequisites on %s (%s) failed: %w", node.Name, node.PublicIP, err)
}
return nil

Copilot uses AI. Check for mistakes.
@ArangoGutierrez ArangoGutierrez force-pushed the perf/parallel-provisioning branch 2 times, most recently from 713cedc to b802204 Compare February 13, 2026 17:59
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…check

provisionBaseOnAllNodes now provisions all nodes concurrently using
errgroup (was sequential — N*5min). extractJoinInfo combines 3 SSH
sessions into 1 script. disableSourceDestCheck runs API calls in
parallel instead of sequentially.

Audit findings NVIDIA#20 (MEDIUM), NVIDIA#21 (MEDIUM), NVIDIA#25 (MEDIUM).

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez force-pushed the perf/parallel-provisioning branch from b802204 to 9ae13ff Compare February 13, 2026 18:09
@coveralls
Copy link

Pull Request Test Coverage Report for Build 21997580498

Details

  • 0 of 65 (0.0%) changed or added relevant lines in 2 files are covered.
  • 3 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.02%) to 48.115%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/provider/aws/cluster.go 0 19 0.0%
pkg/provisioner/cluster.go 0 46 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/provider/aws/cluster.go 1 0.6%
pkg/provisioner/cluster.go 2 13.8%
Totals Coverage Status
Change from base Build 21997070791: -0.02%
Covered Lines: 2591
Relevant Lines: 5385

💛 - Coveralls

@ArangoGutierrez ArangoGutierrez merged commit 6207ab0 into NVIDIA:main Feb 13, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants