perf(cluster): parallelize node provisioning, join info, source/dest check#660
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves Holodeck multinode cluster provisioning performance by parallelizing node-level operations and reducing SSH round trips during kubeadm join credential extraction.
Changes:
- Parallelize base dependency provisioning and Kubernetes prerequisite installation across nodes using
errgroup. - Collapse join token / CA hash / (optional) cert upload into a single SSH session/script in
extractJoinInfo. - Parallelize AWS
ModifyNetworkInterfaceAttributecalls to disable Source/Destination Check across instances.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
pkg/provisioner/cluster.go |
Parallelizes provisioning phases and consolidates join credential extraction into one SSH script/session. |
pkg/provider/aws/cluster.go |
Runs source/dest check disabling concurrently for cluster instances. |
| ` + fmt.Sprintf(`if [ "%t" = "true" ]; then | ||
| CERTKEY=$(sudo kubeadm init phase upload-certs --upload-certs 2>/dev/null | tail -1) | ||
| fi`, cp.isHAEnabled()) + ` | ||
| printf '%%s\n%%s\n%%s' "$TOKEN" "$HASH" "$CERTKEY" |
There was a problem hiding this comment.
In the join-credentials SSH script, the printf format string uses %%s sequences. In shell printf, %% escapes to a literal %, so this will print literal %s text and will not output the token/hash/certkey values, causing joins to fail. Use %s (single %) for the format specifiers so the variables are actually printed.
| printf '%%s\n%%s\n%%s' "$TOKEN" "$HASH" "$CERTKEY" | |
| printf '%s\n%s\n%s' "$TOKEN" "$HASH" "$CERTKEY" |
| } | ||
| node := node // capture loop variable | ||
| g2.Go(func() error { | ||
| return cp.installK8sPrereqs(node) |
There was a problem hiding this comment.
provisionBaseOnAllNodes Phase 2 now returns cp.installK8sPrereqs(node) errors directly from an errgroup goroutine, so failures like "failed to start script" / "failed to run K8s prereq script" can lose the node name/IP context that was previously added by the caller. Wrap the returned error with the node identity (or include the node identity inside installK8sPrereqs for all error paths) so parallel failures remain debuggable.
| return cp.installK8sPrereqs(node) | |
| if err := cp.installK8sPrereqs(node); err != nil { | |
| return fmt.Errorf("install K8s prerequisites on %s (%s) failed: %w", node.Name, node.PublicIP, err) | |
| } | |
| return nil |
713cedc to
b802204
Compare
…check provisionBaseOnAllNodes now provisions all nodes concurrently using errgroup (was sequential — N*5min). extractJoinInfo combines 3 SSH sessions into 1 script. disableSourceDestCheck runs API calls in parallel instead of sequentially. Audit findings NVIDIA#20 (MEDIUM), NVIDIA#21 (MEDIUM), NVIDIA#25 (MEDIUM). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
b802204 to
9ae13ff
Compare
Pull Request Test Coverage Report for Build 21997580498Details
💛 - Coveralls |
Summary
provisionBaseOnAllNodeswith errgroup (N nodes concurrently instead of sequential)extractJoinInfodisableSourceDestCheckAPI calls with errgroupAudit Findings
Changes
pkg/provisioner/cluster.go: Parallel provisionBaseOnAllNodes + combined extractJoinInfopkg/provider/aws/cluster.go: Parallel disableSourceDestCheckTest plan
gofmt— no formatting issuesgo build— compilesgo test ./pkg/...— all tests pass