Skip to content

nvidia arm64 & GPU operator test#583

Merged
jepio merged 15 commits intoflatcar-masterfrom
kola-nvidia-arm64-test
Mar 14, 2025
Merged

nvidia arm64 & GPU operator test#583
jepio merged 15 commits intoflatcar-masterfrom
kola-nvidia-arm64-test

Conversation

@jepio
Copy link
Copy Markdown
Member

@jepio jepio commented Feb 27, 2025

  • Add SkipFunc implementation for skipping test on unsupported instance types
  • Add GPU operator test (includes nvidia-runtime sysext test)
  • Add Arm64 support to both tests
  • Add AWS support

@jepio jepio requested a review from Copilot February 27, 2025 18:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This pull request adds support for NVIDIA GPU testing by introducing a SkipFunc for unsupported instance types, adding a GPU operator test (including an NVIDIA runtime sysext test), and extending support to the ARM64 architecture and AWS platform.

  • Introduces skipOnNonGpu to conditionally skip tests on unsupported instances.
  • Adds a new test (cl.misc.nvidia.operator) with a complete GPU operator installation and validation workflow.
  • Updates existing NVIDIA installation test to incorporate ARM64 support via template configuration.

Reviewed Changes

File Description
kola/tests/misc/nvidia.go Added new constants, skip logic, GPU operator test implementation, and expanded platform/architecture support

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

kola/tests/misc/nvidia.go:162

  • The multi-line helm installation command uses backticks, which preserve literal newlines. Verify that the shell execution handles these newlines as intended, or consider converting it to a single-line command.
_ = c.MustSSH(m, `curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \

kola/tests/misc/nvidia.go:101

  • The SSH check in waitForNvidiaDriver only verifies for the substring 'active (exited)', which may be too specific if the nvidia service enters other valid states. Consider broadening the check or adding comments to clarify the expected state.
out, err := c.SSH(*m, "systemctl status nvidia.service")

@jepio jepio force-pushed the kola-nvidia-arm64-test branch 2 times, most recently from 9e7301d to dbb49cb Compare March 5, 2025 18:20
@jepio jepio marked this pull request as ready for review March 5, 2025 18:22
@jepio jepio requested a review from a team March 5, 2025 18:22
@jepio jepio force-pushed the kola-nvidia-arm64-test branch from dbb49cb to 119cd04 Compare March 6, 2025 19:05
jepio added 14 commits March 7, 2025 12:48
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
This relies on the nvidia-runtime sysext from the bakery.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
So that it doesn't look like a subtest which messes with the retry logic in
scripts.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Instead of a particular output, which only matches a single GPU type.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
The driver version for arm64 has been changed in Flatcar, so we can rely on the
default now.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
@jepio jepio force-pushed the kola-nvidia-arm64-test branch from 119cd04 to 2480322 Compare March 7, 2025 12:37
Copy link
Copy Markdown
Member

@krnowak krnowak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the PR is fine as it is. I have some ideas below about moving the version numbers to constants to make it easier to bump the them when a need appears. This could be very well be done in a follow-up PR, that could probably also add some automation. Up to you.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
@jepio jepio merged commit b246a42 into flatcar-master Mar 14, 2025
2 checks passed
@jepio jepio deleted the kola-nvidia-arm64-test branch March 14, 2025 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants