Skip to content

Adding GKE sample for running nvidia-bug-report#4741

Merged
bytetwin merged 1 commit into
GoogleCloudPlatform:developfrom
raushan2016:develop
Oct 13, 2025
Merged

Adding GKE sample for running nvidia-bug-report#4741
bytetwin merged 1 commit into
GoogleCloudPlatform:developfrom
raushan2016:develop

Conversation

@raushan2016

@raushan2016 raushan2016 commented Oct 9, 2025

Copy link
Copy Markdown
Member

Changes

  • Adding examples on how to run nvidia-bug-report when using COS OS image in GKE as a pod.
  • Renaming from being GCE specific to GCP level so that both GCE and GKE examples is available for COS OS image.

Why

  • Parity with existing example of running nvidia-bug-report on GCE VM with GKE node using COS OS image
  • Running nvidia-bug-report can be challenging as it will require some kernel modules as well as SSH into the GKE node. This approach uses K8s native to run it as K8s Pod using the same image which was previous used for running as docker container for GCE node using COS Image.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor

I do not see an edit on cluster-toolkit-writers.json. Do you want to push another commit?

@bytetwin

bytetwin commented Oct 9, 2025

Copy link
Copy Markdown
Collaborator

I do not see an edit on cluster-toolkit-writers.json. Do you want to push another commit?

I don't see a need to be added to cluster-toolkit-writers.json. Its mainly for core toolkit team members. Since all of us work on fork model, its not needed.

@SwarnaBharathiMantena SwarnaBharathiMantena added the release-improvements Added to release notes under the "Improvements" heading. label Oct 9, 2025
@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor

I do not see an edit on cluster-toolkit-writers.json. Do you want to push another commit?

I don't see a need to be added to cluster-toolkit-writers.json. Its mainly for core toolkit team members. Since all of us work on fork model, its not needed.

I removed the "- Added myself to the cluster-toolkit-writers.json to get access to add labels in future." detail from the Changes comment above.

Comment thread community/cos-nvidia-bug-report/bug-report-pod.yaml Outdated
@bytetwin

Copy link
Copy Markdown
Collaborator

/gcbrun

@bytetwin bytetwin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you squash the commit into one before merge

Adding GKE sample for running nvidia-bug-report

Fix the number of GPUs in the pod spec
@raushan2016

Copy link
Copy Markdown
Member Author

LGTM. Can you squash the commit into one before merge

Done

@bytetwin bytetwin enabled auto-merge October 11, 2025 06:20
@bytetwin

Copy link
Copy Markdown
Collaborator

@SwarnaBharathiMantena - Need another approval as its external contribution

@SwarnaBharathiMantena SwarnaBharathiMantena left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bytetwin bytetwin merged commit f4d58a2 into GoogleCloudPlatform:develop Oct 13, 2025
11 of 65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants