Ignore errors getting optional device attributes by elezar · Pull Request #1356 · NVIDIA/k8s-device-plugin

elezar · 2025-08-12T08:41:45Z

On certain systems (e.g. iGPU-based systems), the NVML nvmlDeviceGetMemoryInformation API
is not supported and returns an error. In these cases we ignore
these errors and log a warning instead. This means that:

For the GPU Device Plugin, memory limits will be enforced for MPS partioning.
For GFD, no nvidia.com/gpu.memory label will be generated.

ArangoGutierrez

lgtm

ArangoGutierrez · 2025-08-13T13:50:42Z

internal/resource/nvml-device.go

+		if err != nil {
+			return 0, err
+		}
+		return *memInfo.MemTotal / (1024 * 1024), nil


Could we add a comment detailing why 1042 * 1024

The magic numbers 1024 * 1024 should be extracted to a named constant like bytesToMebibytes

I actually like this review comment. Moving that to a constant will make it self-explanatory

cmd/mps-control-daemon/mps/device.go

tariq1890 · 2025-08-13T16:28:19Z

internal/resource/nvml-device.go

 // GetTotalMemoryMiB returns the total memory on a device in mebibytes (2^20 bytes)
 func (d nvmlDevice) GetTotalMemoryMiB() (uint64, error) {
 	info, ret := d.GetMemoryInfo()
+	if ret == nvml.ERROR_NOT_SUPPORTED {


Can we factor this out into a new method?

Yes, I will. I have removed this from this changeset though and will created a follow up to readd this properly.

cdesiniotis · 2025-08-13T20:56:27Z

internal/rm/devices.go

 	computeCapability, err := d.GetComputeCapability()
 	if err != nil {
-		return nil, fmt.Errorf("error getting device compute capability: %w", err)
+		klog.Warningf("Ignoring error getting device compute capability: %w", err)


While using %w is syntactically valid, I would argue wrapping errors is no longer required / necessary as we are not returning the error.

cmd/mps-control-daemon/mps/device.go

cdesiniotis · 2025-08-13T21:13:08Z

internal/resource/nvml-device.go

 // GetTotalMemoryMiB returns the total memory on a device in mebibytes (2^20 bytes)
 func (d nvmlDevice) GetTotalMemoryMiB() (uint64, error) {
 	info, ret := d.GetMemoryInfo()
+	if ret == nvml.ERROR_NOT_SUPPORTED {


Question -- is there an additional check we can perform before falling back to reading /proc/meminfo? Specifically, is there any way to identify we are on this type of SOC system where the total system memory is representative of the GPU memory?

I asked whether there is a specific API and got the following response:

Maybe we could use the return code from nvmlDeviceGetMemoryInfo to determine if we need to query /proc/meminfo instead since that would return NOT SUPPORTED for iGPU platforms?

I'm happy to pull this change out of this PR and handle that as a follow up. I think correct labels and MPS memory partitioning is a lower priority than not failing.

I have updated this PR to only skip errors and not update the mechansim used to extract memory information.

Copilot

Pull Request Overview

This PR modifies error handling for optional GPU device attributes by changing fatal errors to warnings when retrieving device memory information fails. This addresses compatibility issues on systems like iGPU-based systems where the NVML memory information API is not supported.

Convert fatal errors to warnings when getting device memory information
Add logging dependency to handle warnings appropriately
Allow graceful degradation instead of complete failure

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
internal/rm/devices.go	Changes fatal error to warning when device memory retrieval fails during device building
internal/lm/resource.go	Changes fatal error to warning when device memory retrieval fails during GPU resource labeling

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.}

Copilot · 2025-08-15T08:25:24Z

internal/rm/devices.go

 	totalMemory, err := d.GetTotalMemory()
 	if err != nil {
-		return nil, fmt.Errorf("error getting device memory: %w", err)
+		klog.Warningf("Ignoring error getting device memory: %w", err)


The %w verb is used with klog.Warningf, but klog formatting doesn't support error wrapping. Use %v instead to properly format the error message.

Suggested change

klog.Warningf("Ignoring error getting device memory: %w", err)

klog.Warningf("Ignoring error getting device memory: %v", err)

internal/lm/resource.go

On certain systems, the NVML nvmlDeviceGetMemoryInformation API is not supported and returns an error. In these cases we ignore these errors and log a warning instead. This means that: * For the GPU Device Plugin, memory limits will be enforced for MPS partioning. * For GFD, no nvidia.com/gpu.memory label will be generated. Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar requested review from ArangoGutierrez, cdesiniotis and tariq1890 August 12, 2025 08:41

elezar added this to the v0.18.0 milestone Aug 12, 2025

elezar self-assigned this Aug 12, 2025

ArangoGutierrez requested a review from Copilot August 12, 2025 09:17

This comment was marked as outdated.

Sign in to view

ArangoGutierrez approved these changes Aug 12, 2025

View reviewed changes

elezar force-pushed the update-for-spark branch 2 times, most recently from 8c3540c to 2ad2ad7 Compare August 13, 2025 10:46

elezar added the must-backport label Aug 13, 2025

ArangoGutierrez requested a review from Copilot August 13, 2025 13:44

This comment was marked as outdated.

Sign in to view

ArangoGutierrez reviewed Aug 13, 2025

View reviewed changes

tariq1890 reviewed Aug 13, 2025

View reviewed changes

cdesiniotis reviewed Aug 13, 2025

View reviewed changes

ArangoGutierrez self-requested a review August 14, 2025 08:55

elezar force-pushed the update-for-spark branch from 2ad2ad7 to d8f217d Compare August 15, 2025 08:08

elezar modified the milestones: v0.18.0, v0.17.x Aug 15, 2025

elezar requested review from cdesiniotis, Copilot and tariq1890 August 15, 2025 08:24

Copilot AI reviewed Aug 15, 2025

View reviewed changes

elezar force-pushed the update-for-spark branch from d8f217d to e3323ce Compare August 15, 2025 08:26

cdesiniotis approved these changes Aug 18, 2025

View reviewed changes

elezar merged commit c494b79 into NVIDIA:main Aug 19, 2025
15 of 16 checks passed

elezar mentioned this pull request Aug 19, 2025

Ignore errors getting device memory using NVML #1374

Merged

	klog.Warningf("Ignoring error getting device memory: %w", err)
	klog.Warningf("Ignoring error getting device memory: %v", err)

Conversation

elezar commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

elezar commented Aug 12, 2025 •

edited

Loading