Skip to content

Ignore errors getting optional device attributes#1356

Merged
elezar merged 1 commit intoNVIDIA:mainfrom
elezar:update-for-spark
Aug 19, 2025
Merged

Ignore errors getting optional device attributes#1356
elezar merged 1 commit intoNVIDIA:mainfrom
elezar:update-for-spark

Conversation

@elezar
Copy link
Member

@elezar elezar commented Aug 12, 2025

On certain systems (e.g. iGPU-based systems), the NVML nvmlDeviceGetMemoryInformation API
is not supported and returns an error. In these cases we ignore
these errors and log a warning instead. This means that:

  • For the GPU Device Plugin, memory limits will be enforced for MPS partioning.
  • For GFD, no nvidia.com/gpu.memory label will be generated.

@elezar elezar added this to the v0.18.0 milestone Aug 12, 2025
@elezar elezar self-assigned this Aug 12, 2025

This comment was marked as outdated.

Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@elezar elezar force-pushed the update-for-spark branch 2 times, most recently from 8c3540c to 2ad2ad7 Compare August 13, 2025 10:46

This comment was marked as outdated.

if err != nil {
return 0, err
}
return *memInfo.MemTotal / (1024 * 1024), nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a comment detailing why 1042 * 1024

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic numbers 1024 * 1024 should be extracted to a named constant like bytesToMebibytes

I actually like this review comment. Moving that to a constant will make it self-explanatory

// GetTotalMemoryMiB returns the total memory on a device in mebibytes (2^20 bytes)
func (d nvmlDevice) GetTotalMemoryMiB() (uint64, error) {
info, ret := d.GetMemoryInfo()
if ret == nvml.ERROR_NOT_SUPPORTED {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we factor this out into a new method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will. I have removed this from this changeset though and will created a follow up to readd this properly.

computeCapability, err := d.GetComputeCapability()
if err != nil {
return nil, fmt.Errorf("error getting device compute capability: %w", err)
klog.Warningf("Ignoring error getting device compute capability: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While using %w is syntactically valid, I would argue wrapping errors is no longer required / necessary as we are not returning the error.

// GetTotalMemoryMiB returns the total memory on a device in mebibytes (2^20 bytes)
func (d nvmlDevice) GetTotalMemoryMiB() (uint64, error) {
info, ret := d.GetMemoryInfo()
if ret == nvml.ERROR_NOT_SUPPORTED {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- is there an additional check we can perform before falling back to reading /proc/meminfo? Specifically, is there any way to identify we are on this type of SOC system where the total system memory is representative of the GPU memory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked whether there is a specific API and got the following response:

Maybe we could use the return code from nvmlDeviceGetMemoryInfo to determine if we need to query /proc/meminfo instead since that would return NOT SUPPORTED for iGPU platforms?

I'm happy to pull this change out of this PR and handle that as a follow up. I think correct labels and MPS memory partitioning is a lower priority than not failing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this PR to only skip errors and not update the mechansim used to extract memory information.

@ArangoGutierrez ArangoGutierrez self-requested a review August 14, 2025 08:55
@elezar elezar modified the milestones: v0.18.0, v0.17.x Aug 15, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR modifies error handling for optional GPU device attributes by changing fatal errors to warnings when retrieving device memory information fails. This addresses compatibility issues on systems like iGPU-based systems where the NVML memory information API is not supported.

  • Convert fatal errors to warnings when getting device memory information
  • Add logging dependency to handle warnings appropriately
  • Allow graceful degradation instead of complete failure

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
internal/rm/devices.go Changes fatal error to warning when device memory retrieval fails during device building
internal/lm/resource.go Changes fatal error to warning when device memory retrieval fails during GPU resource labeling

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

totalMemory, err := d.GetTotalMemory()
if err != nil {
return nil, fmt.Errorf("error getting device memory: %w", err)
klog.Warningf("Ignoring error getting device memory: %w", err)
Copy link

Copilot AI Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The %w verb is used with klog.Warningf, but klog formatting doesn't support error wrapping. Use %v instead to properly format the error message.

Suggested change
klog.Warningf("Ignoring error getting device memory: %w", err)
klog.Warningf("Ignoring error getting device memory: %v", err)

Copilot uses AI. Check for mistakes.
On certain systems, the NVML nvmlDeviceGetMemoryInformation API
is not supported and returns an error. In these cases we ignore
these errors and log a warning instead. This means that:

* For the GPU Device Plugin, memory limits will be enforced for
  MPS partioning.
* For GFD, no nvidia.com/gpu.memory label will be generated.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar merged commit c494b79 into NVIDIA:main Aug 19, 2025
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants