Ignore errors getting optional device attributes#1356
Conversation
8c3540c to
2ad2ad7
Compare
internal/resource/nvml-device.go
Outdated
| if err != nil { | ||
| return 0, err | ||
| } | ||
| return *memInfo.MemTotal / (1024 * 1024), nil |
There was a problem hiding this comment.
Could we add a comment detailing why 1042 * 1024
There was a problem hiding this comment.
The magic numbers 1024 * 1024 should be extracted to a named constant like bytesToMebibytes
I actually like this review comment. Moving that to a constant will make it self-explanatory
internal/resource/nvml-device.go
Outdated
| // GetTotalMemoryMiB returns the total memory on a device in mebibytes (2^20 bytes) | ||
| func (d nvmlDevice) GetTotalMemoryMiB() (uint64, error) { | ||
| info, ret := d.GetMemoryInfo() | ||
| if ret == nvml.ERROR_NOT_SUPPORTED { |
There was a problem hiding this comment.
Can we factor this out into a new method?
There was a problem hiding this comment.
Yes, I will. I have removed this from this changeset though and will created a follow up to readd this properly.
internal/rm/devices.go
Outdated
| computeCapability, err := d.GetComputeCapability() | ||
| if err != nil { | ||
| return nil, fmt.Errorf("error getting device compute capability: %w", err) | ||
| klog.Warningf("Ignoring error getting device compute capability: %w", err) |
There was a problem hiding this comment.
While using %w is syntactically valid, I would argue wrapping errors is no longer required / necessary as we are not returning the error.
internal/resource/nvml-device.go
Outdated
| // GetTotalMemoryMiB returns the total memory on a device in mebibytes (2^20 bytes) | ||
| func (d nvmlDevice) GetTotalMemoryMiB() (uint64, error) { | ||
| info, ret := d.GetMemoryInfo() | ||
| if ret == nvml.ERROR_NOT_SUPPORTED { |
There was a problem hiding this comment.
Question -- is there an additional check we can perform before falling back to reading /proc/meminfo? Specifically, is there any way to identify we are on this type of SOC system where the total system memory is representative of the GPU memory?
There was a problem hiding this comment.
I asked whether there is a specific API and got the following response:
Maybe we could use the return code from nvmlDeviceGetMemoryInfo to determine if we need to query /proc/meminfo instead since that would return NOT SUPPORTED for iGPU platforms?
I'm happy to pull this change out of this PR and handle that as a follow up. I think correct labels and MPS memory partitioning is a lower priority than not failing.
There was a problem hiding this comment.
I have updated this PR to only skip errors and not update the mechansim used to extract memory information.
2ad2ad7 to
d8f217d
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR modifies error handling for optional GPU device attributes by changing fatal errors to warnings when retrieving device memory information fails. This addresses compatibility issues on systems like iGPU-based systems where the NVML memory information API is not supported.
- Convert fatal errors to warnings when getting device memory information
- Add logging dependency to handle warnings appropriately
- Allow graceful degradation instead of complete failure
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| internal/rm/devices.go | Changes fatal error to warning when device memory retrieval fails during device building |
| internal/lm/resource.go | Changes fatal error to warning when device memory retrieval fails during GPU resource labeling |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
internal/rm/devices.go
Outdated
| totalMemory, err := d.GetTotalMemory() | ||
| if err != nil { | ||
| return nil, fmt.Errorf("error getting device memory: %w", err) | ||
| klog.Warningf("Ignoring error getting device memory: %w", err) |
There was a problem hiding this comment.
The %w verb is used with klog.Warningf, but klog formatting doesn't support error wrapping. Use %v instead to properly format the error message.
| klog.Warningf("Ignoring error getting device memory: %w", err) | |
| klog.Warningf("Ignoring error getting device memory: %v", err) |
On certain systems, the NVML nvmlDeviceGetMemoryInformation API is not supported and returns an error. In these cases we ignore these errors and log a warning instead. This means that: * For the GPU Device Plugin, memory limits will be enforced for MPS partioning. * For GFD, no nvidia.com/gpu.memory label will be generated. Signed-off-by: Evan Lezar <elezar@nvidia.com>
d8f217d to
e3323ce
Compare
On certain systems (e.g. iGPU-based systems), the NVML nvmlDeviceGetMemoryInformation API
is not supported and returns an error. In these cases we ignore
these errors and log a warning instead. This means that: