Skip to content

support for new mig profiles#64

Merged
cdesiniotis merged 2 commits intoNVIDIA:mainfrom
varunrsekar:new-mig-profiles
Jul 31, 2025
Merged

support for new mig profiles#64
cdesiniotis merged 2 commits intoNVIDIA:mainfrom
varunrsekar:new-mig-profiles

Conversation

@varunrsekar
Copy link
Contributor

At the bare minimum, this change is needed to run device.NewMigProfile for the new MIG profiles (9-16)

@varunrsekar varunrsekar marked this pull request as ready for review July 10, 2025 06:11
@varunrsekar
Copy link
Contributor Author

@klueska / @elezar Can you please review this?

Copy link
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @varunrsekar can we add unit tests to confirm the new profiles get parsed correctly?

Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>
@varunrsekar
Copy link
Contributor Author

varunrsekar commented Jul 29, 2025

LGTM. @varunrsekar can we add unit tests to confirm the new profiles get parsed correctly?

@cdesiniotis Thanks for the review. Earlier, I had missed adding the *_NO_ME MIG profiles. Incorporating it has made the code very ugly. Can you please re-review?

Comment on lines +240 to +241
sort.Strings(attrs1)
sort.Strings(attrs2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that this is different from the previous implementation, but this sorts the attributes in place meaning that if a device has multiple attributes the profile name may change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should revisit what it means to have multiple attributes. Currently no MIG profiles support it. So hard to imagine what the expectation is.

// Split by +/- to separate out attributes.
split := strings.SplitN(profile, "+", 2)
negsplit := strings.SplitN(profile, "-", 2)
// Make sure we don't get both positive and negative attributes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a requirement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar I've mostly tried to fit into the existing implementation.

But looking at the existing profiles, we don't have multi-attribute profiles yet.

+me profiles: Include at least one media engine (NVDEC, NVENC, NVJPG, or OFA).

+gfx: Adds support for graphics APIs (new in GB20X).

+me.all: Allocates all available media engines to this instance (does not include graphics support).

-me: Excludes all media engines for pure compute workloads.

I would've preferred the attributes to me named +me, +me.all, -me etc instead so we don't have this differentiation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But looking at the existing profiles, we don't have multi-attribute profiles yet.

When we do have multiple attributes, would we expect the + (or -) to be included in the name? Does that mean that we would have +me,+gfx or +me,gfx, for example?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have to decide this. We should just adopt whatever the output of nvidia-smi shows for these profile names.

cdesiniotis
cdesiniotis previously approved these changes Jul 30, 2025
Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>
@tariq1890
Copy link
Contributor

@varunrsekar Thanks for this PR. I have a minor comment. In the fmt.Errorf/fmt.Printf statements, can we use the exact formatting characters of the variables we want to print?

%w for errors
%q for quoted strings
%s for other strings
%v for structs

return nil, fmt.Errorf("attribute begins with a number")
}
if a[0] == '.' || a[len(a)-1] == '.' {
return nil, fmt.Errorf("attribute begins/ends with a dot")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("attribute begins/ends with a dot")
return nil, fmt.Errorf("attribute %s begins/ends with a period/dot", a)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tariq1890 The new errors are following existing pattern. The fmt.Errorf/fmt.Printf fixes you requested can be addressed in a follow-up and keep the focus on the new profiles in this PR.

Copy link
Contributor

@tariq1890 tariq1890 Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I won't block the PR on this comment. But generally, I prefer making improvments on the lines we touch regardless of the conventions followed in the existing code. This practice helps make incremental improvements possible

@cdesiniotis cdesiniotis merged commit 0309bc0 into NVIDIA:main Jul 31, 2025
4 checks passed
@varunrsekar varunrsekar deleted the new-mig-profiles branch July 31, 2025 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants