Allow Orin CUDA forward compat root to be specified#1614
Conversation
b102e2a to
728cbe5
Compare
Pull Request Test Coverage Report for Build 22056881579Details
💛 - Coveralls |
728cbe5 to
9cf3975
Compare
a2807ae to
b9061b5
Compare
cdesiniotis
left a comment
There was a problem hiding this comment.
Left a couple of questions. Changes look reasonable to me otherwise.
| l.csv.Files = csv.DefaultFileList() | ||
| } | ||
| if l.csv.CompatContainerRoot == "" { | ||
| l.csv.CompatContainerRoot = defaultOrinCompatContainerRoot |
There was a problem hiding this comment.
Question -- will the compat folder path change on future architectures, like Thor? How do you envision us handling that?
There was a problem hiding this comment.
With Thor systems (and beyond) the same compat libraries that are used for dGPU-based systems are used. Thus the only edge case we have to deal with is Orin system that are currently under support and require different compat packages.
|
|
||
| return slices.Contains(h.Driver, driverMajor) | ||
| if hostDriverMajor != 0 && len(h.Driver) > 0 { | ||
| return slices.Contains(h.Driver, hostDriverMajor) |
There was a problem hiding this comment.
Question -- I thought we only use the compat libs when the major version is greater than the host driver's major version. Am I missing something here? As currently implemented it appears as if UseCompat() returns true if the major versions are equal.
There was a problem hiding this comment.
That was the initial basic check, yes. This change extends the basic heuristics to actually check the versions specified in the ELF header of the libcuda.so included in the compat folder. The header includes a list of driver major versions that the library can provide forward compatibility for. That is what this check tries to implement.
In the case of Orin-based systems there is no equivalent for the host driver version, and the CUDA version is compared.
There was a problem hiding this comment.
Does this mean that our enable-cuda-compat hook now supports both CUDA forward compatibility and CUDA minor version (aka enhanced) compatibility?
There was a problem hiding this comment.
I am assuming this does allow us to support CUDA enhanced compatibility. Below is an example exemplifying this:
Host driver: 590.44.01 / CUDA 13.1.0
CUDA compat in container: 590.48.01 / CUDA 13.1.1
Assuming that 590 is present in the list of driver major versions in the ELF header of the libcuda.so in the container, the compat libraries would be used and this exercises CUDA enhanced compatibility (right?).
Taking another example, what is the expected / current behavior for the below?
Host driver: 590.48.01 (CUDA version: 13.1.1)
CUDA compat version in container: 590.44.01 / CUDA 13.1.0
Should the host driver libraries or the compat libraries be used? Does it even matter in this example as both the CUDA major and minor versions are the same? IIUC our current enable-cuda-compat hook would decide that the compat libraries should be used (for the same reasons as the first example).
There was a problem hiding this comment.
I have found a t least one example where the logic as implemented here does not work as expected (see #1697). For that reason (and as discussed offline), I think it best to revert this until we have a better defintion of the expectations.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows the container compat root for nvgpu (e.g. Orin) systems to be specified either as the nvidia-container-runtime.modes.csv.compat-container-root option in the config.toml file, or with the --csv.compat-container-root (NVIDIA_CTK_CDI_GENERATE_CSV_COMPAT_CONTAINER_ROOT) option when generating CDI specifications. A WithCSVCompatContainerRoot option is also exposed in the nvcdi API. Note that this option is only relevant when nvgpu devices are detected. Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a --host-cuda-version to enable CUDA forward compat checks on Orin (nvgpu)-based systems where no meaningful host driver version is available. Signed-off-by: Evan Lezar <elezar@nvidia.com>
b9061b5 to
8ccda8a
Compare
This change allows the CUDA forward compat root used for Orin-based systems to be specified as a config option or as a flag to the
nvidia-ctk cdi generatecommand.