Skip to content

Allow Orin CUDA forward compat root to be specified#1614

Merged
elezar merged 4 commits intoNVIDIA:mainfrom
elezar:configure-nvgpu-compat
Feb 18, 2026
Merged

Allow Orin CUDA forward compat root to be specified#1614
elezar merged 4 commits intoNVIDIA:mainfrom
elezar:configure-nvgpu-compat

Conversation

@elezar
Copy link
Member

@elezar elezar commented Jan 27, 2026

This change allows the CUDA forward compat root used for Orin-based systems to be specified as a config option or as a flag to the nvidia-ctk cdi generate command.

@elezar elezar force-pushed the configure-nvgpu-compat branch 3 times, most recently from b102e2a to 728cbe5 Compare January 27, 2026 13:55
@coveralls
Copy link

coveralls commented Jan 27, 2026

Pull Request Test Coverage Report for Build 22056881579

Details

  • 39 of 105 (37.14%) changed or added relevant lines in 10 files are covered.
  • 6 unchanged lines in 5 files lost coverage.
  • Overall coverage decreased (-0.01%) to 39.464%

Changes Missing Coverage Covered Lines Changed/Added Lines %
internal/modifier/csv.go 0 1 0.0%
internal/modifier/gated.go 0 1 0.0%
pkg/nvcdi/driver-nvml.go 0 1 0.0%
pkg/nvcdi/lib.go 0 1 0.0%
pkg/nvcdi/options.go 0 6 0.0%
cmd/nvidia-cdi-hook/cudacompat/cudacompat.go 6 14 42.86%
cmd/nvidia-cdi-hook/cudacompat/cuda-elf-header.go 0 12 0.0%
internal/discover/compat_libs.go 0 15 0.0%
pkg/nvcdi/lib-csv.go 32 53 60.38%
Files with Coverage Reduction New Missed Lines %
cmd/nvidia-cdi-hook/cudacompat/cudacompat.go 1 44.6%
cmd/nvidia-cdi-hook/cudacompat/cuda-elf-header.go 1 55.71%
internal/discover/compat_libs.go 1 0.0%
pkg/nvcdi/options.go 1 11.65%
pkg/nvcdi/lib-csv.go 2 71.93%
Totals Coverage Status
Change from base Build 22056798804: -0.01%
Covered Lines: 5768
Relevant Lines: 14616

💛 - Coveralls

@elezar elezar added this to the v1.19.0 milestone Jan 27, 2026
@elezar elezar added the tegra label Jan 27, 2026
@elezar elezar force-pushed the configure-nvgpu-compat branch from 728cbe5 to 9cf3975 Compare January 28, 2026 16:54
@elezar elezar force-pushed the configure-nvgpu-compat branch 2 times, most recently from a2807ae to b9061b5 Compare February 6, 2026 15:10
Copy link
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of questions. Changes look reasonable to me otherwise.

l.csv.Files = csv.DefaultFileList()
}
if l.csv.CompatContainerRoot == "" {
l.csv.CompatContainerRoot = defaultOrinCompatContainerRoot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- will the compat folder path change on future architectures, like Thor? How do you envision us handling that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Thor systems (and beyond) the same compat libraries that are used for dGPU-based systems are used. Thus the only edge case we have to deal with is Orin system that are currently under support and require different compat packages.


return slices.Contains(h.Driver, driverMajor)
if hostDriverMajor != 0 && len(h.Driver) > 0 {
return slices.Contains(h.Driver, hostDriverMajor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- I thought we only use the compat libs when the major version is greater than the host driver's major version. Am I missing something here? As currently implemented it appears as if UseCompat() returns true if the major versions are equal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the initial basic check, yes. This change extends the basic heuristics to actually check the versions specified in the ELF header of the libcuda.so included in the compat folder. The header includes a list of driver major versions that the library can provide forward compatibility for. That is what this check tries to implement.

In the case of Orin-based systems there is no equivalent for the host driver version, and the CUDA version is compared.

Copy link
Contributor

@cdesiniotis cdesiniotis Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that our enable-cuda-compat hook now supports both CUDA forward compatibility and CUDA minor version (aka enhanced) compatibility?

Copy link
Contributor

@cdesiniotis cdesiniotis Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming this does allow us to support CUDA enhanced compatibility. Below is an example exemplifying this:

Host driver: 590.44.01 / CUDA 13.1.0
CUDA compat in container: 590.48.01 / CUDA 13.1.1

Assuming that 590 is present in the list of driver major versions in the ELF header of the libcuda.so in the container, the compat libraries would be used and this exercises CUDA enhanced compatibility (right?).

Taking another example, what is the expected / current behavior for the below?

Host driver: 590.48.01 (CUDA version: 13.1.1)
CUDA compat version in container: 590.44.01 / CUDA 13.1.0

Should the host driver libraries or the compat libraries be used? Does it even matter in this example as both the CUDA major and minor versions are the same? IIUC our current enable-cuda-compat hook would decide that the compat libraries should be used (for the same reasons as the first example).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found a t least one example where the logic as implemented here does not work as expected (see #1697). For that reason (and as discussed offline), I think it best to revert this until we have a better defintion of the expectations.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change allows the container compat root for nvgpu (e.g. Orin) systems
to be specified either as the

nvidia-container-runtime.modes.csv.compat-container-root option in the config.toml
file, or with the --csv.compat-container-root (NVIDIA_CTK_CDI_GENERATE_CSV_COMPAT_CONTAINER_ROOT)
option when generating CDI specifications.

A WithCSVCompatContainerRoot option is also exposed in the nvcdi API.

Note that this option is only relevant when nvgpu devices are detected.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a --host-cuda-version to enable CUDA forward compat
checks on Orin (nvgpu)-based systems where no meaningful host driver
version is available.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the configure-nvgpu-compat branch from b9061b5 to 8ccda8a Compare February 16, 2026 09:19
@elezar elezar requested a review from cdesiniotis February 17, 2026 07:34
@elezar elezar merged commit 7f2d76b into NVIDIA:main Feb 18, 2026
16 checks passed
@elezar elezar deleted the configure-nvgpu-compat branch February 18, 2026 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants