Use versioned flavor of get driver entrypoint function by ptrendx · Pull Request #1835 · NVIDIA/TransformerEngine

ptrendx · 2025-05-30T20:57:24Z

Description

Fixes the issue with cuStreamGetCtx pointing to cuStreamCtx_v2 in the CUDA 13 drivers.

Summary (mostly) by copilot:
This pull request updates the transformer_engine/common/util/cuda_driver.cpp and transformer_engine/common/util/cuda_driver.h files to enhance compatibility with different CUDA versions. The changes introduce a mechanism to query driver entry points based on the CUDA version, improving flexibility in handling CUDA driver symbols.

Enhancements for CUDA version compatibility:

Updated get_symbol function in transformer_engine/common/util/cuda_driver.cpp: Refactored the function to support querying driver entry points using either a versioned or non-versioned mechanism. The function now accepts a cuda_version parameter and dynamically resolves the appropriate entry point function (cudaGetDriverEntryPoint or cudaGetDriverEntryPointByVersion).
Modified get_symbol function declaration in transformer_engine/common/util/cuda_driver.h: Added an optional cuda_version parameter with a default value of 12010 (our oldest supported version) to allow backward compatibility while enabling version-specific queries.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx · 2025-05-30T20:57:43Z

/te-ci

timmoon10

LGTM

NVIDIA#1835

it was added Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

flx42 · 2025-06-03T17:46:48Z

I verified that it works!

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

for more information, see https://pre-commit.ci

ptrendx · 2025-06-03T20:12:39Z

@timmoon10 sorry for the churn, I was looking at how others dealt with this issue and found this issue from cutlass: NVIDIA/cutlass#2079 - since we, just like them, link against libcudart.so.12, the check for CUDA 12.5 during the compilation is not enough and we actually need to dynamically load the symbols. Fortunately, since we already link against libcudart, we don't need to try to find the lib by name (so at least there is that). @flx42 Could you verify this new version?

ptrendx · 2025-06-03T20:16:42Z

/te-ci

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

flx42 · 2025-06-04T00:20:36Z

@timmoon10 sorry for the churn, I was looking at how others dealt with this issue and found this issue from cutlass: NVIDIA/cutlass#2079 - since we, just like them, link against libcudart.so.12, the check for CUDA 12.5 during the compilation is not enough and we actually need to dynamically load the symbols. Fortunately, since we already link against libcudart, we don't need to try to find the lib by name (so at least there is that). @flx42 Could you verify this new version?

Still works fine!

ptrendx · 2025-06-04T18:24:15Z

/te-ci

Use versioned flavor of get driver entrypoint function

59f1466

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx requested a review from timmoon10 May 30, 2025 20:57

timmoon10 previously approved these changes May 30, 2025

View reviewed changes

flx42 added a commit to flx42/TransformerEngine that referenced this pull request May 31, 2025

Use versioned flavor of get driver entrypoint function

9eb811d

NVIDIA#1835

Update the check to call the versioned API starting with CUDA 12.5 where

31a6359

it was added Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx dismissed timmoon10’s stale review via 31a6359 June 2, 2025 21:42

timmoon10 previously approved these changes Jun 2, 2025

View reviewed changes

Dynamically find entrypoint functions

16c6fd8

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx dismissed timmoon10’s stale review via 16c6fd8 June 3, 2025 20:09

[pre-commit.ci] auto fixes from pre-commit.com hooks

eaba7a7

for more information, see https://pre-commit.ci

ptrendx added 2 commits June 3, 2025 13:22

Error checking

ee2afbb

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Lint fix

89f0666

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

timmoon10 approved these changes Jun 4, 2025

View reviewed changes

ptrendx merged commit 557f0cb into NVIDIA:main Jun 5, 2025
34 of 37 checks passed

nWEIdia mentioned this pull request Jun 11, 2025

[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 pytorch/pytorch#154097

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use versioned flavor of get driver entrypoint function#1835

Use versioned flavor of get driver entrypoint function#1835
ptrendx merged 6 commits intoNVIDIA:mainfrom
ptrendx:pr_entrypoint

ptrendx commented May 30, 2025 •

edited

Loading

Uh oh!

ptrendx commented May 30, 2025

Uh oh!

timmoon10 left a comment

Uh oh!

flx42 commented Jun 3, 2025

Uh oh!

ptrendx commented Jun 3, 2025 •

edited

Loading

Uh oh!

ptrendx commented Jun 3, 2025

Uh oh!

flx42 commented Jun 4, 2025

Uh oh!

ptrendx commented Jun 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ptrendx commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

ptrendx commented May 30, 2025

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

flx42 commented Jun 3, 2025

Uh oh!

ptrendx commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ptrendx commented Jun 3, 2025

Uh oh!

flx42 commented Jun 4, 2025

Uh oh!

ptrendx commented Jun 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ptrendx commented May 30, 2025 •

edited

Loading

ptrendx commented Jun 3, 2025 •

edited

Loading