Skip to content

Initial support Blackwell GPU arch#26820

Merged
asmorkalov merged 2 commits intoopencv:4.xfrom
johnnynunez:patch-1
Jan 25, 2025
Merged

Initial support Blackwell GPU arch#26820
asmorkalov merged 2 commits intoopencv:4.xfrom
johnnynunez:patch-1

Conversation

@johnnynunez
Copy link
Copy Markdown
Contributor

@johnnynunez johnnynunez commented Jan 22, 2025

10.0 blackwell b100/b200
12.0 blackwell rtx50

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

10.0 blackwell b100/b200
12.0 blackwell rtx50
@johnnynunez johnnynunez changed the title initial support blackwell initial support blackwell codegen Jan 22, 2025
@asmorkalov asmorkalov added this to the 4.12.0 milestone Jan 22, 2025
@asmorkalov
Copy link
Copy Markdown
Contributor

cc @cudawarped

@cudawarped
Copy link
Copy Markdown
Contributor

@johnnynunez Which version of the CUDA toolkit are you using to compile for Blackwell. The latest version I have access to (12.6 Update 3) does not support compute capability 10. I assume Nvidia will release a version 13.0 to coincide with the release of the first Blackwell cards.

Isn't Blackwell compute capability 10.0, where does 12.0 come from?

@johnnynunez
Copy link
Copy Markdown
Contributor Author

@johnnynunez Which version of the CUDA toolkit are you using to compile for Blackwell. The latest version I have access to (12.6 Update 3) does not support compute capability 10. I assume Nvidia will release a version 13.0 to coincide with the release of the first Blackwell cards.

Isn't Blackwell compute capability 10.0, where does 12.0 come from?

Hello,
B100 is coming support with 12.7 cuda
Rtx50 is coming with 12.8

@johnnynunez
Copy link
Copy Markdown
Contributor Author

johnnynunez commented Jan 22, 2025

@johnnynunez Which version of the CUDA toolkit are you using to compile for Blackwell. The latest version I have access to (12.6 Update 3) does not support compute capability 10. I assume Nvidia will release a version 13.0 to coincide with the release of the first Blackwell cards.

Isn't Blackwell compute capability 10.0, where does 12.0 come from?

Hello,
B100 is coming support with 12.7 cuda
Rtx50 is coming with 12.8

also thor is 10.1 capability.
I have rtx 5090 card.

@cudawarped
Copy link
Copy Markdown
Contributor

@johnnynunez So you haven't tested that this works on Blackwell 🤯? I would wait until a version of the CUDA toolkit is release which supports compute capability 10.0 to be 100% sure your change don't break anything. I would also remove compute capability 12.0.

Note: If you add 10.1 for Thor you also want to update this filter

ocv_filter_available_architecture(${nvcc_executable} __cuda_arch_bin

@johnnynunez
Copy link
Copy Markdown
Contributor Author

@johnnynunez So you haven't tested that this works on Blackwell 🤯? I would wait until a version of the CUDA toolkit is release which supports compute capability 10.0 to be 100% sure your change don't break anything. I would also remove compute capability 12.0.

Note: If you add 10.1 for Thor you also want to update this filter

ocv_filter_available_architecture(${nvcc_executable} __cuda_arch_bin

But why you remove 12.0?
My 5090 output is 12.0. I think that digits will be 11.0

@johnnynunez
Copy link
Copy Markdown
Contributor Author

@johnnynunez So you haven't tested that this works on Blackwell 🤯? I would wait until a version of the CUDA toolkit is release which supports compute capability 10.0 to be 100% sure your change don't break anything. I would also remove compute capability 12.0.
Note: If you add 10.1 for Thor you also want to update this filter

ocv_filter_available_architecture(${nvcc_executable} __cuda_arch_bin

But why you remove 12.0? My 5090 output is 12.0. I think that digits will be 11.0

Well, today NDA is removed

@johnnynunez
Copy link
Copy Markdown
Contributor Author

@johnnynunez So you haven't tested that this works on Blackwell 🤯? I would wait until a version of the CUDA toolkit is release which supports compute capability 10.0 to be 100% sure your change don't break anything. I would also remove compute capability 12.0.
Note: If you add 10.1 for Thor you also want to update this filter

ocv_filter_available_architecture(${nvcc_executable} __cuda_arch_bin

But why you remove 12.0? My 5090 output is 12.0. I think that digits will be 11.0

I added support on pytorch, xla etc

@cudawarped
Copy link
Copy Markdown
Contributor

cudawarped commented Jan 22, 2025

But why you remove 12.0?
My 5090 output is 12.0. I think that digits will be 11.0

Can you provide me with a source to indicate that your 5090 will be compute capability 12.0?

I added support on pytorch, xla etc

Pytorch merged compute capability 10.0 and 12.0 without any build testing (I guess its a python first library)? Not sure I can see a reason for not waiting especially in OpenCV when you can manually select the compute cabability using combinations of CUDA_ARCH_BIN and CUDA_ARCH_PTX.

@johnnynunez
Copy link
Copy Markdown
Contributor Author

johnnynunez commented Jan 22, 2025

But why you remove 12.0?
My 5090 output is 12.0. I think that digits will be 11.0

Can you provide me with a source to indicate that your 5090 will be compute capability 12.0?

I added support on pytorch, xla etc

Pytorch merged compute capability 10.0 and 12.0 without any build testing (I guess its a python first library)? Not sure I can see a reason for not waiting especially in OpenCV when you can manually select the compute cabability using combinations of CUDA_ARCH_BIN and CUDA_ARCH_PTX.

image

more references:
pytorch/pytorch#145270 I added to pytorch
Dao-AILab/flash-attention#1436

@cudawarped
Copy link
Copy Markdown
Contributor

@johnnynunez 🤯 Nvidia is going up 3 compute capabilities with a single generation, that's going to be really confusing considering they previously did a compute version per generation.

What is the output from

nvidia-smi --query-gpu=compute_cap --format=csv

If it says

compute_cap
12.0

Do you think its possible that the driver is outputing the wrong info?

@asmorkalov Either way I would suggest it would be better to wait until more info is available before merging this PR.

@johnnynunez
Copy link
Copy Markdown
Contributor Author

nvidia-smi --query-gpu=compute_cap --format=csv

It's okay, flash attention v4 is coming for blackwell also. They have 100 and 120.
I totally agree haha but well, nvidia now have a lot of products...
ARM upcoming laptops, gpus desktops, digits, jetson, and data centers

@johnnynunez
Copy link
Copy Markdown
Contributor Author

@johnnynunez 🤯 Nvidia is going up 3 compute capabilities with a single generation, that's going to be really confusing considering they previously did a compute version per generation.

What is the output from

nvidia-smi --query-gpu=compute_cap --format=csv

If it says

compute_cap
12.0

Do you think its possible that the driver is outputing the wrong info?

@asmorkalov Either way I would suggest it would be better to wait until more info is available before merging this PR.

I share you in the following hours because I'm not at home. But we have press release driver 571.86 whl

@asmorkalov
Copy link
Copy Markdown
Contributor

@cudawarped Thanks a lot for the analysis. I'll review hardware specs and return back soon.

@asmorkalov asmorkalov self-requested a review January 22, 2025 09:56
@johnnynunez
Copy link
Copy Markdown
Contributor Author

But why you remove 12.0?
My 5090 output is 12.0. I think that digits will be 11.0

Can you provide me with a source to indicate that your 5090 will be compute capability 12.0?

I added support on pytorch, xla etc

Pytorch merged compute capability 10.0 and 12.0 without any build testing (I guess its a python first library)? Not sure I can see a reason for not waiting especially in OpenCV when you can manually select the compute cabability using combinations of CUDA_ARCH_BIN and CUDA_ARCH_PTX.

image

@johnnynunez
Copy link
Copy Markdown
Contributor Author

But why you remove 12.0?
My 5090 output is 12.0. I think that digits will be 11.0

Can you provide me with a source to indicate that your 5090 will be compute capability 12.0?

I added support on pytorch, xla etc

Pytorch merged compute capability 10.0 and 12.0 without any build testing (I guess its a python first library)? Not sure I can see a reason for not waiting especially in OpenCV when you can manually select the compute cabability using combinations of CUDA_ARCH_BIN and CUDA_ARCH_PTX.

image

new drivers are showing cuda 12.8 and same 12.0 codename

@johnnynunez
Copy link
Copy Markdown
Contributor Author

johnnynunez commented Jan 23, 2025

more references:
NVIDIA/cccl#3493

10.0 b100 b200
10.0a arm laptops, digits?
10.1 thor
10.1a arm laptops, digits?
12.0 rtx50

@cudawarped
Copy link
Copy Markdown
Contributor

@johnnynunez I still suggest we wait I can't see any downside with OpenCV because the compute capability can be manually selected.

I still wouldn't be suprised if the 12 is the version of the CUDA toolkit due to the returned compute capability not being valid because the CUDA toolkit used by pytorch and used to compile nvidia-smi pre-date compute capabilities >= 9.0.

@johnnynunez
Copy link
Copy Markdown
Contributor Author

@johnnynunez I still suggest we wait I can't see any downside with OpenCV because the compute capability can be manually selected.

I still wouldn't be suprised if the 12 is the version of the CUDA toolkit due to the returned compute capability not being valid because the CUDA toolkit used by pytorch and used to compile nvidia-smi pre-date compute capabilities >= 9.0.

yeah! Totally agree

@johnnynunez
Copy link
Copy Markdown
Contributor Author

cuda 12.8 is out
image

@cudawarped
Copy link
Copy Markdown
Contributor

@johnnynunez Compute capability 12 looks to be official. Consumer cards look to have less resident threads per SM.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html?highlight=compute%2520capability#features-and-technical-specifications

@cudawarped
Copy link
Copy Markdown
Contributor

@asmorkalov Builds on both Windows 11 and Ubuntu 22.04 (WSL) with CUDA Toolkit 12.8 using default architectures selection (CUDA_ARCH_BIN and CUDA_ARCH_PTX not specified)

--   NVIDIA CUDA:                   YES (ver 12.8, CUFFT CUBLAS NVCUVID NVCUVENC)
--     NVIDIA GPU arch:             50 52 60 61 70 75 80 86 89 90 100 120
--     NVIDIA PTX archs:            120
--
--   cuDNN:                         YES (ver 9.7.0)

and -DCUDA_GENERATION=Blackwell


--   NVIDIA CUDA:                   YES (ver 12.8, CUFFT CUBLAS NVCUVID NVCUVENC)
--     NVIDIA GPU arch:             100 120
--     NVIDIA PTX archs:
--
--   cuDNN:                         YES (ver 9.7.0)

@johnnynunez
Copy link
Copy Markdown
Contributor Author

johnnynunez commented Jan 24, 2025

@asmorkalov Builds on both Windows 11 and Ubuntu 22.04 (WSL) with CUDA Toolkit 12.8 using default architectures selection (CUDA_ARCH_BIN and CUDA_ARCH_PTX not specified)

--   NVIDIA CUDA:                   YES (ver 12.8, CUFFT CUBLAS NVCUVID NVCUVENC)
--     NVIDIA GPU arch:             50 52 60 61 70 75 80 86 89 90 100 120
--     NVIDIA PTX archs:            120
--
--   cuDNN:                         YES (ver 9.7.0)

and -DCUDA_GENERATION=Blackwell


--   NVIDIA CUDA:                   YES (ver 12.8, CUFFT CUBLAS NVCUVID NVCUVENC)
--     NVIDIA GPU arch:             100 120
--     NVIDIA PTX archs:
--
--   cuDNN:                         YES (ver 9.7.0)

Yeah I compiled it pytorch, xformers, etc and opencv with my rtx5090 and it works. I just couldn't comment on anything because of NDA. But it was lifted yesterday

@johnnynunez
Copy link
Copy Markdown
Contributor Author

@asmorkalov feel free to merge! thanks

@asmorkalov asmorkalov self-assigned this Jan 25, 2025
@asmorkalov asmorkalov changed the title initial support blackwell codegen Initial support Blackwell GPU arch Jan 25, 2025
@asmorkalov asmorkalov merged commit 4b2a33a into opencv:4.x Jan 25, 2025
@asmorkalov asmorkalov mentioned this pull request Feb 19, 2025
NanQin555 pushed a commit to NanQin555/opencv that referenced this pull request Feb 24, 2025
Initial support Blackwell GPU arch opencv#26820 
 
10.0 blackwell b100/b200
12.0 blackwell rtx50

### Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [ ] There is a reference to the original bug report and related work
- [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [ ] The feature is well documented and sample code can be built with the project CMake
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants