Enable cuda4dnn on hardware without support for __half#16218
Enable cuda4dnn on hardware without support for __half#16218alalek merged 5 commits intoopencv:masterfrom
Conversation
|
The The capability would be known at compile-time because you decide at compile-time whether the half precision kernels would be instantiated or not. The CUDA backend would create computation nodes for each supported layer in The backend nodes take the form given below where Here is what opencv/modules/dnn/src/op_cuda.hpp Lines 169 to 201 in 89d3f95 This will attempt to instantiate I am confused how you have managed to build. I'll check in a few hours. |
I do not fully agree, see below.
What I did is make sure nvcc (device code) doesn't compile the __half related code when compiling for a CC<5.3 I ran my CMAKE with That's the reason why I said you don't know at compile time if fp16 will be supported or not, you need a contextinfo to resolve this |
|
Can you try with just |
|
Sure, what do you expect? |
I just did, it compiles and works |
0fc2d14 to
349278e
Compare
|
@asmorkalov I updated this PR to handle latest changes in .cu files. Let me know if and how I can help. |
|
There is a new transpose kernel in permute.cu which was added along with the copy kernel. All the FP16 tests fail when there is no FP16 support (used These should be disabled at runtime if possible as they are not really failures. |
|
Do you know how to test compute capability at runtime? |
|
You might have to use You will need the There is another issue which is unrelated to this PR: if you build for |
|
In that case (built for 7.5 but ran on 6.x) can't you have just in time compilation? |
|
CUDA support in OpenCV provides two options:
The CUDA runtime generates the binary for the device from the PTX (generated by the compiler) at runtime. This generation incurs a cost at runtime but the generated binary is cached. If the binaries are pre-built at compile-time, you can avoid this initialization cost but this would increase the size of the binaries. Currently, these are the only two mechanisms supported by the CUDA backend. Runtime compilation is something I have planned for future (other ideas can be found here). It's quite complex and non-trival to implement JIT especially with the current template based kernels. |
Can you try again with latest commit? BTW I must be stupid but I can't find how to compile and run tests. I have replaced |
--D_BUILD_TESTS=ON
+-DBUILD_TESTS=ON |
|
@alalek sorry this is just a typo in my comment, but not in my command line.
|
|
Build
|
|
@Nefast You need to have the ts module (which in turn requires videoio module) to run the tests. I have often got |
|
I confirm the runtime check works and doesn't show FP16 target on my Geforce 960 |
|
@YashasSamaga Do you confirm you now pass the tests? opencv/modules/dnn/test/test_common.impl.hpp Line 410 in 43a91f8 |
|
@JulienMaille I don't own a device with CC 5.2 or below (back at college and I don't have one here). I have to borrow it from someone. I think it's sufficient if you could upload the output of Those are tags which are used to mark the tests. They are specifically skip tags which cause the tests which are marked with any of them to be skipped. |
|
@YashasSamaga what I don't understand with the code I've linked is that it looks like if cuda is present then we set the flag to skip cuda tests. |
|
@JulienMaille Registering skip tags and applying skip tags are different. Applying a skip tag is what causes the test to be skipped. |
|
@JulienMaille Please rebase onto master. #16230 added new half-precision kernels. These kernels are used in FP16 target only. Need a check for the transpose kernel here: opencv/modules/dnn/src/cuda/permute.cu Lines 130 to 133 in c8419ff |
c8419ff to
6750ab6
Compare
|
Done, rebasing was enough. Do you think this can be merged soon? |
|
@JulienMaille CI bot reports build error: |
78edbd0 to
4b5340d
Compare
|
@JulienMaille Please take a look on CI again. Build is still broken: https://pullrequest.opencv.org/buildbot/builders/precommit_linux64/builds/24320/steps/compile%20release/logs/stdio |
4b5340d to
f0df8ce
Compare
|
@asmorkalov Forgot to remove an extra #endif, it is fixed now |
|
@YashasSamaga probably a stupid question, but does cudnn module relies on cublas? (I'm really suprised by the size of the dll that have to be redistributed and trying to squeeze out anything useless) |
|
@JulienMaille cuDNN does not require cuBLAS but the CUDA backend requires cuBLAS for GEMM. |
|
👍 |
…ability < 5.3) Update CMakeLists.txt Lowered minimum CC to 3.0
d5fb32e to
9c24ca2
Compare
|
👍 @alalek Please take a look and merge. |
|
Clearly, there is a one trivial mistake, I cannot PR for now. But, please refer to: https://github.com/jiapei100/opencv/blob/master/modules/dnn/src/cuda/math.hpp |
|
@alalek jiape is right, the #endif was including a float operation |
|
@JulienMaille Sure! Feel free to prepare new PR with fix (add |
|
please backport (or upport) to opencv 4.2 tag greetings for example, my hardware setup don't have cuda 5.3 feature (Nvidia 2060RTX, cuda 10.2) |
…gpus Enable cuda4dnn on hardware without support for __half * Enable cuda4dnn on hardware without support for half (ie. compute capability < 5.3) Update CMakeLists.txt Lowered minimum CC to 3.0 * UPD: added ifdef on new copy kernel * added fp16 support detection at runtime * Clarified #if condition on atomicAdd definition * More explicit CMake error message
ie. hardware with compute capability < 5.3
It compiles/link fine and I was able to run some inference on my Geforce 960!
Right now I limited support to CC 5.2+ but I suppose we can go lower, what do you think? 5.0? 4.0?
@YashasSamaga said
But compute capability has to be queried at runtime on the selected device in order to tell if FP16 is supported or not, correct?