Skip to content

Enable cuda4dnn on hardware without support for __half#16218

Merged
alalek merged 5 commits intoopencv:masterfrom
JulienMaille:cuda-dnn-for-older-gpus
Jan 15, 2020
Merged

Enable cuda4dnn on hardware without support for __half#16218
alalek merged 5 commits intoopencv:masterfrom
JulienMaille:cuda-dnn-for-older-gpus

Conversation

@JulienMaille
Copy link
Copy Markdown
Contributor

@JulienMaille JulienMaille commented Dec 21, 2019

ie. hardware with compute capability < 5.3

It compiles/link fine and I was able to run some inference on my Geforce 960!
Right now I limited support to CC 5.2+ but I suppose we can go lower, what do you think? 5.0? 4.0?

@YashasSamaga said

Some checks in dnn.cpp to identify use of DNN_TARGET_CUDA_FP16 when half precision is disabled.

But compute capability has to be queried at runtime on the selected device in order to tell if FP16 is supported or not, correct?

force_builders=Custom,docs
buildworker:Custom=linux-4
docker_image:Custom=ubuntu-cuda:18.04

build_image:Custom Mac=openvino-2019r3.0
build_image:Custom Win=openvino-2019r3.0
test_opencl:Custom Win=OFF
test_modules:Custom Mac=dnn,java,python3

@YashasSamaga
Copy link
Copy Markdown
Contributor

YashasSamaga commented Dec 21, 2019

The DNN_TARGET_CUDA_FP16 option must exist in the enumeration irrespective of whether that target is supported or not. This is required to maintain ABI compatibility. Hence, it's possible for a user who has built the module without FP16 support to set the target to DNN_TARGET_CUDA_FP16. Now since this target isn't supported on the device, there should be some error or a warning and then switch to DNN_TARGET_CUDA.

The capability would be known at compile-time because you decide at compile-time whether the half precision kernels would be instantiated or not.

The CUDA backend would create computation nodes for each supported layer in initCUDABackend(). This would invoke initCUDA() method on every layer which will create and return a node. This node is constructed using a helper template called make_cuda_node which automatically instantiates the correct node template based on the target.

The backend nodes take the form given below where T is float or half depending on the target.

template <class T>
class SomeComputeNode : public CUDABackendNode;

Here is what make_cuda_node does:

/** @brief utility function which creates CUDA node of correct type from `targetId`
*
* CUDA operation nodes take the type of data they operate on as a template parameter.
* For example, ConcatOp<float> is an operation node which concats tensors of `float` type
* into a tensor of `float` type.
*
* This utility function aids the creation of nodes of different types and eliminates the
* need for CUDA target constants (`DNN_TARGET_XXX`) to appear in the operation code which
* reduces coupling between modules.
*
* Example:
* template <class T>
* class ConcatOp : public CUDABackendNode;
*
* // returns a cv::Ptr to a ConcatOp<half> object
* auto node = make_cuda_node<ConcatOp>(DNN_TARGET_CUDA_FP16, axis);
*
* // returns a cv::Ptr to a ConcatOp<float> object
* auto node = make_cuda_node<ConcatOp>(DNN_TARGET_CUDA, axis);
*/
template <template <class> class NodeType, class ...Args>
cv::Ptr<BackendNode> make_cuda_node(int targetId, Args&& ...args) {
switch (targetId)
{
case DNN_TARGET_CUDA_FP16:
return Ptr<BackendNode>(new NodeType<half>(std::forward<Args>(args)...));
case DNN_TARGET_CUDA:
return Ptr<BackendNode>(new NodeType<float>(std::forward<Args>(args)...));
default:
CV_Assert(IS_DNN_CUDA_TARGET(targetId));
}
return Ptr<BackendNode>();
}

This will attempt to instantiate half backend nodes which in turn will attempt to invoke half precision kernels. Now since the half-precision kernels were not instantiated, this should lead to a truckload of linker errors.

I am confused how you have managed to build. I'll check in a few hours.

@JulienMaille
Copy link
Copy Markdown
Contributor Author

JulienMaille commented Dec 21, 2019

The capability would be known at compile-time because you decide at compile-time whether the half precision kernels would be instantiated or not.

I do not fully agree, see below.

I am confused how you have managed to build. I'll check in a few hours.

What I did is make sure nvcc (device code) doesn't compile the __half related code when compiling for a CC<5.3
gcc/msvc (host code) is still compiling all the code handling __half.

I ran my CMAKE with CUDA_ARCH_BIN="5.2 5.3 6.0 6.1 7.0 7.5" so in the end I still support __half when it is available.

That's the reason why I said you don't know at compile time if fp16 will be supported or not, you need a contextinfo to resolve this

@YashasSamaga
Copy link
Copy Markdown
Contributor

Can you try with just 5.2 in CUDA_ARCH_BIN?

@JulienMaille
Copy link
Copy Markdown
Contributor Author

Sure, what do you expect?

@JulienMaille
Copy link
Copy Markdown
Contributor Author

Can you try with just 5.2 in CUDA_ARCH_BIN?

I just did, it compiles and works

@asmorkalov asmorkalov self-assigned this Dec 23, 2019
@asmorkalov asmorkalov added category: gpu/cuda (contrib) OpenCV 4.0+: moved to opencv_contrib category: dnn labels Dec 23, 2019
@JulienMaille JulienMaille force-pushed the cuda-dnn-for-older-gpus branch from 0fc2d14 to 349278e Compare December 29, 2019 10:41
@JulienMaille
Copy link
Copy Markdown
Contributor Author

@asmorkalov I updated this PR to handle latest changes in .cu files. Let me know if and how I can help.

@YashasSamaga
Copy link
Copy Markdown
Contributor

YashasSamaga commented Dec 29, 2019

There is a new transpose kernel in permute.cu which was added along with the copy kernel.

All the FP16 tests fail when there is no FP16 support (used CUDA_ARCH_BIN as 5.2). An example failure:

[ RUN      ] Test_ONNX_nets.ResNet50v1/1, where GetParam() = CUDA/CUDA_FP16
unknown file: Failure
C++ exception with description "OpenCV(4.2.0-dev) /FakePath/execution.hpp:52: error: (-217:Gpu API call) invalid device function in function 'make_policy'
" thrown in the test body.
[  FAILED  ] Test_ONNX_nets.ResNet50v1/1, where GetParam() = CUDA/CUDA_FP16 (271 ms)

These should be disabled at runtime if possible as they are not really failures.

@JulienMaille
Copy link
Copy Markdown
Contributor Author

Do you know how to test compute capability at runtime?

@YashasSamaga
Copy link
Copy Markdown
Contributor

YashasSamaga commented Dec 29, 2019

You might have to use cudaDeviceGetAttribute and obtain the attributes corresponding to cudaDevAttrComputeCapabilityMajor and cudaDevAttrComputeCapabilityMinor. You have to put them together major.minor to get the compute capability.

You will need the cuda_runtime.h header whose inclusion would have to guarded by #ifdef HAVE_CUDA. You can get the device id for which the tests will be running using cudaGetDevice.

There is another issue which is unrelated to this PR: if you build for 7.5 only and try to run on a 6.1 GPU, all the CUDA tests would fail because there is no kernel PTX or binaries available for that GPU.

@JulienMaille
Copy link
Copy Markdown
Contributor Author

In that case (built for 7.5 but ran on 6.x) can't you have just in time compilation?

@YashasSamaga
Copy link
Copy Markdown
Contributor

YashasSamaga commented Dec 29, 2019

CUDA support in OpenCV provides two options:

  • you can build binaries for various architectures
  • you can generate PTX for various virtual architectures

The CUDA runtime generates the binary for the device from the PTX (generated by the compiler) at runtime. This generation incurs a cost at runtime but the generated binary is cached. If the binaries are pre-built at compile-time, you can avoid this initialization cost but this would increase the size of the binaries.

Currently, these are the only two mechanisms supported by the CUDA backend. Runtime compilation is something I have planned for future (other ideas can be found here). It's quite complex and non-trival to implement JIT especially with the current template based kernels.

@JulienMaille
Copy link
Copy Markdown
Contributor Author

JulienMaille commented Dec 29, 2019

All the FP16 tests fail when there is no FP16 support (used CUDA_ARCH_BIN as 5.2). An example failure:
These should be disabled at runtime if possible as they are not really failures.

Can you try again with latest commit? BTW I must be stupid but I can't find how to compile and run tests. I have replaced -DBUILD_TESTS=OFF with -DBUILD_TESTS=ON but I don't see them in the generated solution.

@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 29, 2019

--D_BUILD_TESTS=ON
+-DBUILD_TESTS=ON

@JulienMaille
Copy link
Copy Markdown
Contributor Author

JulienMaille commented Dec 29, 2019

@alalek sorry this is just a typo in my comment, but not in my command line.
When I build RUN_TESTS I get

No tests were found!!!

@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 29, 2019

Build opencv_test_dnn target.
Run ./bin/opencv_test_dnn binary.
Also you should specify environment variables for tests:

  • OPENCV_TEST_DATA_PATH=<opencv_extra>/testdata (clone "opencv_extra" repository)
  • and optionally OPENCV_DNN_TEST_DATA_PATH (need to download 5+Gb)

@YashasSamaga
Copy link
Copy Markdown
Contributor

YashasSamaga commented Dec 30, 2019

@Nefast You need to have the ts module (which in turn requires videoio module) to run the tests. I have often got No tests were found every time I forgot to enable these modules.

@JulienMaille
Copy link
Copy Markdown
Contributor Author

I confirm the runtime check works and doesn't show FP16 target on my Geforce 960

@JulienMaille
Copy link
Copy Markdown
Contributor Author

@YashasSamaga Do you confirm you now pass the tests?
I've been looking at the code and there's a lot of stuff I don't understand like this:

CV_TEST_TAG_DNN_SKIP_CUDA, CV_TEST_TAG_DNN_SKIP_CUDA_FP32, CV_TEST_TAG_DNN_SKIP_CUDA_FP16

@YashasSamaga
Copy link
Copy Markdown
Contributor

YashasSamaga commented Jan 4, 2020

@JulienMaille I don't own a device with CC 5.2 or below (back at college and I don't have one here). I have to borrow it from someone. I think it's sufficient if you could upload the output of opencv_test_dnn and opencv_perf_dnn.

Those are tags which are used to mark the tests. They are specifically skip tags which cause the tests which are marked with any of them to be skipped.

@JulienMaille
Copy link
Copy Markdown
Contributor Author

@YashasSamaga what I don't understand with the code I've linked is that it looks like if cuda is present then we set the flag to skip cuda tests.

@YashasSamaga
Copy link
Copy Markdown
Contributor

@JulienMaille Registering skip tags and applying skip tags are different. Applying a skip tag is what causes the test to be skipped.

@YashasSamaga
Copy link
Copy Markdown
Contributor

YashasSamaga commented Jan 8, 2020

@JulienMaille Please rebase onto master. #16230 added new half-precision kernels. These kernels are used in FP16 target only.

Need a check for the transpose kernel here:

template void transpose(const Stream&, Span<__half>, View<__half>, std::size_t, std::size_t);
template void transpose(const Stream&, Span<float>, View<float>, std::size_t, std::size_t);

@JulienMaille JulienMaille force-pushed the cuda-dnn-for-older-gpus branch from c8419ff to 6750ab6 Compare January 8, 2020 20:04
@JulienMaille
Copy link
Copy Markdown
Contributor Author

Done, rebasing was enough. Do you think this can be merged soon?

@asmorkalov
Copy link
Copy Markdown
Contributor

@JulienMaille CI bot reports build error:

/build/precommit_linux64/opencv/modules/dnn/src/dnn.cpp:138:2: error: #endif without #if
 #endif

@JulienMaille JulienMaille force-pushed the cuda-dnn-for-older-gpus branch from 78edbd0 to 4b5340d Compare January 10, 2020 10:54
@asmorkalov
Copy link
Copy Markdown
Contributor

@JulienMaille JulienMaille force-pushed the cuda-dnn-for-older-gpus branch from 4b5340d to f0df8ce Compare January 10, 2020 14:16
@JulienMaille
Copy link
Copy Markdown
Contributor Author

@asmorkalov Forgot to remove an extra #endif, it is fixed now

@JulienMaille
Copy link
Copy Markdown
Contributor Author

@YashasSamaga probably a stupid question, but does cudnn module relies on cublas? (I'm really suprised by the size of the dll that have to be redistributed and trying to squeeze out anything useless)

@YashasSamaga
Copy link
Copy Markdown
Contributor

@JulienMaille cuDNN does not require cuBLAS but the CUDA backend requires cuBLAS for GEMM.

@asmorkalov
Copy link
Copy Markdown
Contributor

👍

@JulienMaille JulienMaille force-pushed the cuda-dnn-for-older-gpus branch 2 times, most recently from d5fb32e to 9c24ca2 Compare January 14, 2020 21:16
@asmorkalov
Copy link
Copy Markdown
Contributor

👍 @alalek Please take a look and merge.

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍

@alalek alalek merged commit 4e2ef8c into opencv:master Jan 15, 2020
@JulienMaille JulienMaille deleted the cuda-dnn-for-older-gpus branch January 15, 2020 16:20
@jiapei100
Copy link
Copy Markdown

jiapei100 commented Jan 15, 2020

Clearly, there is a one trivial mistake, I cannot PR for now. But, please refer to: https://github.com/jiapei100/opencv/blob/master/modules/dnn/src/cuda/math.hpp
Around line 135.

@JulienMaille JulienMaille restored the cuda-dnn-for-older-gpus branch January 15, 2020 20:03
@JulienMaille
Copy link
Copy Markdown
Contributor Author

@alalek jiape is right, the #endif was including a float operation
Correction here, shall I rebase and create a new PR?
JulienMaille@3d3ed03

@alalek
Copy link
Copy Markdown
Member

alalek commented Jan 16, 2020

@JulienMaille Sure! Feel free to prepare new PR with fix (add relates #16218 into description to add GitHub cross-link).

@sl1pkn07
Copy link
Copy Markdown

sl1pkn07 commented Feb 13, 2020

please backport (or upport) to opencv 4.2 tag

greetings

for example, my hardware setup don't have cuda 5.3 feature

--     NVIDIA GPU arch:             30 35 37 50 52 60 61 70 75

(Nvidia 2060RTX, cuda 10.2)

a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023
…gpus

Enable cuda4dnn on hardware without support for __half

* Enable cuda4dnn on hardware without support for half (ie. compute capability < 5.3)

Update CMakeLists.txt

Lowered minimum CC to 3.0

* UPD: added ifdef on new copy kernel

* added fp16 support detection at runtime

* Clarified #if condition on atomicAdd definition

* More explicit CMake error message
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants