Skip to content

Resolve uncovered CUDA dnn layer#24080

Merged
asmorkalov merged 7 commits intoopencv:4.xfrom
dkurt:dnn_cuda_layers
Aug 3, 2023
Merged

Resolve uncovered CUDA dnn layer#24080
asmorkalov merged 7 commits intoopencv:4.xfrom
dkurt:dnn_cuda_layers

Conversation

@dkurt
Copy link
Copy Markdown
Member

@dkurt dkurt commented Jul 31, 2023

Pull Request Readiness Checklist

  • Gelu activation layer on CUDA
  • Try to relax GEMM from ONNX

resolves #24064

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@dkurt dkurt force-pushed the dnn_cuda_layers branch from d4f4c74 to 2f61118 Compare August 1, 2023 10:26
@dkurt
Copy link
Copy Markdown
Member Author

dkurt commented Aug 1, 2023

Seems like LayerNorm changes became bigger and wider. Make sense to move to a separate PR.

@dkurt dkurt marked this pull request as ready for review August 1, 2023 13:19
@dkurt dkurt requested a review from WanliZhong August 1, 2023 17:14
@dkurt
Copy link
Copy Markdown
Member Author

dkurt commented Aug 1, 2023

@WanliZhong, @fengyuentau, please review a part about GEMM. I have switched is_matmul to false when weights are just 2D.

@dkurt dkurt requested a review from fengyuentau August 1, 2023 17:15
@asmorkalov
Copy link
Copy Markdown
Contributor

@dkurt Thanks a lot for the patch. I made experiments with the code. The matmul change is covered by accuracy tests and it works well, but not covered with performance tests. Could you add some performance test to ensure performance regressions before merge.

@asmorkalov asmorkalov added the pr: needs test New functionality requires minimal tests set label Aug 2, 2023
@asmorkalov asmorkalov self-requested a review August 2, 2023 10:59
Copy link
Copy Markdown
Contributor

@asmorkalov asmorkalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@dkurt
Copy link
Copy Markdown
Member Author

dkurt commented Aug 2, 2023

@asmorkalov, added performance test. To compare, replace lp.set("is_matmul", weights.dims > 2); to lp.set("is_matmul", true); on the same branch. This is the only change made in ONNX importer.

Geometric mean (ms)

                           Name of Test                             always  matmul    matmul  
                                                                    matmul multidim  multidim 
                                                                                        vs    
                                                                                      always  
                                                                                      matmul  
                                                                                    (x-factor)
fc::Layer_FullyConnected::([5, 16, 512, 128], 256, false, OCV/CPU)  17.941  16.875     1.06   
fc::Layer_FullyConnected::([5, 16, 512, 128], 256, true, OCV/CPU)   19.657  19.652     1.00   
fc::Layer_FullyConnected::([5, 16, 512, 128], 512, false, OCV/CPU)  35.543  33.532     1.06   
fc::Layer_FullyConnected::([5, 16, 512, 128], 512, true, OCV/CPU)   39.381  39.283     1.00   
fc::Layer_FullyConnected::([5, 16, 512, 128], 1024, false, OCV/CPU) 71.357  68.120     1.05   
fc::Layer_FullyConnected::([5, 16, 512, 128], 1024, true, OCV/CPU)  80.729  81.594     0.99   
fc::Layer_FullyConnected::([5, 512, 384, 0], 256, false, OCV/CPU)   3.217   3.152      1.02   
fc::Layer_FullyConnected::([5, 512, 384, 0], 256, true, OCV/CPU)    3.326   3.301      1.01   
fc::Layer_FullyConnected::([5, 512, 384, 0], 512, false, OCV/CPU)   6.435   6.414      1.00   
fc::Layer_FullyConnected::([5, 512, 384, 0], 512, true, OCV/CPU)    6.718   6.741      1.00   
fc::Layer_FullyConnected::([5, 512, 384, 0], 1024, false, OCV/CPU)  17.074  17.059     1.00   
fc::Layer_FullyConnected::([5, 512, 384, 0], 1024, true, OCV/CPU)   17.356  17.388     1.00

Copy link
Copy Markdown
Member

@fengyuentau fengyuentau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gemm part looks good to me. IIRC, is_matmul is added by Wanli and used to deal with some optimization things. @WanliZhong Could you add some details here?

@dkurt
Copy link
Copy Markdown
Member Author

dkurt commented Aug 3, 2023

@fengyuentau, is_matmul is not about optimization but enabling matmul broadcast. It utilizes the same implementation: #22828

if (isMatMul)
{
int matNum = input[0].total(0, inp1Dim - 2);
int rowMatMul = oriMat.size[oriMat.dims - 2];
Mat srcMatTmp = input[0].reshape(1, matNum);
Mat dstMatTmp = output[0].reshape(1, matNum);
int outerSize = input[0].size[inp1Dim - 2];
int rowStart = -rowMatMul;
for (int n = 0; n < matNum; ++n)
{
Mat srcMat = srcMatTmp.row(n).reshape(1, outerSize);
Mat dstMat = dstMatTmp.row(n).reshape(1, outerSize);
rowStart = (rowStart + rowMatMul) % weightsMat.rows;
Mat weiMat = weightsMat.rowRange(rowStart, rowStart + rowMatMul);
const int nstripes = getNumThreads();
FullyConnected::run(srcMat, weiMat, biasMat, dstMat, activ.get(), nstripes);
}
}
else
{
int axisCan = normalize_axis(axis, inp1Dim);
int outerSize = input[0].total(0, axisCan);
for (size_t i = 0; i < input.size(); i++)
{
Mat srcMat = input[i].reshape(1, outerSize);
Mat dstMat = output[i].reshape(1, outerSize);
const int nstripes = getNumThreads();
FullyConnected::run(srcMat, weightsMat, biasMat, dstMat, activ.get(), nstripes);
}
}

Copy link
Copy Markdown
Member

@WanliZhong WanliZhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the matrix is 2D, the implementation is the same whether or not is_matmul is true. Someday matrix multiplication should have to be separated from the inner product implementation. :-)

Copy link
Copy Markdown
Contributor

@asmorkalov asmorkalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@asmorkalov asmorkalov added category: gpu/cuda (contrib) OpenCV 4.0+: moved to opencv_contrib and removed pr: needs test New functionality requires minimal tests set labels Aug 3, 2023
@asmorkalov asmorkalov self-assigned this Aug 3, 2023
@asmorkalov asmorkalov merged commit 96f23e3 into opencv:4.x Aug 3, 2023
@asmorkalov asmorkalov added this to the 4.9.0 milestone Aug 3, 2023
@asmorkalov asmorkalov mentioned this pull request Aug 7, 2023
thewoz pushed a commit to thewoz/opencv that referenced this pull request Jan 4, 2024
Resolve uncovered CUDA dnn layer opencv#24080

### Pull Request Readiness Checklist

* Gelu activation layer on CUDA
* Try to relax GEMM from ONNX

resolves opencv#24064

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [x] There is a reference to the original bug report and related work
- [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [x] The feature is well documented and sample code can be built with the project CMake
thewoz pushed a commit to thewoz/opencv that referenced this pull request May 29, 2024
Resolve uncovered CUDA dnn layer opencv#24080

### Pull Request Readiness Checklist

* Gelu activation layer on CUDA
* Try to relax GEMM from ONNX

resolves opencv#24064

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [x] There is a reference to the original bug report and related work
- [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [x] The feature is well documented and sample code can be built with the project CMake
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Doesnt use cuda for specific layer

4 participants