dnn: refactor ONNX MatMul with fastGemm#24694
Conversation
|
The previous performance results of |
|
All todo items are checked! |
|
@asmorkalov, this PR looks good to me. It needs to be merged in order to merge the other important PR, #24476. |
|
@dkurt Please join the review too. |
| int total_tiles = m_tiles * n_tiles; | ||
|
|
||
| auto fn = [&](const Range &r) { | ||
| char* packed_a = (char*)(use_stackbuff ? alloca(buff_size) : malloc(buff_size)); |
There was a problem hiding this comment.
OpenCV AutoBuffer makes sense here: https://docs.opencv.org/4.x/d8/dd0/classcv_1_1AutoBuffer.html. No problems with memory leaks and it has built-in logic for alloca.
There was a problem hiding this comment.
If a temporary buffer is usually small (a few K's of memory)
AutoBuffer says something like this. A typical buff_size would be
FAST_GEMM_F32_PACKED_STRIDE_K * (FAST_GEMM_F32_MC + FAST_GEMM_F32_NC) * 4 / 1024
= 64 * (144 + 72) * 4 / 1024 = 54 KB
Is 54 KB still considered to be a few KBs?
| int total_tiles = m_tiles * n_tiles; | ||
|
|
||
| auto fn = [&](const Range &r) { | ||
| char* packed_a = (char*)(use_stackbuff ? alloca(buff_size) : malloc(buff_size)); |
There was a problem hiding this comment.
The same idea for AutoBuffer.
| int total_tiles = m_tiles * n_tiles; | ||
|
|
||
| auto fn = [&](const Range &r) { | ||
| char* packed_a = (char*)(use_stackbuff ? alloca(buff_size) : malloc(buff_size)); |
| int total_tiles = m_tiles * n_tiles; | ||
|
|
||
| auto fn = [&](const Range &r) { | ||
| char* packed_a = (char*)(use_stackbuff ? alloca(buff_size) : malloc(buff_size)); |
| half **dev_C_slices = 0; | ||
| cudaMalloc((void**)&dev_A_slices, batch_count * sizeof(half*)); | ||
| cudaMalloc((void**)&dev_B_slices, batch_count * sizeof(half*)); | ||
| cudaMalloc((void**)&dev_C_slices, batch_count * sizeof(half*)); | ||
| cudaMemcpy(dev_A_slices, A_slices, batch_count * sizeof(half*), cudaMemcpyHostToDevice); | ||
| cudaMemcpy(dev_B_slices, B_slices, batch_count * sizeof(half*), cudaMemcpyHostToDevice); | ||
| cudaMemcpy(dev_C_slices, C_slices, batch_count * sizeof(half*), cudaMemcpyHostToDevice); | ||
|
|
||
| CUDA4DNN_CHECK_CUBLAS(cublasHgemmBatched(handle.get(), opa, opb, iM, iN, iK, &alpha, dev_A_slices, ilda, dev_B_slices, ildb, &beta, dev_C_slices, ildc, batch_count)); | ||
|
|
||
| cudaFree(dev_A_slices); | ||
| cudaFree(dev_B_slices); | ||
| cudaFree(dev_C_slices); |
There was a problem hiding this comment.
Optional optimization with streams is possible. E.g. create stream, use cudaMemcopyAsync and cublasSetStream(). It reduces amount of CPU-GPU syncs.
There was a problem hiding this comment.
Do we have examples demonstrating how to use these two APIs?
There was a problem hiding this comment.
By the way, Linux-RISC-V-Clang seems to have trouble starting jobs.
| cudaMalloc((void**)&dev_A_slices, batch_count * sizeof(float*)); | ||
| cudaMalloc((void**)&dev_B_slices, batch_count * sizeof(float*)); | ||
| cudaMalloc((void**)&dev_C_slices, batch_count * sizeof(float*)); | ||
| cudaMemcpy(dev_A_slices, A_slices, batch_count * sizeof(float*), cudaMemcpyHostToDevice); | ||
| cudaMemcpy(dev_B_slices, B_slices, batch_count * sizeof(float*), cudaMemcpyHostToDevice); | ||
| cudaMemcpy(dev_C_slices, C_slices, batch_count * sizeof(float*), cudaMemcpyHostToDevice); | ||
|
|
||
| // cuBLAS is column-major | ||
| CUDA4DNN_CHECK_CUBLAS(cublasSgemmBatched(handle.get(), opa, opb, iM, iN, iK, &alpha, dev_A_slices, ilda, dev_B_slices, ildb, &beta, dev_C_slices, ildc, batch_count)); | ||
|
|
||
| cudaFree(dev_A_slices); | ||
| cudaFree(dev_B_slices); | ||
| cudaFree(dev_C_slices); |
There was a problem hiding this comment.
The same optional recommendation here.
dnn: refactor ONNX MatMul with fastGemm opencv#24694 Done: - [x] add backends - [x] CUDA - [x] OpenVINO - [x] CANN - [x] OpenCL - [x] Vulkan - [x] add perf tests - [x] const B case ### Benchmark Tests are done on M1. All data is in milliseconds (ms). | Configuration | MatMul (Prepacked) | MatMul | InnerProduct | | - | - | - | - | | A=[12, 197, 197], B=[12, 197, 64], trans_a=0, trans_b=0 | **0.39** | 0.41 | 1.33 | | A=[12, 197, 64], B=[12, 64, 197], trans_a=0, trans_b=0 | **0.42** | 0.42 | 1.17 | | A=[12, 50, 64], B=[12, 64, 50], trans_a=0, trans_b=0 | **0.13** | 0.15 | 0.33 | | A=[12, 50, 50], B=[12, 50, 64], trans_a=0, trans_b=0 | **0.11** | 0.13 | 0.22 | | A=[16, 197, 197], B=[16, 197, 64], trans_a=0, trans_b=0 | **0.46** | 0.54 | 1.46 | | A=[16, 197, 64], B=[16, 64, 197], trans_a=0, trans_b=0 | **0.46** | 0.95 | 1.74 | | A=[16, 50, 64], B=[16, 64, 50], trans_a=0, trans_b=0 | **0.18** | 0.32 | 0.43 | | A=[16, 50, 50], B=[16, 50, 64], trans_a=0, trans_b=0 | **0.15** | 0.25 | 0.25 | ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [x] The PR is proposed to the proper branch - [x] There is a reference to the original bug report and related work - [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [x] The feature is well documented and sample code can be built with the project CMake
Done:
Benchmark
Tests are done on M1. All data is in milliseconds (ms).
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.