Gemm kernels for Intel GPU#8104
Conversation
…::run calls Kernel::run launch OCL gpu kernels and set a event callback function to decreate the ref count of UMat or remove UMat when the lauched workloads are completed. However, for some OCL kernels requires multiple call of Kernel::run function with some kernel parameter changes (e.g., input and output buffer offset) to get the final computation result. In the case, the current implementation requires unnecessary synchronization and cleanupMat. This fix requires the user to specify whether there will be more work or not. If there is no remaining computation, the Kernel::run will reset the kernel object Signed-off-by: Woo, Insoo <insoo.woo@intel.com>
The optimized kernels uses cl_intel_subgroups extension for better performance. Note: This optimized kernels will be part of ISAAC in a code generation way under MIT license. Signed-off-by: Woo, Insoo <insoo.woo@intel.com>
This patch fixes a OCV API compatibility error. The error was reported due to the interface changes of Kernel::run. To resolve the issue, An overloaded function of Kernel::run is added. It take a flag indicating whether there are more work to be done with the kernel object without releasing resources related to it. Signed-off-by: Woo, Insoo <insoo.woo@intel.com>
|
Functions are hiden by default (via compiler flags). |
|
The functions will be available when OpenCL is enabled (HAVE_OPENCL). |
Signed-off-by: Woo, Insoo <insoo.woo@intel.com>
|
Nice performance improvement! Thank you!
Measured on i5-6600 iGPU (Skylake) |
alalek
left a comment
There was a problem hiding this comment.
Current ocl::Kernel design doesn't support well multiple OpenCL kernel runs from single instance (especially concurrent runs). Proposed ocl::Kernel change doesn't look solid and it has many limitations.
Could you check performance of code from this branch on your device?
| const size_t gy = (size_t)(M + dy - 1) / dy; | ||
|
|
||
| size_t local[] = {lx, ly, 1}; | ||
| size_t global[] = {(gx + lx - 1) / lx * lx, (gy + ly - 1) / ly * ly, 1}; |
There was a problem hiding this comment.
(gx + lx - 1) / lx * lx -> gx
This is handled in the .run() method.
There was a problem hiding this comment.
"Proposed ocl::Kernel change doesn't look solid and it has many limitations."
I tried to make as small changes as possible. For the submission, I will take your recommendation.
There will be a fix required to reduce unnecessary overhead of creating an kernel object, setting kernel params, and release resources instead of creating a solid solution.
| (int) (A.offset / sizeof(float)), | ||
| ocl::KernelArg::PtrReadOnly(B), | ||
| (int) (B.offset / sizeof(float)), | ||
| ocl::KernelArg::PtrWriteOnly(D), |
There was a problem hiding this comment.
OpenCL code reads values from this buffer too, so PtrReadWrite(D) should be here.
| if (haveC && beta != 0.0) | ||
| { | ||
| ctrans ? transpose(matC, D) : matC.copyTo(D); | ||
| } |
There was a problem hiding this comment.
In the "else" case, we assume that "D" contains zeros. But this may be not true and "D" may have garbage values.
We can try to hope on zero "beta" in these expressions:
(start_index != 0) ? vload4(0, dst_write0): (float)beta * vload4(0, dst_write0);
but this probably will not work with NaN values in dst.
There was a problem hiding this comment.
The code is part of the exisiting code. I thought that a UMAT buffer is created with zero value initialization. I will fix my change as well as the existing code.
There was a problem hiding this comment.
found that the existing code does not need to be changed
There was a problem hiding this comment.
Zero value initialization works in case of buffer creation only.
But OpenCV can reuse buffers too (or pass ROI of buffer).
Branch with test: https://github.com/alalek/opencv/commits/pr_8104_test
Test results: Linux / Windows
P.S. Sometimes I saw sporadical "nan" values in dst on my Linux machine. Probably this is related to incorrect vectorized load/store operations (see another comment).
| w += TILE_K; | ||
| } | ||
|
|
||
| vstore4(dot00, 0, dst_write0); dst_write0 += ldC; |
There was a problem hiding this comment.
We can't use vectorized store in case of non-aligned data/sizes.
For example, for contiguous matrix 3x3 this will garbage memory in next row.
We need to add more checks into host code.
P.S. Similar problem on vload statements.
There was a problem hiding this comment.
I think that the current implementation is ok for data reads and writes. I have checked the spec (Please see the below).
The read address computed as (p + (offset * n)) must be 8-bit aligned if gentype is charn, ucharn; 16-bit aligned if gentype is shortn, ushortn; 32-bit aligned if gentype is intn, uintn, floatn; 64-bit aligned if gentype is longn, ulongn.
The write address computed as (p + (offset * n)) must be 8-bit aligned if gentype is charn, ucharn; 16-bit aligned if gentype is shortn, ushortn; 32-bit aligned if gentype is intn, uintn, floatn; 64-bit aligned if gentype is longn, ulongn.
There was a problem hiding this comment.
There is no problem with memory alignment, there is problem with vectorized access. You can't access/rewrite less that 4 elements - vstore4 can't make safe update of the last column of matrix 1005x1005 because vstore4 will touch columns 1,2,3 of the next row too in case of contiguous matrix.
Current implementation of kernel is fast, but it has some limitations. We need to determine and to "write" these implementations in the host code where we run OpenCL kernel:
- we have this check: dev.intelSubgroupsSupport()
- we have float type check
- we need to add more checks for src/dst sizes
BTW, this series of writes updates 8 rows "at once" (there is no checks), so the "host" condition should has something like this: (dst.rows & 7) == 0 (or (dst.rows & (local_size[1] - 1)) == 0) to run this kernel.
There was a problem hiding this comment.
In the host-side code, there is a conditional check to choose
intelblas_gemm_Buffer_NN_sp": if (M % 32 == 0 && N % 32 == 0 && K % 16 == 0).
When the condition is not met, "intellblas_gemm_buffer_NN" will be selected.
There was a problem hiding this comment.
Great!
I'm sorry for false alarm about this kernel.
|
I receive sporadical NaN values in the result on my branch with test (https://github.com/alalek/opencv/commits/pr_8104_test). Could you help to investigate this issue? There is output of previous iteration (for reference): |
|
The reason for incorrect output was that D is used without initialization. There were three patches pushed. The first two are reverting patches. |
When C is null and beta is non-zero, D is used without initialization. This resloves the issue Signed-off-by: Woo, Insoo <insoo.woo@intel.com>
|
@insoow Great! One problem is fixed. |
| float4 dot04 = (start_index != 0) ? vload4(0, dst_write0 + 4 * ldC) : (float)beta * vload4(0, dst_write0 + 4 * ldC); | ||
| float4 dot05 = (start_index != 0) ? vload4(0, dst_write0 + 5 * ldC) : (float)beta * vload4(0, dst_write0 + 5 * ldC); | ||
| float4 dot06 = (start_index != 0) ? vload4(0, dst_write0 + 6 * ldC) : (float)beta * vload4(0, dst_write0 + 6 * ldC); | ||
| float4 dot07 = (start_index != 0) ? vload4(0, dst_write0 + 7 * ldC) : (float)beta * vload4(0, dst_write0 + 7 * ldC); |
There was a problem hiding this comment.
This vload4 series probably reads out of buffer.
Including nan values:
Check code:
#define CHECK_NAN_(id, v) if(isnan(dot ## id . s ## v)) { printf("dot" #id ".s" #v " is NAN, lx=%d ly=%d\n", local_x, local_y); }
#define CHECK_NAN(id) CHECK_NAN_(id, 0) CHECK_NAN_(id, 1) CHECK_NAN_(id, 2) CHECK_NAN_(id, 3)
CHECK_NAN(00) CHECK_NAN(01) CHECK_NAN(02) CHECK_NAN(03)
CHECK_NAN(04) CHECK_NAN(05) CHECK_NAN(06) CHECK_NAN(07)
Sporadic results (dst in this case rows=10 cols=5):
OpenCV: OCL: Kernel::run(intelblas_gemm_buffer_NN)
dims=2 local=8x4x1 global=8x4x1
dot00.s0 is NAN, lx=3 ly=3
dot00.s0 is NAN, lx=1 ly=2
dot00.s0 is NAN, lx=7 ly=2
dot00.s0 is NAN, lx=5 ly=1
dot01.s3 is NAN, lx=1 ly=3
...
Probably these nan values are propagated via pipeline. I hope it is the reason of nan values from this comment.
Signed-off-by: Woo, Insoo <insoo.woo@intel.com>
|
Thank you! 👍 |
This pullrequest changes