Gemm kernels for Intel GPU by insoow · Pull Request #8104 · opencv/opencv

insoow · 2017-01-30T21:34:39Z

This pullrequest changes

…::run calls Kernel::run launch OCL gpu kernels and set a event callback function to decreate the ref count of UMat or remove UMat when the lauched workloads are completed. However, for some OCL kernels requires multiple call of Kernel::run function with some kernel parameter changes (e.g., input and output buffer offset) to get the final computation result. In the case, the current implementation requires unnecessary synchronization and cleanupMat. This fix requires the user to specify whether there will be more work or not. If there is no remaining computation, the Kernel::run will reset the kernel object Signed-off-by: Woo, Insoo <insoo.woo@intel.com>

The optimized kernels uses cl_intel_subgroups extension for better performance. Note: This optimized kernels will be part of ISAAC in a code generation way under MIT license. Signed-off-by: Woo, Insoo <insoo.woo@intel.com>

This patch fixes a OCV API compatibility error. The error was reported due to the interface changes of Kernel::run. To resolve the issue, An overloaded function of Kernel::run is added. It take a flag indicating whether there are more work to be done with the kernel object without releasing resources related to it. Signed-off-by: Woo, Insoo <insoo.woo@intel.com>

alalek · 2017-02-07T13:29:24Z

Functions are hiden by default (via compiler flags).
Could you please rename file intel_gpu_gemm.cpp to intel_gpu_gemm.inl.hpp and #include that file into matmul.cpp directly.
Update: Function intel_gpu_gemm should have static modifier (or anonymous namespace).

insoow · 2017-02-23T19:45:13Z

The functions will be available when OpenCL is enabled (HAVE_OPENCL).

Signed-off-by: Woo, Insoo <insoo.woo@intel.com>

alalek · 2017-03-02T13:11:21Z

Nice performance improvement! Thank you!

Testcase Gemm::OCL_GemmFixture::*	Origin	Patch	Rate
(640x640, 0, 32FC1)	8.820 ms	2.336 ms	3.78
(640x640, GEMM_1_T, 32FC1)	8.895 ms	2.675 ms	3.32
(640x640, GEMM_1_T\|GEMM_2_T, 32FC1)	9.276 ms	2.544 ms	3.65
(640x640, GEMM_2_T, 32FC1)	8.977 ms	3.691 ms	2.43
(640x640, GEMM_2_T\|GEMM_3_T, 32FC1)	9.026 ms	3.681 ms	2.45
(640x640, GEMM_3_T, 32FC1)	8.886 ms	2.215 ms	4.01
(1280x1280, 0, 32FC1)	72.351 ms	18.333 ms	3.95
(1280x1280, GEMM_1_T, 32FC1)	72.676 ms	22.818 ms	3.19
(1280x1280, GEMM_1_T\|GEMM_2_T, 32FC1)	73.262 ms	24.942 ms	2.94
(1280x1280, GEMM_2_T, 32FC1)	73.232 ms	27.165 ms	2.70
(1280x1280, GEMM_2_T\|GEMM_3_T, 32FC1)	73.491 ms	26.148 ms	2.81
(1280x1280, GEMM_3_T, 32FC1)	72.124 ms	18.470 ms	3.90

Measured on i5-6600 iGPU (Skylake)

alalek

Current ocl::Kernel design doesn't support well multiple OpenCL kernel runs from single instance (especially concurrent runs). Proposed ocl::Kernel change doesn't look solid and it has many limitations.

Could you check performance of code from this branch on your device?

alalek · 2017-03-02T13:17:40Z

modules/core/src/intel_gpu_gemm.inl.hpp

+    const size_t gy = (size_t)(M + dy - 1) / dy;
+
+    size_t local[] = {lx, ly, 1};
+    size_t global[] = {(gx + lx - 1) / lx * lx, (gy + ly - 1) / ly * ly, 1};


(gx + lx - 1) / lx * lx -> gx
This is handled in the .run() method.

"Proposed ocl::Kernel change doesn't look solid and it has many limitations."
I tried to make as small changes as possible. For the submission, I will take your recommendation.

There will be a fix required to reduce unnecessary overhead of creating an kernel object, setting kernel params, and release resources instead of creating a solid solution.

alalek · 2017-03-02T13:26:05Z

modules/core/src/intel_gpu_gemm.inl.hpp

+           (int) (A.offset / sizeof(float)),
+           ocl::KernelArg::PtrReadOnly(B),
+           (int) (B.offset / sizeof(float)),
+           ocl::KernelArg::PtrWriteOnly(D),


OpenCL code reads values from this buffer too, so PtrReadWrite(D) should be here.

alalek · 2017-03-02T13:32:06Z

modules/core/src/matmul.cpp

+        if (haveC && beta != 0.0)
+        {
+            ctrans ? transpose(matC, D) : matC.copyTo(D);
+        }


In the "else" case, we assume that "D" contains zeros. But this may be not true and "D" may have garbage values.

We can try to hope on zero "beta" in these expressions:

(start_index != 0) ? vload4(0, dst_write0): (float)beta * vload4(0, dst_write0);

but this probably will not work with NaN values in dst.

The code is part of the exisiting code. I thought that a UMAT buffer is created with zero value initialization. I will fix my change as well as the existing code.

found that the existing code does not need to be changed

Zero value initialization works in case of buffer creation only.
But OpenCV can reuse buffers too (or pass ROI of buffer).

Branch with test: https://github.com/alalek/opencv/commits/pr_8104_test
Test results: Linux / Windows

P.S. Sometimes I saw sporadical "nan" values in dst on my Linux machine. Probably this is related to incorrect vectorized load/store operations (see another comment).

This reverts commit 2ef427d. Conflicts: modules/core/src/intel_gpu_gemm.inl.hpp

…e Kernel::run calls" This reverts commit cc7f9f5.

alalek · 2017-03-03T11:09:09Z

modules/core/src/opencl/intel_gemm.cl

+        w += TILE_K;
+    }
+
+    vstore4(dot00, 0, dst_write0); dst_write0 += ldC;


We can't use vectorized store in case of non-aligned data/sizes.
For example, for contiguous matrix 3x3 this will garbage memory in next row.

We need to add more checks into host code.

P.S. Similar problem on vload statements.

I think that the current implementation is ok for data reads and writes. I have checked the spec (Please see the below).

The read address computed as (p + (offset * n)) must be 8-bit aligned if gentype is charn, ucharn; 16-bit aligned if gentype is shortn, ushortn; 32-bit aligned if gentype is intn, uintn, floatn; 64-bit aligned if gentype is longn, ulongn.

The write address computed as (p + (offset * n)) must be 8-bit aligned if gentype is charn, ucharn; 16-bit aligned if gentype is shortn, ushortn; 32-bit aligned if gentype is intn, uintn, floatn; 64-bit aligned if gentype is longn, ulongn.

There is no problem with memory alignment, there is problem with vectorized access. You can't access/rewrite less that 4 elements - vstore4 can't make safe update of the last column of matrix 1005x1005 because vstore4 will touch columns 1,2,3 of the next row too in case of contiguous matrix.

Current implementation of kernel is fast, but it has some limitations. We need to determine and to "write" these implementations in the host code where we run OpenCL kernel:

we have this check: dev.intelSubgroupsSupport()

we have float type check

we need to add more checks for src/dst sizes

BTW, this series of writes updates 8 rows "at once" (there is no checks), so the "host" condition should has something like this: (dst.rows & 7) == 0 (or (dst.rows & (local_size[1] - 1)) == 0) to run this kernel.

In the host-side code, there is a conditional check to choose
intelblas_gemm_Buffer_NN_sp": if (M % 32 == 0 && N % 32 == 0 && K % 16 == 0).
When the condition is not met, "intellblas_gemm_buffer_NN" will be selected.

Great!
I'm sorry for false alarm about this kernel.

alalek · 2017-03-13T23:55:23Z

I receive sporadical NaN values in the result on my branch with test (https://github.com/alalek/opencv/commits/pr_8104_test).
Test parameters: opencv_test_core --gtest_filter=OCL*gemm* --gtest_repeat=-1 --gtest_break_on_failure
Output example from my Linux workstation (OpenCL runtime SRB4: r4.0.59481, i5-6600):

...
Repeating all tests (iteration 146) . . .

Note: Google Test filter = OCL*gemm*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from OCL
[ RUN      ] OCL.gemm_reuse_D
OpenCV: OCL: Kernel::run(intelblas_gemm_buffer_NN)
OpenCV: OCL: Kernel::run(intelblas_gemm_buffer_NN)
a=
[0.4697172, 0.77045798, 0.86939859, 0.55501348, 0.61788052, 0.8690908, 0.59598577, 0.33389202, 0.52609754, 0.32366583;
 0.34521863, 0.053028613, 0.4671855, 0.89212525, 0.22870186, 0.42892265, 0.023268729, 0.8135829, 0.16387039, 0.89843047;
 0.27414906, 0.77042043, 0.71890688, 0.22070757, 0.56171542, 0.90053737, 0.36234802, 0.43648669, 0.090359062, 0.70275462;
 0.93804216, 0.16725916, 0.71459275, 0.53946143, 0.2292451, 0.54528487, 0.95082867, 0.63492537, 0.61040771, 0.72956884;
 0.81757414, 0.062038779, 0.61714286, 0.21701941, 0.25912783, 0.91782087, 0.13752073, 0.25208503, 0.30018359, 0.92475003;
 0.26533496, 0.11604837, 0.19059148, 0.010641068, 0.16111162, 0.060394943, 0.28253025, 0.13271004, 0.19695139, 0.9093287;
 0.10543111, 0.52785105, 0.39195511, 0.22460574, 0.97685224, 0.15073341, 0.27625704, 0.23560759, 0.70046741, 0.88694304;
 0.5709492, 0.40107918, 0.25452986, 0.66150975, 0.35463071, 0.03622815, 0.66591835, 0.15007454, 0.82831192, 0.054897636;
 0.81202829, 0.43369132, 0.68049365, 0.96994615, 0.99950004, 0.31446472, 0.21269783, 0.16224265, 0.13861021, 0.42665961;
 0.14776993, 0.64611834, 0.73192912, 0.45375571, 0.62054801, 0.32181042, 0.92752087, 0.57924128, 0.14379078, 0.17707342]
b=
[0.83263576, 0.29477626, 0.070021123, 0.011865854, 0.65267956, 0.014045805, 0.96897638, 0.6360665, 0.96251404, 0.35905412;
 0.59120524, 0.96354854, 0.23921493, 0.21219331, 0.085621566, 0.22553381, 0.25651896, 0.13891643, 0.64982808, 0.046107113;
 0.8545326, 0.79217458, 0.65596759, 0.61545265, 0.36808848, 0.35304511, 0.81363648, 0.11214325, 0.53980255, 0.59540892;
 0.36527705, 0.27361721, 0.45167589, 0.37423173, 0.2749688, 0.63643247, 0.5537324, 0.82840759, 0.42567188, 0.47756407;
 0.27917421, 0.26075748, 0.13848183, 0.96844661, 0.28806078, 0.020451367, 0.03953138, 0.39176151, 0.80402446, 0.5088976;
 0.24880663, 0.85834324, 0.56660467, 0.78648299, 0.27971894, 0.33192667, 0.57813317, 0.1013802, 0.98743951, 0.043715388;
 0.93579406, 0.094189912, 0.7937423, 0.997172, 0.43242633, 0.54615748, 0.11551228, 0.98970461, 0.93635565, 0.24968669;
 0.6763252, 0.084630996, 0.73001349, 0.62112808, 0.31084806, 0.97781426, 0.034763336, 0.70996177, 0.68390197, 0.69417715;
 0.46738112, 0.57731611, 0.76102859, 0.022845089, 0.6177811, 0.077412724, 0.24424142, 0.36688471, 0.93988723, 0.78573084;
 0.69067037, 0.88135958, 0.43611726, 0.90912604, 0.66283399, 0.54708374, 0.96870279, 0.49341825, 0.30526567, 0.82213616]
d1=
[3.4339712, 3.3018909, 2.874511, 3.301708, 2.1673174, 2.0114236, 2.7168427, 2.4728658, 4.3929782, 2.3993683;
 2.4835873, 2.1523409, 2.1498954, 2.5446389, 1.7926966, 2.2090154, 2.4205515, 2.2555299, 2.6125734, 2.4037178;
 2.9214618, 3.1150961, 2.3442979, 3.2174227, 1.7979124, 1.8984412, 2.4731083, 1.9046223, 3.5246439, 2.0348554;
 3.9956777, 2.8178856, 3.1597445, 3.3588972, -nan, 2.4187253, 3.1440182, 3.2677574, 4.4706631, 2.9257026;
 2.7029417, 2.7270441, 2.0557916, 2.5979862, 2.0933702, 1.5421755, 2.965065, 1.8537009, 3.2329962, 2.1449621;
 1.5905372, 1.3907683, 1.1002947, 1.547938, 1.2073319, 0.92409831, 1.4553438, 1.179099, 1.4453938, 1.3688226;
 2.4848886, 2.5278113, 2.0240541, 2.7472839, 1.8569282, 1.3925812, 1.8769736, 1.903693, 3.0348523, 2.4186354;
 2.429384, 1.662969, 1.9637458, 1.6941061, 1.6775753, 1.2332177, 1.6021049, 2.2346172, 3.071517, 1.8394243;
 2.8938382, 2.4820859, 1.9405041, 2.8026655, 1.9710373, 1.6105465, 2.6875355, 2.4670901, 3.528403, 2.3277476;
 2.9987693, 2.1835959, 2.4639745, 3.0621467, 1.6020653, 1.9954627, 1.7002187, 2.3867924, 3.4208841, 1.957451]
d2=
[6.8679423, 6.6037822, 5.749022, 6.603416, 4.3346348, 4.0228472, 5.4336858, 4.9457316, 8.7859564, 4.798737;
 4.9671745, 4.3046818, 4.2997909, 5.0892782, 3.5853932, 4.4180307, 4.8411031, 4.5110598, 5.2251468, 4.807435;
 5.8429232, 6.2301922, 4.6885953, 6.4348459, 3.595825, 3.7968826, 4.9462171, 3.8092444, 7.0492873, 4.0697107;
 7.9913564, 5.6357713, 6.3194895, 6.7177939, -nan, 4.83745, 6.2880363, 6.5355153, 8.9413252, 5.8514056;
 5.4058843, 5.4540882, 4.1115832, 5.1959729, 4.1867404, 3.0843511, 5.9301295, 3.7074013, 6.465992, 4.2899241;
 3.1810741, 2.7815366, 2.2005892, 3.095876, 2.4146638, 1.8481966, 2.9106874, 2.3581979, 2.8907876, 2.7376451;
 4.9697776, 5.0556226, 4.0481086, 5.4945679, 3.7138562, 2.7851624, 3.7539475, 3.8073862, 6.0697041, 4.8372707;
 4.858768, 3.325938, 3.9274917, 3.3882124, 3.3551505, 2.4664357, 3.2042098, 4.4692345, 6.1430345, 3.6788487;
 5.7876759, 4.9641714, 3.8810079, 5.6053309, 3.9420743, 3.2210932, 5.375071, 4.9341807, 7.0568066, 4.6554947;
 5.9975386, 4.3671918, 4.927949, 6.1242933, 3.2041309, 3.9909253, 3.4004376, 4.7735848, 6.8417678, 3.914902]
diff=
[3.4339712, 3.3018913, 2.874511, 3.301708, 2.1673174, 2.0114236, 2.7168431, 2.4728658, 4.3929782, 2.3993688;
 2.4835873, 2.1523409, 2.1498954, 2.5446393, 1.7926966, 2.2090154, 2.4205515, 2.2555299, 2.6125734, 2.4037173;
 2.9214613, 3.1150961, 2.3442974, 3.2174232, 1.7979126, 1.8984414, 2.4731088, 1.9046221, 3.5246434, 2.0348554;
 3.9956787, 2.8178856, 3.159745, 3.3588967, -nan, 2.4187248, 3.1440182, 3.2677579, 4.4706621, 2.925703;
 2.7029426, 2.7270441, 2.0557916, 2.5979867, 2.0933702, 1.5421755, 2.9650645, 1.8537004, 3.2329957, 2.1449621;
 1.590537, 1.3907683, 1.1002945, 1.547938, 1.2073319, 0.92409831, 1.4553436, 1.179099, 1.4453938, 1.3688226;
 2.484889, 2.5278113, 2.0240545, 2.7472839, 1.856928, 1.3925812, 1.8769739, 1.9036932, 3.0348518, 2.4186354;
 2.429384, 1.662969, 1.9637458, 1.6941063, 1.6775751, 1.233218, 1.6021049, 2.2346172, 3.0715175, 1.8394245;
 2.8938377, 2.4820855, 1.9405038, 2.8026655, 1.971037, 1.6105467, 2.6875355, 2.4670906, 3.5284035, 2.3277471;
 2.9987693, 2.1835959, 2.4639745, 3.0621467, 1.6020656, 1.9954627, 1.7002189, 2.3867924, 3.4208837, 1.957451]
unknown file: Failure
C++ exception with description "/home/alalek/projects/opencv/dev/modules/core/src/mathfuncs.cpp:1540: error: (-211) the value at (4, 3)=[-nan] is out of range [-179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000, 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000) in function checkRange
" thrown in the test body.

Could you help to investigate this issue?

There is output of previous iteration (for reference):

Repeating all tests (iteration 145) . . .

Note: Google Test filter = OCL*gemm*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from OCL
[ RUN      ] OCL.gemm_reuse_D
OpenCV: OCL: Kernel::run(intelblas_gemm_buffer_NN)
OpenCV: OCL: Kernel::run(intelblas_gemm_buffer_NN)
a=
[0.4697172, 0.77045798, 0.86939859, 0.55501348, 0.61788052, 0.8690908, 0.59598577, 0.33389202, 0.52609754, 0.32366583;
 0.34521863, 0.053028613, 0.4671855, 0.89212525, 0.22870186, 0.42892265, 0.023268729, 0.8135829, 0.16387039, 0.89843047;
 0.27414906, 0.77042043, 0.71890688, 0.22070757, 0.56171542, 0.90053737, 0.36234802, 0.43648669, 0.090359062, 0.70275462;
 0.93804216, 0.16725916, 0.71459275, 0.53946143, 0.2292451, 0.54528487, 0.95082867, 0.63492537, 0.61040771, 0.72956884;
 0.81757414, 0.062038779, 0.61714286, 0.21701941, 0.25912783, 0.91782087, 0.13752073, 0.25208503, 0.30018359, 0.92475003;
 0.26533496, 0.11604837, 0.19059148, 0.010641068, 0.16111162, 0.060394943, 0.28253025, 0.13271004, 0.19695139, 0.9093287;
 0.10543111, 0.52785105, 0.39195511, 0.22460574, 0.97685224, 0.15073341, 0.27625704, 0.23560759, 0.70046741, 0.88694304;
 0.5709492, 0.40107918, 0.25452986, 0.66150975, 0.35463071, 0.03622815, 0.66591835, 0.15007454, 0.82831192, 0.054897636;
 0.81202829, 0.43369132, 0.68049365, 0.96994615, 0.99950004, 0.31446472, 0.21269783, 0.16224265, 0.13861021, 0.42665961;
 0.14776993, 0.64611834, 0.73192912, 0.45375571, 0.62054801, 0.32181042, 0.92752087, 0.57924128, 0.14379078, 0.17707342]
b=
[0.83263576, 0.29477626, 0.070021123, 0.011865854, 0.65267956, 0.014045805, 0.96897638, 0.6360665, 0.96251404, 0.35905412;
 0.59120524, 0.96354854, 0.23921493, 0.21219331, 0.085621566, 0.22553381, 0.25651896, 0.13891643, 0.64982808, 0.046107113;
 0.8545326, 0.79217458, 0.65596759, 0.61545265, 0.36808848, 0.35304511, 0.81363648, 0.11214325, 0.53980255, 0.59540892;
 0.36527705, 0.27361721, 0.45167589, 0.37423173, 0.2749688, 0.63643247, 0.5537324, 0.82840759, 0.42567188, 0.47756407;
 0.27917421, 0.26075748, 0.13848183, 0.96844661, 0.28806078, 0.020451367, 0.03953138, 0.39176151, 0.80402446, 0.5088976;
 0.24880663, 0.85834324, 0.56660467, 0.78648299, 0.27971894, 0.33192667, 0.57813317, 0.1013802, 0.98743951, 0.043715388;
 0.93579406, 0.094189912, 0.7937423, 0.997172, 0.43242633, 0.54615748, 0.11551228, 0.98970461, 0.93635565, 0.24968669;
 0.6763252, 0.084630996, 0.73001349, 0.62112808, 0.31084806, 0.97781426, 0.034763336, 0.70996177, 0.68390197, 0.69417715;
 0.46738112, 0.57731611, 0.76102859, 0.022845089, 0.6177811, 0.077412724, 0.24424142, 0.36688471, 0.93988723, 0.78573084;
 0.69067037, 0.88135958, 0.43611726, 0.90912604, 0.66283399, 0.54708374, 0.96870279, 0.49341825, 0.30526567, 0.82213616]
d1=
[3.4339712, 3.3018909, 2.874511, 3.301708, 2.1673174, 2.0114236, 2.7168427, 2.4728658, 4.3929782, 2.3993683;
 2.4835873, 2.1523409, 2.1498954, 2.5446389, 1.7926966, 2.2090154, 2.4205515, 2.2555299, 2.6125734, 2.4037178;
 2.9214618, 3.1150961, 2.3442979, 3.2174227, 1.7979124, 1.8984412, 2.4731083, 1.9046223, 3.5246439, 2.0348554;
 3.9956777, 2.8178856, 3.1597445, 3.3588972, 2.7257032, 2.4187253, 3.1440182, 3.2677574, 4.4706631, 2.9257026;
 2.7029417, 2.7270441, 2.0557916, 2.5979862, 2.0933702, 1.5421755, 2.965065, 1.8537009, 3.2329962, 2.1449621;
 1.5905372, 1.3907683, 1.1002947, 1.547938, 1.2073319, 0.92409831, 1.4553438, 1.179099, 1.4453938, 1.3688226;
 2.4848886, 2.5278113, 2.0240541, 2.7472839, 1.8569282, 1.3925812, 1.8769736, 1.903693, 3.0348523, 2.4186354;
 2.429384, 1.662969, 1.9637458, 1.6941061, 1.6775753, 1.2332177, 1.6021049, 2.2346172, 3.071517, 1.8394243;
 2.8938382, 2.4820859, 1.9405041, 2.8026655, 1.9710373, 1.6105465, 2.6875355, 2.4670901, 3.528403, 2.3277476;
 2.9987693, 2.1835959, 2.4639745, 3.0621467, 1.6020653, 1.9954627, 1.7002187, 2.3867924, 3.4208841, 1.957451]
d2=
[6.8679423, 6.6037822, 5.749022, 6.603416, 4.3346348, 4.0228472, 5.4336858, 4.9457316, 8.7859564, 4.798737;
 4.9671745, 4.3046818, 4.2997909, 5.0892782, 3.5853932, 4.4180307, 4.8411031, 4.5110598, 5.2251468, 4.807435;
 5.8429232, 6.2301922, 4.6885953, 6.4348459, 3.595825, 3.7968826, 4.9462171, 3.8092444, 7.0492873, 4.0697107;
 7.9913564, 5.6357713, 6.3194895, 6.7177939, 5.451407, 4.83745, 6.2880363, 6.5355153, 8.9413252, 5.8514056;
 5.4058843, 5.4540882, 4.1115832, 5.1959729, 4.1867404, 3.0843511, 5.9301295, 3.7074013, 6.465992, 4.2899241;
 3.1810741, 2.7815366, 2.2005892, 3.095876, 2.4146638, 1.8481966, 2.9106874, 2.3581979, 2.8907876, 2.7376451;
 4.9697776, 5.0556226, 4.0481086, 5.4945679, 3.7138562, 2.7851624, 3.7539475, 3.8073862, 6.0697041, 4.8372707;
 4.858768, 3.325938, 3.9274917, 3.3882124, 3.3551505, 2.4664357, 3.2042098, 4.4692345, 6.1430345, 3.6788487;
 5.7876759, 4.9641714, 3.8810079, 5.6053309, 3.9420743, 3.2210932, 5.375071, 4.9341807, 7.0568066, 4.6554947;
 5.9975386, 4.3671918, 4.927949, 6.1242933, 3.2041309, 3.9909253, 3.4004376, 4.7735848, 6.8417678, 3.914902]
diff=
[3.4339712, 3.3018913, 2.874511, 3.301708, 2.1673174, 2.0114236, 2.7168431, 2.4728658, 4.3929782, 2.3993688;
 2.4835873, 2.1523409, 2.1498954, 2.5446393, 1.7926966, 2.2090154, 2.4205515, 2.2555299, 2.6125734, 2.4037173;
 2.9214613, 3.1150961, 2.3442974, 3.2174232, 1.7979126, 1.8984414, 2.4731088, 1.9046221, 3.5246434, 2.0348554;
 3.9956787, 2.8178856, 3.159745, 3.3588967, 2.7257037, 2.4187248, 3.1440182, 3.2677579, 4.4706621, 2.925703;
 2.7029426, 2.7270441, 2.0557916, 2.5979867, 2.0933702, 1.5421755, 2.9650645, 1.8537004, 3.2329957, 2.1449621;
 1.590537, 1.3907683, 1.1002945, 1.547938, 1.2073319, 0.92409831, 1.4553436, 1.179099, 1.4453938, 1.3688226;
 2.484889, 2.5278113, 2.0240545, 2.7472839, 1.856928, 1.3925812, 1.8769739, 1.9036932, 3.0348518, 2.4186354;
 2.429384, 1.662969, 1.9637458, 1.6941063, 1.6775751, 1.233218, 1.6021049, 2.2346172, 3.0715175, 1.8394245;
 2.8938377, 2.4820855, 1.9405038, 2.8026655, 1.971037, 1.6105467, 2.6875355, 2.4670906, 3.5284035, 2.3277471;
 2.9987693, 2.1835959, 2.4639745, 3.0621467, 1.6020656, 1.9954627, 1.7002189, 2.3867924, 3.4208837, 1.957451]
[       OK ] OCL.gemm_reuse_D (1 ms)

insoow · 2017-03-15T21:14:47Z

The reason for incorrect output was that D is used without initialization. There were three patches pushed. The first two are reverting patches.

When C is null and beta is non-zero, D is used without initialization. This resloves the issue Signed-off-by: Woo, Insoo <insoo.woo@intel.com>

alalek · 2017-03-16T15:06:04Z

@insoow Great! One problem is fixed.
There is another problem with NaN values, because: 0.0 * nan produces nan again (instead of zero).

alalek · 2017-03-16T15:14:14Z

modules/core/src/opencl/intel_gemm.cl

+    float4 dot04 = (start_index != 0) ? vload4(0, dst_write0 + 4 * ldC) : (float)beta * vload4(0, dst_write0 + 4 * ldC);
+    float4 dot05 = (start_index != 0) ? vload4(0, dst_write0 + 5 * ldC) : (float)beta * vload4(0, dst_write0 + 5 * ldC);
+    float4 dot06 = (start_index != 0) ? vload4(0, dst_write0 + 6 * ldC) : (float)beta * vload4(0, dst_write0 + 6 * ldC);
+    float4 dot07 = (start_index != 0) ? vload4(0, dst_write0 + 7 * ldC) : (float)beta * vload4(0, dst_write0 + 7 * ldC);


This vload4 series probably reads out of buffer.
Including nan values:

Check code:

#define CHECK_NAN_(id, v) if(isnan(dot ## id . s ## v)) { printf("dot" #id ".s" #v " is NAN, lx=%d ly=%d\n", local_x, local_y); } #define CHECK_NAN(id) CHECK_NAN_(id, 0) CHECK_NAN_(id, 1) CHECK_NAN_(id, 2) CHECK_NAN_(id, 3) CHECK_NAN(00) CHECK_NAN(01) CHECK_NAN(02) CHECK_NAN(03) CHECK_NAN(04) CHECK_NAN(05) CHECK_NAN(06) CHECK_NAN(07)

Sporadic results (dst in this case rows=10 cols=5):

OpenCV: OCL: Kernel::run(intelblas_gemm_buffer_NN) dims=2 local=8x4x1 global=8x4x1 dot00.s0 is NAN, lx=3 ly=3 dot00.s0 is NAN, lx=1 ly=2 dot00.s0 is NAN, lx=7 ly=2 dot00.s0 is NAN, lx=5 ly=1 dot01.s3 is NAN, lx=1 ly=3 ...

Probably these nan values are propagated via pipeline. I hope it is the reason of nan values from this comment.

Signed-off-by: Woo, Insoo <insoo.woo@intel.com>

alalek · 2017-04-19T09:57:20Z

Thank you! 👍

insoow added 3 commits January 30, 2017 15:20

GEMM kernel optimization for Intel GEN

8f5b66f

The optimized kernels uses cl_intel_subgroups extension for better performance. Note: This optimized kernels will be part of ISAAC in a code generation way under MIT license. Signed-off-by: Woo, Insoo <insoo.woo@intel.com>

Renaming intel_gpu_gemm.cpp to intel_gpu_gemm.inl.hpp

0295b3e

Signed-off-by: Woo, Insoo <insoo.woo@intel.com>

alalek reviewed Mar 2, 2017

View reviewed changes

insoow added 2 commits March 2, 2017 13:08

Revert "Fix API compatibility error"

1a1b689

This reverts commit 2ef427d. Conflicts: modules/core/src/intel_gpu_gemm.inl.hpp

Revert "Fix an issue with Kernel object reset release when consecutiv…

2628399

…e Kernel::run calls" This reverts commit cc7f9f5.

alalek reviewed Mar 3, 2017

View reviewed changes

Fix the case of uninitialization D

9d2a135

When C is null and beta is non-zero, D is used without initialization. This resloves the issue Signed-off-by: Woo, Insoo <insoo.woo@intel.com>

alalek reviewed Mar 16, 2017

View reviewed changes

insoow and others added 3 commits March 24, 2017 09:30

fix potential output error due to 0 * nan

763bd30

Signed-off-by: Woo, Insoo <insoo.woo@intel.com>

whitespace fix, eliminate non-ASCII symbols

d8fb5f8

fix build warning

52bf9ad

alalek merged commit 2922738 into opencv:master Apr 19, 2017

alalek mentioned this pull request Sep 8, 2021

core(OpenCL): fix intel_gpu_gemm kernel requirements #20670

Merged

Uh oh!

Conversation

insoow commented Jan 30, 2017

This pullrequest changes

Uh oh!

alalek commented Feb 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

insoow commented Feb 23, 2017

Uh oh!

alalek commented Mar 2, 2017

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

insoow Mar 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alalek commented Mar 13, 2017

Uh oh!

insoow commented Mar 15, 2017

Uh oh!

alalek commented Mar 16, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alalek commented Apr 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alalek commented Feb 7, 2017 •

edited

Loading

insoow Mar 13, 2017 •

edited

Loading