Skip to content

Gemm kernels for Intel GPU#8104

Merged
alalek merged 10 commits intoopencv:masterfrom
insoow:master
Apr 19, 2017
Merged

Gemm kernels for Intel GPU#8104
alalek merged 10 commits intoopencv:masterfrom
insoow:master

Conversation

@insoow
Copy link
Copy Markdown
Contributor

@insoow insoow commented Jan 30, 2017

This pullrequest changes

…::run calls

Kernel::run launch OCL gpu kernels and set a event callback function
to decreate the ref count of UMat or remove UMat when the lauched workloads
are completed. However, for some OCL kernels requires multiple call of
Kernel::run function with some kernel parameter changes (e.g., input
and output buffer offset) to get the final computation result.
In the case, the current implementation requires unnecessary
synchronization and cleanupMat.

This fix requires the user to specify whether there will be more work or not.
If there is no remaining computation, the Kernel::run will reset the
kernel object

Signed-off-by: Woo, Insoo <insoo.woo@intel.com>
The optimized kernels uses cl_intel_subgroups extension for better
performance.

Note: This optimized kernels will be part of ISAAC in a code generation
way under MIT license.

Signed-off-by: Woo, Insoo <insoo.woo@intel.com>
This patch fixes a OCV API compatibility error. The error was reported
due to the interface changes of Kernel::run. To resolve the issue,
An overloaded function of Kernel::run is added. It take a flag indicating
whether there are more work to be done with the kernel object without
releasing resources related to it.

Signed-off-by: Woo, Insoo <insoo.woo@intel.com>
@alalek
Copy link
Copy Markdown
Member

alalek commented Feb 7, 2017

Functions are hiden by default (via compiler flags).
Could you please rename file intel_gpu_gemm.cpp to intel_gpu_gemm.inl.hpp and #include that file into matmul.cpp directly.
Update: Function intel_gpu_gemm should have static modifier (or anonymous namespace).

@insoow
Copy link
Copy Markdown
Contributor Author

insoow commented Feb 23, 2017

The functions will be available when OpenCL is enabled (HAVE_OPENCL).

Signed-off-by: Woo, Insoo <insoo.woo@intel.com>
@alalek
Copy link
Copy Markdown
Member

alalek commented Mar 2, 2017

Nice performance improvement! Thank you!

Testcase Gemm::OCL_GemmFixture::* Origin Patch Rate
(640x640, 0, 32FC1) 8.820 ms 2.336 ms 3.78
(640x640, GEMM_1_T, 32FC1) 8.895 ms 2.675 ms 3.32
(640x640, GEMM_1_T|GEMM_2_T, 32FC1) 9.276 ms 2.544 ms 3.65
(640x640, GEMM_2_T, 32FC1) 8.977 ms 3.691 ms 2.43
(640x640, GEMM_2_T|GEMM_3_T, 32FC1) 9.026 ms 3.681 ms 2.45
(640x640, GEMM_3_T, 32FC1) 8.886 ms 2.215 ms 4.01
(1280x1280, 0, 32FC1) 72.351 ms 18.333 ms 3.95
(1280x1280, GEMM_1_T, 32FC1) 72.676 ms 22.818 ms 3.19
(1280x1280, GEMM_1_T|GEMM_2_T, 32FC1) 73.262 ms 24.942 ms 2.94
(1280x1280, GEMM_2_T, 32FC1) 73.232 ms 27.165 ms 2.70
(1280x1280, GEMM_2_T|GEMM_3_T, 32FC1) 73.491 ms 26.148 ms 2.81
(1280x1280, GEMM_3_T, 32FC1) 72.124 ms 18.470 ms 3.90

Measured on i5-6600 iGPU (Skylake)

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current ocl::Kernel design doesn't support well multiple OpenCL kernel runs from single instance (especially concurrent runs). Proposed ocl::Kernel change doesn't look solid and it has many limitations.

Could you check performance of code from this branch on your device?

const size_t gy = (size_t)(M + dy - 1) / dy;

size_t local[] = {lx, ly, 1};
size_t global[] = {(gx + lx - 1) / lx * lx, (gy + ly - 1) / ly * ly, 1};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(gx + lx - 1) / lx * lx -> gx
This is handled in the .run() method.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Proposed ocl::Kernel change doesn't look solid and it has many limitations."
I tried to make as small changes as possible. For the submission, I will take your recommendation.

There will be a fix required to reduce unnecessary overhead of creating an kernel object, setting kernel params, and release resources instead of creating a solid solution.

(int) (A.offset / sizeof(float)),
ocl::KernelArg::PtrReadOnly(B),
(int) (B.offset / sizeof(float)),
ocl::KernelArg::PtrWriteOnly(D),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenCL code reads values from this buffer too, so PtrReadWrite(D) should be here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

if (haveC && beta != 0.0)
{
ctrans ? transpose(matC, D) : matC.copyTo(D);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the "else" case, we assume that "D" contains zeros. But this may be not true and "D" may have garbage values.

We can try to hope on zero "beta" in these expressions:

(start_index != 0) ? vload4(0, dst_write0): (float)beta * vload4(0, dst_write0);

but this probably will not work with NaN values in dst.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is part of the exisiting code. I thought that a UMAT buffer is created with zero value initialization. I will fix my change as well as the existing code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found that the existing code does not need to be changed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero value initialization works in case of buffer creation only.
But OpenCV can reuse buffers too (or pass ROI of buffer).

Branch with test: https://github.com/alalek/opencv/commits/pr_8104_test
Test results: Linux / Windows

P.S. Sometimes I saw sporadical "nan" values in dst on my Linux machine. Probably this is related to incorrect vectorized load/store operations (see another comment).

insoow added 2 commits March 2, 2017 13:08
This reverts commit 2ef427d.

Conflicts:
	modules/core/src/intel_gpu_gemm.inl.hpp
w += TILE_K;
}

vstore4(dot00, 0, dst_write0); dst_write0 += ldC;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use vectorized store in case of non-aligned data/sizes.
For example, for contiguous matrix 3x3 this will garbage memory in next row.

We need to add more checks into host code.

P.S. Similar problem on vload statements.

Copy link
Copy Markdown
Contributor Author

@insoow insoow Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the current implementation is ok for data reads and writes. I have checked the spec (Please see the below).

The read address computed as (p + (offset * n)) must be 8-bit aligned if gentype is charn, ucharn; 16-bit aligned if gentype is shortn, ushortn; 32-bit aligned if gentype is intn, uintn, floatn; 64-bit aligned if gentype is longn, ulongn.

The write address computed as (p + (offset * n)) must be 8-bit aligned if gentype is charn, ucharn; 16-bit aligned if gentype is shortn, ushortn; 32-bit aligned if gentype is intn, uintn, floatn; 64-bit aligned if gentype is longn, ulongn.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no problem with memory alignment, there is problem with vectorized access. You can't access/rewrite less that 4 elements - vstore4 can't make safe update of the last column of matrix 1005x1005 because vstore4 will touch columns 1,2,3 of the next row too in case of contiguous matrix.

Current implementation of kernel is fast, but it has some limitations. We need to determine and to "write" these implementations in the host code where we run OpenCL kernel:

  • we have this check: dev.intelSubgroupsSupport()
  • we have float type check
  • we need to add more checks for src/dst sizes

BTW, this series of writes updates 8 rows "at once" (there is no checks), so the "host" condition should has something like this: (dst.rows & 7) == 0 (or (dst.rows & (local_size[1] - 1)) == 0) to run this kernel.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the host-side code, there is a conditional check to choose
intelblas_gemm_Buffer_NN_sp": if (M % 32 == 0 && N % 32 == 0 && K % 16 == 0).
When the condition is not met, "intellblas_gemm_buffer_NN" will be selected.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!
I'm sorry for false alarm about this kernel.

@alalek
Copy link
Copy Markdown
Member

alalek commented Mar 13, 2017

I receive sporadical NaN values in the result on my branch with test (https://github.com/alalek/opencv/commits/pr_8104_test).
Test parameters: opencv_test_core --gtest_filter=OCL*gemm* --gtest_repeat=-1 --gtest_break_on_failure
Output example from my Linux workstation (OpenCL runtime SRB4: r4.0.59481, i5-6600):

...
Repeating all tests (iteration 146) . . .

Note: Google Test filter = OCL*gemm*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from OCL
[ RUN      ] OCL.gemm_reuse_D
OpenCV: OCL: Kernel::run(intelblas_gemm_buffer_NN)
OpenCV: OCL: Kernel::run(intelblas_gemm_buffer_NN)
a=
[0.4697172, 0.77045798, 0.86939859, 0.55501348, 0.61788052, 0.8690908, 0.59598577, 0.33389202, 0.52609754, 0.32366583;
 0.34521863, 0.053028613, 0.4671855, 0.89212525, 0.22870186, 0.42892265, 0.023268729, 0.8135829, 0.16387039, 0.89843047;
 0.27414906, 0.77042043, 0.71890688, 0.22070757, 0.56171542, 0.90053737, 0.36234802, 0.43648669, 0.090359062, 0.70275462;
 0.93804216, 0.16725916, 0.71459275, 0.53946143, 0.2292451, 0.54528487, 0.95082867, 0.63492537, 0.61040771, 0.72956884;
 0.81757414, 0.062038779, 0.61714286, 0.21701941, 0.25912783, 0.91782087, 0.13752073, 0.25208503, 0.30018359, 0.92475003;
 0.26533496, 0.11604837, 0.19059148, 0.010641068, 0.16111162, 0.060394943, 0.28253025, 0.13271004, 0.19695139, 0.9093287;
 0.10543111, 0.52785105, 0.39195511, 0.22460574, 0.97685224, 0.15073341, 0.27625704, 0.23560759, 0.70046741, 0.88694304;
 0.5709492, 0.40107918, 0.25452986, 0.66150975, 0.35463071, 0.03622815, 0.66591835, 0.15007454, 0.82831192, 0.054897636;
 0.81202829, 0.43369132, 0.68049365, 0.96994615, 0.99950004, 0.31446472, 0.21269783, 0.16224265, 0.13861021, 0.42665961;
 0.14776993, 0.64611834, 0.73192912, 0.45375571, 0.62054801, 0.32181042, 0.92752087, 0.57924128, 0.14379078, 0.17707342]
b=
[0.83263576, 0.29477626, 0.070021123, 0.011865854, 0.65267956, 0.014045805, 0.96897638, 0.6360665, 0.96251404, 0.35905412;
 0.59120524, 0.96354854, 0.23921493, 0.21219331, 0.085621566, 0.22553381, 0.25651896, 0.13891643, 0.64982808, 0.046107113;
 0.8545326, 0.79217458, 0.65596759, 0.61545265, 0.36808848, 0.35304511, 0.81363648, 0.11214325, 0.53980255, 0.59540892;
 0.36527705, 0.27361721, 0.45167589, 0.37423173, 0.2749688, 0.63643247, 0.5537324, 0.82840759, 0.42567188, 0.47756407;
 0.27917421, 0.26075748, 0.13848183, 0.96844661, 0.28806078, 0.020451367, 0.03953138, 0.39176151, 0.80402446, 0.5088976;
 0.24880663, 0.85834324, 0.56660467, 0.78648299, 0.27971894, 0.33192667, 0.57813317, 0.1013802, 0.98743951, 0.043715388;
 0.93579406, 0.094189912, 0.7937423, 0.997172, 0.43242633, 0.54615748, 0.11551228, 0.98970461, 0.93635565, 0.24968669;
 0.6763252, 0.084630996, 0.73001349, 0.62112808, 0.31084806, 0.97781426, 0.034763336, 0.70996177, 0.68390197, 0.69417715;
 0.46738112, 0.57731611, 0.76102859, 0.022845089, 0.6177811, 0.077412724, 0.24424142, 0.36688471, 0.93988723, 0.78573084;
 0.69067037, 0.88135958, 0.43611726, 0.90912604, 0.66283399, 0.54708374, 0.96870279, 0.49341825, 0.30526567, 0.82213616]
d1=
[3.4339712, 3.3018909, 2.874511, 3.301708, 2.1673174, 2.0114236, 2.7168427, 2.4728658, 4.3929782, 2.3993683;
 2.4835873, 2.1523409, 2.1498954, 2.5446389, 1.7926966, 2.2090154, 2.4205515, 2.2555299, 2.6125734, 2.4037178;
 2.9214618, 3.1150961, 2.3442979, 3.2174227, 1.7979124, 1.8984412, 2.4731083, 1.9046223, 3.5246439, 2.0348554;
 3.9956777, 2.8178856, 3.1597445, 3.3588972, -nan, 2.4187253, 3.1440182, 3.2677574, 4.4706631, 2.9257026;
 2.7029417, 2.7270441, 2.0557916, 2.5979862, 2.0933702, 1.5421755, 2.965065, 1.8537009, 3.2329962, 2.1449621;
 1.5905372, 1.3907683, 1.1002947, 1.547938, 1.2073319, 0.92409831, 1.4553438, 1.179099, 1.4453938, 1.3688226;
 2.4848886, 2.5278113, 2.0240541, 2.7472839, 1.8569282, 1.3925812, 1.8769736, 1.903693, 3.0348523, 2.4186354;
 2.429384, 1.662969, 1.9637458, 1.6941061, 1.6775753, 1.2332177, 1.6021049, 2.2346172, 3.071517, 1.8394243;
 2.8938382, 2.4820859, 1.9405041, 2.8026655, 1.9710373, 1.6105465, 2.6875355, 2.4670901, 3.528403, 2.3277476;
 2.9987693, 2.1835959, 2.4639745, 3.0621467, 1.6020653, 1.9954627, 1.7002187, 2.3867924, 3.4208841, 1.957451]
d2=
[6.8679423, 6.6037822, 5.749022, 6.603416, 4.3346348, 4.0228472, 5.4336858, 4.9457316, 8.7859564, 4.798737;
 4.9671745, 4.3046818, 4.2997909, 5.0892782, 3.5853932, 4.4180307, 4.8411031, 4.5110598, 5.2251468, 4.807435;
 5.8429232, 6.2301922, 4.6885953, 6.4348459, 3.595825, 3.7968826, 4.9462171, 3.8092444, 7.0492873, 4.0697107;
 7.9913564, 5.6357713, 6.3194895, 6.7177939, -nan, 4.83745, 6.2880363, 6.5355153, 8.9413252, 5.8514056;
 5.4058843, 5.4540882, 4.1115832, 5.1959729, 4.1867404, 3.0843511, 5.9301295, 3.7074013, 6.465992, 4.2899241;
 3.1810741, 2.7815366, 2.2005892, 3.095876, 2.4146638, 1.8481966, 2.9106874, 2.3581979, 2.8907876, 2.7376451;
 4.9697776, 5.0556226, 4.0481086, 5.4945679, 3.7138562, 2.7851624, 3.7539475, 3.8073862, 6.0697041, 4.8372707;
 4.858768, 3.325938, 3.9274917, 3.3882124, 3.3551505, 2.4664357, 3.2042098, 4.4692345, 6.1430345, 3.6788487;
 5.7876759, 4.9641714, 3.8810079, 5.6053309, 3.9420743, 3.2210932, 5.375071, 4.9341807, 7.0568066, 4.6554947;
 5.9975386, 4.3671918, 4.927949, 6.1242933, 3.2041309, 3.9909253, 3.4004376, 4.7735848, 6.8417678, 3.914902]
diff=
[3.4339712, 3.3018913, 2.874511, 3.301708, 2.1673174, 2.0114236, 2.7168431, 2.4728658, 4.3929782, 2.3993688;
 2.4835873, 2.1523409, 2.1498954, 2.5446393, 1.7926966, 2.2090154, 2.4205515, 2.2555299, 2.6125734, 2.4037173;
 2.9214613, 3.1150961, 2.3442974, 3.2174232, 1.7979126, 1.8984414, 2.4731088, 1.9046221, 3.5246434, 2.0348554;
 3.9956787, 2.8178856, 3.159745, 3.3588967, -nan, 2.4187248, 3.1440182, 3.2677579, 4.4706621, 2.925703;
 2.7029426, 2.7270441, 2.0557916, 2.5979867, 2.0933702, 1.5421755, 2.9650645, 1.8537004, 3.2329957, 2.1449621;
 1.590537, 1.3907683, 1.1002945, 1.547938, 1.2073319, 0.92409831, 1.4553436, 1.179099, 1.4453938, 1.3688226;
 2.484889, 2.5278113, 2.0240545, 2.7472839, 1.856928, 1.3925812, 1.8769739, 1.9036932, 3.0348518, 2.4186354;
 2.429384, 1.662969, 1.9637458, 1.6941063, 1.6775751, 1.233218, 1.6021049, 2.2346172, 3.0715175, 1.8394245;
 2.8938377, 2.4820855, 1.9405038, 2.8026655, 1.971037, 1.6105467, 2.6875355, 2.4670906, 3.5284035, 2.3277471;
 2.9987693, 2.1835959, 2.4639745, 3.0621467, 1.6020656, 1.9954627, 1.7002189, 2.3867924, 3.4208837, 1.957451]
unknown file: Failure
C++ exception with description "/home/alalek/projects/opencv/dev/modules/core/src/mathfuncs.cpp:1540: error: (-211) the value at (4, 3)=[-nan] is out of range [-179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000, 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000) in function checkRange
" thrown in the test body.

Could you help to investigate this issue?

There is output of previous iteration (for reference):

Repeating all tests (iteration 145) . . .

Note: Google Test filter = OCL*gemm*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from OCL
[ RUN      ] OCL.gemm_reuse_D
OpenCV: OCL: Kernel::run(intelblas_gemm_buffer_NN)
OpenCV: OCL: Kernel::run(intelblas_gemm_buffer_NN)
a=
[0.4697172, 0.77045798, 0.86939859, 0.55501348, 0.61788052, 0.8690908, 0.59598577, 0.33389202, 0.52609754, 0.32366583;
 0.34521863, 0.053028613, 0.4671855, 0.89212525, 0.22870186, 0.42892265, 0.023268729, 0.8135829, 0.16387039, 0.89843047;
 0.27414906, 0.77042043, 0.71890688, 0.22070757, 0.56171542, 0.90053737, 0.36234802, 0.43648669, 0.090359062, 0.70275462;
 0.93804216, 0.16725916, 0.71459275, 0.53946143, 0.2292451, 0.54528487, 0.95082867, 0.63492537, 0.61040771, 0.72956884;
 0.81757414, 0.062038779, 0.61714286, 0.21701941, 0.25912783, 0.91782087, 0.13752073, 0.25208503, 0.30018359, 0.92475003;
 0.26533496, 0.11604837, 0.19059148, 0.010641068, 0.16111162, 0.060394943, 0.28253025, 0.13271004, 0.19695139, 0.9093287;
 0.10543111, 0.52785105, 0.39195511, 0.22460574, 0.97685224, 0.15073341, 0.27625704, 0.23560759, 0.70046741, 0.88694304;
 0.5709492, 0.40107918, 0.25452986, 0.66150975, 0.35463071, 0.03622815, 0.66591835, 0.15007454, 0.82831192, 0.054897636;
 0.81202829, 0.43369132, 0.68049365, 0.96994615, 0.99950004, 0.31446472, 0.21269783, 0.16224265, 0.13861021, 0.42665961;
 0.14776993, 0.64611834, 0.73192912, 0.45375571, 0.62054801, 0.32181042, 0.92752087, 0.57924128, 0.14379078, 0.17707342]
b=
[0.83263576, 0.29477626, 0.070021123, 0.011865854, 0.65267956, 0.014045805, 0.96897638, 0.6360665, 0.96251404, 0.35905412;
 0.59120524, 0.96354854, 0.23921493, 0.21219331, 0.085621566, 0.22553381, 0.25651896, 0.13891643, 0.64982808, 0.046107113;
 0.8545326, 0.79217458, 0.65596759, 0.61545265, 0.36808848, 0.35304511, 0.81363648, 0.11214325, 0.53980255, 0.59540892;
 0.36527705, 0.27361721, 0.45167589, 0.37423173, 0.2749688, 0.63643247, 0.5537324, 0.82840759, 0.42567188, 0.47756407;
 0.27917421, 0.26075748, 0.13848183, 0.96844661, 0.28806078, 0.020451367, 0.03953138, 0.39176151, 0.80402446, 0.5088976;
 0.24880663, 0.85834324, 0.56660467, 0.78648299, 0.27971894, 0.33192667, 0.57813317, 0.1013802, 0.98743951, 0.043715388;
 0.93579406, 0.094189912, 0.7937423, 0.997172, 0.43242633, 0.54615748, 0.11551228, 0.98970461, 0.93635565, 0.24968669;
 0.6763252, 0.084630996, 0.73001349, 0.62112808, 0.31084806, 0.97781426, 0.034763336, 0.70996177, 0.68390197, 0.69417715;
 0.46738112, 0.57731611, 0.76102859, 0.022845089, 0.6177811, 0.077412724, 0.24424142, 0.36688471, 0.93988723, 0.78573084;
 0.69067037, 0.88135958, 0.43611726, 0.90912604, 0.66283399, 0.54708374, 0.96870279, 0.49341825, 0.30526567, 0.82213616]
d1=
[3.4339712, 3.3018909, 2.874511, 3.301708, 2.1673174, 2.0114236, 2.7168427, 2.4728658, 4.3929782, 2.3993683;
 2.4835873, 2.1523409, 2.1498954, 2.5446389, 1.7926966, 2.2090154, 2.4205515, 2.2555299, 2.6125734, 2.4037178;
 2.9214618, 3.1150961, 2.3442979, 3.2174227, 1.7979124, 1.8984412, 2.4731083, 1.9046223, 3.5246439, 2.0348554;
 3.9956777, 2.8178856, 3.1597445, 3.3588972, 2.7257032, 2.4187253, 3.1440182, 3.2677574, 4.4706631, 2.9257026;
 2.7029417, 2.7270441, 2.0557916, 2.5979862, 2.0933702, 1.5421755, 2.965065, 1.8537009, 3.2329962, 2.1449621;
 1.5905372, 1.3907683, 1.1002947, 1.547938, 1.2073319, 0.92409831, 1.4553438, 1.179099, 1.4453938, 1.3688226;
 2.4848886, 2.5278113, 2.0240541, 2.7472839, 1.8569282, 1.3925812, 1.8769736, 1.903693, 3.0348523, 2.4186354;
 2.429384, 1.662969, 1.9637458, 1.6941061, 1.6775753, 1.2332177, 1.6021049, 2.2346172, 3.071517, 1.8394243;
 2.8938382, 2.4820859, 1.9405041, 2.8026655, 1.9710373, 1.6105465, 2.6875355, 2.4670901, 3.528403, 2.3277476;
 2.9987693, 2.1835959, 2.4639745, 3.0621467, 1.6020653, 1.9954627, 1.7002187, 2.3867924, 3.4208841, 1.957451]
d2=
[6.8679423, 6.6037822, 5.749022, 6.603416, 4.3346348, 4.0228472, 5.4336858, 4.9457316, 8.7859564, 4.798737;
 4.9671745, 4.3046818, 4.2997909, 5.0892782, 3.5853932, 4.4180307, 4.8411031, 4.5110598, 5.2251468, 4.807435;
 5.8429232, 6.2301922, 4.6885953, 6.4348459, 3.595825, 3.7968826, 4.9462171, 3.8092444, 7.0492873, 4.0697107;
 7.9913564, 5.6357713, 6.3194895, 6.7177939, 5.451407, 4.83745, 6.2880363, 6.5355153, 8.9413252, 5.8514056;
 5.4058843, 5.4540882, 4.1115832, 5.1959729, 4.1867404, 3.0843511, 5.9301295, 3.7074013, 6.465992, 4.2899241;
 3.1810741, 2.7815366, 2.2005892, 3.095876, 2.4146638, 1.8481966, 2.9106874, 2.3581979, 2.8907876, 2.7376451;
 4.9697776, 5.0556226, 4.0481086, 5.4945679, 3.7138562, 2.7851624, 3.7539475, 3.8073862, 6.0697041, 4.8372707;
 4.858768, 3.325938, 3.9274917, 3.3882124, 3.3551505, 2.4664357, 3.2042098, 4.4692345, 6.1430345, 3.6788487;
 5.7876759, 4.9641714, 3.8810079, 5.6053309, 3.9420743, 3.2210932, 5.375071, 4.9341807, 7.0568066, 4.6554947;
 5.9975386, 4.3671918, 4.927949, 6.1242933, 3.2041309, 3.9909253, 3.4004376, 4.7735848, 6.8417678, 3.914902]
diff=
[3.4339712, 3.3018913, 2.874511, 3.301708, 2.1673174, 2.0114236, 2.7168431, 2.4728658, 4.3929782, 2.3993688;
 2.4835873, 2.1523409, 2.1498954, 2.5446393, 1.7926966, 2.2090154, 2.4205515, 2.2555299, 2.6125734, 2.4037173;
 2.9214613, 3.1150961, 2.3442974, 3.2174232, 1.7979126, 1.8984414, 2.4731088, 1.9046221, 3.5246434, 2.0348554;
 3.9956787, 2.8178856, 3.159745, 3.3588967, 2.7257037, 2.4187248, 3.1440182, 3.2677579, 4.4706621, 2.925703;
 2.7029426, 2.7270441, 2.0557916, 2.5979867, 2.0933702, 1.5421755, 2.9650645, 1.8537004, 3.2329957, 2.1449621;
 1.590537, 1.3907683, 1.1002945, 1.547938, 1.2073319, 0.92409831, 1.4553436, 1.179099, 1.4453938, 1.3688226;
 2.484889, 2.5278113, 2.0240545, 2.7472839, 1.856928, 1.3925812, 1.8769739, 1.9036932, 3.0348518, 2.4186354;
 2.429384, 1.662969, 1.9637458, 1.6941063, 1.6775751, 1.233218, 1.6021049, 2.2346172, 3.0715175, 1.8394245;
 2.8938377, 2.4820855, 1.9405038, 2.8026655, 1.971037, 1.6105467, 2.6875355, 2.4670906, 3.5284035, 2.3277471;
 2.9987693, 2.1835959, 2.4639745, 3.0621467, 1.6020656, 1.9954627, 1.7002189, 2.3867924, 3.4208837, 1.957451]
[       OK ] OCL.gemm_reuse_D (1 ms)

@insoow
Copy link
Copy Markdown
Contributor Author

insoow commented Mar 15, 2017

The reason for incorrect output was that D is used without initialization. There were three patches pushed. The first two are reverting patches.

When C is null and beta is non-zero, D is used without initialization.
This resloves the issue

Signed-off-by: Woo, Insoo <insoo.woo@intel.com>
@alalek
Copy link
Copy Markdown
Member

alalek commented Mar 16, 2017

@insoow Great! One problem is fixed.
There is another problem with NaN values, because: 0.0 * nan produces nan again (instead of zero).

float4 dot04 = (start_index != 0) ? vload4(0, dst_write0 + 4 * ldC) : (float)beta * vload4(0, dst_write0 + 4 * ldC);
float4 dot05 = (start_index != 0) ? vload4(0, dst_write0 + 5 * ldC) : (float)beta * vload4(0, dst_write0 + 5 * ldC);
float4 dot06 = (start_index != 0) ? vload4(0, dst_write0 + 6 * ldC) : (float)beta * vload4(0, dst_write0 + 6 * ldC);
float4 dot07 = (start_index != 0) ? vload4(0, dst_write0 + 7 * ldC) : (float)beta * vload4(0, dst_write0 + 7 * ldC);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This vload4 series probably reads out of buffer.
Including nan values:

Check code:

#define CHECK_NAN_(id, v) if(isnan(dot ## id . s ## v)) { printf("dot" #id ".s" #v " is NAN, lx=%d ly=%d\n", local_x, local_y); }
#define CHECK_NAN(id) CHECK_NAN_(id, 0) CHECK_NAN_(id, 1) CHECK_NAN_(id, 2) CHECK_NAN_(id, 3)
    CHECK_NAN(00) CHECK_NAN(01) CHECK_NAN(02) CHECK_NAN(03)
    CHECK_NAN(04) CHECK_NAN(05) CHECK_NAN(06) CHECK_NAN(07)

Sporadic results (dst in this case rows=10 cols=5):

OpenCV: OCL: Kernel::run(intelblas_gemm_buffer_NN)
    dims=2 local=8x4x1 global=8x4x1
dot00.s0 is NAN, lx=3 ly=3
dot00.s0 is NAN, lx=1 ly=2
dot00.s0 is NAN, lx=7 ly=2
dot00.s0 is NAN, lx=5 ly=1
dot01.s3 is NAN, lx=1 ly=3
...

Probably these nan values are propagated via pipeline. I hope it is the reason of nan values from this comment.

@alalek
Copy link
Copy Markdown
Member

alalek commented Apr 19, 2017

Thank you! 👍

@alalek alalek merged commit 2922738 into opencv:master Apr 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants