Skip to content

Add a 512 bit codepath to the AVX512 fastConv function#10468

Merged
alalek merged 4 commits intoopencv:masterfrom
fenrus75:avx512-2
Jan 31, 2018
Merged

Add a 512 bit codepath to the AVX512 fastConv function#10468
alalek merged 4 commits intoopencv:masterfrom
fenrus75:avx512-2

Conversation

@fenrus75
Copy link
Copy Markdown
Contributor

@fenrus75 fenrus75 commented Dec 29, 2017

this patch adds a 512 wide codepath to the fastConv() function for
AVX512 use.
The basic idea is to process the first N * 16 elements of the vector
with avx512, and then run the rest of the vector using the traditional
AVX2 codepath.

force_builder=Custom
buildworker:Custom=linux-3
docker_image:Custom=ubuntu:17.10

@fenrus75
Copy link
Copy Markdown
Contributor Author

Geometric mean

Name of Test newbase newbase newbase vs newbase (x-factor)
AlexNet::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU) 7.196 ms 7.145 ms 1.01
ENet::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU) 20.913 ms 16.364 ms 1.28
GoogLeNet::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU) 10.445 ms 10.460 ms 1.00
Inception_5h::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU) 11.232 ms 11.322 ms 0.99
MobileNet_SSD_Caffe::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU) 13.381 ms 13.333 ms 1.00
OpenFace::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU) 4.094 ms 3.971 ms 1.03
ResNet50::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU) 30.983 ms 30.740 ms 1.01
SSD::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU) 148.115 ms 156.837 ms 0.94
SqueezeNet_v1_1::DNNTestNetwork::(DNN_BACKEND_DEFAULT, DNN_TARGET_CPU) 2.428 ms 2.447 ms 0.99
perf::ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF) 0.218 ms 0.253 ms 0.86
perf::ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON) 0.059 ms 0.069 ms 0.85
perf::ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF) 0.208 ms 0.244 ms 0.85
perf::ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON) 0.056 ms 0.066 ms 0.85
perf::ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF) 0.217 ms 0.248 ms 0.87
perf::ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON) 0.061 ms 0.069 ms 0.88
perf::ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF) 0.295 ms 0.295 ms 1.00
perf::ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON) 0.081 ms 0.082 ms 0.99
perf::ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF) 0.128 ms 0.155 ms 0.83
perf::ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON) 0.054 ms 0.053 ms 1.02
perf::ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF) 0.300 ms 0.374 ms 0.80
perf::ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON) 0.164 ms 0.153 ms 1.07
perf::ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF) 0.364 ms 0.397 ms 0.92
perf::ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON) 0.102 ms 0.111 ms 0.92
perf::ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF) 0.436 ms 0.465 ms 0.94
perf::ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON) 0.119 ms 0.128 ms 0.92
perf::ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF) 1.300 ms 1.457 ms 0.89
perf::ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON) 0.335 ms 0.385 ms 0.87
perf::ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF) 2.209 ms 3.079 ms 0.72
perf::ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON) 0.589 ms 0.726 ms 0.81
perf::ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF) 0.930 ms 1.254 ms 0.74
perf::ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON) 0.317 ms 0.322 ms 0.98
perf::ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF) 2.027 ms 2.085 ms 0.97
perf::ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON) 0.900 ms 0.935 ms 0.96
perf::ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF) 0.680 ms 0.708 ms 0.96
perf::ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON) 0.193 ms 0.201 ms 0.96
perf::ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF) 0.930 ms 0.952 ms 0.98
perf::ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON) 0.252 ms 0.263 ms 0.96
perf::ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF) 3.588 ms 4.620 ms 0.78
perf::ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON) 0.945 ms 1.254 ms 0.75
perf::ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF) 6.050 ms 8.879 ms 0.68
perf::ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON) 1.598 ms 2.274 ms 0.70
perf::ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF) 2.195 ms 2.841 ms 0.77
perf::ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON) 0.609 ms 0.703 ms 0.87
perf::ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF) 4.303 ms 5.045 ms 0.85
perf::ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON) 1.506 ms 1.504 ms 1.00
perf::ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF) 0.334 ms 0.302 ms 1.11
perf::ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON) 0.321 ms 0.311 ms 1.03
perf::ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF) 0.442 ms 0.433 ms 1.02
perf::ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON) 0.434 ms 0.432 ms 1.00
perf::ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF) 2.137 ms 2.078 ms 1.03
perf::ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON) 2.106 ms 2.097 ms 1.00
perf::ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF) 7.127 ms 6.881 ms 1.04
perf::ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON) 7.146 ms 6.977 ms 1.02
perf::ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF) 2.457 ms 2.593 ms 0.95
perf::ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON) 2.454 ms 2.559 ms 0.96
perf::ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF) 8.113 ms 7.995 ms 1.01
perf::ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON) 8.177 ms 7.884 ms 1.04
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF) 0.219 ms 0.252 ms 0.87
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON) 0.059 ms 0.069 ms 0.85
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF) 0.212 ms 0.246 ms 0.86
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON) 0.056 ms 0.067 ms 0.85
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF) 0.218 ms 0.250 ms 0.87
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON) 0.061 ms 0.069 ms 0.89
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF) 0.297 ms 0.294 ms 1.01
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON) 0.082 ms 0.081 ms 1.02
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF) 0.127 ms 0.126 ms 1.01
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON) 0.053 ms 0.052 ms 1.04
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF) 0.298 ms 0.296 ms 1.01
perf::OCL_ConvolutionPerfTest::(1x1, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON) 0.155 ms 0.156 ms 1.00
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF) 0.359 ms 0.395 ms 0.91
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON) 0.103 ms 0.112 ms 0.91
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF) 0.430 ms 0.467 ms 0.92
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON) 0.118 ms 0.128 ms 0.92
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF) 1.295 ms 1.374 ms 0.94
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON) 0.335 ms 0.343 ms 0.98
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF) 2.141 ms 2.850 ms 0.75
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON) 0.560 ms 0.688 ms 0.81
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF) 0.926 ms 1.152 ms 0.80
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON) 0.326 ms 0.321 ms 1.01
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF) 2.020 ms 2.059 ms 0.98
perf::OCL_ConvolutionPerfTest::(3x3, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON) 0.896 ms 0.935 ms 0.96
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF) 0.670 ms 0.706 ms 0.95
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON) 0.189 ms 0.199 ms 0.95
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF) 0.920 ms 0.963 ms 0.96
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON) 0.251 ms 0.262 ms 0.96
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF) 3.505 ms 4.501 ms 0.78
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON) 0.918 ms 1.155 ms 0.80
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF) 5.838 ms 8.837 ms 0.66
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON) 1.493 ms 2.290 ms 0.65
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF) 2.234 ms 2.777 ms 0.80
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON) 0.611 ms 0.682 ms 0.90
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF) 4.201 ms 4.913 ms 0.86
perf::OCL_ConvolutionPerfTest::(5x5, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON) 1.486 ms 1.444 ms 1.03
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_OFF) 0.308 ms 0.323 ms 0.95
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_2, STRIDE_ON) 0.311 ms 0.318 ms 0.98
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_OFF) 0.446 ms 0.431 ms 1.03
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 4, 224, 224 }, 64), GROUP_OFF, STRIDE_ON) 0.443 ms 0.418 ms 1.06
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_OFF) 2.134 ms 2.119 ms 1.01
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_2, STRIDE_ON) 2.126 ms 2.061 ms 1.03
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_OFF) 7.201 ms 6.956 ms 1.04
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 64, 112, 122 }, 128), GROUP_OFF, STRIDE_ON) 7.174 ms 6.835 ms 1.05
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_OFF) 2.454 ms 2.594 ms 0.95
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_2, STRIDE_ON) 2.470 ms 2.581 ms 0.96
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_OFF) 8.075 ms 8.174 ms 0.99
perf::OCL_ConvolutionPerfTest::(11x11, ({ 1, 256, 28, 28 }, 512), GROUP_OFF, STRIDE_ON) 8.002 ms 8.152 ms 0.98

@fenrus75
Copy link
Copy Markdown
Contributor Author

performance is slightly mixed, so for sure feedback needed.
It seems the non-micro tests at least show mostly gains

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Please rebase your patch on the latest master (CV_AVX512_SKX is not available on commit of this patch)

/* only use AVX512 for multiple-of-16 vectors */
if ((vecsize & 15) == 0) {

__m512 vs00_5 = _mm512_setzero_ps(), vs01_5 = _mm512_setzero_ps(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix indentation here (4 spaces).

__m512 w0 = _mm512_load_ps(wptr0 + k);
__m512 w1 = _mm512_load_ps(wptr1 + k);
__m512 w2 = _mm512_load_ps(wptr2 + k);
__m512 r0 = _mm512_load_ps(rptr);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mm512_load_ps() is aligned load. Need to check that ptrs are aligned to 64-bytes or to change this to _mm512_loadu_ps().
Currently these ptrs are 32-bytes aligned only (based on OpenCV's memory allocator alignment requirement), so AVX/AVX2 code is fine here.

fenrus75 and others added 4 commits January 30, 2018 17:22
this patch adds a 512 wide codepath to the fastConv() function for
AVX512 use.
The basic idea is to process the first N * 16 elements of the vector
with avx512, and then run the rest of the vector using the traditional
AVX2 codepath.
Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fenrus75 Thank you!

@alalek alalek merged commit a75840d into opencv:master Jan 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants