Skip to content

Flush to zero Convolution denormal weights#17295

Merged
opencv-pushbot merged 1 commit intoopencv:3.4from
dkurt:dnn_fusion_ftz
May 22, 2020
Merged

Flush to zero Convolution denormal weights#17295
opencv-pushbot merged 1 commit intoopencv:3.4from
dkurt:dnn_fusion_ftz

Conversation

@dkurt
Copy link
Copy Markdown
Member

@dkurt dkurt commented May 14, 2020

model from #17259

(base) model_1: 56.99ms
(base) model_2: 170.85ms
(ftz) model_1: 49.18ms
(ftz) model_2: 49.33ms
import numpy as np
import cv2 as cv
import time

net = cv.dnn.readNet('model_1.prototxt', 'model_1.caffemodel')
net.setPreferableBackend(cv.dnn.DNN_BACKEND_OPENCV)
inp = np.random.standard_normal([1, 3, 112, 112]).astype(np.float32)
net.setInput(inp)
net.forward()

speeds = []
for i in range(10):
    start = time.time()
    net.forward()
    speeds.append((time.time() - start) * 1000)
print(np.median(speeds))

net = cv.dnn.readNet('model_1.prototxt', 'model_2.caffemodel')
net.setPreferableBackend(cv.dnn.DNN_BACKEND_OPENCV)
inp = np.random.standard_normal([1, 3, 112, 112]).astype(np.float32)
net.setInput(inp)
net.forward()

speeds = []
for i in range(10):
    start = time.time()
    net.forward()
    speeds.append((time.time() - start) * 1000)
print(np.median(speeds))

opencv_perf_dnn:

Median (ms)

                       Name of Test                         base     ftz      ftz    
                                                                               vs    
                                                                              base   
                                                                           (x-factor)
AlexNet::DNNTestNetwork::OCV/CPU                           14.280  14.235     1.00   
DenseNet_121::DNNTestNetwork::OCV/CPU                      39.178  39.567     0.99   
EAST_text_detection::DNNTestNetwork::OCV/CPU               69.456  69.436     1.00   
ENet::DNNTestNetwork::OCV/CPU                              44.60   23.26      1.91   (separate run)
FastNeuralStyle_eccv16::DNNTestNetwork::OCV/CPU            125.432 124.266    1.01   
GoogLeNet::DNNTestNetwork::OCV/CPU                         15.315  15.273     1.00   
Inception_5h::DNNTestNetwork::OCV/CPU                      16.713  16.819     0.99   
Inception_v2_Faster_RCNN::DNNTestNetwork::OCV/CPU          286.398 290.170    0.99   
Inception_v2_SSD_TensorFlow::DNNTestNetwork::OCV/CPU       43.136  43.012     1.00   
MobileNet_SSD_Caffe::DNNTestNetwork::OCV/CPU               20.919  20.866     1.00   
MobileNet_SSD_v1_TensorFlow::DNNTestNetwork::OCV/CPU       22.510  22.410     1.00   
MobileNet_SSD_v2_TensorFlow::DNNTestNetwork::OCV/CPU       31.676  31.742     1.00   
OpenFace::DNNTestNetwork::OCV/CPU                           3.922   3.968     0.99   
OpenPose_pose_mpi_faster_4_stages::DNNTestNetwork::OCV/CPU 607.502 619.329    0.98   
ResNet_50::DNNTestNetwork::OCV/CPU                         36.333  35.966     1.01   
SSD::DNNTestNetwork::OCV/CPU                               270.494 272.510    0.99   
SqueezeNet_v1_1::DNNTestNetwork::OCV/CPU                    3.918   3.909     1.00   
YOLOv3::DNNTestNetwork::OCV/CPU                            212.866 210.719    1.01   
opencv_face_detector::DNNTestNetwork::OCV/CPU              13.940  14.016     0.99 

CPU: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz x8

$ cat /proc/cpuinfo | grep "MHz"
cpu MHz         : 4000.102
cpu MHz         : 4002.573
cpu MHz         : 4179.525
cpu MHz         : 4009.356
cpu MHz         : 3998.455
cpu MHz         : 4001.585
cpu MHz         : 4000.148
cpu MHz         : 4181.959
force_builders=Custom,Custom Win,Custom Mac
build_image:Custom=ubuntu-openvino-2020.2.0:16.04
build_image:Custom Win=openvino-2020.2.0
build_image:Custom Mac=openvino-2020.2.0

test_modules:Custom=dnn,python2,python3,java
test_modules:Custom Win=dnn,python2,python3,java
test_modules:Custom Mac=dnn,python2,python3,java

buildworker:Custom=linux-1
# disabled due high memory usage: test_opencl:Custom=ON
test_opencl:Custom=OFF
test_bigdata:Custom=1
test_filter:Custom=*

@dkurt dkurt changed the title Flush to zero Convolution activation weights Flush to zero Convolution denormal weights May 14, 2020
@dkurt dkurt linked an issue May 14, 2020 that may be closed by this pull request
@YashasSamaga
Copy link
Copy Markdown
Contributor

YashasSamaga commented May 15, 2020

1e-15 in that post was arbitrary. Why not simply enable FTZ in the hardware before convolving?

already_enabled = is_ftz_enabled()
enable_ftz()

// do convolution

if (!already_enabled)
    disable_ftz();

It's possible that there are no denormals in the weights (can check using std::fpclassify) but are being generated during convolution. This can still happen with clipped weights if some input to the convolution layer happens to be really small. So if an attempt is being made to avoid denormals, why not just ignore them completely instead of allowing some amount of denormals?

@dkurt
Copy link
Copy Markdown
Member Author

dkurt commented May 15, 2020

@YashasSamaga, that's good point but we need more experiment with it. It's hard to determine // do convolution because we have multiple CPU backends which receive fused weights as parameters (TEngine, Intel Inference Engine).

Update: there is FTZ right in the MKL-DNN source code. Ok, I'll do as you recommended, thanks!

@dkurt dkurt force-pushed the dnn_fusion_ftz branch from a0240b4 to be03a07 Compare May 15, 2020 20:26
@JulienMaille
Copy link
Copy Markdown
Contributor

just FYI, I gave this patch a try and my inference time is 10% slower with the patch (from 92ms to 102ms)
Before this patch, I also tried compiling with FAST_MATH and inference time was unchanged (92ms)

@dkurt dkurt force-pushed the dnn_fusion_ftz branch from be03a07 to 68d59a2 Compare May 15, 2020 20:45
@dkurt
Copy link
Copy Markdown
Member Author

dkurt commented May 15, 2020

@JulienMaille, I just updated PR. Have you tried version with manual flushing or with intrinsics? Can you provide a reproducer?

@JulienMaille
Copy link
Copy Markdown
Contributor

I tried this:

            // Flush to zero (FTZ) denormal weights: https://github.com/opencv/opencv/issues/17259
            Mat mask = abs(weightsMat) <= 1e-15f;
            weightsMat.setTo(0.0f, mask & (weightsMat > 0.0f));
            weightsMat.setTo(-0.0f, mask & (weightsMat < 0.0f));

Will try the new one and report

@JulienMaille
Copy link
Copy Markdown
Contributor

Same 10% slow down. Do you want an onnx to reproduce?

@dkurt
Copy link
Copy Markdown
Member Author

dkurt commented May 16, 2020

@JulienMaille, can you share efficiency measurement approach? Here an example with min, median, mean estimation with your model (several runs):

min median mean std
baseline 79.30ms 80.55ms 81.24ms 3.77
baseline 78.71ms 79.43ms 79.99ms 3.06
baseline 77.19ms 78.43ms 78.97ms 3.18
FTZ 77.85ms 79.33ms 79.86ms 3.26
FTZ 78.77ms 79.46ms 80.02ms 3.00
FTZ 77.73ms 78.64ms 79.14ms 3.05
import numpy as np
import cv2 as cv
import time
print(cv.__file__)

net = cv.dnn.readNet('model.onnx')
net.setPreferableBackend(cv.dnn.DNN_BACKEND_OPENCV)
inp = np.random.standard_normal([1, 1, 256, 256]).astype(np.float32)
net.setInput(inp)
net.forward()

speeds = []
for i in range(1000):
    start = time.time()
    net.forward()
    speeds.append((time.time() - start) * 1000)
print('%.2fms|%.2fms|%.2fms|%.2f' % (np.min(speeds), np.median(speeds), np.mean(speeds), np.std(speeds)))

@JulienMaille
Copy link
Copy Markdown
Contributor

JulienMaille commented May 16, 2020

I can do that, the numbers I gave were the average on 20 runs, with the first run discarded.
I'm working with the c++ api, is there a way to access "nightly" python builds with openvino? That would make it easier for testing purpose.

@dkurt
Copy link
Copy Markdown
Member Author

dkurt commented May 16, 2020

@JulienMaille, you don't need OpenVINO - the changes in this patch won't affect it - it's for default implementation, DNN_BACKEND_OPENCV.

@JulienMaille
Copy link
Copy Markdown
Contributor

Seems like the difference was due to the cpu being busier when I ran the FTZ benchs.
Even when running 100 loops I can get a 10ms difference on avg/median.
Best run for both baseline and FTZ give the same results

avg: 90.7, min: 88, median: 91, std: 2.75

@JulienMaille
Copy link
Copy Markdown
Contributor

JulienMaille commented May 16, 2020

Uh, sorry. Ok so for reference (100 loops):

INFERENCE_ENGINE    avg:  90.7, min:  88, median:  91, std: 2.75
BACKEND_OPENCV base avg: 137.5, min: 132, median: 137, std: 5.59
                    avg: 137.4, min: 135, median: 138, std: 5.37
                    avg: 137.2, min: 136, median: 137, std: 5.50
BACKEND_OPENCV FTZ  avg: 139.7, min: 134, median: 138, std: 7.32
                    avg: 137.9, min: 132, median: 137, std: 5.28
                    avg: 137.7, min: 134, median: 138, std: 5.41

@asmorkalov
Copy link
Copy Markdown
Contributor

@dkurt Do you have final decision on the patch?

@dkurt
Copy link
Copy Markdown
Member Author

dkurt commented May 22, 2020

@asmorkalov , yes, it's a final version.

@alalek, you mentioned that there is also a way to apply FTZ optimization for ARM CPUs, shall I add it here?

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@YashasSamaga
Copy link
Copy Markdown
Contributor

YashasSamaga commented Aug 22, 2020

  1. What is the scope of the DAZ and FTZ modes? Do they have to be set for each thread?

  2. What if an exception is thrown during convolution or an early return is made (OCL is doing it)? The DAZ and FTZ modes won't be reset (and hence alter the modes in the end-users' thread that invoked net.forward()?).

An RAII based solution would solve problem 2. The RAII objects will automatically reset the FTZ and DAZ modes after an exception or before a return. It's safer and future-proof.


The following posts suggest that the FTZ and DAZ modes have to be set for each thread:

I don't know how ParallelLoopBody works but if it's reusing worker threads from a global thread pool, the threads might have their own FTZ and DAZ modes.

@dkurt dkurt deleted the dnn_fusion_ftz branch December 7, 2020 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

same dnn model with different params, the speed is different

6 participants