Enable Eltwise layer with different numbers of inputs channels#15739
Enable Eltwise layer with different numbers of inputs channels#15739opencv-pushbot merged 2 commits intoopencv:3.4from
Conversation
EDIT: HEY, THIS HAS BEEN MERGED! For anyone who needs this new feature right now on Python and can't wait 3 months for the next OpenCV version, read the following post:Build guide: #15739 (comment) Also check #15739 (comment) if you wanna see the amazing benchmarks of this new network! Wow, I've done a small code review and everything looks excellent. Nice refactoring of activation handling, and the alpha parameter is nicely handled via blending coefficients! Great job! Although I don't understand about half of the eltwise code. Is it doing a resize to make the layers match each other's size (that's what darknet does (via the I am also unsure where the coeffs are doing the alpha blending (multiplication). Unless this is the relevant line: adbd613#diff-7ac73ff12c29882cb913b6f09da2f82cR258 As for testing (compiling), I'll try now! But I haven't compiled OpenCV before. I use it via Python. Thank you so much for everything @dkurt! |
|
Goddamn that was hard to compile. Was following https://docs.opencv.org/3.4/d3/d52/tutorial_windows_install.html which is severely outdated and required researching many changes. I split that into two files. installocv1.sh: #!/bin/bash -e
myRepo=$(pwd)
if [ ! -d "$myRepo/opencv" ]; then
echo "cloning opencv"
git clone https://github.com/opencv/opencv.git
mkdir Build
mkdir Build/opencv
mkdir Install
mkdir Install/opencv
else
cd opencv
git pull --rebase
cd ..
fi
if [ ! -d "$myRepo/opencv_contrib" ]; then
echo "cloning opencv_contrib"
git clone https://github.com/opencv/opencv_contrib.git
mkdir Build
mkdir Build/opencv_contrib
else
cd opencv_contrib
git pull --rebase
cd ..
fiThen I entered the opencv folder and Next, I ran my modified second file: installocv2.sh: #!/bin/bash -e
myRepo=$(pwd)
CMAKE_CONFIG_GENERATOR="Visual Studio 16 2019"
CMAKE_CONFIG_ARCH="x64"
RepoSource=opencv
pushd Build/$RepoSource
CMAKE_OPTIONS='-DBUILD_PERF_TESTS:BOOL=OFF -DBUILD_TESTS:BOOL=OFF -DBUILD_DOCS:BOOL=OFF -DWITH_CUDA:BOOL=OFF -DBUILD_EXAMPLES:BOOL=OFF -DINSTALL_CREATE_DISTRIB=ON'
cmake -G"$CMAKE_CONFIG_GENERATOR" -A"$CMAKE_CONFIG_ARCH" $CMAKE_OPTIONS -DOPENCV_EXTRA_MODULES_PATH="$myRepo"/opencv_contrib/modules -DCMAKE_INSTALL_PREFIX="$myRepo"/install/"$RepoSource" "$myRepo/$RepoSource"
echo "************************* $Source_DIR -->debug"
cmake --build . --config debug
echo "************************* $Source_DIR -->release"
cmake --build . --config release
cmake --build . --target install --config release
cmake --build . --target install --config debug
popdIt's compiling now. Going for a coffee, then I'll try to get the OpenCV C++ interface working and will be trying YOLOv3-Tiny-PRN! |
|
You may use /m:4 for multithreading build to speed it up. Eltwise layer is an element wise summation or product or maximum. In case of differrent number of channels, it sums only the shared channels: Here is three inputs with 5, 3 and 2 channels correspondingly. |
|
@dkurt Hey :-) The build finished a few minutes ago and I just figured out how to include it in a new project. I was just about to look up how to load and run DNN's using the C++ Interface! Regarding your eltwise description: I still don't understand. Darknet resizes the input to the same size as the output before summing different-size layers. Is this patch behaving the same way? That's all that matters. :-) Alright I'm gonna be ready with test results pretty soon... |
|
I'll mark the PR as work in progress this way. Need to check Darknet's behavior once again. Thanks! |
|
@dkurt Okay so after getting stuck using the C++ API directly (I loaded the net and ran forward pass, but realized it's a lot of work to actually draw the outputs), I instead compiled the https://github.com/opencv/opencv/blob/master/samples/dnn/object_detection.cpp example detector, which I didn't even realize existed. It would have saved me half an hour if I knew about that! :-D Then I ran with And I see that it does detect objects and adds some bounding boxes. So that's a good sign. The net/weights are the official files directly from the YOLOv3-Tiny-PRN researchers. And coco.names is standard. :-) https://github.com/AlexeyAB/darknet/blob/master/cfg/coco.names Regarding the darknet behavior, I linked here #15739 (comment) to the source file where you see "if input and output of shortcut layer are different size, do Anyway I am available for more testing, now that my C++ build environment and test code are all complete! Edit: Turns out the PRN network is not behaving properly. See answers below. |
BTW, It is better to turn |
|
Ah okay, yeah I saw that flag and thought I should probably have set that. It would have saved some time getting me started! ;-) Okay I am seeing some misbehaving in OpenCV. Folder
|
|
I thought I may have made a mistake with the threshold, but nope, I didn't... I've updated the post above to clarify that Darknet uses threshold 10% too. Both tests above use 10% threshold. |
|
More tests: OpenCV with 20% threshold = Looks the same as the 10% image above (same result, a whole image filled with "person"-detections). OpenCV with 30% threshold = Much less detections than darknet, clearly not behaving properly (few detections, some wrong labels, weird confidences): Any ideas why the net is misbehaving? Possibly due to the shortcut layer resize technique being different? |
Here is an OpenCV test comparison with regular YOLOv3-Tiny (not the shortcut-based PRN version), just to see that OpenCV properly handles the regular net!
|
|
@WongKinYiu Thank you so much for helping with the research! <3 Good idea to bring all the code in here for easy overview! Here's the layer parse_shortcut(list *options, size_params params, network *net)
{
char *l = option_find(options, "from");
int index = atoi(l);
if(index < 0) index = params.index + index;
int batch = params.batch;
layer from = net->layers[index];
layer s = make_shortcut_layer(batch, index, params.w, params.h, params.c, from.out_w, from.out_h, from.out_c);
char *activation_s = option_find_str(options, "activation", "linear");
ACTIVATION activation = get_activation(activation_s);
s.activation = activation;
s.alpha = option_find_float_quiet(options, "alpha", 1);
s.beta = option_find_float_quiet(options, "beta", 1);
return s;
}Here's the layer parse_shortcut(list *options, size_params params, network net)
{
int assisted_excitation = option_find_float_quiet(options, "assisted_excitation", 0);
char *l = option_find(options, "from");
int index = atoi(l);
if(index < 0) index = params.index + index;
int batch = params.batch;
layer from = net.layers[index];
if (from.antialiasing) from = *from.input_layer;
layer s = make_shortcut_layer(batch, index, params.w, params.h, params.c, from.out_w, from.out_h, from.out_c, assisted_excitation);
char *activation_s = option_find_str(options, "activation", "linear");
ACTIVATION activation = get_activation(activation_s);
s.activation = activation;
return s;
}(As you can see above, @AlexeyAB has removed the alpha and beta parameters, which may mean that he has moved them somewhere else... but I am not sure... maybe he really did remove them. The only Here's the layer make_shortcut_layer(int batch, int index, int w, int h, int c, int w2, int h2, int c2, int assisted_excitation)
{
if(assisted_excitation) fprintf(stderr, "Shortcut Layer - AE: %d\n", index);
else fprintf(stderr,"Shortcut Layer: %d\n", index);
layer l = { (LAYER_TYPE)0 };
l.type = SHORTCUT;
l.batch = batch;
l.w = w2;
l.h = h2;
l.c = c2;
l.out_w = w;
l.out_h = h;
l.out_c = c;
l.outputs = w*h*c;
l.inputs = l.outputs;
l.assisted_excitation = assisted_excitation;
if(w != w2 || h != h2 || c != c2) fprintf(stderr, " w = %d, w2 = %d, h = %d, h2 = %d, c = %d, c2 = %d \n", w, w2, h, h2, c, c2);
l.index = index;
l.delta = (float*)calloc(l.outputs * batch, sizeof(float));
l.output = (float*)calloc(l.outputs * batch, sizeof(float));
l.forward = forward_shortcut_layer;
l.backward = backward_shortcut_layer;
#ifdef GPU
l.forward_gpu = forward_shortcut_layer_gpu;
l.backward_gpu = backward_shortcut_layer_gpu;
l.delta_gpu = cuda_make_array(l.delta, l.outputs*batch);
l.output_gpu = cuda_make_array(l.output, l.outputs*batch);
if (l.assisted_excitation)
{
const int size = l.out_w * l.out_h * l.batch;
l.gt_gpu = cuda_make_array(NULL, size);
l.a_avg_gpu = cuda_make_array(NULL, size);
}
#endif // GPU
return l;
}Here's void forward_shortcut_layer(const layer l, network_state state)
{
if (l.w == l.out_w && l.h == l.out_h && l.c == l.out_c) {
int size = l.batch * l.w * l.h * l.c;
int i;
#pragma omp parallel for
for(i = 0; i < size; ++i)
l.output[i] = state.input[i] + state.net.layers[l.index].output[i];
}
else {
copy_cpu(l.outputs*l.batch, state.input, 1, l.output, 1);
shortcut_cpu(l.batch, l.w, l.h, l.c, state.net.layers[l.index].output, l.out_w, l.out_h, l.out_c, l.output);
}
activate_array(l.output, l.outputs*l.batch, l.activation);
if (l.assisted_excitation && state.train) assisted_excitation_forward(l, state);
}
void backward_shortcut_layer(const layer l, network_state state)
{
gradient_array(l.output, l.outputs*l.batch, l.activation, l.delta);
axpy_cpu(l.outputs*l.batch, 1, l.delta, 1, state.delta, 1);
shortcut_cpu(l.batch, l.out_w, l.out_h, l.out_c, l.delta, l.w, l.h, l.c, state.net.layers[l.index].delta);
}Here's void copy_cpu(int N, float *X, int INCX, float *Y, int INCY)
{
int i;
for(i = 0; i < N; ++i) Y[i*INCY] = X[i*INCX];
}( Here's https://github.com/AlexeyAB/darknet/blob/eac26226a7fc0a9da2b684a564f8f086eaf38390/src/blas.c#L71-L95 void shortcut_cpu(int batch, int w1, int h1, int c1, float *add, int w2, int h2, int c2, float *out)
{
int stride = w1/w2;
int sample = w2/w1;
assert(stride == h1/h2);
assert(sample == h2/h1);
if(stride < 1) stride = 1;
if(sample < 1) sample = 1;
int minw = (w1 < w2) ? w1 : w2;
int minh = (h1 < h2) ? h1 : h2;
int minc = (c1 < c2) ? c1 : c2;
int i,j,k,b;
for(b = 0; b < batch; ++b){
for(k = 0; k < minc; ++k){
for(j = 0; j < minh; ++j){
for(i = 0; i < minw; ++i){
int out_index = i*sample + w2*(j*sample + h2*(k + c2*b));
int add_index = i*stride + w1*(j*stride + h1*(k + c1*b));
out[out_index] += add[add_index];
}
}
}
}
}What makes it hard to decipher the code is that darknet was pretty poorly written in my opinion. The inconsistent variable names and general structure is very messy. And the lack of comments is extreme. I mean, a quick I'm reading the Edit: Yeah looks like a nearest neighbor algorithm, if you compare the code above to this: http://tech-algorithm.com/articles/nearest-neighbor-image-scaling/ |
|
@VideoPlayerCode Hello, Could you help for checking the num_of_channel after Eltwise layer? Yes, it just does an extremely jagged "nearest neighbor" scaling (no smooth interpolation at all). |
|
@WongKinYiu Hi again. :-) I was going to sleep now. But which file do you see that in? I just did a search in all opencv and darknet source code and nothing is named Edit: Perhaps you mean |
|
@VideoPlayerCode good night. what I mean is the size of inputs and output of shourtcut layer of opencv dnn. |
|
@WongKinYiu Ah, yes I agree that it looks like Eltwise is selecting the largest count from its input and output. @dkurt will know what it does. Btw where is that graphic from? https://user-images.githubusercontent.com/12152972/67135742-86d5af80-f24f-11e9-8b64-5892caf77532.png (it's not in the v1/v2/v3 YOLO papers). I'll be back tomorrow to help! Goodnight. :-) |
|
@VideoPlayerCode Oh, I drew the figure one hour ago. |
|
@dkurt Hi, just fyi I am here now and going to test your new change! I also noticed the coeffs fix. I didn't notice that problem the first time. It stores alpha in coeffs[0], and then uses coeffs[0] to multiply (previously it used coeffs[1]), so I am glad the alpha fix was discovered! Great job! Okay, time to recompile and test the net again. |
Test results are here for patch v2! <3 Thank you so much for doing incredible work.All images are made with Confidence Threshold: 10%+. Images below are provided as original, side by side scaled, and individual scaled, to let us compare bounding boxes! The most useful method is to open the individual large images in two web tabs and switch between them to check box similarity! NMS is not configurable in darknet's command line from what I can see, so I guess the slight box differences are due to lack of / low amount of NMS in Darknet. The ONLY thing I am not sure about is why Darknet sees a bus to the left and OpenCV doesn't. PS: The "Truck: 0.54" label in OpenCV is consistent with the "truck: 54%" console output from Darknet, so yeah that is a genuine misdetection from the net, and isn't a problem with Darknet/OpenCV. |
|
I've figured out why there are some overlapping (extra) boxes in Darknet: They're different color! They're a different class! Darknet:
OpenCV:
That explains a lot.. but not everything! There's also TONS of smaller boxes in the Darknet image that overlap or sit mostly within larger boxes of the exact same object class. Ie "a 100x100 box of Car (yellow) containing a 30x30 box of Car (yellow)". Here are my theories on that:
My theory is the last one: That OpenCV has better NMS processing and filters out identical-class objects that sit inside larger boxes. So... what can we conclude? @WongKinYiu @dkurt, if I am correct, OpenCV is now calculating the exact correct neural network output, and the only difference is in box-postprocessing in Darknet vs OpenCV. What do you think? If I am right about differences in NMS-processing, then the only remaining question is why Darknet sees a car to the left (cropped) and OpenCV doesn't. Even if I set OpenCV to 1% threshold, it doesn't detect a box to the left. This one: |
|
@VideoPlayerCode Hello, i can not find the reason for problem in the image since i do not know implementation details of opencv dnn. could you help me for filling this table? thanks a lot.
|
|
@WongKinYiu I was about to head to sleep now but I don't wanna make you wait long for a result. ;-) I took the OpenCV object detector and modified it at this line for benchmarking instead: https://github.com/opencv/opencv/blob/master/samples/dnn/object_detection.cpp#L215-L224 Sloppily modified to run the forward pass 400 times, each time asking the network (OpenCV DNN) how long it took, and then I calculate the average: if (async)
{
futureOutputs.push(net.forwardAsync());
}
else
{
double totalTime = 0.0;
int benchCount = 400;
for (int xyz = 0; xyz <= benchCount; ++xyz) {
std::vector<Mat> outs;
net.forward(outs, outNames);
if (xyz > 0) { // we ignore 0th "warmup" run since slow network setup happens on that run
std::vector<double> layersTimings;
double freq = cv::getTickFrequency() / 1000;
double time = net.getPerfProfile(layersTimings) / freq;
totalTime += time;
std::cout << "Time: " << time << " ms" << std::endl;
if (xyz == benchCount) {
std::cout << "Total Time: " << totalTime << " ms for " << benchCount << " runs" << std::endl;
std::cout << "Average Time: " << (totalTime / (double)benchCount) << " ms" << std::endl;
predictionsQueue.push(outs);
}
}
}
}Result YOLOv3-Tiny:Result YOLOv3-Tiny-PRN:Summary of Results:Now let's remember your CPU speed ratio at Darknet: #15724 So, in conclusion: AWESOME! We got almost the same relative "PRN speedup" on OpenCV (71.33%) as what these optimizations gave on Darknet (62.4%)! And the fact is that Darknet's CPU code is terrible so it doesn't surprise me that it got a slightly better relative improvement by the PRN network since Darknet is so inefficient at everything, so the lowered layer complexity will have a big effect on Darknet. Whereas OpenCV is super efficient and well coded. Either way, OpenCV got a HUGE improvement too! This is awesome! Thank you so much @WongKinYiu for designing this network and @dkurt for your amazing work implementing the necessary math! Now you can see why I was so excited about this network! It's giving 40.2% more FPS than YOLOv3-Tiny, and extremely similar detection accuracy. Mindblowing. And @WongKinYiu if you want me to benchmark via Darknet on this machine, I'd need to know how to do that. Hopefully the answer isn't "use the Darknet library in a C program and time it yourself". If so, do you have any code for that? I don't feel like learning the Darknet C interface. ^_^ Alright world, goodnight for today! :-) |
|
By the way all those results are with the full COCO-trained (80 classes) models. On my existing 1-class YOLOv3-Tiny model, OpenCV takes an average of 24.8562 ms (40.23 FPS) in this benchmark. The 80-class model of YOLOv3-Tiny (which averaged 28.879 ms) is therefore 16.18% slower. I don't have any YOLOv3-Tiny-PRN 1-class model yet, but if that ratio is still true (and I think it will be), then we can expect to see a 1-class PRN model taking 17.7293 ms (56.40 FPS). In other words: I will be training a 1-class PRN version, probably tomorrow. To get a real 1-class test for PRN to replace the "GUESS". :-) Alright, I'm off for today! 😴 |
|
Thank you @VideoPlayerCode and @WongKinYiu! @WongKinYiu, thanks to the scheme from #15739 (comment) we could achieve the same behavior with Darknet. |
|
@VideoPlayerCode, @WongKinYiu, probably, you can be also interested adding this network to our experimental project with accuracy/efficiency diagrams: https://github.com/dkurt/dl_tradeoff. You may see how it looks like in https://dkurt.github.io/dl_tradeoff/. |
|
@dkurt looks great! |
Results!@dkurt @WongKinYiu Hello, the results are here. I finally had some time to train and test a 1-class PRN version! This is a followup from #15739 (comment) I re-ran all tests today, since my CPU is faster today, so it wouldn't have been comparable to the earlier tests. And yes, the theory was correct! 1-class Tiny PRN is super fast just as guessed! (I mostly use 1-2 classes, and I don't want to depend on GPUs/CUDA, so the fact that this new network reaches almost 60 FPS on CPU on 1 class is incredible! And so is the fact that it even reaches 51 FPS on 80 classes, so it is suitable for heavy use too!) Also, during training, I saw that the mAP for YOLOv3-Tiny-PRN is pretty much identical to YOLOv3-Tiny, despite being a much smaller network. The accuracy of the PRN network can also be verified at Wong Kin Yiu's graphs comparing the two nets here on a per-class basis: AlexeyAB/darknet#4091 (comment) ... @WongKinYiu thank you for this genius network design. In short: YOLOv3-Tiny-PRN gives 35-40% more FPS, with the same accuracy. Thank you both so much for designing the network and for implementing it in OpenCV! |
|
PS: About the non-detected bus on the left side of the test image:
This theory sounds right. Probably subtle differences in some layer implementations. Well, it's okay, the implementation is near perfect and finds all objects with the same accuracies as in Darknet itself! (For example the car that was misdetected as a truck was 54% in both Darknet and OpenCV). And as for the way Darknet has more boxes than OpenCV (there are many overlapping/duplicate boxes in the Darknet photos), it's probably what I guessed: Differences in NMS implementation. Either way, it's clear that the shortcut layer is perfectly implemented now and that this ticket is ready for merge. Deep thanks for all your great work @dkurt! |
|
❤️ @alalek |
For anyone who needs this new feature right now on Python and can't wait 3 months for the next OpenCV version:I've successfully compiled and packaged a patched Python module, and documented the process here: opencv/opencv-python#254 The guide is duplicated here, for convenience: This was inspired by needing a brand new feature (#15739) even though OpenCV only officially releases new versions ~4 times per year. I don't have time to wait for 3 months, so I needed to build a The build process was pretty easy, but complicated at the same time. So this documents the entire process to help others! Requirements
Preparing to Build!
Building!
After the Build!
Celebrate. |
Thanks @VideoPlayerCode for the amazing work. Can you please share the exact CPU model as well. Also were you using SSD or HDD? Thanks |
|
@mmaaz60 Hello, @VideoPlayerCode use i7-8750H: 80classes, 50.90 FPS (best performance mode, without display results). @WongKinYiu use following settings: |
|
Thanks @WongKinYiu , It really helps. Just wanted to share that when I used original TIny Yolov3 (not pruned one) with OpenCV-dnn module and Inference Engine Backend, I got around 39 FPS on i7-6700 (20 classes, hdd, display off). So using Inference Engine Bakend will further increase the FPS of Tiny Yolov3 Pruned. Will share the benchmarks once done. Also it will be awesome if you can repeat the benchmarks with OpenCV-dnn and Inference Engine Backend. Thanks |
|
YOLOv3-tiny-PRN:
YOLOv3-tiny:
|
|
Hi @WongKinYiu, Have you tried using OpenVino inference engine backend with Yolov3-tiny-PRN? It will surely improve speed. Thanks |
|
Not yet, i m not familiar with OpenVino. |
|
following your magic |














This pullrequest changes
resolves #15724
This PR enables Eltwise layer (sum, prod, max) with input tensors with different number of channels.
Merge with extra: opencv/opencv_extra#679