Added DNN Darknet Yolo v2 for object detection#9705
Added DNN Darknet Yolo v2 for object detection#9705opencv-pushbot merged 1 commit intoopencv:masterfrom
Conversation
73d476c to
73f2dc7
Compare
|
@AlexeyAB, thank you, this is very valuable contribution! Could you please add some regression test(s) for this functionality? |
73f2dc7 to
0d1f4ae
Compare
|
@vpisarev I added: |
| for (it_type i = net->layers_cfg.begin(); i != net->layers_cfg.end(); ++i) { | ||
| ++layers_counter; | ||
| std::map<std::string, std::string> &layer_params = i->second; | ||
| std::string layer_type = layer_params["type"]; |
There was a problem hiding this comment.
Please, add an assertion for unknown layer types to prevent unexpected errors. In example, I can't read any model now because an every layer_type ends with ] character (convolutional], maxpool]). (Ubuntu OS).
There was a problem hiding this comment.
It works now but Reproducibility_TinyYoloVoc and Reproducibility_YoloVoc tests are failed for me. Do them passed locally?
| * @param darknetModel path to the .weights file with learned network. | ||
| * @returns Pointer to the created importer, NULL in failure cases. | ||
| */ | ||
| CV_EXPORTS_W Ptr<Importer> createDarknetImporter(const String &cfgFile, const String &darknetModel = String()); |
There was a problem hiding this comment.
We defined methods like createCaffeImporter as deprecated. Please keep only readNetFromDarknet.
|
|
||
| cv::Mat frame = cv::imread(parser.get<string>("image"), -1); | ||
|
|
||
| if (frame.channels() == 4) |
There was a problem hiding this comment.
It isn't necessary: just use imread with default argument (http://docs.opencv.org/master/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56).
There was a problem hiding this comment.
@AlexeyAB, thanks, but I meant that cv::imread can read images with alpha into 24bit, http://docs.opencv.org/master/d4/da8/group__imgcodecs.html#gga61d9b0126a3e57d9277ac48327799c80af660544735200cbe942eea09232eb822.
There was a problem hiding this comment.
@dkurt I fixed it.
Initially I did it as in ssd_object_detection.cpp and I thought that this has some hidden meaning :)
| if (frame.channels() == 4) | ||
| cvtColor(frame, frame, cv::COLOR_BGRA2BGR); | ||
| //! [Prepare blob] | ||
| Mat preprocessedFrame = preprocess(frame, network_width, network_height); |
There was a problem hiding this comment.
Please, use blobFromImage's arguments to make preprocessing (http://docs.opencv.org/3.3.0/d6/d0f/group__dnn.html#ga0507466a789702eda8ffcdfa37f4d194).
ce7d140 to
b32cdab
Compare
| return false; | ||
|
|
||
| // Darknet ROUTE-layer | ||
| if (useRoute) return true; |
There was a problem hiding this comment.
Is there some difference between Route layer and Concat? getMemoryShapes returns true if layer can work in-place (all element-wise layers).
There was a problem hiding this comment.
I don't know why, but it doesn't work for Yolo if getMemoryShapes returns false.
Route-layer simply copies unchanged outputs from several layers: https://github.com/pjreddie/darknet/blob/master/src/route_layer.c#L83
Used copy_cpu() with INCX=1 and INCY=1: https://github.com/pjreddie/darknet/blob/master/src/blas.c#L208
There was a problem hiding this comment.
@AlexeyAB, it seems to me problem is in route layer with a single input (that means problem is in current concat layer with #inputs == 1), https://github.com/pjreddie/darknet/blob/master/cfg/yolo-voc.cfg#L208. Is it used like an identity layer, right?
There was a problem hiding this comment.
@dkurt Yes, route with a single input (bottom layer) is used as identity layer.
There was a problem hiding this comment.
@AlexeyAB, could you add an extra branch during route layer creation: add Concat layer if number of inputs more than 1 or Identity layer otherwise?
There was a problem hiding this comment.
@dkurt I added identity layer for the 1 input. But why concat layer can't work with 1 input, and why there is no CV_Assert for this case?
|
|
||
| setParams.setConcat(layers_vec.size(), layers_vec.data()); | ||
| } | ||
| else if (layer_type == "reorg") |
There was a problem hiding this comment.
I'm a bit confused about reorg layer. Let the input is:
channel_0 channel_1 channel_2 channel_3
0 1 4 5 8 9 c d
2 3 6 7 a b e f
and reorgStride = 2. So an output shape is 4x4x1 and values are:
output
1 4 1 5
8 c 9 d
2 6 3 7
a e b f
?
There was a problem hiding this comment.
I left unchanged a bit strange original implementation of this layer. It increases the field of view of each final activation.
Reshape: 26 x 26 x 64 -> 13 x 13 x 256

For stride = 2
input
0, 1, 2, 3,
4, 5, 6, 7,
8, 9, a, b,
c, d, e, f
output
channel_0 channel_1 channel_2 channel_3
0 2 1 3 4 6 5 7
8 a 9 b c e d f
- OpenCV C++ example: http://coliru.stacked-crooked.com/a/eb13942be083fa3d
- Darknet C++ example: http://coliru.stacked-crooked.com/a/225d7fb1f25b286c
-
Param
reverseusually absent in the cfg-file of model, soreverse = 0by default https://github.com/pjreddie/darknet/blob/master/src/parser.c#L387 -
Then
l.reverse = reverse;https://github.com/pjreddie/darknet/blob/master/src/reorg_layer.c#L28 -
So the function
reorg()is called withforward=0: https://github.com/pjreddie/darknet/blob/8215a8864d4ad07e058acafd75b2c6ff6600b9e8/src/reorg_layer.c#L108 -
reorg_cpu()-implementation usesout[in_index] = x[out_index];: https://github.com/pjreddie/darknet/blob/master/src/blas.c#L25
There was a problem hiding this comment.
Thanks! Anyway I suggest to replace it as a single layer or think how we can do the same transformations using existing ones (Permute, Reshape). Reshape layer doesn't change the data by definition neither in one of the frameworks.
There was a problem hiding this comment.
@dkurt I added reorg as separate layer reorg_layer.cpp
|
|
||
| setParams.setReshape(stride, current_shape.input_channels, current_shape.input_h, current_shape.input_w); | ||
|
|
||
| current_shape.input_channels = 256; |
|
@AlexeyAB, Thank you for the valuable contribution! We need to test all the new usage carefully. Can you add some unit tests with small few-layer networks like we do for the other importers? (https://github.com/opencv/opencv/blob/master/modules/dnn/test/test_torch_importer.cpp and https://github.com/opencv/opencv_extra/blob/master/testdata/dnn/torch/torch_gen_test_data.lua, https://github.com/opencv/opencv/blob/master/modules/dnn/test/test_tf_importer.cpp and https://github.com/opencv/opencv_extra/blob/master/testdata/dnn/tensorflow/generate_tf_models.py). In example, write simple configs, run darknet to initialize the weights, pass some random input and get the output, put configs/weights/inputs/outputs into the opencv_extra/testdata/dnn darknet subfolder? |
b32cdab to
7367677
Compare
|
@dkurt So I already added testdata and models for object detection using DNN Darknet Yolo v2 to the
In this pull-request there is:
|
|
@AlexeyAB, yeah, it's great, but I meant tests for separate layers. First of all it's necessary to protect your work done from the bugs that might appear in future development. Next thing is that BuildBot doesn't test these models for now because they aren't there. My local tests are failed and I think we can solve a problem by small checks for separate layers. I referenced how we write unit tests for different frameworks. The binary size of required data is not so huge (i.e. less than 0.5MB for TensorFlow layers) and you can add it in a single PR @ opencv_extra. |
bbf860a to
c3bc2ca
Compare
All tests passed on both: Windows 7 x64 and Linux Debian 8.2 x64 Results for comparison with OpenCV version are obtained on Linux Debian 8.2 using the current last commit of Darknet Yolo v2 compiled with GPU=0, OPENMP=1 and OpenCV=1: https://github.com/pjreddie/darknet Using commands:
|
dd7b464 to
03e4d3f
Compare
| } | ||
| net->transpose = (net->major_ver > 1000) || (net->minor_ver > 1000); | ||
|
|
||
| layerShape current_shape; |
There was a problem hiding this comment.
Why we track shapes? Doesn't weights file contain kernels shapes?
There was a problem hiding this comment.
Yes, weights file doesn't contain kernels shapes.
Also Darknet tracks layers shapes while parsing a cfg-file:
parse_network_cfg()https://github.com/pjreddie/darknet/blob/master/src/parser.c#L630parse_convolutional()https://github.com/pjreddie/darknet/blob/master/src/parser.c#L169make_convolutional_layer()https://github.com/pjreddie/darknet/blob/master/src/convolutional_layer.c#L166convolutional_out_height()https://github.com/pjreddie/darknet/blob/master/src/convolutional_layer.c#L66
int convolutional_out_height(convolutional_layer l)
{
return (l.h + 2*l.pad - l.size) / l.stride + 1;
}
There was a problem hiding this comment.
@AlexeyAB, May be we can remove at least width/height sizes tracking? As I can see only current_shape.input_channels is used to read convolutional layer weights.
| ifile.open(darknetModel, std::ios::binary); | ||
| CV_Assert(ifile.is_open()); | ||
|
|
||
| ifile.read(reinterpret_cast<char *>(&net->major_ver), sizeof(int32_t)); |
There was a problem hiding this comment.
Version numbers are used only to decide how many bytes to skip for seen value. transpose aren't used at all. Please make all unused variables from NetParameter are local.
| return false; | ||
|
|
||
| // Darknet ROUTE-layer | ||
| if (useRoute) return true; |
There was a problem hiding this comment.
@AlexeyAB, it seems to me problem is in route layer with a single input (that means problem is in current concat layer with #inputs == 1), https://github.com/pjreddie/darknet/blob/master/cfg/yolo-voc.cfg#L208. Is it used like an identity layer, right?
03e4d3f to
43e8ec1
Compare
|
@dkurt Yes, route layer with a single input (bottom layer) is used like an identity layer. |
| void setMaxpool(size_t kernel, size_t pad, size_t stride, size_t channels_num) | ||
| { | ||
| cv::dnn::experimental_dnn_v1::LayerParams maxpool_param; | ||
| maxpool_param.set<cv::String>("pool", "max"); |
There was a problem hiding this comment.
Please, setup only actual parameters: "pool", "kernel_size", "pad", "stride".
There was a problem hiding this comment.
Ok. Also required maxpool_param.set<cv::String>("pad_mode", "SAME"); for odd sizes of layers.
There was a problem hiding this comment.
However only one way of padding strategy is used: manual values or padMode ("SAME", "VALID") from TensorFlow. Please take a look on the "ceil_mode" flag instead: https://github.com/opencv/opencv/blob/master/modules/dnn/src/layers/pooling_layer.cpp#L629.
There was a problem hiding this comment.
- Accuracy test passed for Tiny-Yolo if
padMode="SAME"with anyceil_modevalue. - Accuracy test can't be passed for Tiny-Yolo for any values of padMode (
padMode="VALID"or ifpadModisn't set) with anyceil_modevalue.
| int w2 = i*reorgStride + offset % reorgStride; | ||
| int h2 = j*reorgStride + offset / reorgStride; | ||
| int out_index = w2 + width*reorgStride*(h2 + height*reorgStride*c2); | ||
| dstData[in_index] = srcData[out_index]; |
There was a problem hiding this comment.
Is there no typo in in<->out indices place?
There was a problem hiding this comment.
No, there is no typo, initially I left unchanged a bit strange original implementation of this layer.
But now I have changed this place so that there is no confusion.
|
|
||
| CV_Assert(outputs[0][0] > 0 && outputs[0][1] > 0 && outputs[0][2] > 0 && outputs[0][3] > 0); | ||
|
|
||
| return true; |
There was a problem hiding this comment.
It seems to me Reorg layer can't work in-place. getMemoryShapes returns true if layer can do it.
| { | ||
| CV_Assert(inputs.size() > 0); | ||
| outputs = std::vector<MatShape>(inputs.size(), shape(inputs[0][1] * inputs[0][2] * anchors, inputs[0][3] / anchors)); | ||
| return true; |
There was a problem hiding this comment.
The same as Reorg layer: it should returns false.
|
|
||
| darknet::LayerParameter lp; | ||
| std::string layer_name = toString(layer_id); | ||
| if (use_batch_normalize || use_relu) layer_name = "conv_" + layer_name; |
There was a problem hiding this comment.
It's better to name layers using type prefix every time. Moreover some of layers just named with numbers and it's hard to debug them.
43e8ec1 to
1495c29
Compare
| } | ||
|
|
||
| cv::dnn::experimental_dnn_v1::LayerParams getParamConvolution(int kernel, int pad, | ||
| int stride, int filters_num, int channels_num) |
| fused_layer_names.push_back(last_layer); | ||
| } | ||
|
|
||
| void setMaxpool(size_t kernel, size_t pad, size_t stride, size_t channels_num) |
793e696 to
6b5d6fe
Compare
| std::string top(const int index) const { return layer_name; } | ||
| }; | ||
|
|
||
| struct layerShape { |
| params.blobs = blobs; | ||
| } | ||
|
|
||
| void setLastLayerName(std::string layer_name) |
| inputs[0][3] / reorgStride)); | ||
|
|
||
| CV_Assert(outputs[0][0] > 0 && outputs[0][1] > 0 && outputs[0][2] > 0 && outputs[0][3] > 0); | ||
|
|
There was a problem hiding this comment.
Please, add an assertion that total(outputs[0]) == total(inputs[0]).
|
|
||
| int out_c = channels / (reorgStride*reorgStride); | ||
|
|
||
| for (int k = 0; k < channels; ++k) { |
There was a problem hiding this comment.
Please make it more clear: iterate over output dimensions and map them to input ones.
There was a problem hiding this comment.
Most likely there was made a logical mistake in the original version of Darknet in the reorg layer: https://github.com/pjreddie/darknet/blob/master/src/blas.c#L9
It works as I described if called reorg(input, output, out_w, out_h, _out_c, ...); #9705 (comment)
But in the original Darknet version function is called in this way reorg(input, output, in_w, in_h, in_c, ...);, so the one-to-one correspondence of the input and output parameters is preserved, but very strange permutations occur.
But because original Darknet works with this implementation of reorg and all models trained using it, then we can't fix this logical mistake.
Why the author has not found and corrected this error? I think:
- Perhaps this logical error does not spoil the detection accuracy, so it was not detected.
- Theoretically, we can assume that this error even increased the accuracy - so the author found it and left it.
iterate over output dimensions and map them to input ones.
So I can implement it in a such way,
but it works correctly only if (in_w % 2 == 0 && in_h % 2 == 0 && in_c % 4 == 0) : http://coliru.stacked-crooked.com/a/b962d20938362d4f
void reorg_my(const float*const srcData, float *const dstData, int width, int height, int channels, int reorgStride)
{
int outChannels = channels * reorgStride * reorgStride;
int outHeight = height / reorgStride;
int outWidth = width / reorgStride;
for (int y = 0; y < outHeight; ++y) {
for (int x = 0; x < outWidth; ++x) {
for (int c = 0; c < outChannels; ++c) {
int out_index = x + outWidth*(y + outHeight*c);
int step = c / channels;
int x_offset = step % reorgStride;
int y_offset = reorgStride * ((step / reorgStride) % reorgStride);
int in_x = x * reorgStride + x_offset;
int out_seq_y = y + c*outHeight;
int in_intermediate_y = out_seq_y*2 - out_seq_y%2;
in_intermediate_y = in_intermediate_y % (channels*height);
int in_c = in_intermediate_y / height;
int in_y = in_intermediate_y % height + y_offset;
int in_index = in_x + width*(in_y + height*in_c);
dstData[out_index] = srcData[in_index];
}
}
}
}
There was a problem hiding this comment.
Most likely there was made a logical mistake in the original version of Darknet in the reorg layer: https://github.com/pjreddie/darknet/blob/master/src/blas.c#L9
It works as I described if called
reorg(input, output, out_w, out_h, _out_c, ...);#9705 (comment)But in the original Darknet version function is called in this way
reorg(input, output, in_w, in_h, in_c, ...);, so the one-to-one correspondence of the input and output parameters is preserved, but very strange permutations occur.But because original Darknet works with this implementation of reorg and all models trained using it, then we can't fix this logical mistake.
Why the author has not found and corrected this error? I think:
* Perhaps this logical error does not spoil the detection accuracy, so it was not detected. * Theoretically, we can assume that this error even increased the accuracy - so the author found it and left it.iterate over output dimensions and map them to input ones.
So I can implement it in a such way,
but it works correctly onlyif (in_w % 2 == 0 && in_h % 2 == 0 && in_c % 4 == 0): http://coliru.stacked-crooked.com/a/b962d20938362d4fvoid reorg_my(const float*const srcData, float *const dstData, int width, int height, int channels, int reorgStride) { int outChannels = channels * reorgStride * reorgStride; int outHeight = height / reorgStride; int outWidth = width / reorgStride; for (int y = 0; y < outHeight; ++y) { for (int x = 0; x < outWidth; ++x) { for (int c = 0; c < outChannels; ++c) { int out_index = x + outWidth*(y + outHeight*c); int step = c / channels; int x_offset = step % reorgStride; int y_offset = reorgStride * ((step / reorgStride) % reorgStride); int in_x = x * reorgStride + x_offset; int out_seq_y = y + c*outHeight; int in_intermediate_y = out_seq_y*2 - out_seq_y%2; in_intermediate_y = in_intermediate_y % (channels*height); int in_c = in_intermediate_y / height; int in_y = in_intermediate_y % height + y_offset; int in_index = in_x + width*(in_y + height*in_c); dstData[out_index] = srcData[in_index]; } } } }
Is there a GPU version of reorg?
| const float confidenceThreshold = 0.24; | ||
|
|
||
| for (int i = 0; i < out.rows; i++) { | ||
| float const*const prob_ptr = &out.at<float>(i, 5); |
There was a problem hiding this comment.
float const*const is a bit confusing (there are 4 places with it).
May I ask you to use named constant variables or place a comments because it's hard to understand the magic numbers? Especially it's about samples and tests.
There was a problem hiding this comment.
I removed const*const, also added named constant variables and described the format of network output that compared to the reference in the tests.
But why is const*const confusing, is this contrary to the code style conventions that is accepted in OpenCV?
1st const forbids modification of values pointed to by this pointer, 2nd const forbids modification of this pointer.
| getParamConvolution(kernel, pad, stride, filters_num); | ||
|
|
||
| darknet::LayerParameter lp; | ||
| std::string layer_name = "conv_" + toString(layer_id); |
There was a problem hiding this comment.
Please try to use cv::format("conv_%d", layer_id) instead toString here and in other places.
|
|
||
| namespace darknet { | ||
|
|
||
| class LayerParameter { |
There was a problem hiding this comment.
I hope we can emit this structure. Layers are connected sequentially or using explicit numeric offsets starting from the newly added layer. So I think it's possible to use single vector of layers during network building. May I ask you to try it?
There was a problem hiding this comment.
Do you mean that I should to try use cv::dnn::experimental_dnn_v1::LayerParams instead of darknet::LayerParameter?
There was a problem hiding this comment.
Yeah, I think we can just parse text and binary files simultaneously: for an every entry in config we create a new one LayerParams and fill it depends on layer type. If specific layer has weights - read them from opened binary file. Then add a layer to final network (addLayerToPrev or addLayer with multiple connections based on id of the new layer and offsets i.e. -1, -4 of route).
There was a problem hiding this comment.
@AlexeyAB, on the other hand, let's keep it as is now. I'll just install darknet and compare it with PR and we can merge it.
6b5d6fe to
af5f333
Compare
|
Will there be a Python example for CV2 Darknet DNN? When running This works OK: This seems to work OK: But the detection result doesn't make sense: Referring to yolo_object_detection.cpp Thanks for your work porting to opencv. |
|
This repo shows a custom face detection model demo using python. The models were trained on the Widerface dataset and are available for download (weights and cfgs). |
|
It seems that sometimes the implementation does not return the correct detection class type. Even with the default pre-trained 80 class model, sometimes the kind of class is unknown (only zeros in the Also when I train my own model with two different classes, it only tells me the confidence, but not the class itself. Here is the dump of the result matrix (2 class model, every class is always zero): When using the darkent to detect the objects, it is always able to tell what kind of object it is (with the confidence), even if the confidence is very low. Do you experience the same behaviour? |
|
@cansik The fact is that for values that less than threshold: Darknet zeroes the You can get the same bounded boxes with the same probability in both OpenCV and Darknet (but not the same probs which less than threshold) only with:
Note: if in the original Darknet What
Note: if in the original Darknet dets[index].objectness = scale > thresh ? scale : 0; if(dets[index].objectness){
for(j = 0; j < l.classes; ++j){
int class_index = entry_index(l, 0, n*l.w*l.h + i, l.coords + 1 + j);
float prob = scale*predictions[class_index];
dets[index].prob[j] = (prob > thresh) ? prob : 0;
}
} |
|
@AlexeyAB As far as I know, the probability gets cleared by opencv. That is ok for me if the confidence is under the threshold. How do I set the threshold in opencv? Is it possible to lower it zero, to get the probabilities of all predictions? Or do you mean with threshold the threshold defined in the Here is an example which is really strange: On this image the trained network finds three characters (in opencv). All of them have a confidence higher than Second example: For this picture I have exported the result matrix: yolo_results.sheets Why are there only three probabilities, and not more for each item? Or do I understand the result matrix wrongly? |
|
@cansik, Have you tried to vary |
|
@dkurt Yes that helped, thank you. I was not sure where to set the threshold, and a (stupid) bug in my result evaluation let to no difference, even when I played with this param. Now everything works as expected. But is it possible to set this threshold directly on the |
|
@cansik, OpenCV parses |
|
@AlexeyAB can we visualize our model on tensorboard after training? |
|
@ahmadfaizan1990 TensorBoard can visualize only data from models which were trained by using TensorFlow. For Darknet your can see Loss & mAP (accuracy) chart during training: https://github.com/AlexeyAB/darknet#when-should-i-stop-training |
@AlexeyAB thanks for your reply, i have one more question if i change some convolutional layer in cfg file or want to minimize the layers, because after change in cfg file i am getting this error. |




This pullrequest changes
Added neural network Darknet Yolo v2 for object detection: https://pjreddie.com/darknet/yolo/
Added example of usage:
yolo_object_detection.cpp/example_dnn-yolo_object_detection.exeyolo.cfg,yolo-voc.cfg,tiny-yolo.cfg,tiny-yolo-voc.cfg- can be downloaded: https://drive.google.com/drive/folders/0BwRgzHpNbsWBN3JtSjBocng5YW8yolo9000.cfgSupported layers:
Merge with extra: opencv/opencv_extra#385
Comparison of use:
original Darknet-Yolo-v2:
darknet.exe detector test data/voc.data yolo-voc.cfg yolo-voc.weights -i 0 -thresh 0.24 data/dog.jpgOpenCV Yolo example:
example_dnn-yolo_object_detection.exe -cfg=yolo/yolo.cfg -model=yolo/yolo.weights -image=yolo/dog.jpg -min_confidence=0.24Comparison of results OpenCV-example vs original Darknet: https://github.com/pjreddie/darknet
For
cfg,weightsandjpg-s from: https://drive.google.com/drive/folders/0BwRgzHpNbsWBN3JtSjBocng5YW8yolo.cfg&yolo.weightsdog.jpgusingyolo.cfgeagle.jpgusingyolo.cfggiraffe.jpgusingyolo.cfgyolo-voc.cfg&yolo-voc.weightsdog.jpgusingyolo-voc.cfgeagle.jpgusingyolo-voc.cfggiraffe.jpgusingyolo-voc.cfgHow to train (to detect your custom objects): https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects
Accuracy-speed: