-
-
Notifications
You must be signed in to change notification settings - Fork 56.5k
Description
System information (version)
- OpenCV => 4.2.0
- Operating System / Platform => Ubuntu 18.04
- Compiler => GCC 7.4
Detailed description
Vast majority of permute operations can be reduced to a reshape or a 2d-transpose. The reshape operation can be skipped entirely and the transpose operation can be efficiently performed inplace.
Stats for single image inference:
| Model | Total Permute Layers | Reduced to 2d-transpose | Reduced to reshape |
|---|---|---|---|
| MobileNet SSD | 12 | 10 | 2 |
| YOLOv3 | 3 | 3 | 0 |
| Inception v2 Faster RCNN | 3 | 2 | 1 |
| Inception v2 Mask RCNN | 3 | 2 | 1 |
Currently, the permute layer operates inplace when the permute order is identity order. This logic can be upgraded to include the logic used here:
opencv/modules/dnn/src/cuda/permute.cu
Lines 173 to 186 in 1f2b2c5
/* singleton axes do not contribute towards address calculation * * Reasoning: * ---------- * Suppose an item's indices in the input tensor is [i1, i2, ...]. The indices in the * output tensor will be some permutation of the input tensor indices. Let the output * tensor indices be [o1, o2, ...]. The permutation operation essentially copies items * from the input tensor to new locations in the output tensor as dictated by the indices. * * If the size of the nth axis (say i2) of the input is one the input and output indicies for * all the elements will be of the form be [i1, 0, ...] and [..., 0, ...] respectively. * The index does not contribute to the element's address calculation and hence would give * identical result if it weren't there. */ opencv/modules/dnn/src/cuda/permute.cu
Lines 217 to 231 in 1f2b2c5
/* contiguous axes whose relative ordering stays same before and after permutation can be merged into one axis * example: in permute order 0 2 3 1, axes 2 and 3 can be grouped into a single axis * * Reasoning: * ---------- * Suppose an item's indices in the input tensor is [i0, i1, i2, i3, ...]. Let the permutation order be [0, 3, 1, 2, ...]. * Note that i1 and i2 are adjacent axes in the same order in input as well as output. The indices in the output tensor * will be [i0, i3, i1, i2, ...]. * * Each axis in the contiguous axes sequence will add an offset of iN * strideN. In the above example, * the two axes add a total offset of `i1 * (size2 * stride2) + i2 * stride2` which is `(i1 * size2 + i2) * stride2`, * in both input and output. Note stride2 can be different in the input and output. We can merge the two axes into one axis * with a size of `size1 * size2`. The new offset added will be `i12 * stride12` as the kernel iterates through `i12`. Note * that `i12` is actually `(i1 * size2 + i2)` and `stride12` is `stride2`. */
This logic would allow the permute operation to be skipped entirely when it reduces to a reshape and perform inplace when it reduces to a 2d-transpose.
The reshape optimization is easier to implement. The logic can be implemented in getMemoryShapes. The same logic can be implemented in op_permute.cpp for the Vulkan backend. It's fairly easy to implement the same in the CUDA backend. IE nodes created in BlankLayer can be used when permute reduces to a reshape (or maybe IE's PermuteLayer/TransposeOp could handle it on its own).
The inplace 2d-transpose optimization (as all backends may not support it) would require inplace to be triggered for specific backends (which I believe the current DNN module doesn't support).