dnn(permute): reduce permute to reshape (or transpose) whenever possible and work inplace

##### System information (version)
- OpenCV => 4.2.0
- Operating System / Platform => Ubuntu 18.04
- Compiler => GCC 7.4

##### Detailed description

Vast majority of permute operations can be reduced to a reshape or a 2d-transpose. The reshape operation can be skipped entirely and the transpose operation can be efficiently performed inplace.

**Stats for single image inference:**

Model                 | Total Permute Layers | Reduced to 2d-transpose | Reduced to reshape
---------------------- | ---------------------------- | ----------------------------- | ---------------------------
MobileNet SSD  | 12                               | 10                                | 2
YOLOv3             | 3                                  | 3                                  | 0
Inception v2 Faster RCNN | 3                   | 2                                 | 1
Inception v2 Mask RCNN | 3                   | 2                                 | 1

Currently, the permute layer operates inplace when the permute order is identity order. This logic can be upgraded to include the logic used here:
1. https://github.com/opencv/opencv/blob/1f2b2c52422813188bc5cbcc5d312c89e41786b2/modules/dnn/src/cuda/permute.cu#L173-L186
2. https://github.com/opencv/opencv/blob/1f2b2c52422813188bc5cbcc5d312c89e41786b2/modules/dnn/src/cuda/permute.cu#L217-L231

This logic would allow the permute operation to be skipped entirely when it reduces to a reshape and perform inplace when it reduces to a 2d-transpose.

The reshape optimization is easier to implement. The logic can be implemented in `getMemoryShapes`. The same logic can be implemented in `op_permute.cpp` for the Vulkan backend. It's fairly easy to implement the same in the CUDA backend. IE nodes created in `BlankLayer` can be used when permute reduces to a reshape (or maybe IE's PermuteLayer/TransposeOp could handle it on its own).

The inplace 2d-transpose optimization (as all backends may not support it) would require inplace to be triggered for specific backends (which I believe the current DNN module doesn't support). 

	/* singleton axes do not contribute towards address calculation
	*
	* Reasoning:
	* ----------
	* Suppose an item's indices in the input tensor is [i1, i2, ...]. The indices in the
	* output tensor will be some permutation of the input tensor indices. Let the output
	* tensor indices be [o1, o2, ...]. The permutation operation essentially copies items
	* from the input tensor to new locations in the output tensor as dictated by the indices.
	*
	* If the size of the nth axis (say i2) of the input is one the input and output indicies for
	* all the elements will be of the form be [i1, 0, ...] and [..., 0, ...] respectively.
	* The index does not contribute to the element's address calculation and hence would give
	* identical result if it weren't there.
	*/

	/* contiguous axes whose relative ordering stays same before and after permutation can be merged into one axis
	* example: in permute order 0 2 3 1, axes 2 and 3 can be grouped into a single axis
	*
	* Reasoning:
	* ----------
	* Suppose an item's indices in the input tensor is [i0, i1, i2, i3, ...]. Let the permutation order be [0, 3, 1, 2, ...].
	* Note that i1 and i2 are adjacent axes in the same order in input as well as output. The indices in the output tensor
	* will be [i0, i3, i1, i2, ...].
	*
	* Each axis in the contiguous axes sequence will add an offset of iN * strideN. In the above example,
	* the two axes add a total offset of `i1 * (size2 * stride2) + i2 * stride2` which is `(i1 * size2 + i2) * stride2`,
	* in both input and output. Note stride2 can be different in the input and output. We can merge the two axes into one axis
	* with a size of `size1 * size2`. The new offset added will be `i12 * stride12` as the kernel iterates through `i12`. Note
	* that `i12` is actually `(i1 * size2 + i2)` and `stride12` is `stride2`.
	*/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dnn(permute): reduce permute to reshape (or transpose) whenever possible and work inplace #16306

System information (version)

Detailed description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Total Permute Layers	Reduced to 2d-transpose	Reduced to reshape
MobileNet SSD	12	10	2
YOLOv3	3	3	0
Inception v2 Faster RCNN	3	2	1
Inception v2 Mask RCNN	3	2	1

Uh oh!

dnn(permute): reduce permute to reshape (or transpose) whenever possible and work inplace #16306

Description

System information (version)

Detailed description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions