Skip to content

dnn(permute): reduce permute to reshape (or transpose) whenever possible and work inplace #16306

@YashasSamaga

Description

@YashasSamaga
System information (version)
  • OpenCV => 4.2.0
  • Operating System / Platform => Ubuntu 18.04
  • Compiler => GCC 7.4
Detailed description

Vast majority of permute operations can be reduced to a reshape or a 2d-transpose. The reshape operation can be skipped entirely and the transpose operation can be efficiently performed inplace.

Stats for single image inference:

Model Total Permute Layers Reduced to 2d-transpose Reduced to reshape
MobileNet SSD 12 10 2
YOLOv3 3 3 0
Inception v2 Faster RCNN 3 2 1
Inception v2 Mask RCNN 3 2 1

Currently, the permute layer operates inplace when the permute order is identity order. This logic can be upgraded to include the logic used here:

  1. /* singleton axes do not contribute towards address calculation
    *
    * Reasoning:
    * ----------
    * Suppose an item's indices in the input tensor is [i1, i2, ...]. The indices in the
    * output tensor will be some permutation of the input tensor indices. Let the output
    * tensor indices be [o1, o2, ...]. The permutation operation essentially copies items
    * from the input tensor to new locations in the output tensor as dictated by the indices.
    *
    * If the size of the nth axis (say i2) of the input is one the input and output indicies for
    * all the elements will be of the form be [i1, 0, ...] and [..., 0, ...] respectively.
    * The index does not contribute to the element's address calculation and hence would give
    * identical result if it weren't there.
    */
  2. /* contiguous axes whose relative ordering stays same before and after permutation can be merged into one axis
    * example: in permute order 0 2 3 1, axes 2 and 3 can be grouped into a single axis
    *
    * Reasoning:
    * ----------
    * Suppose an item's indices in the input tensor is [i0, i1, i2, i3, ...]. Let the permutation order be [0, 3, 1, 2, ...].
    * Note that i1 and i2 are adjacent axes in the same order in input as well as output. The indices in the output tensor
    * will be [i0, i3, i1, i2, ...].
    *
    * Each axis in the contiguous axes sequence will add an offset of iN * strideN. In the above example,
    * the two axes add a total offset of `i1 * (size2 * stride2) + i2 * stride2` which is `(i1 * size2 + i2) * stride2`,
    * in both input and output. Note stride2 can be different in the input and output. We can merge the two axes into one axis
    * with a size of `size1 * size2`. The new offset added will be `i12 * stride12` as the kernel iterates through `i12`. Note
    * that `i12` is actually `(i1 * size2 + i2)` and `stride12` is `stride2`.
    */

This logic would allow the permute operation to be skipped entirely when it reduces to a reshape and perform inplace when it reduces to a 2d-transpose.

The reshape optimization is easier to implement. The logic can be implemented in getMemoryShapes. The same logic can be implemented in op_permute.cpp for the Vulkan backend. It's fairly easy to implement the same in the CUDA backend. IE nodes created in BlankLayer can be used when permute reduces to a reshape (or maybe IE's PermuteLayer/TransposeOp could handle it on its own).

The inplace 2d-transpose optimization (as all backends may not support it) would require inplace to be triggered for specific backends (which I believe the current DNN module doesn't support).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions