Implement ChannelShuffle op with XNNPACK#43602
Implement ChannelShuffle op with XNNPACK#43602taivu1998 wants to merge 10 commits intogh/taivu1998/6/basefrom
Conversation
[ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 99777c9 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 26 times. |
kimishpatel
left a comment
There was a problem hiding this comment.
Overall this looks good. Please benchmark for both channels first and channels last and report the numbers here.
Differential Revision: [D23334952](https://our.internmc.facebook.com/intern/diff/D23334952) [ghstack-poisoned]
kimishpatel
left a comment
There was a problem hiding this comment.
Looks good to me. Add benchmark results to see the perf improvement.
|
Differential Revision: [D23334952](https://our.internmc.facebook.com/intern/diff/D23334952) [ghstack-poisoned]
This definitely is not as good as I expected, in some cases worse, but I think I know what is going on. So XNNPACK can access input out of bounds. E.g. if input is of 64 bytes it is possible that XNNPACK will try to access up 64+16 bytes. So the call In mobile however the case is different. In mobile builds we use
|
|
Can you paste the local diff here? If it is still giving similar perf, then I dont think that necessarily bodes well. We may have to look at the profile and understand why it performs worse. In that case we may just temporarily disable it.
More often than not, channels last is used only for mobile, as far as I know, plus less surprises in performance for non-mobile case. Come to think of it, it might be the case that channel shuffle is better done in NCHW than NHWC. So let's stick with mobile use case plan. |
Differential Revision: [D23334952](https://our.internmc.facebook.com/intern/diff/D23334952) [ghstack-poisoned]
|
So, as @kimishpatel suggested above, the XNNPACK channel shuffle may perform better in the mobile use case, because we don't need to pay the cost of allocating and copying tensors. To test this, I modified https://github.com/pytorch/pytorch/blob/master/c10/core/CPUAllocator.cpp#L225-L244 as follows to test mobile case. Below are some examples.
So the average improvement here is about 40%. Hence, I changed |
|
In addition, I tried switching to |
Differential Revision: [D23334952](https://our.internmc.facebook.com/intern/diff/D23334952) [ghstack-poisoned]
Differential Revision: [D23334952](https://our.internmc.facebook.com/intern/diff/D23334952) [ghstack-poisoned]
Codecov Report
@@ Coverage Diff @@
## gh/taivu1998/6/base #43602 +/- ##
====================================================
Coverage 69.32% 69.32%
====================================================
Files 379 379
Lines 47116 47116
====================================================
+ Hits 32663 32664 +1
+ Misses 14453 14452 -1
Continue to review full report at Codecov.
|
Differential Revision: [D23334952](https://our.internmc.facebook.com/intern/diff/D23334952) [ghstack-poisoned]
kimishpatel
left a comment
There was a problem hiding this comment.
Make sure all tests/builds are passing.
Differential Revision: [D23334952](https://our.internmc.facebook.com/intern/diff/D23334952) [ghstack-poisoned]
Differential Revision: [D23334952](https://our.internmc.facebook.com/intern/diff/D23334952) [ghstack-poisoned]
Differential Revision: [D23334952](https://our.internmc.facebook.com/intern/diff/D23334952) [ghstack-poisoned]
|
@kimishpatel merged this pull request in a678907. |
Summary: Pull Request resolved: pytorch#43602 Test Plan: Imported from OSS Reviewed By: kimishpatel Differential Revision: D23334952 Pulled By: kimishpatel fbshipit-source-id: 858ef3db599b1c521ba3a1855c9a3c35fe3b02b0
Stack from ghstack:
Differential Revision: D23334952