Implement ChannelShuffle op with XNNPACK by taivu1998 · Pull Request #43602 · pytorch/pytorch

taivu1998 · 2020-08-26T01:00:06Z

Stack from ghstack:

Implement ChannelShuffle op with XNNPACK #43602 Implement ChannelShuffle op with XNNPACK
Add benchmark for channel_shuffle operator #43509 Add benchmark for channel_shuffle operator

Differential Revision: D23334952

[ghstack-poisoned]

ghstack-source-id: c99af4b Pull Request resolved: #43602

dr-ci · 2020-08-26T12:06:11Z

💊 CI failures summary and remediations

As of commit 99777c9 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 26 times.

kimishpatel

Overall this looks good. Please benchmark for both channels first and channels last and report the numbers here.

Differential Revision: [D23334952](https://our.internmc.facebook.com/intern/diff/D23334952) [ghstack-poisoned]

ghstack-source-id: 567c9aa Pull Request resolved: #43602

kimishpatel

Looks good to me. Add benchmark results to see the perf improvement.

taivu1998 · 2020-08-31T07:29:57Z

Hyperparameters	Channel First (original)	Channel Last (original)	Channel Last (XNNPACK)
Batch_size: 2, Channels per group: 16, Height: 16, Width: 16, Groups: 2	10.510 us	19.407 us	17.920 us
Batch_size: 2, Channels per group: 32, Height: 32, Width: 32, Groups: 2	46.724 us	137.186 us	65.069 us
Batch_size: 4, Channels per group: 32, Height: 32, Width: 32, Groups: 4	856.847 us	1204.111 us	690.858 us
Batch_size: 4, Channels per group: 64, Height: 64, Width: 64, Groups: 4	2808.228 us	26120.442 us	2765.157 us
Batch_size: 8, Channels per group: 64, Height: 64, Width: 64, Groups: 8	54180.293 us	159411.737 us	49908.312 us
Batch_size: 16, Channels per group: 64, Height: 64, Width: 64, Groups: 16	218493.526 us	1107034.267 us	200149.906 us

Differential Revision: [D23334952](https://our.internmc.facebook.com/intern/diff/D23334952) [ghstack-poisoned]

ghstack-source-id: 8f815e9 Pull Request resolved: #43602

kimishpatel · 2020-08-31T14:31:40Z

Hyperparameters Channel First (original) Channel Last (original) Channel Last (XNNPACK)
Batch_size: 2, Channels per group: 16, Height: 16, Width: 16, Groups: 2 10.510 us 19.407 us 17.920 us
Batch_size: 2, Channels per group: 32, Height: 32, Width: 32, Groups: 2 46.724 us 137.186 us 65.069 us
Batch_size: 4, Channels per group: 32, Height: 32, Width: 32, Groups: 4 856.847 us 1204.111 us 690.858 us
Batch_size: 4, Channels per group: 64, Height: 64, Width: 64, Groups: 4 2808.228 us 26120.442 us 2765.157 us
Batch_size: 8, Channels per group: 64, Height: 64, Width: 64, Groups: 8 54180.293 us 159411.737 us 49908.312 us
Batch_size: 16, Channels per group: 64, Height: 64, Width: 64, Groups: 16 218493.526 us 1107034.267 us 200149.906 us

This definitely is not as good as I expected, in some cases worse, but I think I know what is going on.

So XNNPACK can access input out of bounds. E.g. if input is of 64 bytes it is possible that XNNPACK will try to access up 64+16 bytes. So the call allocate_padded_contiguous_if_needed checks if the allocation is already padded (by checking if this tensor was allocated using DefaultMobileCPUAllocator). If it is not padded, it allocates new tensor and copies older one to new one. So most likely we are paying the cost of copying and hence the speedup is not as much.

In mobile however the case is different. In mobile builds we use DefaultCPUMobileAllocator which already does padded allocation. Thus we do not pay any additional cost. Let's do two things.

Locally change this line https://github.com/pytorch/pytorch/blob/master/c10/core/CPUAllocator.cpp#L239 to return GetDefaultMobileCPUAllocator(); and rerun the experiments. We will not commit this change, but this is to confirm my intuition above.
Change all if defined(USE_XNNPACK) in ChanneShuffle.cpp to #ifdef C10_MOBILE so as to go through this path only on mobile. Make that change in this PR and we will commit with that.

taivu1998 · 2020-08-31T15:57:28Z

Hyperparameters Channel First (original) Channel Last (original) Channel Last (XNNPACK)
Batch_size: 2, Channels per group: 16, Height: 16, Width: 16, Groups: 2 10.510 us 19.407 us 17.920 us
Batch_size: 2, Channels per group: 32, Height: 32, Width: 32, Groups: 2 46.724 us 137.186 us 65.069 us
Batch_size: 4, Channels per group: 32, Height: 32, Width: 32, Groups: 4 856.847 us 1204.111 us 690.858 us
Batch_size: 4, Channels per group: 64, Height: 64, Width: 64, Groups: 4 2808.228 us 26120.442 us 2765.157 us
Batch_size: 8, Channels per group: 64, Height: 64, Width: 64, Groups: 8 54180.293 us 159411.737 us 49908.312 us
Batch_size: 16, Channels per group: 64, Height: 64, Width: 64, Groups: 16 218493.526 us 1107034.267 us 200149.906 us

This definitely is not as good as I expected, in some cases worse, but I think I know what is going on.

So XNNPACK can access input out of bounds. E.g. if input is of 64 bytes it is possible that XNNPACK will try to access up 64+16 bytes. So the call allocate_padded_contiguous_if_needed checks if the allocation is already padded (by checking if this tensor was allocated using DefaultMobileCPUAllocator). If it is not padded, it allocates new tensor and copies older one to new one. So most likely we are paying the cost of copying and hence the speedup is not as much.

In mobile however the case is different. In mobile builds we use DefaultCPUMobileAllocator which already does padded allocation. Thus we do not pay any additional cost. Let's do two things.

Locally change this line https://github.com/pytorch/pytorch/blob/master/c10/core/CPUAllocator.cpp#L239 to return GetDefaultMobileCPUAllocator(); and rerun the experiments. We will not commit this change, but this is to confirm my intuition above.

Change all if defined(USE_XNNPACK) in ChanneShuffle.cpp to #ifdef C10_MOBILE so as to go through this path only on mobile. Make that change in this PR and we will commit with that.

I tried making that change, and still got similar performance as above.
I'm wondering why we only want to enable it for mobile. Even though XNNPACK didn't outperform channel first performance, it still performed significantly better in the channel last case. Hence, we should always enable it in the channel last case, shouldn't we?

kimishpatel · 2020-08-31T16:14:47Z

I tried making that change, and still got similar performance as above.

Can you paste the local diff here? If it is still giving similar perf, then I dont think that necessarily bodes well. We may have to look at the profile and understand why it performs worse. In that case we may just temporarily disable it.

I'm wondering why we only want to enable it for mobile. Even though XNNPACK didn't outperform channel first performance, it still performed significantly better in the channel last case. Hence, we should always enable it in the channel last case, shouldn't we?

More often than not, channels last is used only for mobile, as far as I know, plus less surprises in performance for non-mobile case.
Furthermore channels last performs worse here because the default implementation is better for channels first. Thus if we use that implementation for channels last we are creating a baseline which is much worse.

Come to think of it, it might be the case that channel shuffle is better done in NCHW than NHWC. So let's stick with mobile use case plan.

Differential Revision: [D23334952](https://our.internmc.facebook.com/intern/diff/D23334952) [ghstack-poisoned]

ghstack-source-id: 1793ec6 Pull Request resolved: #43602

taivu1998 · 2020-08-31T18:20:27Z

So, as @kimishpatel suggested above, the XNNPACK channel shuffle may perform better in the mobile use case, because we don't need to pay the cost of allocating and copying tensors. To test this, I modified https://github.com/pytorch/pytorch/blob/master/c10/core/CPUAllocator.cpp#L225-L244 as follows to test mobile case.

diff --git a/c10/core/CPUAllocator.cpp b/c10/core/CPUAllocator.cpp
index 24e92982b8..f004dffe33 100644
--- a/c10/core/CPUAllocator.cpp
+++ b/c10/core/CPUAllocator.cpp
@@ -222,7 +222,7 @@ at::Allocator* GetDefaultMobileCPUAllocator() {
   return &g_mobile_cpu_allocator;
 }
 
-#ifdef C10_MOBILE
+// #ifdef C10_MOBILE
 
 at::Allocator* GetDefaultCPUAllocator() {
   return GetDefaultMobileCPUAllocator();
@@ -230,19 +230,19 @@ at::Allocator* GetDefaultCPUAllocator() {
 
 REGISTER_ALLOCATOR(DeviceType::CPU, &g_mobile_cpu_allocator);
 
-#else
+// #else
 
-// Global default CPU Allocator
-static DefaultCPUAllocator g_cpu_alloc;
+// // Global default CPU Allocator
+// static DefaultCPUAllocator g_cpu_alloc;
 
-at::Allocator* GetDefaultCPUAllocator() {
-  // return &g_cpu_alloc;
-  return GetDefaultMobileCPUAllocator();
-}
+// at::Allocator* GetDefaultCPUAllocator() {
+//   // return &g_cpu_alloc;
+//   return GetDefaultMobileCPUAllocator();
+// }
 
-REGISTER_ALLOCATOR(DeviceType::CPU, &g_cpu_alloc);
+// REGISTER_ALLOCATOR(DeviceType::CPU, &g_mobile_cpu_allocator);
 
-#endif /* C10_Mobile */
+// #endif /* C10_Mobile */
 
 void ProfiledCPUMemoryReporter::New(void* ptr, size_t nbytes) {
   if (nbytes == 0) {

Below are some examples.

Hyperparameters	Channel First (original)	Channel Last (XNNPACK)	Improvement
Batch_size: 2, Channels per group: 16, Height: 16, Width: 16, Groups: 2	12.024 us	21.694 us	-80.4%
Batch_size: 2, Channels per group: 32, Height: 32, Width: 32, Groups: 2	56.682 us	42.476 us	25.1%
Batch_size: 4, Channels per group: 32, Height: 32, Width: 32, Groups: 4	843.809 us	368.091 us	56.4%
Batch_size: 4, Channels per group: 64, Height: 64, Width: 64, Groups: 4	9870.584 us	2254.690 us	77.2%
Batch_size: 8, Channels per group: 64, Height: 64, Width: 64, Groups: 8	54709.689 us	10727.511 us	80.4%
Batch_size: 16, Channels per group: 64, Height: 64, Width: 64, Groups: 16	196339.790 us	40057.746 us	79.6%

So the average improvement here is about 40%.

Hence, I changed #if defined(USE_XNNPACK) in ChanneShuffle.cpp to #ifdef C10_MOBILE so that we only enable XNNPACK the mobile case.

taivu1998 · 2020-08-31T20:23:52Z

In addition, I tried switching to #if defined(USE_XNNPACK) and then ran the unit test python test/test_nn.py TestNN.test_channel_shuffle. The program successfully passed the test.

Differential Revision: [D23334952](https://our.internmc.facebook.com/intern/diff/D23334952) [ghstack-poisoned]

ghstack-source-id: fb661eb Pull Request resolved: #43602

codecov · 2020-09-01T01:16:52Z

Codecov Report

Merging #43602 into gh/taivu1998/6/base will increase coverage by 0.00%.
The diff coverage is n/a.

@@                 Coverage Diff                  @@
##           gh/taivu1998/6/base   #43602   +/-   ##
====================================================
  Coverage                69.32%   69.32%           
====================================================
  Files                      379      379           
  Lines                    47116    47116           
====================================================
+ Hits                     32663    32664    +1     
+ Misses                   14453    14452    -1

Impacted Files	Coverage Δ
torch/testing/_internal/expecttest.py	`78.57% <0.00%> (+1.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bbb0106...99777c9. Read the comment docs.

Differential Revision: [D23334952](https://our.internmc.facebook.com/intern/diff/D23334952) [ghstack-poisoned]

ghstack-source-id: 4a9384b Pull Request resolved: #43602

kimishpatel

Make sure all tests/builds are passing.

Differential Revision: [D23334952](https://our.internmc.facebook.com/intern/diff/D23334952) [ghstack-poisoned]

ghstack-source-id: d299654 Pull Request resolved: #43602

Differential Revision: [D23334952](https://our.internmc.facebook.com/intern/diff/D23334952) [ghstack-poisoned]

ghstack-source-id: 95551d7 Pull Request resolved: #43602

facebook-github-bot · 2020-09-02T18:33:26Z

@kimishpatel merged this pull request in a678907.

Reverts: #43602 Fixes: #149828 ghstack-source-id: 2ed585e Pull-Request: #166873

Summary: Pull Request resolved: pytorch#43602 Test Plan: Imported from OSS Reviewed By: kimishpatel Differential Revision: D23334952 Pulled By: kimishpatel fbshipit-source-id: 858ef3db599b1c521ba3a1855c9a3c35fe3b02b0

Implement ChannelShuffle op with XNNPACK

51c2333

[ghstack-poisoned]

taivu1998 mentioned this pull request Aug 26, 2020

Add benchmark for channel_shuffle operator #43509

Closed

taivu1998 pushed a commit that referenced this pull request Aug 26, 2020

Implement ChannelShuffle op with XNNPACK

a4a701d

ghstack-source-id: c99af4b Pull Request resolved: #43602

taivu1998 requested a review from kimishpatel August 26, 2020 01:29

taivu1998 requested a review from AshkanAliabadi August 26, 2020 19:53