Skip to content

Implement ChannelShuffle op with XNNPACK#43602

Closed
taivu1998 wants to merge 10 commits intogh/taivu1998/6/basefrom
gh/taivu1998/6/head
Closed

Implement ChannelShuffle op with XNNPACK#43602
taivu1998 wants to merge 10 commits intogh/taivu1998/6/basefrom
gh/taivu1998/6/head

Conversation

@taivu1998
Copy link
Copy Markdown
Contributor

@taivu1998 taivu1998 commented Aug 26, 2020

Stack from ghstack:

Differential Revision: D23334952

taivu1998 pushed a commit that referenced this pull request Aug 26, 2020
ghstack-source-id: c99af4b
Pull Request resolved: #43602
@taivu1998 taivu1998 requested a review from kimishpatel August 26, 2020 01:29
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Aug 26, 2020

💊 CI failures summary and remediations

As of commit 99777c9 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 26 times.

Comment thread aten/src/ATen/native/ChanelShuffle.cpp
Comment thread aten/src/ATen/native/ChanelShuffle.cpp Outdated
Comment thread aten/src/ATen/native/xnnpack/ChannelShuffle.cpp Outdated
Comment thread aten/src/ATen/native/xnnpack/ChannelShuffle.cpp
Comment thread benchmarks/operator_benchmark/pt/channel_shuffle_test.py Outdated
Copy link
Copy Markdown
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good. Please benchmark for both channels first and channels last and report the numbers here.

taivu1998 pushed a commit that referenced this pull request Aug 27, 2020
ghstack-source-id: 567c9aa
Pull Request resolved: #43602
Copy link
Copy Markdown
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Add benchmark results to see the perf improvement.

@taivu1998
Copy link
Copy Markdown
Contributor Author

Hyperparameters Channel First (original) Channel Last (original) Channel Last (XNNPACK)
Batch_size: 2, Channels per group: 16, Height: 16, Width: 16, Groups: 2 10.510 us 19.407 us 17.920 us
Batch_size: 2, Channels per group: 32, Height: 32, Width: 32, Groups: 2 46.724 us 137.186 us 65.069 us
Batch_size: 4, Channels per group: 32, Height: 32, Width: 32, Groups: 4 856.847 us 1204.111 us 690.858 us
Batch_size: 4, Channels per group: 64, Height: 64, Width: 64, Groups: 4 2808.228 us 26120.442 us 2765.157 us
Batch_size: 8, Channels per group: 64, Height: 64, Width: 64, Groups: 8 54180.293 us 159411.737 us 49908.312 us
Batch_size: 16, Channels per group: 64, Height: 64, Width: 64, Groups: 16 218493.526 us 1107034.267 us 200149.906 us

taivu1998 pushed a commit that referenced this pull request Aug 31, 2020
ghstack-source-id: 8f815e9
Pull Request resolved: #43602
@kimishpatel
Copy link
Copy Markdown
Contributor

Hyperparameters Channel First (original) Channel Last (original) Channel Last (XNNPACK)
Batch_size: 2, Channels per group: 16, Height: 16, Width: 16, Groups: 2 10.510 us 19.407 us 17.920 us
Batch_size: 2, Channels per group: 32, Height: 32, Width: 32, Groups: 2 46.724 us 137.186 us 65.069 us
Batch_size: 4, Channels per group: 32, Height: 32, Width: 32, Groups: 4 856.847 us 1204.111 us 690.858 us
Batch_size: 4, Channels per group: 64, Height: 64, Width: 64, Groups: 4 2808.228 us 26120.442 us 2765.157 us
Batch_size: 8, Channels per group: 64, Height: 64, Width: 64, Groups: 8 54180.293 us 159411.737 us 49908.312 us
Batch_size: 16, Channels per group: 64, Height: 64, Width: 64, Groups: 16 218493.526 us 1107034.267 us 200149.906 us

This definitely is not as good as I expected, in some cases worse, but I think I know what is going on.

So XNNPACK can access input out of bounds. E.g. if input is of 64 bytes it is possible that XNNPACK will try to access up 64+16 bytes. So the call allocate_padded_contiguous_if_needed checks if the allocation is already padded (by checking if this tensor was allocated using DefaultMobileCPUAllocator). If it is not padded, it allocates new tensor and copies older one to new one. So most likely we are paying the cost of copying and hence the speedup is not as much.

In mobile however the case is different. In mobile builds we use DefaultCPUMobileAllocator which already does padded allocation. Thus we do not pay any additional cost. Let's do two things.

  1. Locally change this line https://github.com/pytorch/pytorch/blob/master/c10/core/CPUAllocator.cpp#L239 to return GetDefaultMobileCPUAllocator(); and rerun the experiments. We will not commit this change, but this is to confirm my intuition above.
  2. Change all if defined(USE_XNNPACK) in ChanneShuffle.cpp to #ifdef C10_MOBILE so as to go through this path only on mobile. Make that change in this PR and we will commit with that.

@taivu1998
Copy link
Copy Markdown
Contributor Author

Hyperparameters Channel First (original) Channel Last (original) Channel Last (XNNPACK)
Batch_size: 2, Channels per group: 16, Height: 16, Width: 16, Groups: 2 10.510 us 19.407 us 17.920 us
Batch_size: 2, Channels per group: 32, Height: 32, Width: 32, Groups: 2 46.724 us 137.186 us 65.069 us
Batch_size: 4, Channels per group: 32, Height: 32, Width: 32, Groups: 4 856.847 us 1204.111 us 690.858 us
Batch_size: 4, Channels per group: 64, Height: 64, Width: 64, Groups: 4 2808.228 us 26120.442 us 2765.157 us
Batch_size: 8, Channels per group: 64, Height: 64, Width: 64, Groups: 8 54180.293 us 159411.737 us 49908.312 us
Batch_size: 16, Channels per group: 64, Height: 64, Width: 64, Groups: 16 218493.526 us 1107034.267 us 200149.906 us

This definitely is not as good as I expected, in some cases worse, but I think I know what is going on.

So XNNPACK can access input out of bounds. E.g. if input is of 64 bytes it is possible that XNNPACK will try to access up 64+16 bytes. So the call allocate_padded_contiguous_if_needed checks if the allocation is already padded (by checking if this tensor was allocated using DefaultMobileCPUAllocator). If it is not padded, it allocates new tensor and copies older one to new one. So most likely we are paying the cost of copying and hence the speedup is not as much.

In mobile however the case is different. In mobile builds we use DefaultCPUMobileAllocator which already does padded allocation. Thus we do not pay any additional cost. Let's do two things.

  1. Locally change this line https://github.com/pytorch/pytorch/blob/master/c10/core/CPUAllocator.cpp#L239 to return GetDefaultMobileCPUAllocator(); and rerun the experiments. We will not commit this change, but this is to confirm my intuition above.
  2. Change all if defined(USE_XNNPACK) in ChanneShuffle.cpp to #ifdef C10_MOBILE so as to go through this path only on mobile. Make that change in this PR and we will commit with that.
  1. I tried making that change, and still got similar performance as above.
  2. I'm wondering why we only want to enable it for mobile. Even though XNNPACK didn't outperform channel first performance, it still performed significantly better in the channel last case. Hence, we should always enable it in the channel last case, shouldn't we?

@kimishpatel
Copy link
Copy Markdown
Contributor

  1. I tried making that change, and still got similar performance as above.

Can you paste the local diff here? If it is still giving similar perf, then I dont think that necessarily bodes well. We may have to look at the profile and understand why it performs worse. In that case we may just temporarily disable it.

  1. I'm wondering why we only want to enable it for mobile. Even though XNNPACK didn't outperform channel first performance, it still performed significantly better in the channel last case. Hence, we should always enable it in the channel last case, shouldn't we?

More often than not, channels last is used only for mobile, as far as I know, plus less surprises in performance for non-mobile case.
Furthermore channels last performs worse here because the default implementation is better for channels first. Thus if we use that implementation for channels last we are creating a baseline which is much worse.

Come to think of it, it might be the case that channel shuffle is better done in NCHW than NHWC. So let's stick with mobile use case plan.

taivu1998 pushed a commit that referenced this pull request Aug 31, 2020
ghstack-source-id: 1793ec6
Pull Request resolved: #43602
@taivu1998
Copy link
Copy Markdown
Contributor Author

taivu1998 commented Aug 31, 2020

So, as @kimishpatel suggested above, the XNNPACK channel shuffle may perform better in the mobile use case, because we don't need to pay the cost of allocating and copying tensors. To test this, I modified https://github.com/pytorch/pytorch/blob/master/c10/core/CPUAllocator.cpp#L225-L244 as follows to test mobile case.

diff --git a/c10/core/CPUAllocator.cpp b/c10/core/CPUAllocator.cpp
index 24e92982b8..f004dffe33 100644
--- a/c10/core/CPUAllocator.cpp
+++ b/c10/core/CPUAllocator.cpp
@@ -222,7 +222,7 @@ at::Allocator* GetDefaultMobileCPUAllocator() {
   return &g_mobile_cpu_allocator;
 }
 
-#ifdef C10_MOBILE
+// #ifdef C10_MOBILE
 
 at::Allocator* GetDefaultCPUAllocator() {
   return GetDefaultMobileCPUAllocator();
@@ -230,19 +230,19 @@ at::Allocator* GetDefaultCPUAllocator() {
 
 REGISTER_ALLOCATOR(DeviceType::CPU, &g_mobile_cpu_allocator);
 
-#else
+// #else
 
-// Global default CPU Allocator
-static DefaultCPUAllocator g_cpu_alloc;
+// // Global default CPU Allocator
+// static DefaultCPUAllocator g_cpu_alloc;
 
-at::Allocator* GetDefaultCPUAllocator() {
-  // return &g_cpu_alloc;
-  return GetDefaultMobileCPUAllocator();
-}
+// at::Allocator* GetDefaultCPUAllocator() {
+//   // return &g_cpu_alloc;
+//   return GetDefaultMobileCPUAllocator();
+// }
 
-REGISTER_ALLOCATOR(DeviceType::CPU, &g_cpu_alloc);
+// REGISTER_ALLOCATOR(DeviceType::CPU, &g_mobile_cpu_allocator);
 
-#endif /* C10_Mobile */
+// #endif /* C10_Mobile */
 
 void ProfiledCPUMemoryReporter::New(void* ptr, size_t nbytes) {
   if (nbytes == 0) {

Below are some examples.

Hyperparameters Channel First (original) Channel Last (XNNPACK) Improvement
Batch_size: 2, Channels per group: 16, Height: 16, Width: 16, Groups: 2 12.024 us 21.694 us -80.4%
Batch_size: 2, Channels per group: 32, Height: 32, Width: 32, Groups: 2 56.682 us 42.476 us 25.1%
Batch_size: 4, Channels per group: 32, Height: 32, Width: 32, Groups: 4 843.809 us 368.091 us 56.4%
Batch_size: 4, Channels per group: 64, Height: 64, Width: 64, Groups: 4 9870.584 us 2254.690 us 77.2%
Batch_size: 8, Channels per group: 64, Height: 64, Width: 64, Groups: 8 54709.689 us 10727.511 us 80.4%
Batch_size: 16, Channels per group: 64, Height: 64, Width: 64, Groups: 16 196339.790 us 40057.746 us 79.6%

So the average improvement here is about 40%.

Hence, I changed #if defined(USE_XNNPACK) in ChanneShuffle.cpp to #ifdef C10_MOBILE so that we only enable XNNPACK the mobile case.

@taivu1998
Copy link
Copy Markdown
Contributor Author

In addition, I tried switching to #if defined(USE_XNNPACK) and then ran the unit test python test/test_nn.py TestNN.test_channel_shuffle. The program successfully passed the test.

taivu1998 pushed a commit that referenced this pull request Aug 31, 2020
ghstack-source-id: fb661eb
Pull Request resolved: #43602
@taivu1998 taivu1998 requested a review from kimishpatel August 31, 2020 22:18
@codecov
Copy link
Copy Markdown

codecov Bot commented Sep 1, 2020

Codecov Report

Merging #43602 into gh/taivu1998/6/base will increase coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@                 Coverage Diff                  @@
##           gh/taivu1998/6/base   #43602   +/-   ##
====================================================
  Coverage                69.32%   69.32%           
====================================================
  Files                      379      379           
  Lines                    47116    47116           
====================================================
+ Hits                     32663    32664    +1     
+ Misses                   14453    14452    -1     
Impacted Files Coverage Δ
torch/testing/_internal/expecttest.py 78.57% <0.00%> (+1.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bbb0106...99777c9. Read the comment docs.

taivu1998 pushed a commit that referenced this pull request Sep 1, 2020
ghstack-source-id: 4a9384b
Pull Request resolved: #43602
Copy link
Copy Markdown
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure all tests/builds are passing.

taivu1998 pushed a commit that referenced this pull request Sep 1, 2020
ghstack-source-id: d299654
Pull Request resolved: #43602
taivu1998 pushed a commit that referenced this pull request Sep 1, 2020
ghstack-source-id: 95551d7
Pull Request resolved: #43602
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@kimishpatel merged this pull request in a678907.

@facebook-github-bot facebook-github-bot deleted the gh/taivu1998/6/head branch September 6, 2020 14:16
fadara01 added a commit that referenced this pull request Dec 2, 2025
Reverts: #43602
Fixes: #149828
ghstack-source-id: 2ed585e
Pull-Request: #166873
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary: Pull Request resolved: pytorch#43602

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D23334952

Pulled By: kimishpatel

fbshipit-source-id: 858ef3db599b1c521ba3a1855c9a3c35fe3b02b0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants