[pytorch][perf] add eigen blas for mobile build#26508
[pytorch][perf] add eigen blas for mobile build#26508ljk53 wants to merge 3 commits intogh/ljk53/51/basefrom
Conversation
Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback
implementation for other ops.
Test Plan:
- Create a simple matrix multiplication script model:
```
import torch
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.weights = torch.ones(1000, 1000)
def forward(self, x):
return torch.mm(x, self.weights)
n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
```
- Before integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 2218.52. Iters per second: 0.450751
```
- After integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 314.535. Iters per second: 3.17929
```
[ghstack-poisoned]
Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback
implementation for other ops.
Test Plan:
- Create a simple matrix multiplication script model:
```
import torch
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.weights = torch.ones(1000, 1000)
def forward(self, x):
return torch.mm(x, self.weights)
n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
```
- Before integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 2218.52. Iters per second: 0.450751
```
- After integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 314.535. Iters per second: 3.17929
```
ghstack-source-id: c479009
Pull Request resolved: #26508
|
Looks fine to me from the mobile side. Any concerns from the core team? |
|
Looking into the Android CI failure - it attempted to enable Fortran but failed. Seems it's hitting a nasty issue in eigen/cmake/language_support.cmake: https://github.com/eigenteam/eigen-git-mirror/blob/d41dc4dd74acce21fb210e7625d5d135751fa9e5/cmake/language_support.cmake#L22 Eigen introduces a custom cmake function "workaround_9220" to test language support. It first creates a dummy cmake file with "enable_language(Fortran)", then runs cmake twice - if both runs succeed then it determines Fortran is supported on the host. For whatever reason this adhoc language test generates false positive for our Android setup - enable_language(Fortran) succeeds in its dummy test but fails fatally later on when it actually calls enable_language(Fortran) in the main cmake. Probably because it doesn't carry Android NDK options when doing ad-hoc test in a separate cmake. More thorough fix needs to be done in eigen/cmake. One ugly workaround is to uninstall fortran compiler in our docker image. I'm also looking for alternative approach to override the failure. Suggestions are welcome. |
Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback
implementation for other ops.
Test Plan:
- Create a simple matrix multiplication script model:
```
import torch
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.weights = torch.ones(1000, 1000)
def forward(self, x):
return torch.mm(x, self.weights)
n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
```
- Before integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 2218.52. Iters per second: 0.450751
```
- After integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 314.535. Iters per second: 3.17929
```
- Improve MobileNetV2 single thread perf by ~5%:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true"
Milliseconds per iter: 367.055.
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true"
Milliseconds per iter: 348.77.
```
Differential Revision: [D17489587](https://our.internmc.facebook.com/intern/diff/D17489587)
[ghstack-poisoned]
Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback
implementation for other ops.
Test Plan:
- Create a simple matrix multiplication script model:
```
import torch
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.weights = torch.ones(1000, 1000)
def forward(self, x):
return torch.mm(x, self.weights)
n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
```
- Before integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 2218.52. Iters per second: 0.450751
```
- After integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 314.535. Iters per second: 3.17929
```
- Improve MobileNetV2 single thread perf by ~5%:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true"
Milliseconds per iter: 367.055.
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true"
Milliseconds per iter: 348.77.
```
Differential Revision: [D17489587](https://our.internmc.facebook.com/intern/diff/D17489587)
[ghstack-poisoned]
Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's a nice to have good fallback
implementation for other ops.
Test Plan:
- Create a simple matrix multiplication script model:
```
import torch
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.weights = torch.ones(1000, 1000)
def forward(self, x):
return torch.mm(x, self.weights)
n = Net()
module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)})
module.save('mm.pk')
```
- Before integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 2218.52. Iters per second: 0.450751
```
- After integrate with eigen blas:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mm.pk --input_dims="1000,1000" --input_type=float --warmup=5 --iter=5"
Milliseconds per iter: 314.535. Iters per second: 3.17929
```
- Improve MobileNetV2 single thread perf by ~5%:
```
adb shell "cd /data/local/tmp; ./speed_benchmark_torch --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true"
Milliseconds per iter: 367.055.
adb shell "cd /data/local/tmp; ./speed_benchmark_torch_eigen --model=mobilenetv2.pk --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter=20 --print_output=false --caffe2_threadpool_force_inline=true"
Milliseconds per iter: 348.77.
```
ghstack-source-id: d78af74
Pull Request resolved: #26508
|
I decided to create a new cmake file under cmake/External/EigenBLAS.cmake. It's simple enough and allows me: 1) work around the fortran compiler test bug; 2) make other cosmetic changes like not creating dynamic library but creating and installing static library. |
dzhulgakov
left a comment
There was a problem hiding this comment.
Looks good! Now we just need to fix NNPACK + groups :)
|
cc @xuhdev, if you are interested |
|
This is good, thanks! |
Summary: Pull Request resolved: pytorch/pytorch#26508 Enable BLAS for pytorch mobile build using Eigen BLAS. It's not most juicy optimization for typical mobile CV models as we are already using NNPACK/QNNPACK for most ops there. But it's nice to have good fallback implementation for other ops. Test Plan: - Create a simple matrix multiplication script model: ``` import torch class Net(torch.nn.Module): def __init__(self): super(Net, self).__init__() self.weights = torch.ones(1000, 1000) def forward(self, x): return torch.mm(x, self.weights) n = Net() module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)}) module.save('mm.pk') ``` - Before integrate with eigen blas: ``` adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch \ --model=mm.pk \ --input_dims="1000,1000" \ --input_type=float \ --warmup=5 \ --iter=5' Milliseconds per iter: 2218.52. ``` - After integrate with eigen blas: ``` adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch_eigen \ --model=mm.pk \ --input_dims="1000,1000" \ --input_type=float \ --warmup=5 \ --iter=5' Milliseconds per iter: 314.535. ``` - Improve MobileNetV2 single thread perf by ~5%: ``` adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch \ --model=mobilenetv2.pk \ --input_dims="1,3,224,224" \ --input_type=float \ --warmup=5 \ --iter=20 \ --print_output=false \ --caffe2_threadpool_force_inline=true' Milliseconds per iter: 367.055. adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch_eigen \ --model=mobilenetv2.pk \ --input_dims="1,3,224,224" \ --input_type=float \ --warmup=5 \ --iter=20 \ --print_output=false \ --caffe2_threadpool_force_inline=true' Milliseconds per iter: 348.77. ``` Differential Revision: D17489587 fbshipit-source-id: efe542db810a900f680da7ec7e60f215f58db66e
|
This pull request has been merged in d6e3aed. |
Summary: Pull Request resolved: pytorch#26508 Enable BLAS for pytorch mobile build using Eigen BLAS. It's not most juicy optimization for typical mobile CV models as we are already using NNPACK/QNNPACK for most ops there. But it's nice to have good fallback implementation for other ops. Test Plan: - Create a simple matrix multiplication script model: ``` import torch class Net(torch.nn.Module): def __init__(self): super(Net, self).__init__() self.weights = torch.ones(1000, 1000) def forward(self, x): return torch.mm(x, self.weights) n = Net() module = torch.jit.trace_module(n, {'forward': torch.ones(1000, 1000)}) module.save('mm.pk') ``` - Before integrate with eigen blas: ``` adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch \ --model=mm.pk \ --input_dims="1000,1000" \ --input_type=float \ --warmup=5 \ --iter=5' Milliseconds per iter: 2218.52. ``` - After integrate with eigen blas: ``` adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch_eigen \ --model=mm.pk \ --input_dims="1000,1000" \ --input_type=float \ --warmup=5 \ --iter=5' Milliseconds per iter: 314.535. ``` - Improve MobileNetV2 single thread perf by ~5%: ``` adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch \ --model=mobilenetv2.pk \ --input_dims="1,3,224,224" \ --input_type=float \ --warmup=5 \ --iter=20 \ --print_output=false \ --caffe2_threadpool_force_inline=true' Milliseconds per iter: 367.055. adb shell 'cd /data/local/tmp; \ ./speed_benchmark_torch_eigen \ --model=mobilenetv2.pk \ --input_dims="1,3,224,224" \ --input_type=float \ --warmup=5 \ --iter=20 \ --print_output=false \ --caffe2_threadpool_force_inline=true' Milliseconds per iter: 348.77. ``` Differential Revision: D17489587 fbshipit-source-id: efe542db810a900f680da7ec7e60f215f58db66e
Stack from ghstack:
Summary:
Enable BLAS for pytorch mobile build using Eigen BLAS.
It's not most juicy optimization for typical mobile CV models as we are already
using NNPACK/QNNPACK for most ops there. But it's nice to have good fallback
implementation for other ops.
Test Plan:
Differential Revision: D17489587