A silly performance test linearly depends(1000 times slower on 1000x larger tenor) on tensor size on macOS(with 6 cores i9) but can be 4-10 time faster on Linux or Windows:
In [2]: t = torch.ones(1000, device='cpu')
In [3]: timeit t.pow(123)
7.45 µs ± 343 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [4]: t = torch.ones(1000000, device='cpu')
In [5]: timeit t.pow(123)
5.83 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
torch.__config__.parallel_info() says for all OSes that PyTorch was compiled with OpenMP but it's not actually available on macOS:
at::get_num_threads() : 1
at::get_num_interop_threads() : 6
OpenMP not found
...
ATen parallel backend: OpenMP
Version.cpp:
std::string get_openmp_version() {
#ifdef _OPENMP
...
#else
ss << "OpenMP not found";
#endif
}
ParallelOpenMP.h:
inline void parallel_for(..., const F& f) {
...
#ifdef _OPENMP
...
#else
f(begin, end);
#endif
}
which looks like serial invocation.
This affects at least PyTorch 1.3.0-1.6.0 both pip and conda
cc @ezyang @gchanan @zou3519 @malfet @VitalyFedyunin
A silly performance test linearly depends(1000 times slower on 1000x larger tenor) on tensor size on macOS(with 6 cores i9) but can be 4-10 time faster on Linux or Windows:
torch.__config__.parallel_info()says for all OSes that PyTorch was compiled with OpenMP but it's not actually available on macOS:Version.cpp:
ParallelOpenMP.h:
which looks like serial invocation.
This affects at least PyTorch 1.3.0-1.6.0 both pip and conda
cc @ezyang @gchanan @zou3519 @malfet @VitalyFedyunin