-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Here is an example on Mac OS 10.9 on a Late 2013 MacBook Pro 15" with a 2.3 GHz Intel Core i7-4850HQ using OpenBLAS 0.2.15, Python 2.7.11, NumPy 1.10.2, SciPy 0.16.1. I have verified that NumPy and SciPy are properly linked to the version of OpenBLAS specified. OpenBLAS has built with the following flags make DYNAMIC_ARCH=1 BINARY=64 NO_LAPACK=0 NO_AFFINITY=1 NUM_THREADS=1 no other options or modifications were made.
>>> import numpy
>>> from scipy.linalg import blas
>>> a = numpy.ones((100,101), dtype=numpy.float32)
>>> %timeit blas.ssyrk(1, a)
10000 loops, best of 3: 37.8 µs per loop
>>> a = numpy.ones((100,101), dtype=numpy.float64)
>>> %timeit blas.dsyrk(1, a)
1000 loops, best of 3: 520 µs per loop
I wouldn't be surprised to see it takes twice as a long due to the fact that it is double versus single precision; however, taking over an order of magnitude longer seems to be a bit much.
Following the same build procedure on a Linux VM on the same machine (uses VirtualBox 5.0.12), I get a much more reasonable time for dsyrk (around double ssyrk). I have no idea whether this carries over to Windows or not. If someone is able to reproduce a similar example using C or Fortran, please share your steps.
After further discussion, we found this was array size dependent. Below is a graph showing this dependence. More details about how the graph was made can be found in this comment. ( #730 (comment) )
For comparison, one can look at the time taken by sgemm and dgemm, but one will not see this behavior.

