Skip to content

[WIP] Install and test against mkldnn#6132

Closed
ezyang wants to merge 1 commit intopytorch:masterfrom
ezyang:pr/enable-mkldnn-ci
Closed

[WIP] Install and test against mkldnn#6132
ezyang wants to merge 1 commit intopytorch:masterfrom
ezyang:pr/enable-mkldnn-ci

Conversation

@ezyang
Copy link
Copy Markdown
Contributor

@ezyang ezyang commented Mar 30, 2018

Also took the opportunity to remove obsolete mkl and mkl-devel settings.

Signed-off-by: Edward Z. Yang ezyang@fb.com

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
@colesbury
Copy link
Copy Markdown
Member

Does this install mkldnn for every configuration? Would it make sense to have one configuration that doesn't have mkldnn to make sure we don't break that case?

@ezyang
Copy link
Copy Markdown
Contributor Author

ezyang commented Mar 30, 2018

It only installs mkldnn on the conda enabled configurations, so the pip-only ones will run without it.

OTOH, I'm pretty unhappy about the current state of the tests, where you have to build in a specific configuration to exercise certain tests. Enabling a feature shouldn't change the tests you run; it should just add to it. So then we don't have to all of these funny contortions to turn cudnn on/off to make sure we exercise both paths, for example.

@yf225
Copy link
Copy Markdown
Contributor

yf225 commented Mar 30, 2018

@ezyang
Copy link
Copy Markdown
Contributor Author

ezyang commented Mar 30, 2018

@yf225 It still failed?

@yf225
Copy link
Copy Markdown
Contributor

yf225 commented Mar 30, 2018

yeah third attempt failed too on a different machine, so it's probably legit: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/short-perf-test-cpu/2291/console

@yf225
Copy link
Copy Markdown
Contributor

yf225 commented Mar 30, 2018

@mingfeima https://github.com/pytorch/examples/tree/perftests/mnist seems to run slower with the mkldnn package. Do you mind taking a look?

@mingfeima
Copy link
Copy Markdown
Collaborator

@yf225 i tested https://github.com/pytorch/examples/tree/perftests/mnist on my machine, Xeon(R) Platinum 8180 CPU @ 2.50GHz, OMP_NUM_THREADS=4 python main.py --epochs 1. It did run slower with mkldnn,

  • mnist without mkldnn: 9.63s per epoch
  • mnist with mkldnn: 10.99s per epoch

however, mnist is a very small workload which is unsuitable for performance measurement. Convnet is a better alternative from performance perspective.
below is convnet-alexnet performance, also OMP_NUM_THREADS=4

  • convnet-alexnet without mkldnn: 3.22s per iteration
  • convnet-alexnet with mkldnn: 1.45s per iteration

And also the mkldnn integration is just a kickoff, the current performance is roughly 20% of the best number. We will continue work on that.

i got a few questions,
a) what is the CPU type you are using for benchmarking?
b) any special reason whey setting OMP_NUM_THREADS to be 4, why not use up all the cores?

if you want to use the same number of threads for mkl, no need to set MKL_NUM_THREADS, it will follow OMP_NUM_THREADS. MKL_NUM_THREADS is only needed when you want to use different number threads from OMP_NUM_THREADS.

also KMP_AFFINITY has big impact on CPU performance. if you are testing using only 4 cores, this doesn't really make any difference. but if you use all the cores on CPU, export KMP_AFFINITY=granularity=fine,compact,1,0 will give you faster result.

@yf225
Copy link
Copy Markdown
Contributor

yf225 commented Apr 2, 2018

@mingfeima Thanks a lot for the advice - we will move over to use Convnet for our benchmark.

Currently we are using Mac mini i7 as our test machine, which is using 2.3GHz
i7-3615
. We will be moving our test machine to Packet Tiny which has Atom C2550 on it.

The test machines would probably only have 4 cores - are the improvements we have mostly on CPUs that have more cores?

Also we are in the process of revamping our CPU perf test setup, and one issue is that wall-clock runtime for CPU perf tests is not reliable since we can't guarantee that there won't be other programs running on the machine at the time. We are thinking about using perf stat -e instructions to count the # of instructions executed instead, and I am curious do you think this metric would also improve when we move over to mkldnn? Thanks!

@mingfeima
Copy link
Copy Markdown
Collaborator

@yf225 ah, looks we have a huge difference in testing cpu performance. :(

First of all, Intel has quite a long product line, i5/i7 for desktop and Atom (low power low performance) for mobile. In AI, Intel is promoting Xeon (high power high performance) which is mostly used in data centers, CSP(cloud service provider). Inside Intel, CPU refers to Xeon in the context of AI. The latest generation of Xeon is called Skylake, AWS C5 instance is Xeon skylake. And we test on Xeon Skylake 8180, which has 56 cores @ 2.5GHz, 512-bit instruction set, providing roughly 9T flops/s, good enough to train large topology such as Inception_v3 on ImageNet.

Presumably, MKLDNN should provide speedup for both i5/i7 and atom, but actually MKLDNN is designed for Xeon. We highly recommend to include Xeon in CPU performance testing, better skylake.

As for the second question, it must be guaranteed that NO other process interrupts the performance benchmarking, otherwise the result is useless. This is also true for GPU performance benchmarking. We have a 32-node skylake cluster, managed by slurm which will guaranteed the submitted test job takes the machine exclusively.

Also use the metric of instruction per second is probably not a good idea, the result is going to be missing leading since modern CPU is using super instruction set (256 bit or 512 bit SIMD) which generates fewer instructions but much faster.

@ezyang
Copy link
Copy Markdown
Contributor Author

ezyang commented Apr 4, 2018

Hi @yf225, so do you think we should adjust the CPU timing and then accept this patch?

@yf225
Copy link
Copy Markdown
Contributor

yf225 commented Apr 4, 2018

@ezyang Yes sorry for getting back to this late - I think we should accept it since it gives bigger gains to more compute-intensive models.

@mingfeima Thanks a lot for the advice! We are moving over to using bare metal machines for CPU performance testing, which allows us to have dedicated CPU cores for running the test suite, so no other process will interrupt the performance benchmarking anymore.

On the bare metal machines we will measure actual wall-clock time, which will give an accurate estimate of program runtime.

We probably have to stick with Atom from a cost perspective for our PR cpu perf test, but we plan to run larger perf tests nightly and weekly on Xeon machines in the near future, and the benefits of MKLDNN will definitely show there.

@yf225
Copy link
Copy Markdown
Contributor

yf225 commented May 30, 2018

@pytorchbot retest this please

@ezyang
Copy link
Copy Markdown
Contributor Author

ezyang commented May 31, 2018

But I guess, see also #7974

@yf225
Copy link
Copy Markdown
Contributor

yf225 commented Jun 26, 2018

@cpuhrsch Should we merge this one?

@ezyang ezyang changed the title Install and test against mkldnn [WIP] Install and test against mkldnn Oct 8, 2018
@ezyang
Copy link
Copy Markdown
Contributor Author

ezyang commented Oct 8, 2018

At this point, I'm not really sure what would need to happen to merge this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants