[WIP] Install and test against mkldnn by ezyang · Pull Request #6132 · pytorch/pytorch

ezyang · 2018-03-30T14:45:46Z

Also took the opportunity to remove obsolete mkl and mkl-devel settings.

Signed-off-by: Edward Z. Yang ezyang@fb.com

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

colesbury · 2018-03-30T15:05:47Z

Does this install mkldnn for every configuration? Would it make sense to have one configuration that doesn't have mkldnn to make sure we don't break that case?

ezyang · 2018-03-30T15:34:08Z

It only installs mkldnn on the conda enabled configurations, so the pip-only ones will run without it.

OTOH, I'm pretty unhappy about the current state of the tests, where you have to build in a specific configuration to exercise certain tests. Enabling a feature shouldn't change the tests you run; it should just add to it. So then we don't have to all of these funny contortions to turn cudnn on/off to make sure we exercise both paths, for example.

yf225 · 2018-03-30T16:58:01Z

Running CPU perf test again: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/short-perf-test-cpu/2287/

ezyang · 2018-03-30T19:16:24Z

@yf225 It still failed?

yf225 · 2018-03-30T19:53:27Z

yeah third attempt failed too on a different machine, so it's probably legit: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/short-perf-test-cpu/2291/console

yf225 · 2018-03-30T19:58:05Z

@mingfeima https://github.com/pytorch/examples/tree/perftests/mnist seems to run slower with the mkldnn package. Do you mind taking a look?

mingfeima · 2018-04-02T06:03:14Z

@yf225 i tested https://github.com/pytorch/examples/tree/perftests/mnist on my machine, Xeon(R) Platinum 8180 CPU @ 2.50GHz, OMP_NUM_THREADS=4 python main.py --epochs 1. It did run slower with mkldnn,

mnist without mkldnn: 9.63s per epoch
mnist with mkldnn: 10.99s per epoch

however, mnist is a very small workload which is unsuitable for performance measurement. Convnet is a better alternative from performance perspective.
below is convnet-alexnet performance, also OMP_NUM_THREADS=4

convnet-alexnet without mkldnn: 3.22s per iteration
convnet-alexnet with mkldnn: 1.45s per iteration

And also the mkldnn integration is just a kickoff, the current performance is roughly 20% of the best number. We will continue work on that.

i got a few questions,
a) what is the CPU type you are using for benchmarking?
b) any special reason whey setting OMP_NUM_THREADS to be 4, why not use up all the cores?

if you want to use the same number of threads for mkl, no need to set MKL_NUM_THREADS, it will follow OMP_NUM_THREADS. MKL_NUM_THREADS is only needed when you want to use different number threads from OMP_NUM_THREADS.

also KMP_AFFINITY has big impact on CPU performance. if you are testing using only 4 cores, this doesn't really make any difference. but if you use all the cores on CPU, export KMP_AFFINITY=granularity=fine,compact,1,0 will give you faster result.

yf225 · 2018-04-02T20:48:15Z

@mingfeima Thanks a lot for the advice - we will move over to use Convnet for our benchmark.

Currently we are using Mac mini i7 as our test machine, which is using 2.3GHz
i7-3615. We will be moving our test machine to Packet Tiny which has Atom C2550 on it.

The test machines would probably only have 4 cores - are the improvements we have mostly on CPUs that have more cores?

Also we are in the process of revamping our CPU perf test setup, and one issue is that wall-clock runtime for CPU perf tests is not reliable since we can't guarantee that there won't be other programs running on the machine at the time. We are thinking about using perf stat -e instructions to count the # of instructions executed instead, and I am curious do you think this metric would also improve when we move over to mkldnn? Thanks!

mingfeima · 2018-04-03T02:08:29Z

@yf225 ah, looks we have a huge difference in testing cpu performance. :(

First of all, Intel has quite a long product line, i5/i7 for desktop and Atom (low power low performance) for mobile. In AI, Intel is promoting Xeon (high power high performance) which is mostly used in data centers, CSP(cloud service provider). Inside Intel, CPU refers to Xeon in the context of AI. The latest generation of Xeon is called Skylake, AWS C5 instance is Xeon skylake. And we test on Xeon Skylake 8180, which has 56 cores @ 2.5GHz, 512-bit instruction set, providing roughly 9T flops/s, good enough to train large topology such as Inception_v3 on ImageNet.

Presumably, MKLDNN should provide speedup for both i5/i7 and atom, but actually MKLDNN is designed for Xeon. We highly recommend to include Xeon in CPU performance testing, better skylake.

As for the second question, it must be guaranteed that NO other process interrupts the performance benchmarking, otherwise the result is useless. This is also true for GPU performance benchmarking. We have a 32-node skylake cluster, managed by slurm which will guaranteed the submitted test job takes the machine exclusively.

Also use the metric of instruction per second is probably not a good idea, the result is going to be missing leading since modern CPU is using super instruction set (256 bit or 512 bit SIMD) which generates fewer instructions but much faster.

ezyang · 2018-04-04T18:43:38Z

Hi @yf225, so do you think we should adjust the CPU timing and then accept this patch?

yf225 · 2018-04-04T18:58:16Z

@ezyang Yes sorry for getting back to this late - I think we should accept it since it gives bigger gains to more compute-intensive models.

@mingfeima Thanks a lot for the advice! We are moving over to using bare metal machines for CPU performance testing, which allows us to have dedicated CPU cores for running the test suite, so no other process will interrupt the performance benchmarking anymore.

On the bare metal machines we will measure actual wall-clock time, which will give an accurate estimate of program runtime.

We probably have to stick with Atom from a cost perspective for our PR cpu perf test, but we plan to run larger perf tests nightly and weekly on Xeon machines in the near future, and the benefits of MKLDNN will definitely show there.

yf225 · 2018-05-30T21:27:32Z

@pytorchbot retest this please

ezyang · 2018-05-31T14:26:02Z

But I guess, see also #7974

yf225 · 2018-06-26T16:26:45Z

@cpuhrsch Should we merge this one?

ezyang · 2018-10-08T16:40:34Z

At this point, I'm not really sure what would need to happen to merge this PR.

Install and test against mkldnn

e661ac8

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

o8ht88z00f mentioned this pull request Mar 30, 2018

[auto] pytorch-pr-6132 onnxbot/onnx-fb-universe#1385

Closed

ezyang mentioned this pull request Mar 30, 2018

Test suite should test implementations #6135

Open

onnxbot-worker-3 mentioned this pull request Apr 5, 2018

[auto] pytorch-pr-6132 onnxbot/onnx-fb-universe#1440

Open

yf225 self-assigned this May 29, 2018

ezyang changed the title ~~Install and test against mkldnn~~ [WIP] Install and test against mkldnn Oct 8, 2018

ezyang mentioned this pull request Oct 24, 2018

use mkldnn for 3d / dilated convolution #12865

Closed

ezyang closed this Jun 14, 2019

Conversation

ezyang commented Mar 30, 2018

Uh oh!

colesbury commented Mar 30, 2018

Uh oh!

ezyang commented Mar 30, 2018

Uh oh!

yf225 commented Mar 30, 2018

Uh oh!

ezyang commented Mar 30, 2018

Uh oh!

yf225 commented Mar 30, 2018

Uh oh!

yf225 commented Mar 30, 2018

Uh oh!

mingfeima commented Apr 2, 2018

Uh oh!

yf225 commented Apr 2, 2018

Uh oh!

mingfeima commented Apr 3, 2018

Uh oh!

ezyang commented Apr 4, 2018

Uh oh!

yf225 commented Apr 4, 2018

Uh oh!

yf225 commented May 30, 2018

Uh oh!

ezyang commented May 31, 2018

Uh oh!

yf225 commented Jun 26, 2018

Uh oh!

ezyang commented Oct 8, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants