[caffe2]seperate mkl, mklml, and mkldnn by gujinghui · Pull Request #12170 · pytorch/pytorch

gujinghui · 2018-09-28T14:38:03Z

Remove avx2 support in mkldnn
Seperate mkl, mklml, and mkldnn
Fix convfusion test case

gujinghui · 2018-09-29T01:58:25Z

@pytorchbot retest this please

gujinghui · 2018-09-29T05:06:09Z

@yinghai @jgong5
pls review.

yinghai

What's the test plan for this change?

And looks like we still have FIndMKLDNN.cmake. Can we merge that into one?

gujinghui · 2018-10-04T15:40:52Z

@yinghai
The test cases under caffe2/python/ideep/ covers these changes.

For FindMKLDNN.cmake, yes, I have plan to move all to FindMKLDNN.cmake.
After this PR, I will submit a PR to do this ASAP.
The reason is next PR is related to pytorch, while this one is still only for caffe2.
Pls merge this PR firstly. Thanks.

yinghai · 2018-10-04T21:45:48Z

What's the test plan for this change?

Sorry what I meant is what's your command to build pytorch?

facebook-github-bot

yinghai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

gujinghui · 2018-10-08T02:41:30Z

@yinghai
Pls try below cmd.
cmake -DBLAS=MKL -DUSE_IDEEP=ON ..

yinghai · 2018-10-09T04:37:21Z

@orionr I think we cannot build with cmake along anymore? We need setup.py to build things properly, right?

gujinghui · 2018-10-09T06:24:48Z

@yinghai after this PR, I'll submit new PR to enable setup.py for mkl-dnn ASAP. Is it OK to you?

gujinghui · 2018-10-10T02:42:00Z

@yinghai
Is there any blocking issue for this PR?
If any, pls contact us anytime.

orionr · 2018-10-10T14:48:14Z

@yinghai and @gujinghui sorry for the slow response here.

libtorch builds (for C++ and no Python) are built exclusively with CMake, but PyTorch and Caffe2 builds will always go through setup.py. Changes like this should really match with changes in tools/setup_helpers/mkldnn. Ultimately when we move _C build to CMake this shouldn't be needed, but it's required right now. We also would need to call all third_party builds from root CMake rather than build_pytorch_libs.sh (and build_pytorch_libs.bat). So, long story short, you'll want to make sure tools/setup_helpers/mkldnn and setup.py match these changes when you land it. Generally you just want to make sure they are inline with the changes even if MKLDNN discovery is a bit different in the two cases.

cc @anderspapitto.

orionr · 2018-10-10T14:53:23Z

Also, @gujinghui, to test, run python setup.py build_deps rather than calling CMake directly.

gujinghui · 2018-10-13T11:32:33Z

@yinghai @orionr
please help review. The test cmd is export NO_MKLDNN=NO and python setup.py build_deps

gujinghui · 2018-10-13T11:37:02Z

@orionr
Thanks for detail descriptions.
With this PR, the mkldnn is built from source with caffe2 by setup.py.
No need to seek the libmkldnn.so in system environment.

gujinghui · 2018-10-15T09:13:20Z

@yinghai @orionr
rebased. pls help review.

orionr

Great work - thank you! I think we need to cleanup some of the naming and, most importantly, remove the CAFFE2_* variables from cmake/Modules/. Those should also likely be upstreamed as well. Happy to talk more on the PR, though.

caffe2/core/macros.h.in

cmake/Modules/FindMKL.cmake

cmake/Modules/FindMKLDNN.cmake

cmake/Summary.cmake

gujinghui · 2018-10-16T05:32:58Z

@orionr
Understood your concern.
But the USE_MKL has been used in mkl-dnn, as well, which should introduce some conflictions.

And USE_MKLDNN/USE_MKL is defined by user to try to enable MKLDNN/MKL.
But, sometimes we cannot find MKLDNN/MKL.
In order to avoid fusion, new macros are needed here.

How about use below macros in both cmake and C code, instead?
CAFFE2_USE_MKL --> MKL_FOUND
CAFFE2_USE_MKLDNN --> MKLDNN_FOUND

Then, in caffe2/core/macros.h.in, we have below lines.
#cmakedefine MKL_FOUND
#cmakedefine MKLDNN_FOUND

And in C code, the code looks like below.
#ifdef MKLDNN_FOUND
......
#endif

facebook-github-bot

yinghai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

yinghai · 2018-10-17T04:58:33Z

How about use below macros in both cmake and C code, instead?
CAFFE2_USE_MKL --> MKL_FOUND
CAFFE2_USE_MKLDNN --> MKLDNN_FOUND

If this is the case, I feel we should just use CAFFE2_USE_MKL, because in macros.in.h we have a bunch of them already. What I'm not update with the build plan. @orionr, what's your take?

gujinghui · 2018-10-23T01:41:20Z

hi @orionr @yinghai
any update on this PR?
what else should be changed for this?

yinghai · 2018-10-23T18:18:45Z

@orionr

orionr · 2018-10-23T18:53:59Z

Thanks for the ping.

I think there are still CAFFE2_* variables being set in cmake/Modules/Find*.cmake files? If so, that needs to be fixed before this can land. If these are only going to be used for PyTorch / Caffe2 (and not for other projects) in the future (I don't know the answer to that) we could move them to a new location, but hopefully we can make them standardized.

gujinghui · 2018-10-24T05:33:33Z

@orionr
Thanks for reply.
All CAFFE2_* var have been moved out of cmake/Modules/FindMKL*.cmake.

@yinghai @orionr
Pls help review it.

Thanks,
Jinghui

orionr

Looks great! Thank you. One question on Windows and will now look at the CI results.

cmake/Modules/FindMKLDNN.cmake

orionr · 2018-10-25T00:14:31Z

Looks like we have some CI failures with this - let's make sure those pass before importing, since this is a pretty significant change. Closing and reopening to see if we can get them restarted.

orionr · 2018-10-25T00:14:57Z

@pytorchbot retest this please

gujinghui · 2018-10-25T05:09:36Z

@orionr
Looks like the rest failures are not caused by this PR.

pytorch-win-ws2016-cuda9-cudnn7-py3-build:

00:56:41 ..\caffe2\core\nomnigraph\include\nomnigraph/Representations/NeuralNet.h(456): error C2491: 'nom::repr::nn::createOperator': definition of dllimport function not allowed

caffe2_py2_gcc4_8_ubuntu14_04_test

Oct 24 11:46:01 =================================== FAILURES ===================================
Oct 24 11:46:01 _____________________________ TestGlu.test_glu_old _____________________________
Oct 24 11:46:01
Oct 24 11:46:01 self = <caffe2.python.operator_test.glu_op_test.TestGlu testMethod=test_glu_old>
Oct 24 11:46:01 args = (), kwargs = {}
Oct 24 11:46:01
Oct 24 11:46:01 def func(self, *args, **kwargs):
Oct 24 11:46:01 > self.should_serialize = True
Oct 24 11:46:01
Oct 24 11:46:01 /usr/local/caffe2/lib/python2.7/dist-packages/caffe2/python/serialized_test/serialized_test_util.py:34:
Oct 24 11:46:01 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Oct 24 11:46:01 /usr/local/lib/python2.7/dist-packages/hypothesis/_settings.py:248: in new_test
Oct 24 11:46:01 test(*args, **kwargs)
Oct 24 11:46:01 /usr/local/caffe2/lib/python2.7/dist-packages/caffe2/python/serialized_test/serialized_test_util.py:37: in func
Oct 24 11:46:01 hyp_func(self, *args, **kwargs)
Oct 24 11:46:01 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Oct 24 11:46:01
Oct 24 11:46:01 self = <caffe2.python.operator_test.glu_op_test.TestGlu testMethod=test_glu_old>
Oct 24 11:46:01
Oct 24 11:46:01 @settings(suppress_health_check=[HealthCheck.filter_too_much])
Oct 24 11:46:01 > @serial.given(
Oct 24 11:46:01 X=hu.tensor(),
Oct 24 11:46:01 axis=st.integers(min_value=0, max_value=3),
Oct 24 11:46:01 **hu.gcs
Oct 24 11:46:01 )
Oct 24 11:46:01 def test_glu_old(self, X, axis, gc, dc):
Oct 24 11:46:01 E FailedHealthCheck: It looks like your strategy is filtering out a lot of data. Health check found 50 filtered examples but only 9 good ones. This will make your tests much slower, and also will probably distort the data generation quite a lot. You should adapt your strategy to filter less. This can also be caused by a low max_leaves parameter in recursive() calls

pytorch_linux_trusty_py3_6_gcc7_build_test

Oct 24 10:23:25 ======================================================================
Oct 24 10:23:25 FAIL: test_all_reduce_product (main.TestDistBackend)
Oct 24 10:23:25 ----------------------------------------------------------------------
Oct 24 10:23:25 Traceback (most recent call last):
Oct 24 10:23:25 File "test_distributed.py", line 1288, in wrapper
Oct 24 10:23:25 self._join_and_reduce(fn)
Oct 24 10:23:25 File "test_distributed.py", line 1371, in _join_and_reduce
Oct 24 10:23:25 self.assertEqual(p.exitcode, first_process.exitcode)
Oct 24 10:23:25 File "/var/lib/jenkins/workspace/test/common_utils.py", line 404, in assertEqual
Oct 24 10:23:25 super(TestCase, self).assertEqual(x, y, message)
Oct 24 10:23:25 AssertionError: None != 1 :
Oct 24 10:23:25
Oct 24 10:23:25 ----------------------------------------------------------------------
Oct 24 10:23:25 Ran 62 tests in 611.141s
Oct 24 10:23:25
Oct 24 10:23:25 FAILED (failures=1, skipped=30)
Oct 24 10:23:25 Traceback (most recent call last):
Oct 24 10:23:25 File "test/run_test.py", line 394, in
Oct 24 10:23:25 main()
Oct 24 10:23:25 File "test/run_test.py", line 386, in main
Oct 24 10:23:25 raise RuntimeError(message)
Oct 24 10:23:25 RuntimeError: test_distributed failed!
Oct 24 10:23:25 + cleanup
Oct 24 10:23:25 + retcode=1
Oct 24 10:23:25 + set +x
Oct 24 10:23:25 =================== sccache compilation log ===================
Oct 24 10:23:25 ERROR:sccache::server: ["null"] fatal error: Permission denied (os error 13)
Oct 24 10:23:25
Oct 24 10:23:25 ERROR:sccache::server: ["null"] Permission denied (os error 13)

ezyang · 2018-10-25T15:02:42Z

GLU health should be fixed on master.

orionr · 2018-10-25T15:42:53Z

@gujinghui can you please rebase and force push to this branch so we can confirm all the tests pass? Thanks.

Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>

Mapping between cmake var and c macro: USE_MKL --> CAFFE2_USE_MKL USE_IDEEP --> CAFFE2_USE_IDEEP Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>

Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>

gujinghui · 2018-10-29T05:24:14Z

@orionr
rebased. pls help review.

orionr · 2018-10-29T15:00:03Z

Thank you! Failing test is only a timeout, so CI looks good. Starting to land.

facebook-github-bot

orionr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: 1. Remove avx2 support in mkldnn 2. Seperate mkl, mklml, and mkldnn 3. Fix convfusion test case Pull Request resolved: pytorch/pytorch#12170 Reviewed By: yinghai Differential Revision: D10207126 Pulled By: orionr fbshipit-source-id: 1e62eb47943f426a89d57e2d2606439f2b04fd51

gujinghui force-pushed the remove_avx2 branch 3 times, most recently from 865a907 to 03807a4 Compare September 28, 2018 15:04

yinghai reviewed Oct 3, 2018

View reviewed changes

facebook-github-bot reviewed Oct 4, 2018

View reviewed changes

gujinghui mentioned this pull request Oct 13, 2018

[caffe2] pre-convert filter format in optimizeForIDEEP #9488

Closed

gujinghui force-pushed the remove_avx2 branch from 03807a4 to 5c85a19 Compare October 13, 2018 11:30

gujinghui force-pushed the remove_avx2 branch from 5c85a19 to 625be2c Compare October 15, 2018 09:12

orionr suggested changes Oct 15, 2018

View reviewed changes

gujinghui force-pushed the remove_avx2 branch from 625be2c to 82ac65c Compare October 16, 2018 05:27

facebook-github-bot reviewed Oct 16, 2018

View reviewed changes

gujinghui mentioned this pull request Oct 18, 2018

[caffe2] Enable int8 inference for MKLDNN #11860

Closed

gujinghui force-pushed the remove_avx2 branch from 82ac65c to b4a30e8 Compare October 24, 2018 05:27

gujinghui mentioned this pull request Oct 24, 2018

Fix build issues and test failures in MKLDNN, and refine cmake files #11853

Closed

orionr reviewed Oct 25, 2018

View reviewed changes

cmake/Modules/FindMKLDNN.cmake Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

orionr closed this Oct 25, 2018

orionr reopened this Oct 25, 2018

gujinghui added 3 commits October 29, 2018 11:23

Fix convfusion test case in mkldnn

6fb9f88

Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>

Remove avx2 support and seperate mkl, and ideep for mkldnn-bdg

3e5f944

Mapping between cmake var and c macro: USE_MKL --> CAFFE2_USE_MKL USE_IDEEP --> CAFFE2_USE_IDEEP Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>

build mkldnn bridge for pytorch

45f15fd

Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>

gujinghui force-pushed the remove_avx2 branch from b4a30e8 to 45f15fd Compare October 29, 2018 05:23

orionr mentioned this pull request Oct 29, 2018

Pytorch build finds local mkl instead of conda version #7083

Closed

orionr approved these changes Oct 29, 2018

View reviewed changes

facebook-github-bot reviewed Oct 29, 2018

View reviewed changes

facebook-github-bot closed this in dbab9b7 Oct 29, 2018

orionr mentioned this pull request Oct 29, 2018

Updated for for arbitrary command line arg ordering #13253

Closed

yaceben mentioned this pull request Oct 30, 2018

Use conda official channel's MKL-DNN? #13153

Closed

h6197627 mentioned this pull request Nov 3, 2018

MKLDNN doesn't link MKL and Intel OMP during PyTorch build #13544

Closed

ezyang added open source merged labels Jun 24, 2019

Conversation

gujinghui commented Sep 28, 2018

Uh oh!

gujinghui commented Sep 29, 2018

Uh oh!

gujinghui commented Sep 29, 2018

Uh oh!

yinghai left a comment

Choose a reason for hiding this comment

Uh oh!

gujinghui commented Oct 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yinghai commented Oct 4, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

gujinghui commented Oct 8, 2018

Uh oh!

yinghai commented Oct 9, 2018

Uh oh!

gujinghui commented Oct 9, 2018

Uh oh!

gujinghui commented Oct 10, 2018

Uh oh!

orionr commented Oct 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orionr commented Oct 10, 2018

Uh oh!

gujinghui commented Oct 13, 2018

Uh oh!

gujinghui commented Oct 13, 2018

Uh oh!

gujinghui commented Oct 15, 2018

Uh oh!

orionr left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

gujinghui commented Oct 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

yinghai commented Oct 17, 2018

Uh oh!

gujinghui commented Oct 23, 2018

Uh oh!

yinghai commented Oct 23, 2018

Uh oh!

orionr commented Oct 23, 2018

Uh oh!

gujinghui commented Oct 24, 2018

Uh oh!

orionr left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

gujinghui commented Oct 4, 2018 •

edited

Loading

orionr commented Oct 10, 2018 •

edited

Loading

gujinghui commented Oct 16, 2018 •

edited

Loading