Skip to content

[caffe2]seperate mkl, mklml, and mkldnn#12170

Closed
gujinghui wants to merge 3 commits intopytorch:masterfrom
gujinghui:remove_avx2
Closed

[caffe2]seperate mkl, mklml, and mkldnn#12170
gujinghui wants to merge 3 commits intopytorch:masterfrom
gujinghui:remove_avx2

Conversation

@gujinghui
Copy link
Collaborator

  1. Remove avx2 support in mkldnn
  2. Seperate mkl, mklml, and mkldnn
  3. Fix convfusion test case

@gujinghui gujinghui force-pushed the remove_avx2 branch 3 times, most recently from 865a907 to 03807a4 Compare September 28, 2018 15:04
@gujinghui
Copy link
Collaborator Author

@pytorchbot retest this please

@gujinghui
Copy link
Collaborator Author

@yinghai @jgong5
pls review.

Copy link
Contributor

@yinghai yinghai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the test plan for this change?

And looks like we still have FIndMKLDNN.cmake. Can we merge that into one?

@gujinghui
Copy link
Collaborator Author

gujinghui commented Oct 4, 2018

@yinghai
The test cases under caffe2/python/ideep/ covers these changes.

For FindMKLDNN.cmake, yes, I have plan to move all to FindMKLDNN.cmake.
After this PR, I will submit a PR to do this ASAP.
The reason is next PR is related to pytorch, while this one is still only for caffe2.
Pls merge this PR firstly. Thanks.

@yinghai
Copy link
Contributor

yinghai commented Oct 4, 2018

What's the test plan for this change?

Sorry what I meant is what's your command to build pytorch?

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yinghai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@gujinghui
Copy link
Collaborator Author

@yinghai
Pls try below cmd.
cmake -DBLAS=MKL -DUSE_IDEEP=ON ..

@yinghai
Copy link
Contributor

yinghai commented Oct 9, 2018

@orionr I think we cannot build with cmake along anymore? We need setup.py to build things properly, right?

@gujinghui
Copy link
Collaborator Author

@yinghai after this PR, I'll submit new PR to enable setup.py for mkl-dnn ASAP. Is it OK to you?

@gujinghui
Copy link
Collaborator Author

@yinghai
Is there any blocking issue for this PR?
If any, pls contact us anytime.

@orionr
Copy link
Contributor

orionr commented Oct 10, 2018

@yinghai and @gujinghui sorry for the slow response here.

libtorch builds (for C++ and no Python) are built exclusively with CMake, but PyTorch and Caffe2 builds will always go through setup.py. Changes like this should really match with changes in tools/setup_helpers/mkldnn. Ultimately when we move _C build to CMake this shouldn't be needed, but it's required right now. We also would need to call all third_party builds from root CMake rather than build_pytorch_libs.sh (and build_pytorch_libs.bat). So, long story short, you'll want to make sure tools/setup_helpers/mkldnn and setup.py match these changes when you land it. Generally you just want to make sure they are inline with the changes even if MKLDNN discovery is a bit different in the two cases.

cc @anderspapitto.

@orionr
Copy link
Contributor

orionr commented Oct 10, 2018

Also, @gujinghui, to test, run python setup.py build_deps rather than calling CMake directly.

@gujinghui
Copy link
Collaborator Author

@yinghai @orionr
please help review. The test cmd is export NO_MKLDNN=NO and python setup.py build_deps

@gujinghui
Copy link
Collaborator Author

@orionr
Thanks for detail descriptions.
With this PR, the mkldnn is built from source with caffe2 by setup.py.
No need to seek the libmkldnn.so in system environment.

@gujinghui
Copy link
Collaborator Author

@yinghai @orionr
rebased. pls help review.

Copy link
Contributor

@orionr orionr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work - thank you! I think we need to cleanup some of the naming and, most importantly, remove the CAFFE2_* variables from cmake/Modules/. Those should also likely be upstreamed as well. Happy to talk more on the PR, though.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@gujinghui
Copy link
Collaborator Author

gujinghui commented Oct 16, 2018

@orionr
Understood your concern.
But the USE_MKL has been used in mkl-dnn, as well, which should introduce some conflictions.

And USE_MKLDNN/USE_MKL is defined by user to try to enable MKLDNN/MKL.
But, sometimes we cannot find MKLDNN/MKL.
In order to avoid fusion, new macros are needed here.

How about use below macros in both cmake and C code, instead?
CAFFE2_USE_MKL --> MKL_FOUND
CAFFE2_USE_MKLDNN --> MKLDNN_FOUND

Then, in caffe2/core/macros.h.in, we have below lines.
#cmakedefine MKL_FOUND
#cmakedefine MKLDNN_FOUND

And in C code, the code looks like below.
#ifdef MKLDNN_FOUND
......
#endif

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yinghai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@yinghai
Copy link
Contributor

yinghai commented Oct 17, 2018

How about use below macros in both cmake and C code, instead?
CAFFE2_USE_MKL --> MKL_FOUND
CAFFE2_USE_MKLDNN --> MKLDNN_FOUND

If this is the case, I feel we should just use CAFFE2_USE_MKL, because in macros.in.h we have a bunch of them already. What I'm not update with the build plan. @orionr, what's your take?

@gujinghui
Copy link
Collaborator Author

hi @orionr @yinghai
any update on this PR?
what else should be changed for this?

@yinghai
Copy link
Contributor

yinghai commented Oct 23, 2018

@orionr

@orionr
Copy link
Contributor

orionr commented Oct 23, 2018

Thanks for the ping.

I think there are still CAFFE2_* variables being set in cmake/Modules/Find*.cmake files? If so, that needs to be fixed before this can land. If these are only going to be used for PyTorch / Caffe2 (and not for other projects) in the future (I don't know the answer to that) we could move them to a new location, but hopefully we can make them standardized.

@gujinghui
Copy link
Collaborator Author

@orionr
Thanks for reply.
All CAFFE2_* var have been moved out of cmake/Modules/FindMKL*.cmake.

@yinghai @orionr
Pls help review it.

Thanks,
Jinghui

Copy link
Contributor

@orionr orionr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you. One question on Windows and will now look at the CI results.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@orionr
Copy link
Contributor

orionr commented Oct 25, 2018

Looks like we have some CI failures with this - let's make sure those pass before importing, since this is a pretty significant change. Closing and reopening to see if we can get them restarted.

@orionr orionr closed this Oct 25, 2018
@orionr orionr reopened this Oct 25, 2018
@orionr
Copy link
Contributor

orionr commented Oct 25, 2018

@pytorchbot retest this please

@gujinghui
Copy link
Collaborator Author

@orionr
Looks like the rest failures are not caused by this PR.

  1. pytorch-win-ws2016-cuda9-cudnn7-py3-build:

00:56:41 ..\caffe2\core\nomnigraph\include\nomnigraph/Representations/NeuralNet.h(456): error C2491: 'nom::repr::nn::createOperator': definition of dllimport function not allowed

  1. caffe2_py2_gcc4_8_ubuntu14_04_test

Oct 24 11:46:01 =================================== FAILURES ===================================
Oct 24 11:46:01 _____________________________ TestGlu.test_glu_old _____________________________
Oct 24 11:46:01
Oct 24 11:46:01 self = <caffe2.python.operator_test.glu_op_test.TestGlu testMethod=test_glu_old>
Oct 24 11:46:01 args = (), kwargs = {}
Oct 24 11:46:01
Oct 24 11:46:01 def func(self, *args, **kwargs):
Oct 24 11:46:01 > self.should_serialize = True
Oct 24 11:46:01
Oct 24 11:46:01 /usr/local/caffe2/lib/python2.7/dist-packages/caffe2/python/serialized_test/serialized_test_util.py:34:
Oct 24 11:46:01 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Oct 24 11:46:01 /usr/local/lib/python2.7/dist-packages/hypothesis/_settings.py:248: in new_test
Oct 24 11:46:01 test(*args, **kwargs)
Oct 24 11:46:01 /usr/local/caffe2/lib/python2.7/dist-packages/caffe2/python/serialized_test/serialized_test_util.py:37: in func
Oct 24 11:46:01 hyp_func(self, *args, **kwargs)
Oct 24 11:46:01 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Oct 24 11:46:01
Oct 24 11:46:01 self = <caffe2.python.operator_test.glu_op_test.TestGlu testMethod=test_glu_old>
Oct 24 11:46:01
Oct 24 11:46:01 @settings(suppress_health_check=[HealthCheck.filter_too_much])
Oct 24 11:46:01 > @serial.given(
Oct 24 11:46:01 X=hu.tensor(),
Oct 24 11:46:01 axis=st.integers(min_value=0, max_value=3),
Oct 24 11:46:01 **hu.gcs
Oct 24 11:46:01 )
Oct 24 11:46:01 def test_glu_old(self, X, axis, gc, dc):
Oct 24 11:46:01 E FailedHealthCheck: It looks like your strategy is filtering out a lot of data. Health check found 50 filtered examples but only 9 good ones. This will make your tests much slower, and also will probably distort the data generation quite a lot. You should adapt your strategy to filter less. This can also be caused by a low max_leaves parameter in recursive() calls

  1. pytorch_linux_trusty_py3_6_gcc7_build_test

Oct 24 10:23:25 ======================================================================
Oct 24 10:23:25 FAIL: test_all_reduce_product (main.TestDistBackend)
Oct 24 10:23:25 ----------------------------------------------------------------------
Oct 24 10:23:25 Traceback (most recent call last):
Oct 24 10:23:25 File "test_distributed.py", line 1288, in wrapper
Oct 24 10:23:25 self._join_and_reduce(fn)
Oct 24 10:23:25 File "test_distributed.py", line 1371, in _join_and_reduce
Oct 24 10:23:25 self.assertEqual(p.exitcode, first_process.exitcode)
Oct 24 10:23:25 File "/var/lib/jenkins/workspace/test/common_utils.py", line 404, in assertEqual
Oct 24 10:23:25 super(TestCase, self).assertEqual(x, y, message)
Oct 24 10:23:25 AssertionError: None != 1 :
Oct 24 10:23:25
Oct 24 10:23:25 ----------------------------------------------------------------------
Oct 24 10:23:25 Ran 62 tests in 611.141s
Oct 24 10:23:25
Oct 24 10:23:25 FAILED (failures=1, skipped=30)
Oct 24 10:23:25 Traceback (most recent call last):
Oct 24 10:23:25 File "test/run_test.py", line 394, in
Oct 24 10:23:25 main()
Oct 24 10:23:25 File "test/run_test.py", line 386, in main
Oct 24 10:23:25 raise RuntimeError(message)
Oct 24 10:23:25 RuntimeError: test_distributed failed!
Oct 24 10:23:25 + cleanup
Oct 24 10:23:25 + retcode=1
Oct 24 10:23:25 + set +x
Oct 24 10:23:25 =================== sccache compilation log ===================
Oct 24 10:23:25 ERROR:sccache::server: ["null"] fatal error: Permission denied (os error 13)
Oct 24 10:23:25
Oct 24 10:23:25 ERROR:sccache::server: ["null"] Permission denied (os error 13)

@ezyang
Copy link
Contributor

ezyang commented Oct 25, 2018

GLU health should be fixed on master.

@orionr
Copy link
Contributor

orionr commented Oct 25, 2018

@gujinghui can you please rebase and force push to this branch so we can confirm all the tests pass? Thanks.

Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>
Mapping between cmake var and c macro:
USE_MKL    -->  CAFFE2_USE_MKL
USE_IDEEP  -->  CAFFE2_USE_IDEEP

Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>
Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>
@gujinghui
Copy link
Collaborator Author

@orionr
rebased. pls help review.

@orionr
Copy link
Contributor

orionr commented Oct 29, 2018

Thank you! Failing test is only a timeout, so CI looks good. Starting to land.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

orionr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Oct 29, 2018
Summary:
1. Remove avx2 support in mkldnn
2. Seperate mkl, mklml, and mkldnn
3. Fix convfusion test case
Pull Request resolved: pytorch/pytorch#12170

Reviewed By: yinghai

Differential Revision: D10207126

Pulled By: orionr

fbshipit-source-id: 1e62eb47943f426a89d57e2d2606439f2b04fd51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants