Skip to content

Lazily initialise thread local num_threads value#37461

Closed
peterbell10 wants to merge 5 commits intopytorch:masterfrom
peterbell10:lazy_init_num_threads
Closed

Lazily initialise thread local num_threads value#37461
peterbell10 wants to merge 5 commits intopytorch:masterfrom
peterbell10:lazy_init_num_threads

Conversation

@peterbell10
Copy link
Copy Markdown
Collaborator

Fixes #37259, fixes #20156

This lazily calls at::init_num_threads once for each thread by adding a call to lazy_init_num_threads in at::parallel_for and at::parallel_reduce.

If this solution is okay, then we should add the same to guard other places that might use MKL or OpenMP.

@ilia-cher
Copy link
Copy Markdown
Contributor

generally LG, here we make sure that if thread uses parallel api then init_num_threads is called

but I thought the original issue was that a user created thread reverted openmp settings back to the default, if user code does not run the code from ATen/Parallel* then it will still be using default OpenMP settings

could we also then add a pybind binding for init_num_threads, so that users can at least explicitly call it?

@peterbell10
Copy link
Copy Markdown
Collaborator Author

Okay, added torch.init_num_threads. I think I've also now guarded all OpenMP parallel regions.

Based on the existence of mkl_set_num_threads_local I'm thinking that mkl_set_num_threads isn't thread local and so we don't need to add lazy init before calling MKL functions. So, I think that's everything?

@dr-ci
Copy link
Copy Markdown

dr-ci bot commented Apr 28, 2020

💊 CI failures summary and remediations

As of commit 601d192 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 21 times.

@peterbell10 peterbell10 force-pushed the lazy_init_num_threads branch from 1d62fc7 to 49fba19 Compare April 29, 2020 15:07
@ilia-cher
Copy link
Copy Markdown
Contributor

you can probably push the button ready for review to publish the PR

@peterbell10 peterbell10 marked this pull request as ready for review April 29, 2020 20:09
@peterbell10 peterbell10 changed the title WIP: Lazily initialise thread local num_threads value Lazily initialise thread local num_threads value Apr 29, 2020
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a better way to create a binding, and this one is deprecated, we moved to pybind, example of how to add a binding in one line of code: https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/init.cpp#L57

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I used pybind11, add_docstr failed because the function already had an (empty) docstring. Obviously I can just define the docstring in Module.cpp but then the doc strings aren't all in one place. What do you think?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to pybind, can you take another look @ilia-cher?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, pragma omp creeped back into the codebase, we'll need to change this to parallel_for, but we don't have to do this in this PR

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about multiple compilation units including this function in the header, could we move the implementation into the .cpp ? e.g. ATen/ParallelCommon.cpp

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is that the cheap boolean check will be inlined into the caller and in the overwhelmingly likely case, there is no function call overhead.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's marked inline so duplicates in multiple compilation units will be deduplicated at link time.

@ilia-cher
Copy link
Copy Markdown
Contributor

overall LG, thanks! left some minor comments

@ngimel ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 5, 2020
@peterbell10 peterbell10 force-pushed the lazy_init_num_threads branch from 49fba19 to c520098 Compare May 7, 2020 16:58
@peterbell10 peterbell10 force-pushed the lazy_init_num_threads branch from c520098 to 601d192 Compare May 7, 2020 17:14
@ilia-cher
Copy link
Copy Markdown
Contributor

will check today, thanks!

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilia-cher has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.


namespace internal {

// Initialise num_threads lazily at first parallel call
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have more explicit instructions about when to use this. AFAICT, you have to call this before every omp pragma?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately there are are some new pragmas that got into the codebase, I specifically killed all of them earlier and used at::parallel_for, but I'll follow up with a diff to clean them up

In general, we do call at::init_num_threads in places we control (i.e. except explicit user created threads); this PR is more a safe guard to make sure the original settings are respected if user creates a new thread themselves and user uses at::paralllel_for
So, technically after this PR the only case we don't cover is when user explicitly creates their own thread and user doesn't use at::parallel_for (and e.g. uses OpenMP directly), but in this case we can't really do much anyways

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilia-cher has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ilia-cher merged this pull request in 5137827.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bad performance with python threads libtorch does not initialize OpenMP/MKL by default

6 participants