Lock optimizations for DistAutogradContainer. by pritamdamania87 · Pull Request #36529 · pytorch/pytorch

pritamdamania87 · 2020-04-13T22:36:53Z

Stack from ghstack:

Lock optimizations for DistAutogradContainer. #36529 Lock optimizations for DistAutogradContainer.

DistAutogradContainer is a singleton for the entire process and has a
single lock that protects access to a map keyed by context_id. Performance
profiling showed that this lock is a potential bottleneck for training. As a
result, in this PR we have the following optimizations:

Shard the map into multiple buckets with each bucket having its own lock. This
would ensure we hold much finer grained locks.
sendReleaseContextRpc was being called under a lock, moved this to be
outside the lock.

Differential Revision: D21003934

DistAutogradContainer is a singleton for the entire process and has a single lock that protects access to map keyed by context_id. Performance profiling showed that this lock is a potential bottleneck for training. As a result, in this PR, we have the following optimizations: 1) Shard the map into 256 buckets with each bucket having its own lock. This would ensure we hold much finer grained locks. 2) sendReleaseContextRpc was being called under a lock, moved this to be outside the lock. Differential Revision: [D21003934](https://our.internmc.facebook.com/intern/diff/D21003934/) [ghstack-poisoned]

DistAutogradContainer is a singleton for the entire process and has a single lock that protects access to map keyed by context_id. Performance profiling showed that this lock is a potential bottleneck for training. As a result, in this PR, we have the following optimizations: 1) Shard the map into 256 buckets with each bucket having its own lock. This would ensure we hold much finer grained locks. 2) sendReleaseContextRpc was being called under a lock, moved this to be outside the lock. Differential Revision: [D21003934](https://our.internmc.facebook.com/intern/diff/D21003934/) ghstack-source-id: 102066593 Pull Request resolved: #36529

dr-ci · 2020-04-13T22:37:51Z

💊 Build failures summary and remediations

As of commit 9cd489f (more details on the Dr. CI page):

1/1 failures introduced in this PR

XLA failure

Job pytorch_xla_linux_xenial_py3_6_clang7_build is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 5 times.

jjlilley · 2020-04-13T22:43:24Z

+inline DistAutogradContainer::ContextsShard& DistAutogradContainer::getShard(
+    int64_t context_id) {
+  // kNumShards has to be a power of 2 for this to work.
+  DCHECK((kNumShards & (kNumShards - 1)) == 0);


minor: maybe static_assert() instead?

Moved the check to init instead.

jjlilley · 2020-04-13T22:49:02Z

+  };
+
+  // Number of shards for the map storing autograd contexts.
+  static constexpr int16_t kNumShards = 256;


This value of 256 is fine.

fwiw, my feeling is that it might be slightly on the high side (e.g. my initial guess would be circa 64??) - when I've done lock sharding in the past, I tended to see diminishing returns much above num_cpus (or num_cpus*2) shards.

should we use some heuristic formula (sth like @jjlilley mentioned above) for the default numShards?

btw, I wasn't necessarily advocating setting size based on num_cpus... (I kind of like the power-of-2 mode here)

Just saying (without data), that my guess is that the perf gain from 128 to 256 might be low

jjlilley · 2020-04-13T23:06:49Z

+    std::unordered_map<int64_t, ContextPtr> contexts;
+
+    // Lock for this shard.
+    mutable std::mutex lock;


consider reordering so that lock is first (acquiring the lock will fetch the beginning part of the unordered_map<> into cache in exclusive mode automatically)

consider padding slightly, to avoid interference.
in fbcode, sizeof() this struct should be 96 bytes. If the hardware cache line is 128, two adjacent entries will contend with each other.
std::hardware_destructive_interference_size is c++17-only, but could always just add something like the following to the bottom, and call it a day:

int64_t unusedCachePadding_[8]; // prevent adjacent shards from sharing a cache line

oh, so I'm learning that the more modern way is something like

struct alignas(128) ContextsShard {

wanchaol · 2020-04-13T23:24:15Z


-  std::lock_guard<std::mutex> guard(autograd_context_lock_);
+  auto context_id = next_context_id_++;
+  current_context_id_ = context_id;


what's the difference between context_id and current_context_id?

current_context_id_ is the thread local context id variable. This needs to be set for RPC operations to pick this up.

wanchaol · 2020-04-13T23:26:08Z

+  };
+
+  // Number of shards for the map storing autograd contexts.
+  static constexpr int16_t kNumShards = 256;


should we use some heuristic formula (sth like @jjlilley mentioned above) for the default numShards?

DistAutogradContainer is a singleton for the entire process and has a single lock that protects access to a map keyed by `context_id`. Performance profiling showed that this lock is a potential bottleneck for training. As a result, in this PR we have the following optimizations: 1) Shard the map into 256 buckets with each bucket having its own lock. This would ensure we hold much finer grained locks. 2) sendReleaseContextRpc was being called under a lock, moved this to be outside the lock. Differential Revision: [D21003934](https://our.internmc.facebook.com/intern/diff/D21003934/) [ghstack-poisoned]

Pull Request resolved: #36529 DistAutogradContainer is a singleton for the entire process and has a single lock that protects access to map keyed by context_id. Performance profiling showed that this lock is a potential bottleneck for training. As a result, in this PR, we have the following optimizations: 1) Shard the map into 256 buckets with each bucket having its own lock. This would ensure we hold much finer grained locks. 2) sendReleaseContextRpc was being called under a lock, moved this to be outside the lock. ghstack-source-id: 102085139 Differential Revision: [D21003934](https://our.internmc.facebook.com/intern/diff/D21003934/)

jjlilley · 2020-04-14T03:26:01Z

+      num_shards <<= 1;
+    }
+  }
+  LOG(INFO) << "Number of shards for DistAutogradContainer: " << num_shards;


This log is fine for now, though eventually we might want to demote to VLOG(1)

facebook-github-bot · 2020-04-14T08:16:37Z

This pull request has been merged in 6e7eaab.

Summary: Pull Request resolved: pytorch#36529 DistAutogradContainer is a singleton for the entire process and has a single lock that protects access to map keyed by context_id. Performance profiling showed that this lock is a potential bottleneck for training. As a result, in this PR, we have the following optimizations: 1) Shard the map into 256 buckets with each bucket having its own lock. This would ensure we hold much finer grained locks. 2) sendReleaseContextRpc was being called under a lock, moved this to be outside the lock. ghstack-source-id: 102085139 Test Plan: waitforbuildbot Differential Revision: D21003934 fbshipit-source-id: 55f80dd317311bce0efd3ca8ca617d071297b5dc

pritamdamania87 requested review from mrshenli and zhaojuanmao as code owners April 13, 2020 22:36

pritamdamania87 requested review from jjlilley and satgera April 13, 2020 22:41

jjlilley reviewed Apr 13, 2020

View reviewed changes

wanchaol reviewed Apr 13, 2020

View reviewed changes

pritamdamania87 requested a review from jjlilley April 14, 2020 02:20

jjlilley approved these changes Apr 14, 2020

View reviewed changes

facebook-github-bot closed this in 6e7eaab Apr 14, 2020

facebook-github-bot added the merged label Apr 14, 2020

facebook-github-bot deleted the gh/pritamdamania87/117/head branch April 17, 2020 14:17

mruberry added the Merged label Oct 28, 2020

Conversation

pritamdamania87 commented Apr 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci Bot commented Apr 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 Build failures summary and remediations

XLA failure

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjlilley Apr 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Apr 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pritamdamania87 commented Apr 13, 2020 •

edited

Loading

dr-ci Bot commented Apr 13, 2020 •

edited

Loading

jjlilley Apr 13, 2020 •

edited

Loading