Hetero neighbor sampler multithreading#215
Conversation
Codecov Report
@@ Coverage Diff @@
## master #215 +/- ##
=======================================
Coverage 83.49% 83.49%
=======================================
Files 26 26
Lines 848 848
=======================================
Hits 708 708
Misses 140 140
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
78b60f8 to
8dd411b
Compare
|
The speedup does not see to be very much, do we have analysis on what is the bottleneck ? |
I think the reason is that the number of threads is limited to the number of dst nodes. As many different dst nodes there are as many threads. For the ogbn-mag dataset, this will be 4. |
8dd411b to
930dc1a
Compare
The code generally LGTM, but i am still curious on the limited speedup. You may try from two aspects:
|
Sure, I'll check that in VTune. |
rusty1s
left a comment
There was a problem hiding this comment.
Looks good. A few minor comments.
| std::back_inserter(sampled_nodes_dict[dst_sampled.first])); | ||
| } | ||
| } | ||
| at::parallel_for(0, node_types.size(), 1, [&](size_t _s, size_t _e) { |
There was a problem hiding this comment.
Does this bring any gain? We are not doing heavy work here, so the overhead of threading might not be worth it?
There was a problem hiding this comment.
To check this, I made two measurements. The first in red are measurements made with the current code. And in green is after removing at::parallel_for (what was before). The results show that the introduction of parallelization at this point gives maybe not a big but still speedup (I took the measurements on a different machine than the one mentioned in the PR description, which is why the results are different):

There was a problem hiding this comment.
But maybe to be sure, I'll take 10 measurements and average the results
There was a problem hiding this comment.
I would be in favor of removing it, as it doesn't look that useful.
930dc1a to
6449aca
Compare
This reverts commit f1b10bb.
This reverts commit b58ee31.

The main roadblocks in the context of sampler parallelization is the mapper and the sampler. However, in the case of hetero sampling, there are separate mappers for each dst node type and separate samplers for each edge type. Therefore, to avoid race condition we can use parallelization per dst node type, i.e. each thread is assigned an edge types with unique dst node type that it is working on.
Example sampling times obtained using hetero_neighbor.py benchmark:
Example end-to-end times obtained using inference_benchmark.py: