Implement parallel scatter reductions for CPU by v0dro · Pull Request #36447 · pytorch/pytorch

v0dro · 2020-04-12T03:55:08Z

This PR implements gh-33389.

As a result of this PR, users can now specify various reduction modes for scatter operations. Currently, add, subtract, multiply and divide have been implemented, and adding new ones is not hard.

While we now allow dynamic runtime selection of reduction modes, the performance is the same as as was the case for the scatter_add_ method in the master branch. Proof can be seen in the graph below, which compares scatter_add_ in the master branch (blue) and scatter_(reduce="add") from this PR (orange).

The script used for benchmarking is as follows:

import os
import sys
import torch
import time
import numpy
from IPython import get_ipython

Ms=256
Ns=512
dim = 0
top_power = 2
ipython = get_ipython()

plot_name = os.path.basename(__file__)
branch = sys.argv[1]
fname = open(plot_name + ".csv", "a+")

for pM in range(top_power):
    M = Ms * (2 ** pM)
    for pN in range(top_power):
        N = Ns * (2 ** pN)
        input_one = torch.rand(M, N)
        index = torch.tensor(numpy.random.randint(0, M, (M, N)))
        res = torch.randn(M, N)

        test_case = f"{M}x{N}"
        print(test_case)
        tobj = ipython.magic("timeit -o res.scatter_(dim, index, input_one, reduce=\"add\")")

        fname.write(f"{test_case},{branch},{tobj.average},{tobj.stdev}\n")

fname.close()

Additionally, one can see that various reduction modes take almost the same time to execute:

op: add
70.6 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.1 µs ± 26.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
op: subtract
71 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.4 µs ± 34.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
op: multiply
70.9 µs ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
27.4 µs ± 29.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
op: divide
164 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
52.3 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Script:

import torch
import time
import numpy
from IPython import get_ipython

ipython = get_ipython()

nrows = 3000
ncols = 10000
dims = [nrows, ncols]

res = torch.randint(5, 10, dims)
idx1 = torch.randint(dims[0], (1, dims[1])).long()
src1 = torch.randint(5, 10, (1, dims[1]))
idx2 = torch.randint(dims[1], (dims[0], 1)).long()
src2 = torch.randint(5, 10, (dims[0], 1))

for op in ["add", "subtract", "multiply", "divide"]:
    print(f"op: {op}")
    ipython.magic("timeit res.scatter_(0, idx1, src1, reduce=op)")
    ipython.magic("timeit res.scatter_(1, idx2, src2, reduce=op)")

dr-ci · 2020-04-12T03:55:57Z

💊 CI failures summary and remediations

As of commit 4b15a86 (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 2/2 non-CircleCI failure(s)

Extra GitHub checks: 1 failed

Failed: GitHub Actions - clang-tidy

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-xenial-rocm3.3-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 186 times.

…ctions-cpu

v0dro · 2020-04-16T13:07:03Z

@nikitaved can you review this? Also I'm getting a failure for XLA is that OK?

ngimel · 2020-06-11T16:14:27Z

Dispatch needlessly compiles for all the dtypes, not just for floating point types. Fixing it however would require large copypaste, so in the interest of code clarity (but not binary size :-) ) I'm inclined to let it go as is, after you try @nikitaved's suggestion. Thoughts? If you come up with a clever way to avoid extra dispatch without copy paste, that would be awesome to, but if not, that's not a blocker.

v0dro · 2020-06-12T09:17:54Z

I have introduced constexpr, but I don't think reintroducing unordered map will make things any faster. If you look at the following MWE, optimizations do not kick in even when the functors are declared constexpr when using the functor inside an unordered map:

#include <unordered_map>
#include <functional>
#include <iostream>
#include <chrono>

class F {
public:
  constexpr void operator() (int & a, int & b) const {
    a += b;
  }
};

int main(int argc, char* argv[]) {
  F fun;
  
  int a = atoi(argv[1]);
  int b = atoi(argv[2]);
  std::chrono::time_point<std::chrono::system_clock> start, stop;
  auto time = 0.0;
  using binary_t = std::function<void(int&, int&)>;
  std::unordered_map<int, binary_t> funcs;
  
  funcs[0] = fun;
  start = std::chrono::system_clock::now();
  for (long long i = 0; i < 10000000; ++i) {
    funcs[0](a, b);
  }
  stop = std::chrono::system_clock::now();
  
  std::cout << "time with unordered_map: " << std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start).count() << std::endl;
  std::cout << "a: " << a << std::endl;
  a = 1;

  start = std::chrono::system_clock::now();
  for (long long i = 0; i < 10000000; ++i) {
    fun(a, b);
  }
  stop = std::chrono::system_clock::now();
  
  std::cout << "time without map: " << std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start).count() << std::endl;
  std::cout << "a: " << a << std::endl;
}

Results:

time with unordered_map: 50887068
a: 30000001
time without map: 44
a: 30000001

v0dro · 2020-06-12T11:10:21Z

New benchmark:
scatter-regression.py.csv.pdf

ngimel · 2020-06-16T21:47:54Z

Is this ready? If so, can you please rebase so that we can get signal from CI, right now too many tests are failing.

…ctions-cpu

ngimel · 2020-06-19T23:23:30Z

CI errors are real.

v0dro · 2020-06-24T07:34:25Z

@ngimel errors are fixed. Could you please have a look? The windows failures seem unrelated.

ngimel · 2020-06-24T17:35:46Z

YEah, windows failures are unrelated, but can you please rebase so that we can get a signal from windows builds?

…ctions-cpu

v0dro · 2020-06-27T03:07:37Z

@ngimel updated and ready to merge.

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-06-30T02:12:23Z

@ngimel merged this pull request in 9ca4a46.

facebook-github-bot · 2020-06-30T02:12:58Z

@ngimel merged this pull request in 9ca4a46.

…ction methods. (#40962) Summary: Follow up to #36447 . Update for #33389. Also removes unused `unordered_map` include from the CPP file. Pull Request resolved: #40962 Differential Revision: D22376253 Pulled By: ngimel fbshipit-source-id: 4e7432190e9a847321aec6d6f6634056fa69bdb8

…ction methods. (pytorch#40962) Summary: Follow up to pytorch#36447 . Update for pytorch#33389. Also removes unused `unordered_map` include from the CPP file. Pull Request resolved: pytorch#40962 Differential Revision: D22376253 Pulled By: ngimel fbshipit-source-id: 4e7432190e9a847321aec6d6f6634056fa69bdb8

Summary: This PR implements pytorchgh-33389. As a result of this PR, users can now specify various reduction modes for scatter operations. Currently, `add`, `subtract`, `multiply` and `divide` have been implemented, and adding new ones is not hard. While we now allow dynamic runtime selection of reduction modes, the performance is the same as as was the case for the `scatter_add_` method in the master branch. Proof can be seen in the graph below, which compares `scatter_add_` in the master branch (blue) and `scatter_(reduce="add")` from this PR (orange). ![scatter-regression py csv](https://user-images.githubusercontent.com/2629909/82671491-e5e22380-9c79-11ea-95d6-6344760c8578.png) The script used for benchmarking is as follows: ``` python import os import sys import torch import time import numpy from IPython import get_ipython Ms=256 Ns=512 dim = 0 top_power = 2 ipython = get_ipython() plot_name = os.path.basename(__file__) branch = sys.argv[1] fname = open(plot_name + ".csv", "a+") for pM in range(top_power): M = Ms * (2 ** pM) for pN in range(top_power): N = Ns * (2 ** pN) input_one = torch.rand(M, N) index = torch.tensor(numpy.random.randint(0, M, (M, N))) res = torch.randn(M, N) test_case = f"{M}x{N}" print(test_case) tobj = ipython.magic("timeit -o res.scatter_(dim, index, input_one, reduce=\"add\")") fname.write(f"{test_case},{branch},{tobj.average},{tobj.stdev}\n") fname.close() ``` Additionally, one can see that various reduction modes take almost the same time to execute: ``` op: add 70.6 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 26.1 µs ± 26.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) op: subtract 71 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 26.4 µs ± 34.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) op: multiply 70.9 µs ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 27.4 µs ± 29.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) op: divide 164 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 52.3 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` Script: ``` python import torch import time import numpy from IPython import get_ipython ipython = get_ipython() nrows = 3000 ncols = 10000 dims = [nrows, ncols] res = torch.randint(5, 10, dims) idx1 = torch.randint(dims[0], (1, dims[1])).long() src1 = torch.randint(5, 10, (1, dims[1])) idx2 = torch.randint(dims[1], (dims[0], 1)).long() src2 = torch.randint(5, 10, (dims[0], 1)) for op in ["add", "subtract", "multiply", "divide"]: print(f"op: {op}") ipython.magic("timeit res.scatter_(0, idx1, src1, reduce=op)") ipython.magic("timeit res.scatter_(1, idx2, src2, reduce=op)") ``` Pull Request resolved: pytorch#36447 Differential Revision: D22272631 Pulled By: ngimel fbshipit-source-id: 3cdb46510f9bb0e135a5c03d6d4aa5de9402ee90

…ction methods. (pytorch#40962) Summary: Follow up to pytorch#36447 . Update for pytorch#33389. Also removes unused `unordered_map` include from the CPP file. Pull Request resolved: pytorch#40962 Differential Revision: D22376253 Pulled By: ngimel fbshipit-source-id: 4e7432190e9a847321aec6d6f6634056fa69bdb8

v0dro added 19 commits April 3, 2020 20:34

update tests for scatter reduction operation

d5bce8c

update reduction from previous commits

d37abcc

make a simple call scaffold

33f6e31

simple if else for reduction

1457347

stuff

35782c5

scatter gather update

e17d967

works so far

1fa0588

scatter reductions working now

0c177a9

totally flawed implementation intermediate commit

435561e

tensor advanced indexing

2643960

binary ops for scatter reductions

d6dc7e2

resolving scalar scatter case

855e806

implementation of scalar operations with lots of code duplication

e3144c2

move into run_kernel function

6f438ba

found something wrong in scalar implementation

16f3348

giant copy paste

814988f

update test

5a14dc4

complete port of long implementation

92dde30

reduce LOC

b664d72

remove functor

df5ad25

pytorchbot added the open source label Apr 12, 2020

v0dro added 6 commits April 12, 2020 13:28

Merge branch 'master' of github.com:pytorch/pytorch into scatter-redu…

698efd0

…ctions-cpu

specialify reduce multiply

f3414d8

change template parameter

97ad487

Merge branch 'master' of github.com:pytorch/pytorch into scatter-redu…

948dc7a

…ctions-cpu

remove unordered_map constructor

13ae855

move to map due to issues with compiling on GCC 5

f0bdd13

nikitaved reviewed Apr 16, 2020

View reviewed changes

Comment thread aten/src/ATen/native/cpu/ScatterGatherKernel.cpp Outdated

add constexpr

fac7a72

remove unneeded unordered map

f157ce3

Merge branch 'master' of github.com:pytorch/pytorch into scatter-redu…

e5f2070

…ctions-cpu

v0dro added 2 commits June 19, 2020 20:12

update scatter with latest master branch updates

989010b

add error check for scatter_add

d0713d7

v0dro added 2 commits June 26, 2020 02:51

Merge branch 'master' of github.com:pytorch/pytorch into scatter-redu…

335a8b4

…ctions-cpu

update scatter kernel

4b15a86

facebook-github-bot reviewed Jun 27, 2020

View reviewed changes

facebook-github-bot closed this in 9ca4a46 Jun 29, 2020

facebook-github-bot added the merged label Jun 30, 2020

rgommers mentioned this pull request Jun 30, 2020

Non-deterministic parallel scatter reduction algorithms for scatter operations for CPU (sum, subtract, divide, multiply). #33389

Closed

v0dro mentioned this pull request Jul 3, 2020

Update the documentation of the scatter_ method with support for reduction methods. #40962

Closed

v0dro mentioned this pull request Sep 19, 2020

scatter_ supporting different reduction modes #22378

Closed

mruberry added the Merged label Oct 28, 2020

sidgoyal78 mentioned this pull request Dec 10, 2020

Add Forcenet facebookresearch/fairchem#150

Merged

9 tasks

Conversation

v0dro commented Apr 12, 2020 • edited by rgommers Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci Bot commented Apr 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Extra GitHub checks: 1 failed

ci.pytorch.org: 1 failed

Uh oh!

v0dro commented Apr 16, 2020

Uh oh!

Uh oh!

ngimel commented Jun 11, 2020

Uh oh!

v0dro commented Jun 12, 2020

Uh oh!

v0dro commented Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Jun 16, 2020

Uh oh!

ngimel commented Jun 19, 2020

Uh oh!

v0dro commented Jun 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Jun 24, 2020

Uh oh!

v0dro commented Jun 27, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 30, 2020

Uh oh!

facebook-github-bot commented Jun 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

v0dro commented Apr 12, 2020 •

edited by rgommers

Loading

dr-ci Bot commented Apr 12, 2020 •

edited

Loading

v0dro commented Jun 12, 2020 •

edited

Loading

v0dro commented Jun 24, 2020 •

edited

Loading