[sparse] Autograd get_indices/values and sparse_coo ctor by ssnl · Pull Request #11253 · pytorch/pytorch

ssnl · 2018-09-04T23:22:56Z

TODO: docs

Closes #11232

facebook-github-bot

SsnL has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

SsnL has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

SsnL has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

SsnL has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

`output_differentiability` in derivatives.yaml. Also relax the check that gradient formulas need to use all grad outputs. It is well possible that to compute a particular grad_input[i], only part of all grad_ourputs are needed. add sparse get_values and make it back-prop-able Make get_values back-prop-able make indices and values view functions Make all sparse_coo ctors dispatch to a native function, _sparse_new_with_dims_and_tensor. Remove the dispatch mechaism on native_* native ctors, e.g., native_sparse_coo_tensor. Now all the code lives in functions like sparse_coo_tensor. Make sparse coo ctor a view function Make _newFlattenedIndices a native function Implement sparse_constructor_backward Get rid of NNZ optimization Move native/sparse/SparseUtils.h to SparseTensorUtils.h add getter docs make _set_coalesced a native fn and call it _coalesced_ sparseDims -> sparse_dim; denseDims -> dense_dim update test_print expect because I fixed _indices output to not have grad_fn now infer type first get_indices -> indices; get_values -> values purge options from sparse_coo_tensor with indices and values tensors Fix coalesced tests; update prints; use type dispatch for size only ctor Update note; support nondiff views; update prints workaround for sparse views and inplace ops

…r tensor dispatch

Add has_* for TensorOptions Fix Python sparse_coo_tensor entry Fix a CUDA coalesce error; add tests

…ng in this time

ssnl · 2018-10-23T18:32:35Z

i'm using github to sync code between local and remote to debug. so i'll close this PR to save CI runs, and reopen afterwards

Summary: Reopen of #11253 after fixing bug in index_select Pull Request resolved: #13001 Differential Revision: D10514987 Pulled By: SsnL fbshipit-source-id: 399a83a1d3246877a3523baf99aaf1ce8066f33f

Summary: - to fix #12241 - add `_sparse_sum()` to ATen, and expose as `torch.sparse.sum()`, not support `SparseTensor.sum()` currently - this PR depends on #11253, and will need to be updated upon it lands - [x] implement forward - [x] implement backward - performance [benchmark script](https://gist.github.com/weiyangfb/f4c55c88b6092ef8f7e348f6b9ad8946#file-sparse_sum_benchmark-py): - sum all dims is fastest for sparse tensor - when input is sparse enough nnz = 0.1%, sum of sparse tensor is faster than dense in CPU, but not necessary in CUDA - CUDA backward is comparable (<2x) between `sum several dims` vs `sum all dims` in sparse - CPU backward uses binary search is still slow in sparse, takes `5x` time in `sum [0, 2, 3] dims` vs `sum all dims` - optimize CUDA backward for now - using thrust for sort and binary search, but runtime not improved - both of CPU and CUDA forward are slow in sparse (`sum several dims` vs `sum all dims`), at most `20x` slower in CPU, and `10x` in CUDA - improve CPU and CUDA forward kernels (nnz, sizes, sum_dims, keepdim, sum all or dims, bk=backward) | CPU (sparse vs dense) | CUDA(sparse vs dense) -- | -- | -- (1000, [1000, 1000, 2, 2], [0, 1], False, sumAll) | 8.77 µs vs 72.9 µs | 42.5 µs vs 108 µs (1000, [1000, 1000, 2, 2], [0, 1], False, sumD) | 112 µs vs 4.47 ms | 484 µs vs 407 µs (1000, [1000, 1000, 2, 2], [0, 1], False, sumAll, bk) | 141 µs vs 148 µs | 647 µs vs 231 µs (1000, [1000, 1000, 2, 2], [0, 1], False, sumD, bk) | 235 µs vs 1.23 ms | 781 µs vs 213 µs (1000, [1000, 1000, 2, 2], [2, 3], False, sumD) | 48.5 µs vs 360 µs | 160 µs vs 2.03 ms (1000, [1000, 1000, 2, 2], [2, 3], False, sumD, bk) | 258 µs vs 1.22 ms | 798 µs vs 224 µs (1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD) | 204 µs vs 882 µs | 443 µs vs 133 µs (1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD, bk) | 709 µs vs 1.15 ms | 893 µs vs 202 µs (10000, [1000, 1000, 2, 2], [0, 1], False, sumAll) | 39.8 µs vs 81 µs | 42.4 µs vs 113 µs (10000, [1000, 1000, 2, 2], [0, 1], False, sumD) | 747 µs vs 4.7 ms | 2.4 ms vs 414 µs (10000, [1000, 1000, 2, 2], [0, 1], False, sumAll, bk) | 1.04 ms vs 126 µs | 5.03 ms vs 231 µs (10000, [1000, 1000, 2, 2], [0, 1], False, sumD, bk) | 1.12 ms vs 1.24 ms | 5.99 ms vs 213 µs (10000, [1000, 1000, 2, 2], [2, 3], False, sumD) | 133 µs vs 366 µs | 463 µs vs 2.03 ms (10000, [1000, 1000, 2, 2], [2, 3], False, sumD, bk) | 1.56 ms vs 1.22 ms | 6.11 ms vs 229 µs (10000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD) | 1.53 ms vs 799 µs | 824 µs vs 134 µs (10000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD, bk) | 5.15 ms vs 1.09 ms | 7.02 ms vs 205 µs - after improving CPU and CUDA forward kernels - in `(1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD)` forward, CPU takes ~~`171 µs`~~, in which `130 µs` is spent on `coalesce()`, for CUDA, total time is ~~`331 µs`~~, in which `141 µs` is spent on `coalesce()`, we need to reduce time at other places outside `coalesce()`. - after a few simple tweaks, now in the forward, it is at most `10x` slower in CPU, and `7x` in CUDA. And time takes in `sum dense dims only [2, 3]` is `~2x` of `sum all dims`. Speed of `sum all sparse dims [0, 1]` is on bar with `sum all dims` (nnz, sizes, sum_dims, keepdim, sum all or dims, bk=backward) | CPU (sparse vs dense) | CUDA(sparse vs dense) -- | -- | -- (1000, [1000, 1000, 2, 2], [0, 1], False, sumAll) | 7 µs vs 69.5 µs | 31.5 µs vs 61.6 µs (1000, [1000, 1000, 2, 2], [0, 1], False, sumD) | 11.3 µs vs 4.72 ms | 35.2 µs vs 285 µs (1000, [1000, 1000, 2, 2], [0, 1], False, sumAll, bk) | 197 µs vs 124 µs | 857 µs vs 134 µs (1000, [1000, 1000, 2, 2], [0, 1], False, sumD, bk) | 124 µs vs 833 µs | 796 µs vs 106 µs (1000, [1000, 1000, 2, 2], [2, 3], False, sumD) | 20.5 µs vs 213 µs | 39.4 µs vs 1.24 ms (1000, [1000, 1000, 2, 2], [2, 3], False, sumD, bk) | 131 µs vs 830 µs | 881 µs vs 132 µs (1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD) | 95.8 µs vs 409 µs | 246 µs vs 87.2 µs (1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD, bk) | 624 µs vs 820 µs | 953 µs vs 124 µs (10000, [1000, 1000, 2, 2], [0, 1], False, sumAll) | 45.3 µs vs 72.9 µs | 33.9 µs vs 57.2 µs (10000, [1000, 1000, 2, 2], [0, 1], False, sumD) | 81.4 µs vs 4.49 ms | 39.7 µs vs 280 µs (10000, [1000, 1000, 2, 2], [0, 1], False, sumAll, bk) | 984 µs vs 111 µs | 6.41 ms vs 121 µs (10000, [1000, 1000, 2, 2], [0, 1], False, sumD, bk) | 1.45 ms vs 828 µs | 6.77 ms vs 113 µs (10000, [1000, 1000, 2, 2], [2, 3], False, sumD) | 74.9 µs vs 209 µs | 37.7 µs vs 1.23 ms (10000, [1000, 1000, 2, 2], [2, 3], False, sumD, bk) | 1.48 ms vs 845 µs | 6.96 ms vs 132 µs (10000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD) | 1.14 ms vs 411 µs | 252 µs vs 87.8 µs (10000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD, bk) | 4.53 ms vs 851 µs | 7.12 ms vs 128 µs - time takes in CUDA backward of sparse is super long with large variance (in case of nnz=10000, it normally takes 6-7ms). To improve backward of sparse ops, we will need to debug at places other than CUDA kernels. here is a benchmark of `torch.copy_()`: ``` >>> d = [1000, 1000, 2, 2] >>> nnz = 10000 >>> I = torch.cat([torch.randint(0, d[0], size=(nnz,)), torch.randint(0, d[1], size=(nnz,))], 0).reshape(2, nnz) >>> V = torch.randn(nnz, d[2], d[3]) >>> size = torch.Size(d) >>> S = torch.sparse_coo_tensor(I, V, size).coalesce().cuda() >>> S2 = torch.sparse_coo_tensor(I, V, size).coalesce().cuda().requires_grad_() >>> data = S2.clone() >>> S.copy_(S2) >>> y = S * 2 >>> torch.cuda.synchronize() >>> %timeit y.backward(data, retain_graph=True); torch.cuda.synchronize() 7.07 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` Pull Request resolved: #12430 Differential Revision: D12878313 Pulled By: weiyangfb fbshipit-source-id: e16dc7681ba41fdabf4838cf05e491ca9108c6fe

ssnl requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners September 4, 2018 23:22

facebook-github-bot reviewed Sep 4, 2018

View reviewed changes

ssnl force-pushed the sp_val branch from 5969fe2 to 651de7d Compare September 4, 2018 23:26

facebook-github-bot reviewed Sep 4, 2018

View reviewed changes

ssnl force-pushed the sp_val branch 12 times, most recently from d995d08 to 645b9f5 Compare September 6, 2018 02:54

facebook-github-bot reviewed Sep 6, 2018

View reviewed changes

ssnl force-pushed the sp_val branch from 645b9f5 to 49c3bf3 Compare September 6, 2018 03:08

facebook-github-bot reviewed Sep 6, 2018

View reviewed changes

ssnl force-pushed the sp_val branch 6 times, most recently from 37ec54b to 39675d2 Compare September 7, 2018 21:38

ssnl added 23 commits October 23, 2018 14:31

View op outputs are not registered as views when !GradMode::enabled()

a54bc32

potential_history_tracking -> potentially_tracks_history

4534716

make the note clearer

e6da62b

update note

86bd088

more comments

d1870f0

diff and nondiff views

e524769

more comments

d005f6c

rename note

bb22f05

fix typos

5a5c54b

typo

b2b53bf

Use function variant with option dispatch; favor options dispatch ove…

66c340f

…r tensor dispatch

arg checking and de-duplicate code

9ac4e0c

revert grad mode stuff

5fbe6b6

Make _values non-differentiable; Add note on why

fc51d02

Add has_* for TensorOptions Fix Python sparse_coo_tensor entry Fix a CUDA coalesce error; add tests

update expect because _values is nondifferentiable

2180f8e

Fix narrow_copy_sparse compilation

c8cb568

Revert unintended derivatives.yaml change;

651036c

revert no_grad viewer test since that part of the change is not getti…

db84a1f

…ng in this time

Fix test_numba_integration may try to produce impossible sparse tensor

b8fe968

fix numba test & improve argcheck

602c1cf

skip autograd ifRocm

0425616

debug tensorFromBlob with variable type

2b7ae11

ssnl force-pushed the sp_val branch from a07d046 to 2b7ae11 Compare October 23, 2018 18:31

ssnl closed this Oct 23, 2018

ssnl mentioned this pull request Oct 23, 2018

[sparse] Autograd indices/values and sparse_coo ctor #13001

Closed

ezyang added the open source label Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sparse] Autograd get_indices/values and sparse_coo ctor#11253

[sparse] Autograd get_indices/values and sparse_coo ctor#11253
ssnl wants to merge 23 commits intopytorch:masterfrom
ssnl:sp_val

ssnl commented Sep 4, 2018 •

edited

Loading

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

ssnl commented Oct 23, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ssnl commented Sep 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ssnl commented Oct 23, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ssnl commented Sep 4, 2018 •

edited

Loading