Sparse softmax support (CPU) by pearu · Pull Request #36305 · pytorch/pytorch

pearu · 2020-04-09T09:35:10Z

This PR implements softmax support for sparse tensors.

The sparse softmax is related to dense softmax when the values of unspecified sparse tensor entries are taken to be -inf that will have the effect of "zero entries ignored". This relation is used for testing the correctness of results here.

Resolves #23651 for CPU.

ezyang · 2020-04-20T16:16:04Z

The sparse softmax is related to dense softmax when the values of unspecified sparse tensor entries are taken to be -inf that will have the effect of "zero entries ignored"

We had a bunch of discussions about whether or not it is appropriate to treat unspecified sparse tensor entries as things other than zero; specifically, whether or not we should be allowed to overload the old function name in this situation (some discussion about this at #1369 ). Did you folks at Quansight come to a decision about what should be done here?

pearu · 2020-04-21T07:02:03Z

@ezyang we have not discussed the overloading vs name-prefixing options yet but we have discussed the idea of introducing the so-called fill value to sparse tensors that would set the default value for unspecified sparse tensor entries.

The fill value feature would be useful also for the sparse softmax case that would allow users to specify the user-defined behavior:

Using fill_value=0 would lead to nice invariant softmax(sparse, dim).to_dense() == softmax(sparse.to_dense(), dim) while the result of softmax(sparse, dim) would be still sparse tensor (with a non-zero fill_value).
Using fill_value=-inf [the current PR] gives a mathematical basis for softmax(sparse, dim) to be a sparse tensor (as requested in the issue Add sparse softmax/log_softmax functionality (ignore zero entries) #23651).

In addition, the fill value feature has the potential to eliminate the overloading vs name-prefixing question, and in general, it would lower the bridge between sparse and dense tensor in a consistent way.

The fill value has been implemented in https://sparse.pydata.org/ and from #9674 (comment) I see you have already discussed this option as well.

For this PR, we have two choices:

sparse softmax is exposed as softmax (and log_softmax) [the current behavior] via overloading
sparse softmax is exposed as spsoftmax (and splog_softmax)

In the long term, with the perspective of having the fill value feature available, option 1. makes sense to me.
On the other hand, option 2. seems to be the current practice in pytorch and I would be ok changing the PR accordingly.

So, I am not certain atm which approach would be preferable to users. Thoughts?

ezyang · 2020-04-21T14:51:12Z

If y'all are prepared to take on the maintenance burden of a fill value, I think I would also agree that this would be best, and then we don't have to introduce sparse versions of all of our functions. Without fill, (1) will technically require you to do BC-breaking changes in the future, as you will turn an operation that previously implicitly converted zero-filled tensors to neginf-filled tensors, into an error. It will save you work in the future if you do (2) first.

pearu · 2020-04-22T08:08:09Z

OK, it makes sense. Here's a plan:

Expose sparse softmax as torch.sparse.softmax that is defined for sparse tensors with assumed fill value equal to negative infinity (that would be equivalent to ignoring unspecified sparse tensor entries).
Expose sparse log softmax as torch.sparse.log_softmax with fill value as above.
Defining torch.softmax on sparse tensors will be reserved for sparse tensors that support fill value feature. The result of torch.softmax will be different from the result of torch.sparse.softmax because of the default fill value for sparse tensors will be zero. The results will be equal iff the fill value of a sparse tensor is specified as negative infinity.

ezyang · 2020-04-22T14:57:17Z

SGTM.

pearu · 2020-04-27T08:25:23Z

@apaszke @ezyang this PR is ready for review. The PR implements sparse softmax support for CPU, the CUDA support is planned as a followup PR.

ezyang · 2020-04-27T19:38:06Z

@pearu can you get someone from Quansight to review first?

nikitaved · 2020-04-30T14:28:43Z

All the loops could be replaced with the TensorIterator loops, I think... TensorIterator loops over all dimensions not involved in reduction, and the kernel loop iterates over the reduction dimension if you perform the reduction and scaling all at once. Otherwise you could iterate over each element and then rescale it appropriately while knowing which value to rescale it with. TensorIterator also brings multithreading for free.

… Implement support for dtype argument.

…alues. Rename iscalar_t to index_t. Update docs. Clean up.

… is about 25%.

rgommers · 2020-05-12T15:06:50Z

+if TYPE_CHECKING:
+    from torch import dtype as DType
+else:
+    DType = int


@ezyang do you happen to know why this is needed? @pearu checked and found that torchscript doesn't support torch.types._dtype in the type annotation. That seems like it's going to give problems in other places.

Yes, we're going need to clue in TorchScript about this, unfortunately >:(

pearu · 2020-05-12T22:07:23Z

Here are some benchmark results for torch.sparse.softmax (v1.6.0.a) and tensorflow.sparse.softmax (v 2.2.0), both using CPU, values are float64 scalars, timing repeat is 1000:

sizes	nnz	dimension	pytorch sparse	pytorch dense	tensorflow
(100, 100)	1000	0	127.68 us	66.16 us	N/A
(100, 100)	1000	1	97.27 us	10.36 us	234.18 us
(100, 100)	10000	0	595.16 us	67.35 us	N/A
(100, 100)	10000	1	513.72 us	10.22 us	490.75 us
(100, 100, 100)	1000	0	280.73 us	1013.29 us	N/A
(100, 100, 100)	1000	1	230.72 us	793.75 us	N/A
(100, 100, 100)	1000	2	225.87 us	554.37 us	324.38 us
(100, 100, 100)	10000	0	1371.77 us	1135.67 us	N/A
(100, 100, 100)	10000	1	1228.75 us	691.85 us	N/A
(100, 100, 100)	10000	2	1171.73 us	533.57 us	1294.87 us
(100, 100, 100)	100000	0	7957.96 us	1413.64 us	N/A
(100, 100, 100)	100000	1	6750.87 us	752.53 us	N/A
(100, 100, 100)	100000	2	5801.34 us	532.25 us	6444.39 us

Some notes:

pytorch.sparse.softmax is generally faster than tensorflow.sparse.softmax. That is pretty good considering that pytorch sparse softmax supports more features that tf.
softmax performance is higher for larger dimension argument, this is likely related to better locality properties of index pools when using row-wise storage.
pytorch sparse vs dense timings show that there exists a density region (< 1 %) where using sparse storage gives considerably better performance than using dense storage. For relatively dense matrices, usage of sparse storage leads to up to 10x smaller performance.

ezyang · 2020-05-13T02:05:37Z

Failure is legit:


May 12 18:18:16   test_softmax (__main__.TestSparse) ... /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/SoftMax.cpp:348:35: runtime error: division by zero
May 12 18:18:17     #0 0x7f50670b37cd in .omp_outlined. (/opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so+0xc2517cd)
May 12 18:18:17     #1 0x7f5053405712 in __kmp_invoke_microtask (/opt/conda/lib/libiomp5.so+0x122712)
May 12 18:18:17     #2 0x7f5053393ffe in __kmp_invoke_task_func (/opt/conda/lib/libiomp5.so+0xb0ffe)
May 12 18:18:17     #3 0x7f5053393059 in __kmp_launch_thread (/opt/conda/lib/libiomp5.so+0xb0059)
May 12 18:18:17     #4 0x7f5053405bd7 in _INTERNALdc942ded::__kmp_launch_worker(void*) (/opt/conda/lib/libiomp5.so+0x122bd7)
May 12 18:18:17     #5 0x7f508a9cd6b9 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76b9)
May 12 18:18:17     #6 0x7f508a70341c in clone /build/glibc-LK5gWL/glibc-2.23/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:109
May 12 18:18:17

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-05-14T15:49:56Z

@ezyang merged this pull request in 48c0331.

cpuhrsch · 2022-07-25T18:30:23Z

Interesting to read the conversations in this PR again in light of maskedtensor cc @george-qi

Summary: This PR implements softmax support for sparse tensors. The sparse softmax is related to dense softmax when the values of unspecified sparse tensor entries are taken to be `-inf` that will have the effect of "zero entries ignored". This relation is used for testing the correctness of results here. Resolves pytorch#23651 for CPU. - [x] sparse softmax - [x] CPU C++ implementation - [x] unittests - [x] update softmax documentation - [x] autograd support - [x] sparse log_softmax - [x] CPU C++ implementation - [x] unittests - [x] update log_softmax documentation - [x] autograd support Pull Request resolved: pytorch#36305 Differential Revision: D21566540 Pulled By: ezyang fbshipit-source-id: a632ea69c38622f960721482e442efeb8d0a54fc

pearu added the open source label Apr 9, 2020

pearu self-assigned this Apr 9, 2020

pearu changed the title ~~Sparse softmax support~~ Sparse softmax support [WIP] Apr 9, 2020

pearu changed the title ~~Sparse softmax support [WIP]~~ Sparse softmax support (CPU) [WIP] Apr 16, 2020

pearu force-pushed the pearu/23651-softmax-sparse branch from 1842a1c to cb88366 Compare April 16, 2020 14:24

pearu requested a review from apaszke as a code owner April 20, 2020 13:25

pearu mentioned this pull request Apr 20, 2020

Add sparse softmax/log_softmax functionality (ignore zero entries) #23651

Closed

pearu force-pushed the pearu/23651-softmax-sparse branch 2 times, most recently from 9144b22 to 7661c03 Compare April 26, 2020 19:15

pearu changed the title ~~Sparse softmax support (CPU) [WIP]~~ Sparse softmax support (CPU) Apr 27, 2020

hameerabbasi self-requested a review April 29, 2020 14:34

nikitaved self-requested a review April 29, 2020 15:33

nikitaved reviewed Apr 30, 2020

View reviewed changes