"Sparsified" mathematical operations

I had a discussion with @soumith about this on Friday, but I wanted to record things down here so that I don't forget.

**Background.** Imagine that you are @martinraison and you are implementing sparse Adagrad. You end up writing this code:

                if p.grad.data.is_sparse:
                    grad_indices = grad.indices()
                    grad_values = grad.values()
                    size = torch.Size([x for x in grad.size()])

                    def make_sparse(values):
                        constructor = type(p.grad.data)
                        if grad_indices.dim() == 0 or values.dim() == 0:
                            return constructor()
                        return constructor(grad_indices, values, size)
                    state['sum'].add_(make_sparse(grad_values.pow(2)))
                    std = state['sum'].sparse_mask(grad)
                    std_values = std.values().sqrt_().add_(1e-10)
                    p.data.add_(-clr, make_sparse(grad_values / std_values))
                else:
                    state['sum'].addcmul_(1, grad, grad)
                    std = state['sum'].sqrt().add_(1e-10)
                    p.data.addcdiv_(-clr, grad, std)

So, as you can see, the sparse version of this code is four times as long as the non-sparse version. Why? There are a few contributory factors:

1. `addcmul_` doesn't understand how to handle sparse arguments, so we have to compute the sparse tensor we want to add, and then add it in. In the code above, this is done by projecting out the underlying values, performing the necessary operations, and then constructing a new sparse tensor (using `make_sparse`)

2. The line `state['sum'].sqrt().add_(1e-10)` works in both the sparse and dense settings, because `state['sum']` is always a dense tensor. However, we can see that we will subsequently use this to divide the gradient, which means we don't actually need to compute all of the entries, only those which will divide non-zero coefficients of gradient. Once again, this is done by pulling out the values you need, doing the necessary ops, and then turning it back into a sparse vector.

OK, so what can we do about this.

**Proposal 1: Nothing is wrong, carry on.** The code above is wordy, but some of it can be simplified by clone()'ing the sparse tensor, and then performing the updates on values in place. (@soumith suggested this to me.) For example, it's an easy matter to rewrite `state['sum'].add_(make_sparse(grad_values.pow(2)))` to:

    s = grad.clone()
    s.values().pow_(2)
    state['sum'].add_(s)

Which avoids making use of the `make_sparse` function.

There are some downsides to this approach:

1. You have to be careful to make sure your sparse tensor is coalesced; an uncoalesced tensor may have multiple entries for an index in its values array; if we naively apply an operation to the values tensor of an uncoalesced tensor, that is bad. Indeed, the adagrad code initially got this wrong, and it was fixed in 1018b238acf2ecfc836484f3153c2864e9f4e963/

2. There are two minor performance hits to doing things this way. By running `grad.clone()` you are forced to make a copy of values and indexes. The values copy is wasted, because you're going to immediately do an inplace update; you should have just directly written in the updated values. Depending on what you do to the sparse tensor, the index copy may also be wasted, because you may not ever actually change the shape of the tensor (as far as I can tell, we don't have any copy on write.)

3. I get really nervous when performing mutations on tensors which are part of the internal representations of others, e.g., as occurs in `s.values().pow_()`. Yes, it is pretty clear what the intent is in this code snippet, but I think when there is an indirection, it's easier to accidentally end up sharing state when you don't want to. Then eek!

**Proposal 2: Add lots of methods.** If we want to get users to avoid mucking about `values()` manually, the next obvious thing to do is add methods to sparse tensor to let them do the things they want to do. This is a clear win for well defined, easy-to-compute operations on sparse tensors like `pow(scalar)` and `sqrt`. We could implement them in Python or hand-write kernels for them (my personal inclination is to implement them in Python, since that reduces the TCB, but @soumith tells me there are good reasons to implement kernels for each of them.)

However, there are some pitfalls.

1. Consider `std_values = std.values().sqrt_().add_(1e-10)`.  It is tempting to say that we should replace this line with `std.sqrt_().add_(1e-10)`, implementing appropriate sparse versions of sqrt_ and add_. But now there is a problem: the "sparse" version of add... isn't an add at all! If you add a scalar to a sparse tensor, the result isn't going to be sparse anymore! The operation you *really* wanted here was "add scalar if we actually have an entry for it in the sparse vector" which is a kind of weird operation, and in any case really shouldn't be called add. (To add insult to injury, it's not even "add scalar if entry is nonzero" because, prior to coalescing, we might very well have redundant zero entries in our sparse tensor!) So really, if we add a method for this operation, we should not call it add; perhaps we should name it `sadd` ("sparse add"). This highlights a merit of Proposal 1, which is that we didn't have to come up with new names for the operations.

2. Let's look more closely at `grad_values / std_values`. We can do this division because we know that the indexes of grad and std are the same. If we are defining a (sparse!) division between two sparse tensors, we do not necessarily know if this is the case: the sparse tensors might be on completely different indexes. (Actually, I don't even know what the intended semantics of division should be in this case.) Nor can I think of a cheap way of testing that this is the case, since if you were using `clone()` object equality no longer holds between the indexes.

**Proposal 3: A lifting operator (or two).** One final possibility is to define a lifting function, which takes a method invocation on dense tensors, and lifts it into a sparse variant. For example, we might  rewrite `std_values = std.values().sqrt_().add_(1e-10)` as `std.lift_sparse(lambda x: x.sqrt_().add_(1e-10))`. Similarly, `torch.lift_sparse(torch.div, std, grad)` would take division on dense tensors and lift it to apply on sparse vectors, handling the necessary unwrapping and wrapping, and ensuring that the inputs are coalesced.

@soumith didn't really like this idea, because when you provide an operator like this, users may end up using it in ways that you don't expect, leading to strange errors on their end. I know lifting operators like this work quite nicely in Haskell, but it is certainly true that Python is not Haskell.

Thanks for reading. Interested to hear your thoughts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Sparsified" mathematical operations #1369

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

"Sparsified" mathematical operations #1369

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions