The docs for nn.Softmax, which are essentially copied over from Torch7's nn.Softmax, say "Applies the SoftMax function to an n-dimensional input Tensor, rescaling them so that the elements of the n-dimensional output Tensor lie in the range (0, 1) and sum to 1." This is ambiguous with respect to what exactly should sum to 1 (what dimension/s the softmax should be performed over). What in fact happens, also a holdover from Torch7, is:
- For a 1D input, the softmax takes place over dimension 0 (this was needed in Torch7 to support inputs without a batch dimension, which are not allowed in PyTorch). This is still reasonable default behavior, because there's no reason to use
nn.Softmax when you have a batch of scalars.
- For a 2D input, the softmax takes place over dimension 1. This is the most common case, and the desired behavior.
- For a 3D input, the softmax takes place over dimension 0. This is usually wrong in PyTorch, since dimension 0 is usually the batch dimension. It may, however, be the right thing to do if dimension 0 is the timestep dimension of a timestep x batch x feature tensor; this is the desired behavior in RNNs with attention.
- For a 4D input, the softmax again takes place over dimension 1.
- For a 5D+ input, THNN gives a RuntimeError.
If all of this is intentional, then we should document it. The ideal solution is probably to keep this behavior and document it but add an optional dim argument, as in TensorFlow, that lets the user pick a dimension to softmax over. For reference, the default behavior in TF is to softmax over the last dimension, while the default behavior in Chainer (which can also IIRC be overridden by the user as of v2) is to softmax over dimension 1. It is occasionally also useful to be able to softmax over multiple dimensions at once, but this is easy to emulate with .view() so it's not necessary to have in core.
It may also be useful to add a torch.softmax with the same behavior as F.softmax, to apply the softmax operation to Tensors.
The docs for
nn.Softmax, which are essentially copied over from Torch7'snn.Softmax, say "Applies the SoftMax function to an n-dimensional input Tensor, rescaling them so that the elements of the n-dimensional output Tensor lie in the range (0, 1) and sum to 1." This is ambiguous with respect to what exactly should sum to 1 (what dimension/s the softmax should be performed over). What in fact happens, also a holdover from Torch7, is:nn.Softmaxwhen you have a batch of scalars.If all of this is intentional, then we should document it. The ideal solution is probably to keep this behavior and document it but add an optional
dimargument, as in TensorFlow, that lets the user pick a dimension to softmax over. For reference, the default behavior in TF is to softmax over the last dimension, while the default behavior in Chainer (which can also IIRC be overridden by the user as of v2) is to softmax over dimension 1. It is occasionally also useful to be able to softmax over multiple dimensions at once, but this is easy to emulate with.view()so it's not necessary to have in core.It may also be useful to add a
torch.softmaxwith the same behavior asF.softmax, to apply the softmax operation to Tensors.