Conversation
|
@apaszke Any advice on autograd plumbing? (I'll send this PR to pytorch/pytorch after pytorch#3841 is merged) |
apaszke
left a comment
There was a problem hiding this comment.
This mostly looks good, but needs a few tweaks
| return -((value - self.mean) ** 2) / (2 * var) - log_std - math.log(math.sqrt(2 * math.pi)) | ||
|
|
||
|
|
||
| class _StandardGamma(Function): |
There was a problem hiding this comment.
This is an old-style autograd function. You should write it as shown in the docs.
There was a problem hiding this comment.
I wasn't able to store the gradient via ctx.save_for_backward(grad). Is there a new-style way to save an intermediate that is neither an input nor an output?
There was a problem hiding this comment.
You can't do that because that can only be done with inputs/outputs. Just do ctx.grad = grad.
|
|
||
| // This is identical to THRandom_standard_gamma but also stores the | ||
| // reparameterized gradient wrt alpha in grad_alpha. | ||
| double THRandom_standard_gamma_with_grad(THGenerator *_generator, double alpha, |
There was a problem hiding this comment.
You need to compute the grad during forward, because you'd need to replay all the control flow here otherwise, right?
There was a problem hiding this comment.
Correct. This is a complicated gradient due to the rejection sampler. To reproduce the computation, would need to store about 10x more state.
| default: THPDefaultGenerator->cdata | ||
| kwarg_only: True | ||
| - THTensor* alpha | ||
| - cname: standard_gamma_alpha_with_grad |
There was a problem hiding this comment.
It's bette to expose this as an internal method instead of a new overload (think torch._C._standard_gamma_alpha_with_grad)
There was a problem hiding this comment.
Could you point me to an example internal method whose plumbing I can copy? I'm a little lost here.
| def standard_gamma(self, grad=None): | ||
| if grad is None: | ||
| return Variable(torch.standard_gamma(self.data)) | ||
| return Variable(torch.standard_gamma(self.data, grad), requires_grad=self.requires_grad) |
There was a problem hiding this comment.
You should just use the Function subclass you implemented here.
There was a problem hiding this comment.
Thanks for the help! Should I embed that subclass in torch.autograd.variable, or should torch.distributions be the canonical interface for standard_gamma() for Variables?
|
@alicanb Would you be up for reviewing the statistical aspects of this PR? |
|
Sure
…On Sun, Nov 26, 2017, 4:58 PM Fritz Obermeyer ***@***.***> wrote:
@alicanb <https://github.com/alicanb> Would you be up for reviewing the
statistical aspects of this PR?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABCw1rqKMA7YYVEcBOXqmskJBjyYc83yks5s6d8EgaJpZM4Qq1wL>
.
|
| const double accept_grad = d_grad * e + d * e_grad; | ||
| const double dv = d * v; | ||
| const double dv_grad = d_grad * v + d * v_grad // Pathwise part. | ||
| + (dv - alpha) * accept_grad; // Acceptance part. |
There was a problem hiding this comment.
This is the trickiest part. This roughly follows Naesseth et al. (2017) but computes an exact reparameterized gradient rather than approximating (i.e. if you drew identical samples from an inverse CDF sampler Knowles (2015) and this cheaper sampler, the gradients would be identical). Naesseth et al. seem to be missing the analytical baseline alpha in (dv - alpha) which is easy to compute. Note that accept = log(acceptance ratio), so we're using the log trick in multiplying by accept_grad (and avoiding an expensive exp()).
| real*const alpha_data = THTensor_(data)(alpha); | ||
| real*const saved_u_data = THTensor_(data)(saved_u); | ||
| real*const saved_x_data = THTensor_(data)(saved_x); | ||
| for(int64_t i = 0, numel = THTensor_(nElement)(alpha); i < numel; ++i) { |
There was a problem hiding this comment.
Any reason why you don't use OpenMP here?
There was a problem hiding this comment.
It's difficult to parallelize because THGenerator *gen is stateful. However this _fwd() step is much cheaper than the _bwd() step.
| @staticmethod | ||
| def backward(ctx, grad_output): | ||
| return grad_output * Variable(ctx.saved_grad) | ||
| alpha, = ctx.saved_variables |
There was a problem hiding this comment.
This backward isn't really differentiable twice. Can you mark it @once_differentiable and use ctx.saved_tensors? grad_output will become a tensor too.
| - arg: THTensor* output | ||
| output: True | ||
| - THTensor* alpha | ||
| - THTensor* saved_u |
There was a problem hiding this comment.
It would be nice if we could skip these outputs if we knew an op won't be differentiated. But it's fine as is, and we can fix that later.
There was a problem hiding this comment.
Already done :-) standard_gamma() is a simplified version of standard_gamma_fwd() where the unused outputs are discarded.
|
Do you know the difference between |
|
They are pretty much equivalent. torch.manual_seed(2)
s = torch.get_rng_state()
print(torch.randn(1))
# Same
torch.manual_seed(2)
print(torch.randn(1))
# Same
torch.set_rng_state(s)
print(torch.randn(1)) The benefit of |
| @@ -163,6 +166,48 @@ def test_gamma_sample(self): | |||
| scipy.stats.gamma(alpha, scale=1 / beta), | |||
There was a problem hiding this comment.
1 / beta throws TypeError on python3. 1->1.0 fixes it
| alphas = Variable(torch.Tensor([alpha]), requires_grad=True) | ||
| betas = Variable(torch.Tensor([beta]), requires_grad=True) | ||
| self._check_sampler_sampler(Gamma(alphas, betas), | ||
| scipy.stats.gamma(alpha, scale=1 / beta), |
57a96b1 to
69df177
Compare
|
Ok, I've changed algorithms to now directly approximate the reparameterized gradient. This achieves simpler code, cheaper computation, and more accurate gradients (they are no longer stochastic for alpha < 1). |
|
Thanks @fritzo – I'll plan on taking some time to review on Fri. |
f74bfac to
9c7694f
Compare
d6c1d30 to
6021328
Compare
|
Moving to pytorch#3978 |
Fixes #25
DO NOT MERGE. This PR will be moved to the pytorch org after pytorch#3841 is merged.
This implements reparameterized gradient for
distributions.Gamma. The gradient is implemented by directly approximating the reparameterized gradient functiondx/dalphafollowing Knowles (2015). The approximation is accurate to within 1% relative error for a wide range of alphas.Derivation
Note that if
x ~ Gamma(alpha, beta)thenx / beta ~ Gamma(alpha, 1). Since division is already implemented in PyTorch, we can thus reduce our problem to computing a reparameterized gradient of a standard gammax ~ Gamma(alpha) = Gamma(alpha, 1)wrtalpha.This PR implements a function
standard_gamma_grad(x, alpha)that directly approximates the reparameterized gradient defined (for any continuous univariate distribution) asThis definition is used in the unit tests in
tests/test_distributions.py, which computed/dalpha cdf(x;alpha)via finite difference of thescipy.stats.gamma.cdf()function.The approximation is split into three regions:
x < 0.001we differentiate the power-law approximation from Knowles (2015)digamma()is implemented in PyTorch, we use a finite difference oflgamma().alpha > 30we use the approximationPQ(u,v)is a rational function of order 2 in u and 3 in v. This was trained using least squares minimizing squared relative error on ~20000 samples drawn fromFor complete derivation, see this Jupyter Notebook.