Implement StudentT ops#2233
Conversation
|
What is required to implement |
|
@fritzo That condition is only used for moment-matching, where we need to match covariance of two student-ts. For example, given In https://arxiv.org/abs/1703.02428 section 4.3 and https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8822764 , the authors use KL for such approximation, hence |
|
I think that we can leverage tail behaviors of |
|
When |
|
Good idea!! I'll try to find the formula and match one of the moments < df. |
|
This new paper https://arxiv.org/abs/1912.01607 computes general moments for Student's t-distribution. There are many choices for moment order, especially for multidimensional cases. I'll try to pick up a simple one from them. |
…ndition and marginalize
| identity = torch.eye(self.loc.size(-1), device=self.loc.device, dtype=self.loc.dtype) | ||
| scale_inv = identity.triangular_solve(self.scale_tril, upper=False).solution.transpose(-1, -2) | ||
| return torch.matmul(scale_inv.transpose(-1, -2), scale_inv) | ||
| return torch.cholesky_solve(identity, self.scale_tril) |
There was a problem hiding this comment.
This is much more stable, which helps many tests.
…ha beta parameterization; most tests are passed except .add method
There was a problem hiding this comment.
@fritzo The PR is almost ready. All methods are tested but I still need to add some more tests to verify one-step-ahead prediction and measurement update agree with intermediate approximations in Section 4.1 of https://arxiv.org/abs/1703.02428.
In MVT, the log_prob has the term log((x - m)P(x - m) + ...). I have tried various parameterization with df, info_vec/loc, precision but none of them are able to represent a StudentT with a rank-deficient precision matrix. The closest one I come up with is (df, info_vec, precision) but log_density and condition require Cholesky of P, which is not available with rank-deficient precision. It turns out that the parameterization (alpha, beta, info_vec, precision) in GammaGaussian can solve the issue. With that parameterization, all methods except .__add__ are pretty straightforward and inherited well from GammaGaussian, so I think you can quickly go over them (modulo design recommendations).
The most difficult one is .__add__, which involves many tricky computations. There are 4 conditions to perform add op:
- 2 variables are uncorrelated. For Gaussian, uncorrelation implies independent, but this is not true for StudentT. We will do approximation here:
St(x; df, m, P)St(y; df, n, Q) = St([x, y]; df, [m, n], [[P, 0], [0, Q]])
To keep track of uncorrelated relationship, I introduced a "mask" property, which keep track of non-degenerate variables. It turns out that this is useful for moment matching too, where we only need to match the moment of non-degenerate terms! - 2 variables have the same degree of freedom. If this condition does not happen, we will choose a common dof (I follow the reference to choose dof to be the min dof of two distributions - to preserve the tail) and leverage moment matching to approximate a StudentT with a new one having new dof. The math for moment matching is complicated, so I tried to sketch the idea in the comment section of
_moment_matchingfunction. - The degree of freedom is greater than (absolute central) moment matching order. By default, I use
order=1, so all dofs must be greater than 1. - The mixing priors Gamma(concentration, rate) should be the same. By choosing the common df, we already have the same concentration. So we must make them have the same rate. Changing the rate requires us to change the scale of
info_vecandprecision, which will introduce additional non-trivial codes. So in moment matching, I also force the output to haveconcentration = rate.
I know that the implementation is tricky to able to appreciate and might include some bugs due to my wrong understanding. Please let me know if you find any point which is unclear so we can discuss more on them.
|
@fritzo Unfortunately, the implementation is only useful for sequential filtering. :( The reason is the |
|
|
||
| @pytest.mark.parametrize("batch_shape", [(), (4,), (3, 2)], ids=str) | ||
| @pytest.mark.parametrize("dim", [1, 2, 3]) | ||
| def test_studentt_multiple_representation(batch_shape, dim): |
There was a problem hiding this comment.
@fritzo this test verifies multiple representations of a StudentT
Well the main purpose of What do you think about an alternative approach to |
|
re not worth implementing: Agreed! I'll move some cleanups into a separate PR and leave the implementation here in case it is useful for benchmarking in the future. re alternative approach to StudentTHMM inference: It is reasonable and should be easier to have than StableHMM. I think that it is also more useful than GammaGaussian, whose posteriors of the last state is more Gaussian-like for long time series with high dimensional outputs. I'll follow this approach instead. |
Great! I think the code should be straight forward. I see the biggest open research question being: |
|
I get your point. I'll think more about that question... |
|
The reference for this pull request is interesting. So is there a way to fit a t smoother with Pyro? I am coming from the PyFlux library where, for example, I can fit a dynamic linear model with student’s t noise. PyFlux is no longer being actively developed. I would like to switch to Pyro to fit these types of models if possible. |
|
Hi @matthewfieger , one thing we are working on is a EDIT here is the alternative line of work: #2254 |
The implementation will assume that
df > 2throughout.Right now, there is nothing implemented yet. I'll add more details below. The math is mostly corresponding to GammaGaussian op.
Reference
Tasks