Fixed the gradient of RBF kernel in gaussian_process#11116
Fixed the gradient of RBF kernel in gaussian_process#11116cheng-w-liu wants to merge 3 commits intoscikit-learn:masterfrom
Conversation
Corrected gradient of the RBF kernel
|
thanks for the fix. Please add a test for the gradient, potentially using the scipy gradient checker. Also, the tests are currently failing. Please rename the issue to have a more semantic name. Also, if you say "Closes #11113" it will automatically close the related issue when merged. |
|
Thanks for the suggestion @amueller , will add test cases. |
|
You can look at the full log of the CI tests by clicking on Details. For instance Travis is failing because of a flake8 error: the line you fixed is now too long. You can run flake8 locally to find these kinds of errors. |
- Also fixed the gradient for the anisotropic case
|
Thanks for the input @albertcthomas |
albertcthomas
left a comment
There was a problem hiding this comment.
Some tests of the gaussian process module are now failing. You can run all the tests locally on your machine by running make tests. If you want to just run the test of the gaussian process module you can use pytest.
| K_gradient = \ | ||
| (K * squareform(dists))[:, :, np.newaxis] | ||
| (K * squareform(dists) / self.length_scale) | ||
| K_gradient = K_gradient[:, :, np.newaxis] |
There was a problem hiding this comment.
I think you can also use an ellipsis here:
[..., np.newaxis]
|
The tests related to Gaussian Process failed because the calculations of gradients of the kernels are using an utility function Since the scope of this PR concerns with RBF kernel, does that make sense to only modify the test cases related to RBF kernels to use |
|
And you will be able to check if this makes the tests pass. |
|
The tests failed because some of the test cases are using compound kernels. And the calculation of the gradient of a compound kernel is also incorrect, open another issue here: #11140 |
|
could you check if this affected by #8393? |
|
@jmetzen If you get a chance to take a look, do the modifications in this PR make sense to you? |
|
The reason test_kernel_gradient() didn't pass is because the analytical gradient and the numerical approximation do not match for a few kernels. The utility function _approx_fprime used to approximate the gradients is actually almost a copy of scipy's _approx_fprime_helper, which is used in scipy.optimize.grad_check. However, because the kernels' hyper-parameters |
|
Hey, nice catch on the bug and thank you for the fix — I used to give the marginal likelihood to a Bayesian optimiser because I thought gradient descend got stuck in a bad local optimum. Do you know if this will be merged soon? It is a pretty annoying thing to have to deal with. |
|
I think this got lost along the way, but we should indeed prioritise fixing it. The PR is not yet complete, however, and yes, it would be good to have feedback from someone who knows optimisation well... |
|
Hey! As far as fixing the gradients is concerned, I think the changes made have served the original purpose of this PR. It's a bit awkward that this PR can't be merged because the tests didn't pass. The tests are actually incorrect, as explained above: the kernel parameters are in log-scale, but the numerical derivatives do not scale accordingly. If it's required to fix the tests in order to merge the PR, it's a bit beyond my bandwidth at this point. If anyone is interested in, please feel free to fork and continue. |
|
Yes, the tests should be fixed to merge this...
|
|
Marking this as help wanted / stalled. |
|
So I had some time today and decided to thoroughly go through the derivation of the gradient. I don't think there is a problem with the original gradient calculation, at least for the squared exponential. Consider that the optimisation is over the space of Then Differentiating the marginal log-likelihood with respect to Notice how the division is by I guess my problem really was too hard for the optimiser after all. I would be very happy if you spot an error in the above. PS: the lack of latex rendering is very annoying… if anyone knows how to render maths in github's MD, please let me know. |
|
@webdrone But I believe there still needs to be consistency between the optimizer and RBF kernel implementations. In particular, if the kernel returns gradients with respect to the log-hyperparameter "theta = log(length_scale)", then the optimizer should be explicitly told to minimize w.r.t. "theta" (and passed the corresponding log bounds). I think this would also make sense from a numerical stability standpoint, but it would require modification of various other components of the code. One possible solution would be to add a property such as "log_scale" to the Hyperparameter class definition (or define it as a new "value_type"?). Then the "_constrained_optimization" implementation in the definition of "GaussianProcessRegressor" could be modified to deal with log-hyperparameters in a manner consistent with the kernel gradient calculations. Update: From what I can tell, the current implementation is indeed correct, or at least determines the same optimal hyperparameter values obtained from explicitly minimizing w.r.t. the log-hyperparameters. It may be helpful to add a comment in the code to explain how the current implementation works more clearly though. Update 2: |
|
I have just did a few more experiments and it appears that the accuracy of the current implementation is simply a testament to the effectiveness of the L-BFGS optimization algorithm. It appears that you can replace with lines 1233-1234 with just about any strictly-positive function of and still achieve the same optimal hyperparameter values that the current implementation finds for most examples I have tested. So the accuracy of the implementation on a few simple experiments does not actually seem to justify the lack of consistency between the gradient calculations and optimization calls (i.e. optimizing w.r.t. log-parameters but passing the unmodified values). I can fix this consistency issue, but when I run Would a proposed code change have any chance of being accepted if these tests are all failing? Also, are there any suggestions/recommendations on how to format a log-scale implementation such as the one mentioned in my previous comment? |
|
Thanks for the investigation! Very often it is easiest to work with an
attempted implementation, proposed as a pull request, even if it is not
ready to merge (e.g. tests failing at first). If you want to propose an
alternative to this pull request, perhaps incorporating the log_scale
proposal, please do.
|
|
Ok sorry for all of these posts, but I believe the current implementation is actually using log-scaled (or "log-transformed") parameters after all... I eventually found a hint buried in the definition of the And indeed it appears that the log-scale assumption is enforced in Line 244 within the So I think the current implementation has already incorporated all of the suggestions in my previous comments. We should probably add a more clear explanation of this log-scale assumption in the comments though since anyone who is interested in implementing custom kernels inherited from the base class Update: Now that I know what to search for, the phrase Update 2: I realize now that this is also mentioned in the documentation after looking at it more closely:
which can be found in section 1.7.5.1. Gaussian Process Kernel API. |
|
Sorry about that. What are next steps here? tests appear to be failing. |
|
@jnothman I believe the implementation of the kernel gradients is correct and consistent with the optimizer's use of log-scale hyperparameters. So I think this PR can be closed in the sense that the RBF implementation was in fact correct as originally implemented. As mentioned in some of the previous comments, the _approx_fprime function may be causing the tests to fail. But I believe the approximation is correct as long as the arguments I am new to using Both warnings stem from line 54 of In particular there appears to be an issue with the gradient calculation for the second kernel in the The actual warning occurs at line 1372 of can involve dividing by From what I can tell, the current implementation appears to be working properly. |
|
Could you offer improvements to the code or its documentation to make users
more confident in its gradients?
|
Thank you for clarifying that. It's an important distinction to make that it is the log marginal likelihood being optimized over the log-transformed hyperparameters (i.e. both are log-transformed and this should be underlined). |
|
Please feel free to offer a pull request improving the documentation
|
I have a few deadlines coming up right now, but I can take a look at improving the documentation afterwards. I will try to submit a pull request with more details regarding the gradient calculations and the internal hyperparameters sometime next week if it hasn't already been revised by then. |
|
@nw2190 still interested in working on this? |
Corrected gradient of the RBF kernel
Reference Issues/PRs
Fixes #11113
What does this implement/fix? Explain your changes.
Use the correct formula for the gradient of the RBF kernel:
gradient = k(x_i, x_j) * ( ||x_i - x_j||^2 / length_scale^3)