Fixed the gradient of RBF kernel in gaussian_process by cheng-w-liu · Pull Request #11116 · scikit-learn/scikit-learn

cheng-w-liu · 2018-05-22T03:44:29Z

Corrected gradient of the RBF kernel

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Use the correct formula for the gradient of the RBF kernel:
gradient = k(x_i, x_j) * ( ||x_i - x_j||^2 / length_scale^3)

Corrected gradient of the RBF kernel

amueller · 2018-05-22T04:25:48Z

thanks for the fix. Please add a test for the gradient, potentially using the scipy gradient checker. Also, the tests are currently failing.

Please rename the issue to have a more semantic name. Also, if you say "Closes #11113" it will automatically close the related issue when merged.

cheng-w-liu · 2018-05-22T06:04:52Z

Thanks for the suggestion @amueller , will add test cases.
Any insight into why the CI builds failed?

albertcthomas · 2018-05-22T11:07:10Z

You can look at the full log of the CI tests by clicking on Details. For instance Travis is failing because of a flake8 error: the line you fixed is now too long. You can run flake8 locally to find these kinds of errors.

- Also fixed the gradient for the anisotropic case

cheng-w-liu · 2018-05-23T04:57:30Z

Thanks for the input @albertcthomas
Indeed the line was 2-character longer than the standard 79 characters.

albertcthomas

Some tests of the gaussian process module are now failing. You can run all the tests locally on your machine by running make tests. If you want to just run the test of the gaussian process module you can use pytest.

albertcthomas · 2018-05-23T08:38:32Z

sklearn/gaussian_process/kernels.py

                K_gradient = \
-                    (K * squareform(dists))[:, :, np.newaxis]
+                    (K * squareform(dists) / self.length_scale)
+                K_gradient = K_gradient[:, :, np.newaxis]


I think you can also use an ellipsis here:
[..., np.newaxis]

cheng-w-liu · 2018-05-24T05:08:10Z

The tests related to Gaussian Process failed because the calculations of gradients of the kernels are using an utility function _approx_fprime that I suspect is not completely correct.

Since the scope of this PR concerns with RBF kernel, does that make sense to only modify the test cases related to RBF kernels to use scipy.optimize.grad_check?

albertcthomas · 2018-05-24T09:47:07Z

As @amueller said above, I think you can replace _approx_fprime by the scipy gradient checker in the tests.

albertcthomas · 2018-05-24T09:47:45Z

And you will be able to check if this makes the tests pass.

cheng-w-liu · 2018-05-25T19:05:00Z

The tests failed because some of the test cases are using compound kernels. And the calculation of the gradient of a compound kernel is also incorrect, open another issue here: #11140

jmetzen · 2018-05-29T18:14:46Z

could you check if this affected by #8393?

cheng-w-liu · 2018-06-01T04:16:43Z

@jmetzen
#8391 and its fix didn't seem to affect the calculation of gradients. It was a bit odd to me that after the correction of the analytical gradient for RBF kernels, the gradients tests are failing.

If you get a chance to take a look, do the modifications in this PR make sense to you?
Thanks for your feedback!

cheng-w-liu · 2018-06-08T06:03:57Z

The reason test_kernel_gradient() didn't pass is because the analytical gradient and the numerical approximation do not match for a few kernels.

The utility function _approx_fprime used to approximate the gradients is actually almost a copy of scipy's _approx_fprime_helper, which is used in scipy.optimize.grad_check. However, because the kernels' hyper-parameters theta (which corresponds to the xk argument in the function _approx_fprime) is in log-scale, while the denominator in the numerical gradient approximation isn't scaled accordingly, the resulted numerical approximations of the gradients are incorrect.

webdrone · 2018-08-20T21:50:18Z

Hey, nice catch on the bug and thank you for the fix — I used to give the marginal likelihood to a Bayesian optimiser because I thought gradient descend got stuck in a bad local optimum. Do you know if this will be merged soon? It is a pretty annoying thing to have to deal with.

jnothman · 2018-08-20T23:35:55Z

I think this got lost along the way, but we should indeed prioritise fixing it. The PR is not yet complete, however, and yes, it would be good to have feedback from someone who knows optimisation well...

cheng-w-liu · 2018-08-21T05:02:05Z

Hey! As far as fixing the gradients is concerned, I think the changes made have served the original purpose of this PR.

It's a bit awkward that this PR can't be merged because the tests didn't pass. The tests are actually incorrect, as explained above: the kernel parameters are in log-scale, but the numerical derivatives do not scale accordingly. If it's required to fix the tests in order to merge the PR, it's a bit beyond my bandwidth at this point. If anyone is interested in, please feel free to fork and continue.

jnothman · 2018-08-21T05:52:23Z

Yes, the tests should be fixed to merge this...

jnothman · 2018-08-21T05:52:36Z

Marking this as help wanted / stalled.

webdrone · 2018-08-21T16:04:10Z

So I had some time today and decided to thoroughly go through the derivation of the gradient. I don't think there is a problem with the original gradient calculation, at least for the squared exponential.

Consider that the optimisation is over the space of $θ$, where $θ_i = \ln(λ_i)$, and
\[
k(x, y) = exp[-1/2 (x - y)^\top M (x - y)],
\]
where $M = \textrm{diag}(λ^2)$. This is the right parameterisation since the kernel uses the hyperparameter 'length_scale' ($λ$) in this manner (it divides all input by $λ$ to compute distances, making $M_{ii} = λ_i^2$).

Then
\[
\frac{\partial k(x, y)}{\partial λ_i} = \frac{1}{λ_i^3} (x_i - y_i)^2 * k(x, y),
\]
and so we compute every element of $\frac{\partial K}{\partial λ_i}$. Also expressed as:
\[
\frac{\partial K}{\partial λ_i} = \frac{1}{λ_i^3} cdist(X[:, i], X[:, i], 'sqeuclidean') * K,
\]
where the last multiplication is element-wise.

Differentiating the marginal log-likelihood with respect to $θ$ yields the result of R&W Chapter 5 (http://www.gaussianprocess.org/gpml/chapters/RW5.pdf), eq (5.9). Then with chain rule:
\begin{align}
\frac{\partial}{\partial θ_i} \ell\ell &= \frac{1}{2} \mathrm{tr} (αα^\top - K^{-1} \frac{\partial K}{\partial θ_i})\
&= \frac{1}{2} \mathrm{tr} ((αα^\top - K^{-1}) \frac{\partial K}{\partial λ_i}\frac{\partial λ_i}{\partial θ_i}) \
&= \frac{1}{2} \mathrm{tr} ((αα^\top - K^{-1}) \frac{\partial K}{\partial λ_i} λ_i) \
&= \frac{1}{2} \mathrm{tr} ((αα^\top - K^{-1}) \frac{1}{λ_i^2} cdist(X[:, i], X[:, i], 'sqeuclidean') * K).
\end{align}

Notice how the division is by $λ_i^2$, not $λ_i^3$, because of the $λ_i$ factor from the chain rule since we optimise wrt $θ = \ln{λ}$. This is the original operation being done before the change introduced in this PR.

I guess my problem really was too hard for the optimiser after all. I would be very happy if you spot an error in the above.

PS: the lack of latex rendering is very annoying… if anyone knows how to render maths in github's MD, please let me know.

nw2190 · 2019-02-11T17:53:58Z

@webdrone
Thanks for providing the full derivation; it certainly helps clarify where the current RBF implementation is coming from.

But I believe there still needs to be consistency between the optimizer and RBF kernel implementations. In particular, if the kernel returns gradients with respect to the log-hyperparameter "theta = log(length_scale)", then the optimizer should be explicitly told to minimize w.r.t. "theta" (and passed the corresponding log bounds).

I think this would also make sense from a numerical stability standpoint, but it would require modification of various other components of the code.

One possible solution would be to add a property such as "log_scale" to the Hyperparameter class definition (or define it as a new "value_type"?). Then the "_constrained_optimization" implementation in the definition of "GaussianProcessRegressor" could be modified to deal with log-hyperparameters in a manner consistent with the kernel gradient calculations.

Update:
I ended up re-writing the GP regression code from scratch using the explicit log-hyperparameter optimization to compare the results with the current SciKit Learn implementation. I also used a re-implementation of Rasmussen's minimize code for the comparison to avoid code overlap.

From what I can tell, the current implementation is indeed correct, or at least determines the same optimal hyperparameter values obtained from explicitly minimizing w.r.t. the log-hyperparameters.

It may be helpful to add a comment in the code to explain how the current implementation works more clearly though.

Update 2:
Sorry, I forgot to mention that I also included the WhiteKernel in the comparison which appears to follow the same log-hyperparameter gradient calculation by including the multiplicative factor "noise_level" in line 1107 (i.e. the derivative of the diagonal matrix would simply be the identity, but here it is scaled by "noise_level" which would agree with optimization w.r.t. the log parameter "s = log(noise_level)" and a chain-rule multiplicative factor "d/ds[exp(s)] = noise_level").

nw2190 · 2019-02-13T23:26:45Z

I have just did a few more experiments and it appears that the accuracy of the current implementation is simply a testament to the effectiveness of the L-BFGS optimization algorithm.

It appears that you can replace with lines 1233-1234 with just about any strictly-positive function of length_scale, e.g. :

    g = lambda x:  np.cos(x)**2
    K_gradient = \
        ( g(length_scale) * K * squareform(dists))[:, :, np.newaxis]

and still achieve the same optimal hyperparameter values that the current implementation finds for most examples I have tested.

So the accuracy of the implementation on a few simple experiments does not actually seem to justify the lack of consistency between the gradient calculations and optimization calls (i.e. optimizing w.r.t. log-parameters but passing the unmodified values).

I can fix this consistency issue, but when I run pytest --cov sklearn/gaussian_process/tests/ on an unmodified fork of the repository there are a serious number of tests are failing.

Would a proposed code change have any chance of being accepted if these tests are all failing? Also, are there any suggestions/recommendations on how to format a log-scale implementation such as the one mentioned in my previous comment?

jnothman · 2019-02-13T23:48:46Z

Thanks for the investigation! Very often it is easiest to work with an attempted implementation, proposed as a pull request, even if it is not ready to merge (e.g. tests failing at first). If you want to propose an alternative to this pull request, perhaps incorporating the log_scale proposal, please do.

nw2190 · 2019-02-13T23:59:51Z

Ok sorry for all of these posts, but I believe the current implementation is actually using log-scaled (or "log-transformed") parameters after all...

I eventually found a hint buried in the definition of the Sum kernel in Line 228 of the kernels.py file:

    Note that theta are typically the log-transformed values of the
    kernel's hyperparameters as this representation of the search space
    is more amenable for hyperparameter search, as hyperparameters like 
    length-scales naturally live on a log-scale.

And indeed it appears that the log-scale assumption is enforced in Line 244 within the Kernel class definition. Also these parameters are transformed back with np.exp() in the setter method at Line 264.

So I think the current implementation has already incorporated all of the suggestions in my previous comments. We should probably add a more clear explanation of this log-scale assumption in the comments though since anyone who is interested in implementing custom kernels inherited from the base class Kernel will need to be aware of this.

Update: Now that I know what to search for, the phrase log-transformed is included quite frequently in the comments of the KernelOperator and CompoundKernel definitions; but I think it should be mentioned in a more visible section of the code since this is not necessarily an "obvious" assumption to make about the implementation of covariance kernels.

Update 2: I realize now that this is also mentioned in the documentation after looking at it more closely:

Note that both properties (theta and bounds) return log-transformed values of the internally used values since those are typically more amenable to gradient-based optimization.

which can be found in section 1.7.5.1. Gaussian Process Kernel API.

jnothman · 2019-03-18T09:08:18Z

Sorry about that.

What are next steps here? tests appear to be failing.

nw2190 · 2019-03-18T17:37:41Z

@jnothman I believe the implementation of the kernel gradients is correct and consistent with the optimizer's use of log-scale hyperparameters. So I think this PR can be closed in the sense that the RBF implementation was in fact correct as originally implemented.

As mentioned in some of the previous comments, the _approx_fprime function may be causing the tests to fail. But I believe the approximation is correct as long as the arguments xk are assumed to be log-scale hyperparameters (which appears to be the case since they are typically passed self.theta).

I am new to using pytest, but from what I can tell the tests seem to be passing for the current master branch (with Python 3.7 on Arch Linux). There are only two warnings that I am getting which are associated with test_kernels.py:

            if self.nu == 0.5:
                K_gradient = K[..., np.newaxis] * D \
>                   / np.sqrt(D.sum(2))[:, :, np.newaxis]
E               RuntimeWarning: invalid value encountered in true_divide 
kernels.py:1372: RuntimeWarning

Both warnings stem from line 54 of test_kernels.py during the test_kernel_gradient call.

In particular there appears to be an issue with the gradient calculation for the second kernel in the Product kernel operator on line 760.

The actual warning occurs at line 1372 of kernels.py. This is likely caused by the fact that the lines:

    if self.nu == 0.5:
        K_gradient = K[..., np.newaxis] * D \
            / np.sqrt(D.sum(2))[:, :, np.newaxis]
        K_gradient[~np.isfinite(K_gradient)] = 0

can involve dividing by D = squareform(dists**2)[:, :, np.newaxis] which has zeros along the diagonal (which remain zero after summing across the trivially expanded axis). This appears to be handled by the isfinite check following the division, so I do not think this is a very serious issue.

From what I can tell, the current implementation appears to be working properly.

jnothman · 2019-03-19T07:15:29Z

Could you offer improvements to the code or its documentation to make users more confident in its gradients?

AbhilashMathews · 2019-03-25T19:17:32Z

So I had some time today and decided to thoroughly go through the derivation of the gradient. I don't think there is a problem with the original gradient calculation, at least for the squared exponential.

Consider that the optimisation is over the space of $θ$, where $θ_i = \ln(λ_i)$, and
[
k(x, y) = exp[-1/2 (x - y)^\top M (x - y)],
]
where $M = \textrm{diag}(λ^2)$. This is the right parameterisation since the kernel uses the hyperparameter 'length_scale' ($λ$) in this manner (it divides all input by $λ$ to compute distances, making $M_{ii} = λ_i^2$).

Then
[
\frac{\partial k(x, y)}{\partial λ_i} = \frac{1}{λ_i^3} (x_i - y_i)^2 * k(x, y),
]
and so we compute every element of $\frac{\partial K}{\partial λ_i}$. Also expressed as:
[
\frac{\partial K}{\partial λ_i} = \frac{1}{λ_i^3} cdist(X[:, i], X[:, i], 'sqeuclidean') * K,
]
where the last multiplication is element-wise.

Differentiating the marginal log-likelihood with respect to $θ$ yields the result of R&W Chapter 5 (http://www.gaussianprocess.org/gpml/chapters/RW5.pdf), eq (5.9). Then with chain rule:
\begin{align}
\frac{\partial}{\partial θ_i} \ell\ell &= \frac{1}{2} \mathrm{tr} (αα^\top - K^{-1} \frac{\partial K}{\partial θ_i})
&= \frac{1}{2} \mathrm{tr} ((αα^\top - K^{-1}) \frac{\partial K}{\partial λ_i}\frac{\partial λ_i}{\partial θ_i})
&= \frac{1}{2} \mathrm{tr} ((αα^\top - K^{-1}) \frac{\partial K}{\partial λ_i} λ_i)
&= \frac{1}{2} \mathrm{tr} ((αα^\top - K^{-1}) \frac{1}{λ_i^2} cdist(X[:, i], X[:, i], 'sqeuclidean') * K).
\end{align}

Notice how the division is by $λ_i^2$, not $λ_i^3$, because of the $λ_i$ factor from the chain rule since we optimise wrt $θ = \ln{λ}$. This is the original operation being done before the change introduced in this PR.

I guess my problem really was too hard for the optimiser after all. I would be very happy if you spot an error in the above.

PS: the lack of latex rendering is very annoying… if anyone knows how to render maths in github's MD, please let me know.

Thank you for clarifying that. It's an important distinction to make that it is the log marginal likelihood being optimized over the log-transformed hyperparameters (i.e. both are log-transformed and this should be underlined).

jnothman · 2019-03-26T01:52:36Z

Please feel free to offer a pull request improving the documentation

nw2190 · 2019-03-26T02:06:09Z

Please feel free to offer a pull request improving the documentation

I have a few deadlines coming up right now, but I can take a look at improving the documentation afterwards. I will try to submit a pull request with more details regarding the gradient calculations and the internal hyperparameters sometime next week if it hasn't already been revised by then.

amueller · 2019-08-07T22:00:55Z

@nw2190 still interested in working on this?

Issue 11113: scikit-learn#11113

080417e

Corrected gradient of the RBF kernel

cheng-w-liu mentioned this pull request May 22, 2018

For the RBF Kernel in gaussian_process, the calculation of the gradient seems incorrect? #11113

Closed

cheng-w-liu changed the title ~~Fixed Issue 11113~~ Fixed the gradient of RBF kernel in gaussian_process May 22, 2018

fixed typo

2cedbe1

- Broken out the expression into two lines to conform with E501

6e34dc4

- Also fixed the gradient for the anisotropic case

albertcthomas reviewed May 23, 2018

View reviewed changes

jnothman added the Bug label Aug 20, 2018

jnothman added Stalled help wanted labels Aug 21, 2018

albertcthomas mentioned this pull request Mar 17, 2019

Possible error in evaluating kernel gradient in scikit-learn's GPR #13457

Closed

leelaylay mentioned this pull request Mar 27, 2019

Metis prints unexpected warning microsoft/nni#930

Open

github-actions bot added the module:gaussian_process label Mar 2, 2020

This was referenced Aug 3, 2020

Gradient of ConstantKernel is Incorrect #11581

Closed

[MRG] DOC clarify the kernel gradient for GaussianProcesses #18115

Merged

jnothman closed this in #18115 Aug 15, 2020

Uh oh!

Conversation

cheng-w-liu commented May 22, 2018 • edited by jnothman Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

amueller commented May 22, 2018

Uh oh!

cheng-w-liu commented May 22, 2018

Uh oh!

albertcthomas commented May 22, 2018

Uh oh!

cheng-w-liu commented May 23, 2018

Uh oh!

albertcthomas left a comment

Choose a reason for hiding this comment

Uh oh!

albertcthomas May 23, 2018

Choose a reason for hiding this comment

Uh oh!

cheng-w-liu commented May 24, 2018

Uh oh!

albertcthomas commented May 24, 2018

Uh oh!

albertcthomas commented May 24, 2018

Uh oh!

cheng-w-liu commented May 25, 2018

Uh oh!

jmetzen commented May 29, 2018

Uh oh!

cheng-w-liu commented Jun 1, 2018

Uh oh!

cheng-w-liu commented Jun 8, 2018

Uh oh!

webdrone commented Aug 20, 2018

Uh oh!

jnothman commented Aug 20, 2018

Uh oh!

cheng-w-liu commented Aug 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Aug 21, 2018 via email

Uh oh!

jnothman commented Aug 21, 2018

Uh oh!

webdrone commented Aug 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nw2190 commented Feb 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nw2190 commented Feb 13, 2019

Uh oh!

jnothman commented Feb 13, 2019 via email

Uh oh!

nw2190 commented Feb 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Mar 18, 2019

Uh oh!

nw2190 commented Mar 18, 2019

Uh oh!

jnothman commented Mar 19, 2019 via email

Uh oh!

AbhilashMathews commented Mar 25, 2019

Uh oh!

jnothman commented Mar 26, 2019 via email

Uh oh!

nw2190 commented Mar 26, 2019

Uh oh!

amueller commented Aug 7, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

cheng-w-liu commented May 22, 2018 •

edited by jnothman

Loading

cheng-w-liu commented Aug 21, 2018 •

edited

Loading

webdrone commented Aug 21, 2018 •

edited

Loading

nw2190 commented Feb 11, 2019 •

edited

Loading

nw2190 commented Feb 13, 2019 •

edited

Loading