Skip to content

Fixed the gradient of RBF kernel in gaussian_process#11116

Closed
cheng-w-liu wants to merge 3 commits intoscikit-learn:masterfrom
cheng-w-liu:dev-fix-issue-11113
Closed

Fixed the gradient of RBF kernel in gaussian_process#11116
cheng-w-liu wants to merge 3 commits intoscikit-learn:masterfrom
cheng-w-liu:dev-fix-issue-11113

Conversation

@cheng-w-liu
Copy link
Copy Markdown

@cheng-w-liu cheng-w-liu commented May 22, 2018

Corrected gradient of the RBF kernel

Reference Issues/PRs

Fixes #11113

What does this implement/fix? Explain your changes.

Use the correct formula for the gradient of the RBF kernel:
gradient = k(x_i, x_j) * ( ||x_i - x_j||^2 / length_scale^3)

Corrected gradient of the RBF kernel
@amueller
Copy link
Copy Markdown
Member

thanks for the fix. Please add a test for the gradient, potentially using the scipy gradient checker. Also, the tests are currently failing.

Please rename the issue to have a more semantic name. Also, if you say "Closes #11113" it will automatically close the related issue when merged.

@cheng-w-liu cheng-w-liu changed the title Fixed Issue 11113 Fixed the gradient of RBF kernel in gaussian_process May 22, 2018
@cheng-w-liu
Copy link
Copy Markdown
Author

Thanks for the suggestion @amueller , will add test cases.
Any insight into why the CI builds failed?

@albertcthomas
Copy link
Copy Markdown
Contributor

You can look at the full log of the CI tests by clicking on Details. For instance Travis is failing because of a flake8 error: the line you fixed is now too long. You can run flake8 locally to find these kinds of errors.

- Also fixed the gradient for the anisotropic case
@cheng-w-liu
Copy link
Copy Markdown
Author

Thanks for the input @albertcthomas
Indeed the line was 2-character longer than the standard 79 characters.

Copy link
Copy Markdown
Contributor

@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tests of the gaussian process module are now failing. You can run all the tests locally on your machine by running make tests. If you want to just run the test of the gaussian process module you can use pytest.

K_gradient = \
(K * squareform(dists))[:, :, np.newaxis]
(K * squareform(dists) / self.length_scale)
K_gradient = K_gradient[:, :, np.newaxis]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can also use an ellipsis here:
[..., np.newaxis]

@cheng-w-liu
Copy link
Copy Markdown
Author

The tests related to Gaussian Process failed because the calculations of gradients of the kernels are using an utility function _approx_fprime that I suspect is not completely correct.

Since the scope of this PR concerns with RBF kernel, does that make sense to only modify the test cases related to RBF kernels to use scipy.optimize.grad_check?

@albertcthomas
Copy link
Copy Markdown
Contributor

As @amueller said above, I think you can replace _approx_fprime by the scipy gradient checker in the tests.

@albertcthomas
Copy link
Copy Markdown
Contributor

And you will be able to check if this makes the tests pass.

@cheng-w-liu
Copy link
Copy Markdown
Author

The tests failed because some of the test cases are using compound kernels. And the calculation of the gradient of a compound kernel is also incorrect, open another issue here: #11140

@jmetzen
Copy link
Copy Markdown
Member

jmetzen commented May 29, 2018

could you check if this affected by #8393?

@cheng-w-liu
Copy link
Copy Markdown
Author

@jmetzen
#8391 and its fix didn't seem to affect the calculation of gradients. It was a bit odd to me that after the correction of the analytical gradient for RBF kernels, the gradients tests are failing.

If you get a chance to take a look, do the modifications in this PR make sense to you?
Thanks for your feedback!

@cheng-w-liu
Copy link
Copy Markdown
Author

The reason test_kernel_gradient() didn't pass is because the analytical gradient and the numerical approximation do not match for a few kernels.

The utility function _approx_fprime used to approximate the gradients is actually almost a copy of scipy's _approx_fprime_helper, which is used in scipy.optimize.grad_check. However, because the kernels' hyper-parameters theta (which corresponds to the xk argument in the function _approx_fprime) is in log-scale, while the denominator in the numerical gradient approximation isn't scaled accordingly, the resulted numerical approximations of the gradients are incorrect.

@webdrone
Copy link
Copy Markdown

Hey, nice catch on the bug and thank you for the fix — I used to give the marginal likelihood to a Bayesian optimiser because I thought gradient descend got stuck in a bad local optimum. Do you know if this will be merged soon? It is a pretty annoying thing to have to deal with.

@jnothman jnothman added the Bug label Aug 20, 2018
@jnothman
Copy link
Copy Markdown
Member

I think this got lost along the way, but we should indeed prioritise fixing it. The PR is not yet complete, however, and yes, it would be good to have feedback from someone who knows optimisation well...

@cheng-w-liu
Copy link
Copy Markdown
Author

cheng-w-liu commented Aug 21, 2018

Hey! As far as fixing the gradients is concerned, I think the changes made have served the original purpose of this PR.

It's a bit awkward that this PR can't be merged because the tests didn't pass. The tests are actually incorrect, as explained above: the kernel parameters are in log-scale, but the numerical derivatives do not scale accordingly. If it's required to fix the tests in order to merge the PR, it's a bit beyond my bandwidth at this point. If anyone is interested in, please feel free to fork and continue.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Aug 21, 2018 via email

@jnothman
Copy link
Copy Markdown
Member

Marking this as help wanted / stalled.

@webdrone
Copy link
Copy Markdown

webdrone commented Aug 21, 2018

So I had some time today and decided to thoroughly go through the derivation of the gradient. I don't think there is a problem with the original gradient calculation, at least for the squared exponential.

Consider that the optimisation is over the space of $θ$, where $θ_i = \ln(λ_i)$, and
\[
k(x, y) = exp[-1/2 (x - y)^\top M (x - y)],
\]
where $M = \textrm{diag}(λ^2)$. This is the right parameterisation since the kernel uses the hyperparameter 'length_scale' ($λ$) in this manner (it divides all input by $λ$ to compute distances, making $M_{ii} = λ_i^2$).

Then
\[
\frac{\partial k(x, y)}{\partial λ_i} = \frac{1}{λ_i^3} (x_i - y_i)^2 * k(x, y),
\]
and so we compute every element of $\frac{\partial K}{\partial λ_i}$. Also expressed as:
\[
\frac{\partial K}{\partial λ_i} = \frac{1}{λ_i^3} cdist(X[:, i], X[:, i], 'sqeuclidean') * K,
\]
where the last multiplication is element-wise.

Differentiating the marginal log-likelihood with respect to $θ$ yields the result of R&W Chapter 5 (http://www.gaussianprocess.org/gpml/chapters/RW5.pdf), eq (5.9). Then with chain rule:
\begin{align}
\frac{\partial}{\partial θ_i} \ell\ell &= \frac{1}{2} \mathrm{tr} (αα^\top - K^{-1} \frac{\partial K}{\partial θ_i})\
&= \frac{1}{2} \mathrm{tr} ((αα^\top - K^{-1}) \frac{\partial K}{\partial λ_i}\frac{\partial λ_i}{\partial θ_i}) \
&= \frac{1}{2} \mathrm{tr} ((αα^\top - K^{-1}) \frac{\partial K}{\partial λ_i} λ_i) \
&= \frac{1}{2} \mathrm{tr} ((αα^\top - K^{-1}) \frac{1}{λ_i^2} cdist(X[:, i], X[:, i], 'sqeuclidean') * K).
\end{align}

Notice how the division is by $λ_i^2$, not $λ_i^3$, because of the $λ_i$ factor from the chain rule since we optimise wrt $θ = \ln{λ}$. This is the original operation being done before the change introduced in this PR.

I guess my problem really was too hard for the optimiser after all. I would be very happy if you spot an error in the above.

PS: the lack of latex rendering is very annoying… if anyone knows how to render maths in github's MD, please let me know.

@nw2190
Copy link
Copy Markdown

nw2190 commented Feb 11, 2019

@webdrone
Thanks for providing the full derivation; it certainly helps clarify where the current RBF implementation is coming from.

But I believe there still needs to be consistency between the optimizer and RBF kernel implementations. In particular, if the kernel returns gradients with respect to the log-hyperparameter "theta = log(length_scale)", then the optimizer should be explicitly told to minimize w.r.t. "theta" (and passed the corresponding log bounds).

I think this would also make sense from a numerical stability standpoint, but it would require modification of various other components of the code.

One possible solution would be to add a property such as "log_scale" to the Hyperparameter class definition (or define it as a new "value_type"?). Then the "_constrained_optimization" implementation in the definition of "GaussianProcessRegressor" could be modified to deal with log-hyperparameters in a manner consistent with the kernel gradient calculations.

Update:
I ended up re-writing the GP regression code from scratch using the explicit log-hyperparameter optimization to compare the results with the current SciKit Learn implementation. I also used a re-implementation of Rasmussen's minimize code for the comparison to avoid code overlap.

From what I can tell, the current implementation is indeed correct, or at least determines the same optimal hyperparameter values obtained from explicitly minimizing w.r.t. the log-hyperparameters.

It may be helpful to add a comment in the code to explain how the current implementation works more clearly though.

Update 2:
Sorry, I forgot to mention that I also included the WhiteKernel in the comparison which appears to follow the same log-hyperparameter gradient calculation by including the multiplicative factor "noise_level" in line 1107 (i.e. the derivative of the diagonal matrix would simply be the identity, but here it is scaled by "noise_level" which would agree with optimization w.r.t. the log parameter "s = log(noise_level)" and a chain-rule multiplicative factor "d/ds[exp(s)] = noise_level").

@nw2190
Copy link
Copy Markdown

nw2190 commented Feb 13, 2019

I have just did a few more experiments and it appears that the accuracy of the current implementation is simply a testament to the effectiveness of the L-BFGS optimization algorithm.

It appears that you can replace with lines 1233-1234 with just about any strictly-positive function of length_scale, e.g. :

    g = lambda x:  np.cos(x)**2
    K_gradient = \
        ( g(length_scale) * K * squareform(dists))[:, :, np.newaxis]

and still achieve the same optimal hyperparameter values that the current implementation finds for most examples I have tested.

So the accuracy of the implementation on a few simple experiments does not actually seem to justify the lack of consistency between the gradient calculations and optimization calls (i.e. optimizing w.r.t. log-parameters but passing the unmodified values).

I can fix this consistency issue, but when I run pytest --cov sklearn/gaussian_process/tests/ on an unmodified fork of the repository there are a serious number of tests are failing.

Would a proposed code change have any chance of being accepted if these tests are all failing? Also, are there any suggestions/recommendations on how to format a log-scale implementation such as the one mentioned in my previous comment?

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Feb 13, 2019 via email

@nw2190
Copy link
Copy Markdown

nw2190 commented Feb 13, 2019

Ok sorry for all of these posts, but I believe the current implementation is actually using log-scaled (or "log-transformed") parameters after all...

I eventually found a hint buried in the definition of the Sum kernel in Line 228 of the kernels.py file:

    Note that theta are typically the log-transformed values of the
    kernel's hyperparameters as this representation of the search space
    is more amenable for hyperparameter search, as hyperparameters like 
    length-scales naturally live on a log-scale.    

And indeed it appears that the log-scale assumption is enforced in Line 244 within the Kernel class definition. Also these parameters are transformed back with np.exp() in the setter method at Line 264.

So I think the current implementation has already incorporated all of the suggestions in my previous comments. We should probably add a more clear explanation of this log-scale assumption in the comments though since anyone who is interested in implementing custom kernels inherited from the base class Kernel will need to be aware of this.

Update: Now that I know what to search for, the phrase log-transformed is included quite frequently in the comments of the KernelOperator and CompoundKernel definitions; but I think it should be mentioned in a more visible section of the code since this is not necessarily an "obvious" assumption to make about the implementation of covariance kernels.

Update 2: I realize now that this is also mentioned in the documentation after looking at it more closely:

Note that both properties (theta and bounds) return log-transformed values of the internally used values since those are typically more amenable to gradient-based optimization.

which can be found in section 1.7.5.1. Gaussian Process Kernel API.

@jnothman
Copy link
Copy Markdown
Member

Sorry about that.

What are next steps here? tests appear to be failing.

@nw2190
Copy link
Copy Markdown

nw2190 commented Mar 18, 2019

@jnothman I believe the implementation of the kernel gradients is correct and consistent with the optimizer's use of log-scale hyperparameters. So I think this PR can be closed in the sense that the RBF implementation was in fact correct as originally implemented.

As mentioned in some of the previous comments, the _approx_fprime function may be causing the tests to fail. But I believe the approximation is correct as long as the arguments xk are assumed to be log-scale hyperparameters (which appears to be the case since they are typically passed self.theta).

I am new to using pytest, but from what I can tell the tests seem to be passing for the current master branch (with Python 3.7 on Arch Linux). There are only two warnings that I am getting which are associated with test_kernels.py:

            if self.nu == 0.5:
                K_gradient = K[..., np.newaxis] * D \
>                   / np.sqrt(D.sum(2))[:, :, np.newaxis]
E               RuntimeWarning: invalid value encountered in true_divide 
kernels.py:1372: RuntimeWarning

Both warnings stem from line 54 of test_kernels.py during the test_kernel_gradient call.

In particular there appears to be an issue with the gradient calculation for the second kernel in the Product kernel operator on line 760.

The actual warning occurs at line 1372 of kernels.py. This is likely caused by the fact that the lines:

    if self.nu == 0.5:
        K_gradient = K[..., np.newaxis] * D \
            / np.sqrt(D.sum(2))[:, :, np.newaxis]
        K_gradient[~np.isfinite(K_gradient)] = 0

can involve dividing by D = squareform(dists**2)[:, :, np.newaxis] which has zeros along the diagonal (which remain zero after summing across the trivially expanded axis). This appears to be handled by the isfinite check following the division, so I do not think this is a very serious issue.

From what I can tell, the current implementation appears to be working properly.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Mar 19, 2019 via email

@AbhilashMathews
Copy link
Copy Markdown

So I had some time today and decided to thoroughly go through the derivation of the gradient. I don't think there is a problem with the original gradient calculation, at least for the squared exponential.

Consider that the optimisation is over the space of $θ$, where $θ_i = \ln(λ_i)$, and
[
k(x, y) = exp[-1/2 (x - y)^\top M (x - y)],
]
where $M = \textrm{diag}(λ^2)$. This is the right parameterisation since the kernel uses the hyperparameter 'length_scale' ($λ$) in this manner (it divides all input by $λ$ to compute distances, making $M_{ii} = λ_i^2$).

Then
[
\frac{\partial k(x, y)}{\partial λ_i} = \frac{1}{λ_i^3} (x_i - y_i)^2 * k(x, y),
]
and so we compute every element of $\frac{\partial K}{\partial λ_i}$. Also expressed as:
[
\frac{\partial K}{\partial λ_i} = \frac{1}{λ_i^3} cdist(X[:, i], X[:, i], 'sqeuclidean') * K,
]
where the last multiplication is element-wise.

Differentiating the marginal log-likelihood with respect to $θ$ yields the result of R&W Chapter 5 (http://www.gaussianprocess.org/gpml/chapters/RW5.pdf), eq (5.9). Then with chain rule:
\begin{align}
\frac{\partial}{\partial θ_i} \ell\ell &= \frac{1}{2} \mathrm{tr} (αα^\top - K^{-1} \frac{\partial K}{\partial θ_i})
&= \frac{1}{2} \mathrm{tr} ((αα^\top - K^{-1}) \frac{\partial K}{\partial λ_i}\frac{\partial λ_i}{\partial θ_i})
&= \frac{1}{2} \mathrm{tr} ((αα^\top - K^{-1}) \frac{\partial K}{\partial λ_i} λ_i)
&= \frac{1}{2} \mathrm{tr} ((αα^\top - K^{-1}) \frac{1}{λ_i^2} cdist(X[:, i], X[:, i], 'sqeuclidean') * K).
\end{align}

Notice how the division is by $λ_i^2$, not $λ_i^3$, because of the $λ_i$ factor from the chain rule since we optimise wrt $θ = \ln{λ}$. This is the original operation being done before the change introduced in this PR.

I guess my problem really was too hard for the optimiser after all. I would be very happy if you spot an error in the above.

PS: the lack of latex rendering is very annoying… if anyone knows how to render maths in github's MD, please let me know.

Thank you for clarifying that. It's an important distinction to make that it is the log marginal likelihood being optimized over the log-transformed hyperparameters (i.e. both are log-transformed and this should be underlined).

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Mar 26, 2019 via email

@nw2190
Copy link
Copy Markdown

nw2190 commented Mar 26, 2019

Please feel free to offer a pull request improving the documentation

I have a few deadlines coming up right now, but I can take a look at improving the documentation afterwards. I will try to submit a pull request with more details regarding the gradient calculations and the internal hyperparameters sometime next week if it hasn't already been revised by then.

@amueller
Copy link
Copy Markdown
Member

amueller commented Aug 7, 2019

@nw2190 still interested in working on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

For the RBF Kernel in gaussian_process, the calculation of the gradient seems incorrect?

8 participants