more numerically stable cosine_similarity#31378
more numerically stable cosine_similarity#31378nikitaved wants to merge 9 commits intopytorch:masterfrom
Conversation
💊 CircleCI build failures summary and remediationsAs of commit 4b3510f:
Detailed failure analysisOne may explore the probable reasons each build failed interactively on the Dr. CI website. 🕵️ 5 new failures recognized by patternsThe following build failures do not appear to be due to upstream breakage:
|
There was a problem hiding this comment.
Can we get a comment explaining the strategy here? It would also be good to compare it with the old implementation as well in the comment.
There was a problem hiding this comment.
Shall I touch upon how division by zero is handled?
There was a problem hiding this comment.
So you are avoiding overflow numerical problems of the original implementation, but you are exposing yourself to undeflow problems with yours (when normalizing, individual components may underflow and turn to 0). Do we have a feeling why one sort of tradeoffs is better than other?
There was a problem hiding this comment.
Underflow is not a problem, floats are best when the value is small (more significant bits in mantissa). And their precision deteriorates once their absolute value increases.
There was a problem hiding this comment.
It is also one of the reasons why you normalize, say, imagines, before feeding them to a network... All these inner products inside can cause troubles otherwise
There was a problem hiding this comment.
Ah, sorry, I assume that this function is used with floating point types. Would that be a correct assumption?
There was a problem hiding this comment.
Yes, this function is used with floating point types. But you are right, usually underflow is less of a problem than overflow. It's possible to create an adversarial example though where instead of getting close to 1 cosine similarity you'd get a significantly smaller value.
|
@nikitaved can you add regression tests for cases that used to give Also, it would be good to expand on this comment in the PR description:
The change in behavior is nicely explained in the comment, but it looks to me like the old test should still pass as well. Best never to change tests in PRs where you also change behavior; that's typically a sign something is wrong. So I'd suggest to change that back, and only add new tests |
|
@rgommers, ok, I will add the tests. The old test will not pass. In order to show that I need to write down the derivative and point out the difference, so it becomes rather "technical". Shall I at least write the explanation here? |
okay. just the explanation here will be useful |
There was a problem hiding this comment.
The change from max(foo, eps) to foo + eps is to enforce that the normalized vectors have norms <1 ?
The goal being that floating point errors in the dot product won't push us >1 ?
How big should eps be to make sure this is always true?
There was a problem hiding this comment.
It is to enforce that we do not divide by zero, and yes, as a side effect, it makes norms less than 1. I can't say I am 100% sure that this will work for any epsilon now that I think about it, but it does work for the standard epsilon (1e-8). Anyway, this implementation should tolerate larger set of epsilon values compared to the older algorithm because the vectors in the inner product live in a unit ball, and mantissa of floating point values is utilized much better.
There was a problem hiding this comment.
But the difference with the max is that if the norm is small (like 2*eps), you will make a much bigger error here. Is that something we should worry about?
There was a problem hiding this comment.
Well, that is a philosophical dilemma of how to handle zeroes. I personally do not like using max, because it makes the function potentially non-differentiable... Are you ok with non-differentiability? Also, I did play with max, and had some issues with autograd not being able to process backward...
There was a problem hiding this comment.
You can avoid this problem by doing this change in a non differentiable manner using NoGradGuard around the max operations. That way, this won't change the gradient computation.
There was a problem hiding this comment.
Could you please explain why then clamping did not introduce any issues in the older implementation?
There was a problem hiding this comment.
The current implementation do not have it.
And the other PR should have the NoGradGuard as well to avoid this issue. There is a comment in it asking to do so.
|
I am prepared to merge this, but @nikitaved I'd like to know if you were planning to incorporate Alban's suggestion (use max instead of adding epsilon) |
|
@albanD, seems like I am doing something wrong. This causes an error: auto x1_norm = at::norm(x1, 2, dim, true);
auto x2_norm = at::norm(x2, 2, dim, true);
{
at::NoGradGuard guard;
x1_norm.clamp_max_(std::sqrt(eps));
x2_norm.clamp_max_(std::sqrt(eps));
} |
|
The problem is that you cannot change the result of the norm operation inplace because as you can see in the gradient formula the value of the output is needed to compute the gradients. |
|
Right, this would do what you want? auto x1_pow = at::pow(x1, 2).sum(dim, true);
auto x2_pow = at::pow(x2, 2).sum(dim, true);
{
at::NoGradGuard guard;
x1_pow.clamp_min_(eps);
x2_pow.clamp_min_(eps);
}
auto x1_norm = x1_pow.sqrt_();
auto x2_norm = x2_pow.sqrt_(); |
|
Sorry, it is the sum that has to be clamped. I used |
The guard won't change anything wrt inplace checks, If it is problematic inside, it would be problematic without the guard as well. Right I edited the code above. |
|
@albanD, sorry for the confusion. It is exactly the question I already asked you previously. The current implementation uses unguarded clamps. From your answer I understood that it is better to guard it. So, guarded or unguarded, which is better and why? |
a06a264 to
0a8d2ee
Compare
|
@ezyang, I updated the code following the suggestions of @albanD. The example from 29442 now works without any clamping (note that the input vector to BCELoss has a negative value), but I decided to add it just in case. Still, the test checking division by zero fails (because of |
|
@albanD, it is an off-topic, bud could you recommend something to read about the autograd in pytorch? |
|
You can check the current autograd note in the doc that give some info. |
|
@ezyang, it is possible to keep the current testing approach, but for that I need the following logic implemented efficiently: Is there a way to do that without a loop over the batch dimension? |
|
Thank you, let's merge it then! |
|
@pytorchbot merge this please |
|
Hey @nikitaved. |
Summary: **Previous behavior**: compute inner product, then normalize. **This patch**: first normalize, then compute inner product. This should be more numerically stable because it avoids losing precision in inner product for inputs with large norms. By design ensures that cosine similarity is within `[-1.0, +1.0]`, so it should fix [#29442](#29442). P.S. I had to change tests because this implementation handles division by 0 differently. This PR computes cosine similarity as follows: <x/max(eps, ||x||), y/max(eps, ||y||)>. Let f(x,y) = <x,y>/(||x|| * ||y||), then df/dx = y/(||x|| * ||y||) - (||y||/||x|| * <x,y> * x)/(||x|| * ||y||)^2. The changed test checks division by zero in backward when x=0 and y != 0. For this case the non-zero part of the gradient is just y / (||x|| * ||y||). The previous test evaluates y/(||x|| * ||y||) to y / eps, and this PR to 1/eps * y/||y||. Pull Request resolved: #31378 Approved by: https://github.com/ezyang, https://github.com/albanD Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/9e137ee583c4fdb2dd3aa0c425dc9c289454cbf2 Reviewed By: seemethere Differential Revision: D35850003 fbshipit-source-id: 5de5563377ad3fa8acc62d0d974b241ee5bc0af1
|
This implementation doesn't compute, albeit more stable, does not compute the cosine similarity as per the docs. If the norm of both x_1 and x_2 is less than eps, here we are dividing by eps^2, rather than by eps, as the docs declare. That being said, I like this implementation much better, numerically speaking, than the previous one, so I'm more in favour of fixing the docs than fixing the code really. |
|
Updating the doc sounds better than adding another max(eps, denom) to push eps^2 to eps. |
The behaviour of `cosine_similarity` was subtly changed in #31378, but the docs were not updated. [ghstack-poisoned]
The behaviour of `cosine_similarity` was subtly changed in #31378, but the docs were not updated. cc svekars carljparker [ghstack-poisoned]
The behaviour of `cosine_similarity` was subtly changed in #31378, but the docs were not updated. cc svekars carljparker [ghstack-poisoned]
…ility and the gradients" There was a regression in #31378 which was reported in #104564. This PR should keep the efficiency and memory usage from the original implementation, while keeping the stability of the latter. This solution was already discussed in #31378, but it was discarded because it modified the vector_norm in-place. The only magic ingredient that was missing for that solution to work was to add a `clone()` after calling the `vector_norm`. I hope this PR takes shorter to land than #104564. Fixes #104564 [ghstack-poisoned]
The behaviour of `cosine_similarity` was subtly changed in #31378, but the docs were not updated. cc svekars carljparker [ghstack-poisoned]
…keeping the stability and the gradients" There was a regression in #31378 which was reported in #104564. This PR should keep the efficiency and memory usage from the original implementation, while keeping the stability of the latter. This solution was already discussed in #31378, but it was discarded because it modified the vector_norm in-place. The only magic ingredient that was missing for that solution to work was to add a `clone()` after calling the `vector_norm`. I hope this PR takes shorter to land than #104564. Fixes #104564 [ghstack-poisoned]
…ility and the gradients" There was a regression in #31378 which was reported in #104564. This PR should keep the efficiency and memory usage from the original implementation, while keeping the stability of the latter. This solution was already discussed in #31378, but it was discarded because it modified the vector_norm in-place. The only magic ingredient that was missing for that solution to work was to add a `clone()` after calling the `vector_norm`. I hope this PR takes shorter to land than #104564. Fixes #104564 [ghstack-poisoned]
The behaviour of `cosine_similarity` was subtly changed in #31378, but the docs were not updated. cc svekars carljparker [ghstack-poisoned]
…he gradients (#104771) There was a regression in #31378 which was reported in #104564. This PR should keep the efficiency and memory usage from the original implementation, while keeping the stability of the latter. This solution was already discussed in #31378, but it was discarded because it modified the vector_norm in-place. The only magic ingredient that was missing for that solution to work was to add a `clone()` after calling the `vector_norm`. I hope this PR takes shorter to land than #104564. Fixes #104564 Pull Request resolved: #104771 Approved by: https://github.com/albanD
The behaviour of `cosine_similarity` was subtly changed in #31378, but the docs were not updated. Pull Request resolved: #104772 Approved by: https://github.com/albanD, https://github.com/svekars
**Previous behavior**: compute inner product, then normalize. **This patch**: first normalize, then compute inner product. This should be more numerically stable because it avoids losing precision in inner product for inputs with large norms. By design ensures that cosine similarity is within `[-1.0, +1.0]`, so it should fix [pytorch#29442](pytorch#29442). P.S. I had to change tests because this implementation handles division by 0 differently. This PR computes cosine similarity as follows: <x/max(eps, ||x||), y/max(eps, ||y||)>. Let f(x,y) = <x,y>/(||x|| * ||y||), then df/dx = y/(||x|| * ||y||) - (||y||/||x|| * <x,y> * x)/(||x|| * ||y||)^2. The changed test checks division by zero in backward when x=0 and y != 0. For this case the non-zero part of the gradient is just y / (||x|| * ||y||). The previous test evaluates y/(||x|| * ||y||) to y / eps, and this PR to 1/eps * y/||y||. Pull Request resolved: pytorch#31378 Approved by: https://github.com/ezyang, https://github.com/albanD

Previous behavior: compute inner product, then normalize.
This patch: first normalize, then compute inner product. This should be more numerically stable because it avoids losing precision in inner product for inputs with large norms.
By design ensures that cosine similarity is within
[-1.0, +1.0], so it should fix #29442.P.S. I had to change tests because this implementation handles division by 0 differently.
This PR computes cosine similarity as follows: <x/max(eps, ||x||), y/max(eps, ||y||)>.
Let f(x,y) = <x,y>/(||x|| * ||y||), then
df/dx = y/(||x|| * ||y||) - (||y||/||x|| * <x,y> * x)/(||x|| * ||y||)^2.
The changed test checks division by zero in backward when x=0 and y != 0.
For this case the non-zero part of the gradient is just y / (||x|| * ||y||).
The previous test evaluates y/(||x|| * ||y||) to y / eps, and this PR to 1/eps * y/||y||.