Skip to content

more numerically stable cosine_similarity#31378

Closed
nikitaved wants to merge 9 commits intopytorch:masterfrom
Quansight:better-cosine-similarity
Closed

more numerically stable cosine_similarity#31378
nikitaved wants to merge 9 commits intopytorch:masterfrom
Quansight:better-cosine-similarity

Conversation

@nikitaved
Copy link
Copy Markdown
Collaborator

@nikitaved nikitaved commented Dec 17, 2019

Previous behavior: compute inner product, then normalize.
This patch: first normalize, then compute inner product. This should be more numerically stable because it avoids losing precision in inner product for inputs with large norms.
By design ensures that cosine similarity is within [-1.0, +1.0], so it should fix #29442.

P.S. I had to change tests because this implementation handles division by 0 differently.
This PR computes cosine similarity as follows: <x/max(eps, ||x||), y/max(eps, ||y||)>.
Let f(x,y) = <x,y>/(||x|| * ||y||), then
df/dx = y/(||x|| * ||y||) - (||y||/||x|| * <x,y> * x)/(||x|| * ||y||)^2.
The changed test checks division by zero in backward when x=0 and y != 0.
For this case the non-zero part of the gradient is just y / (||x|| * ||y||).
The previous test evaluates y/(||x|| * ||y||) to y / eps, and this PR to 1/eps * y/||y||.

@kostmo
Copy link
Copy Markdown
Member

kostmo commented Dec 17, 2019

💊 CircleCI build failures summary and remediations

As of commit 4b3510f:

  • 1/7 broken upstream at merge base 5b321a0 since Jan 27

    Please rebase on the viable/strict branch (expand for instructions)

    If your commit is newer than viable/strict, you can try basing on an older, stable commit:

    git fetch origin viable/strict
    git rebase --onto viable/strict $(git merge-base origin/master HEAD)
    

    If your commit is older than viable/strict:

    git fetch origin viable/strict
    git rebase viable/strict
    

    Check out the recency history of this "viable master" tracking branch.

  • 5/7 failures introduced in this PR

  • 1/7 recognized as flaky ❄️

    • Re-run these jobs?

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

🕵️ 5 new failures recognized by patterns

The following build failures do not appear to be due to upstream breakage:

See CircleCI build pytorch_linux_xenial_py3_clang5_mobile_build (1/5)

Step: "Build" (full log | pattern match details)

Jan 27 14:46:51 caused by: Connection refused (os error 111)
Jan 27 14:46:51 +++ eval 'extract_trap_cmd ' 
Jan 27 14:46:51 ++++ extract_trap_cmd 
Jan 27 14:46:51 ++++ printf '%s\n' '' 
Jan 27 14:46:51 +++ printf '%s\n' cleanup 
Jan 27 14:46:51 ++ trap -- ' 
Jan 27 14:46:51 cleanup' EXIT 
Jan 27 14:46:51 ++ which sccache 
Jan 27 14:46:51 ++ sccache --stop-server 
Jan 27 14:46:51 Stopping sccache server... 
Jan 27 14:46:51 error: couldn't connect to server 
Jan 27 14:46:51 caused by: Connection refused (os error 111) 
Jan 27 14:46:51 ++ true 
Jan 27 14:46:51 ++ rm /var/lib/jenkins/sccache_error.log 
Jan 27 14:46:51 rm: cannot remove '/var/lib/jenkins/sccache_error.log': No such file or directory 
Jan 27 14:46:51 ++ true 
Jan 27 14:46:51 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Jan 27 14:46:51 ++ SCCACHE_IDLE_TIMEOUT=1200 
Jan 27 14:46:51 ++ RUST_LOG=sccache::server=error 
Jan 27 14:46:51 ++ sccache --start-server 
Jan 27 14:46:51 Starting sccache server... 
Jan 27 14:46:51 ++ sccache --zero-stats 

See CircleCI build pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build (2/5)

Step: "Build" (full log | pattern match details)

Jan 27 14:52:23 FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Distance.cpp.o
Jan 27 14:52:08   const int64_t ndim = input.dim(); 
Jan 27 14:52:08                 ^ 
Jan 27 14:52:08 1 warning generated. 
Jan 27 14:52:11 [272/591] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Convolution.cpp.o 
Jan 27 14:52:12 [273/591] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/ConvolutionTBC.cpp.o 
Jan 27 14:52:18 [274/591] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Cross.cpp.o 
Jan 27 14:52:19 [275/591] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/DilatedMaxPool2d.cpp.o 
Jan 27 14:52:19 [276/591] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Copy.cpp.o 
Jan 27 14:52:20 [277/591] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/DispatchStub.cpp.o 
Jan 27 14:52:23 [278/591] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Distance.cpp.o 
Jan 27 14:52:23 FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Distance.cpp.o  
extra -Wno-unused-parameter -Wno-missing-field-initializers -Wno-write-strings -Wno-unknown-pragmas -Wno-missing-braces -Wno-maybe-uninitialized -fvisibility=hidden -O2 -DCAFFE2_BUILD_MAIN_LIB -std=gnu++14 -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Distance.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Distance.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Distance.cpp.o -c /var/lib/jenkins/workspace/aten/src/ATen/native/Distance.cpp 
Jan 27 14:52:23 /var/lib/jenkins/workspace/aten/src/ATen/native/Distance.cpp:8:10: fatal error: 'torch/torch.h' file not found 
Jan 27 14:52:23 #include <torch/torch.h> 
Jan 27 14:52:23          ^~~~~~~~~~~~~~~ 
Jan 27 14:52:23 1 error generated. 
Jan 27 14:52:26 [279/591] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/DilatedMaxPool3d.cpp.o 
Jan 27 14:52:27 [280/591] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Dropout.cpp.o 
Jan 27 14:52:27 ninja: build stopped: subcommand failed. 

See CircleCI build pytorch_linux_xenial_py3_clang5_android_ndk_r19c_mobile_code_analysis (3/5)

Step: "Build" (full log | pattern match details)

Jan 27 14:52:52 FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Distance.cpp.o
Jan 27 14:52:48 1 warning generated. 
Jan 27 14:52:48 [303/627] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/DilatedMaxPool3d.cpp.o 
Jan 27 14:52:48 In file included from /var/lib/jenkins/workspace/aten/src/ATen/native/DilatedMaxPool3d.cpp:5: 
Jan 27 14:52:48 ../../aten/src/ATen/native/Pool.h:58:17: warning: unused variable 'nOutputPlane' [-Wunused-variable] 
Jan 27 14:52:48   const int64_t nOutputPlane = nInputPlane; 
Jan 27 14:52:48                 ^ 
Jan 27 14:52:48 1 warning generated. 
Jan 27 14:52:48 [304/627] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/DispatchStub.cpp.o 
Jan 27 14:52:48 [305/627] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Copy.cpp.o 
Jan 27 14:52:52 [306/627] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Distance.cpp.o 
Jan 27 14:52:52 FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Distance.cpp.o  
extra -Wno-unused-parameter -Wno-missing-field-initializers -Wno-write-strings -Wno-unknown-pragmas -Wno-missing-braces -Wno-maybe-uninitialized -fvisibility=hidden -O2 -DCAFFE2_BUILD_MAIN_LIB -std=gnu++14 -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Distance.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Distance.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Distance.cpp.o -c /var/lib/jenkins/workspace/aten/src/ATen/native/Distance.cpp 
Jan 27 14:52:52 /var/lib/jenkins/workspace/aten/src/ATen/native/Distance.cpp:8:10: fatal error: 'torch/torch.h' file not found 
Jan 27 14:52:52 #include <torch/torch.h> 
Jan 27 14:52:52          ^~~~~~~~~~~~~~~ 
Jan 27 14:52:52 1 error generated. 
Jan 27 14:52:58 [307/627] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Dropout.cpp.o 
Jan 27 14:53:00 [308/627] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Embedding.cpp.o 
Jan 27 14:53:07 [309/627] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Distributions.cpp.o 
Jan 27 14:53:07 ninja: build stopped: subcommand failed. 
Jan 27 14:53:07 + sccache_epilogue 

See CircleCI build pytorch_linux_xenial_cuda10_1_cudnn7_py3_nogpu_test (4/5)

Step: "Test" (full log | pattern match details)

Jan 27 15:31:53 RuntimeError: test_quantization failed!
Jan 27 15:31:53 Ran 36 tests in 48.158s 
Jan 27 15:31:53  
Jan 27 15:31:53 FAILED (errors=1, skipped=1) 
Jan 27 15:31:53  
Jan 27 15:31:53 Generating XML reports... 
Jan 27 15:31:53 Traceback (most recent call last): 
Jan 27 15:31:53   File "test/run_test.py", line 456, in <module> 
Jan 27 15:31:53     main() 
Jan 27 15:31:53   File "test/run_test.py", line 449, in main 
Jan 27 15:31:53     raise RuntimeError(message) 
Jan 27 15:31:53 RuntimeError: test_quantization failed! 
Jan 27 15:31:53 =================== sccache compilation log =================== 
Jan 27 15:31:53 + cleanup 
Jan 27 15:31:53 + retcode=1 
Jan 27 15:31:53 + set +x 
Jan 27 15:31:53 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Jan 27 15:31:53 Compile requests                  7 
Jan 27 15:31:53 Compile requests executed         6 
Jan 27 15:31:53 Cache hits                        0 
Jan 27 15:31:53 Cache misses                      6 
Jan 27 15:31:53 Cache timeouts                    0 

See CircleCI build pytorch_linux_xenial_cuda10_1_cudnn7_py3_NO_AVX2_test (5/5)

Step: "Test" (full log | pattern match details)

Jan 27 15:38:13 RuntimeError: test_quantization failed!
Jan 27 15:38:13 Ran 36 tests in 55.292s 
Jan 27 15:38:13  
Jan 27 15:38:13 FAILED (errors=1, skipped=1) 
Jan 27 15:38:13  
Jan 27 15:38:13 Generating XML reports... 
Jan 27 15:38:13 Traceback (most recent call last): 
Jan 27 15:38:13   File "test/run_test.py", line 456, in <module> 
Jan 27 15:38:13     main() 
Jan 27 15:38:13   File "test/run_test.py", line 449, in main 
Jan 27 15:38:13     raise RuntimeError(message) 
Jan 27 15:38:13 RuntimeError: test_quantization failed! 
Jan 27 15:38:14 + cleanup 
Jan 27 15:38:14 + retcode=1 
Jan 27 15:38:14 + set +x 
Jan 27 15:38:14 =================== sccache compilation log =================== 
Jan 27 15:38:14 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Jan 27 15:38:14 Compile requests                32 
Jan 27 15:38:14 Compile requests executed       11 
Jan 27 15:38:14 Cache hits                       1 
Jan 27 15:38:14 Cache misses                    10 
Jan 27 15:38:14 Cache timeouts                   0 

❄️ 1 failure recognized as flaky

The following build failures have been detected as flaky and may not be your fault:

See CircleCI build pytorch_linux_xenial_cuda10_1_cudnn7_py3_gcc7_test (1/1)

Step: "Test" (full log | pattern match details) ❄️

Jan 27 16:04:56 ConnectionResetError: [Errno 104] Connection reset by peer
Jan 27 16:04:56   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 455, in accept 
Jan 27 16:04:56     deliver_challenge(c, self._authkey) 
Jan 27 16:04:56   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 722, in deliver_challenge 
Jan 27 16:04:56     response = connection.recv_bytes(256)        # reject large message 
Jan 27 16:04:56   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes 
Jan 27 16:04:56     buf = self._recv_bytes(maxlength) 
Jan 27 16:04:56   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes 
Jan 27 16:04:56     buf = self._recv(4) 
Jan 27 16:04:56   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 379, in _recv 
Jan 27 16:04:56     chunk = read(handle, remaining) 
Jan 27 16:04:56 ConnectionResetError: [Errno 104] Connection reset by peer 
Jan 27 16:04:56 /opt/conda/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown 
Jan 27 16:04:56   len(cache)) 
Jan 27 16:04:59 Process ErrorTrackingProcess-122: 
Jan 27 16:04:59 Traceback (most recent call last): 
Jan 27 16:04:59   File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap 
Jan 27 16:04:59     self.run() 
Jan 27 16:04:59   File "/var/lib/jenkins/workspace/test/test_dataloader.py", line 333, in run 
Jan 27 16:04:59     super(ErrorTrackingProcess, self).run() 
Jan 27 16:04:59   File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run 
Jan 27 16:04:59     self._target(*self._args, **self._kwargs) 

🚧 1 upstream failure recognized by patterns:

These builds matched patterns, but were probably caused by upstream breakages:


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 29 times.

Comment thread aten/src/ATen/native/Distance.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a comment explaining the strategy here? It would also be good to compare it with the old implementation as well in the comment.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I touch upon how division by zero is handled?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you are avoiding overflow numerical problems of the original implementation, but you are exposing yourself to undeflow problems with yours (when normalizing, individual components may underflow and turn to 0). Do we have a feeling why one sort of tradeoffs is better than other?

Copy link
Copy Markdown
Collaborator Author

@nikitaved nikitaved Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Underflow is not a problem, floats are best when the value is small (more significant bits in mantissa). And their precision deteriorates once their absolute value increases.

Copy link
Copy Markdown
Collaborator Author

@nikitaved nikitaved Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also one of the reasons why you normalize, say, imagines, before feeding them to a network... All these inner products inside can cause troubles otherwise

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I assume that this function is used with floating point types. Would that be a correct assumption?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this function is used with floating point types. But you are right, usually underflow is less of a problem than overflow. It's possible to create an adversarial example though where instead of getting close to 1 cosine similarity you'd get a significantly smaller value.

@ngimel ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 20, 2019
@rgommers
Copy link
Copy Markdown
Collaborator

rgommers commented Jan 2, 2020

@nikitaved can you add regression tests for cases that used to give > 1. or < -1. return value?

Also, it would be good to expand on this comment in the PR description:

P.S. I had to change tests because this implementation handles division by 0 differently.

The change in behavior is nicely explained in the comment, but it looks to me like the old test should still pass as well. Best never to change tests in PRs where you also change behavior; that's typically a sign something is wrong. So I'd suggest to change that back, and only add new tests

@nikitaved
Copy link
Copy Markdown
Collaborator Author

nikitaved commented Jan 2, 2020

@rgommers, ok, I will add the tests. The old test will not pass. In order to show that I need to write down the derivative and point out the difference, so it becomes rather "technical". Shall I at least write the explanation here?

@rgommers
Copy link
Copy Markdown
Collaborator

rgommers commented Jan 2, 2020

In order to show that I need to write down the derivative and point out the difference, so it becomes rather "technical". Shall I at least write the explanation here?

okay. just the explanation here will be useful

Comment thread aten/src/ATen/native/Distance.cpp Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change from max(foo, eps) to foo + eps is to enforce that the normalized vectors have norms <1 ?
The goal being that floating point errors in the dot product won't push us >1 ?
How big should eps be to make sure this is always true?

Copy link
Copy Markdown
Collaborator Author

@nikitaved nikitaved Jan 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to enforce that we do not divide by zero, and yes, as a side effect, it makes norms less than 1. I can't say I am 100% sure that this will work for any epsilon now that I think about it, but it does work for the standard epsilon (1e-8). Anyway, this implementation should tolerate larger set of epsilon values compared to the older algorithm because the vectors in the inner product live in a unit ball, and mantissa of floating point values is utilized much better.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the difference with the max is that if the norm is small (like 2*eps), you will make a much bigger error here. Is that something we should worry about?

Copy link
Copy Markdown
Collaborator Author

@nikitaved nikitaved Jan 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that is a philosophical dilemma of how to handle zeroes. I personally do not like using max, because it makes the function potentially non-differentiable... Are you ok with non-differentiability? Also, I did play with max, and had some issues with autograd not being able to process backward...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid this problem by doing this change in a non differentiable manner using NoGradGuard around the max operations. That way, this won't change the gradient computation.

Copy link
Copy Markdown
Collaborator Author

@nikitaved nikitaved Jan 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why then clamping did not introduce any issues in the older implementation?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation do not have it.
And the other PR should have the NoGradGuard as well to avoid this issue. There is a comment in it asking to do so.

@ezyang ezyang added module: bc-breaking Related to a BC-breaking change module: numerical-stability Problems related to numerical stability of operations labels Jan 7, 2020
@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Jan 7, 2020

I am prepared to merge this, but @nikitaved I'd like to know if you were planning to incorporate Alban's suggestion (use max instead of adding epsilon)

@nikitaved
Copy link
Copy Markdown
Collaborator Author

@ezyang, yes, I will try the things suggested by @albanD.

@nikitaved
Copy link
Copy Markdown
Collaborator Author

@albanD, seems like I am doing something wrong. This causes an error:

  auto x1_norm = at::norm(x1, 2, dim, true);
  auto x2_norm = at::norm(x2, 2, dim, true);

  {
    at::NoGradGuard guard;
    x1_norm.clamp_max_(std::sqrt(eps));
    x2_norm.clamp_max_(std::sqrt(eps));
  }
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [4, 1]], which is output 0 of NormBackward1, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

@albanD
Copy link
Copy Markdown
Collaborator

albanD commented Jan 8, 2020

The problem is that you cannot change the result of the norm operation inplace because as you can see in the gradient formula the value of the output is needed to compute the gradients.
You will need to do the clamp out of place. Or define a custom backward for cosine_similarity.

@albanD
Copy link
Copy Markdown
Collaborator

albanD commented Jan 8, 2020

Right, this would do what you want?

  auto x1_pow = at::pow(x1, 2).sum(dim, true);
  auto x2_pow = at::pow(x2, 2).sum(dim, true);

  {
    at::NoGradGuard guard;
    x1_pow.clamp_min_(eps);
    x2_pow.clamp_min_(eps);
  }

  auto x1_norm = x1_pow.sqrt_();
  auto x2_norm = x2_pow.sqrt_();

@nikitaved
Copy link
Copy Markdown
Collaborator Author

nikitaved commented Jan 8, 2020

Sorry, it is the sum that has to be clamped. I used at::norm because you told me it is more stable near 0. But then, similar to the current implementation, it is sufficient to use inner product. This inner product can be clamped inplace. Could you tell me why clamping outside a guarded block might be problematic?

@albanD
Copy link
Copy Markdown
Collaborator

albanD commented Jan 8, 2020

Could you tell me why clamping outside a guarded block might be problematic?

The guard won't change anything wrt inplace checks, If it is problematic inside, it would be problematic without the guard as well.
In this case, it is problematic only because the output of norm is needed for backward, so you are not allowed to modify it inplace.

Right I edited the code above.

@nikitaved
Copy link
Copy Markdown
Collaborator Author

nikitaved commented Jan 8, 2020

@albanD, sorry for the confusion. It is exactly the question I already asked you previously. The current implementation uses unguarded clamps. From your answer I understood that it is better to guard it. So, guarded or unguarded, which is better and why?

@nikitaved nikitaved force-pushed the better-cosine-similarity branch from a06a264 to 0a8d2ee Compare January 8, 2020 17:45
@nikitaved
Copy link
Copy Markdown
Collaborator Author

nikitaved commented Jan 8, 2020

@ezyang, I updated the code following the suggestions of @albanD. The example from 29442 now works without any clamping (note that the input vector to BCELoss has a negative value), but I decided to add it just in case. Still, the test checking division by zero fails (because of sqrt in the denominator). Do I have to do anything about it?

@albanD
Copy link
Copy Markdown
Collaborator

albanD commented Jan 8, 2020

This is the backward graph generated by the current code in master:
Screen Shot 2020-01-08 at 11 51 19 AM

The question is if the ClampMinMax node should be here or not. What it does is set the gradient flowing through it to 0 if it did clamping and let them flow as is otherwise.
If you add the guard, this node will be removed and the gradient will be unchanged even when clamp actually changed the value.

In my opinion, if you clamp, it is only for numerical stability and that should not zero-out any gradient. So you want to guard.

@nikitaved
Copy link
Copy Markdown
Collaborator Author

@albanD, it is an off-topic, bud could you recommend something to read about the autograd in pytorch?

@albanD
Copy link
Copy Markdown
Collaborator

albanD commented Jan 8, 2020

You can check the current autograd note in the doc that give some info.
This PR attempts to clarify it: #30253 but it is still work in progress.

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Jan 9, 2020

Re your question about division by zero, let's ask @ailzhang, who added this in #18250. Ailing, do you think it's OK to adjust the zero test given our different implementation strategy here?

@ezyang ezyang requested a review from ailzhang January 9, 2020 16:35
@nikitaved
Copy link
Copy Markdown
Collaborator Author

nikitaved commented Jan 20, 2020

@ezyang, it is possible to keep the current testing approach, but for that I need the following logic implemented efficiently:

if (||x||^2 < eps OR ||y||^2 < eps) { ... }
else { ... }

Is there a way to do that without a loop over the batch dimension?

@nikitaved
Copy link
Copy Markdown
Collaborator Author

@albanD , @ngimel , I have removed clamp at +-1 to increase the performance. Unfortunately, I was not able to use norm as we need to condition inputs prior to taking sqrt as the grad relies on norm reciprocals squared.

Copy link
Copy Markdown
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

@nikitaved
Copy link
Copy Markdown
Collaborator Author

Thank you, let's merge it then!

@nikitaved
Copy link
Copy Markdown
Collaborator Author

@pytorchbot merge this please

@github-actions
Copy link
Copy Markdown
Contributor

Hey @nikitaved.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Apr 22, 2022
Summary:
**Previous behavior**: compute inner product, then normalize.
**This patch**: first normalize, then compute inner product. This should be more numerically stable because it avoids losing precision in inner product for inputs with large norms.
By design ensures that cosine similarity is within `[-1.0, +1.0]`, so it should fix [#29442](#29442).

P.S. I had to change tests because this implementation handles division by 0 differently.
This PR computes cosine similarity as follows: <x/max(eps, ||x||), y/max(eps, ||y||)>.
Let f(x,y) = <x,y>/(||x|| * ||y||), then
df/dx = y/(||x|| * ||y||) - (||y||/||x|| * <x,y> * x)/(||x|| * ||y||)^2.
The changed test checks division by zero in backward when x=0 and y != 0.
For this case the non-zero part of the gradient is just y / (||x|| * ||y||).
The previous test evaluates y/(||x|| * ||y||) to y / eps, and this PR to 1/eps * y/||y||.

Pull Request resolved: #31378
Approved by: https://github.com/ezyang, https://github.com/albanD

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/9e137ee583c4fdb2dd3aa0c425dc9c289454cbf2

Reviewed By: seemethere

Differential Revision: D35850003

fbshipit-source-id: 5de5563377ad3fa8acc62d0d974b241ee5bc0af1
@lezcano
Copy link
Copy Markdown
Collaborator

lezcano commented Jul 7, 2023

This implementation doesn't compute, albeit more stable, does not compute the cosine similarity as per the docs. If the norm of both x_1 and x_2 is less than eps, here we are dividing by eps^2, rather than by eps, as the docs declare.

That being said, I like this implementation much better, numerically speaking, than the previous one, so I'm more in favour of fixing the docs than fixing the code really.

@albanD
Copy link
Copy Markdown
Collaborator

albanD commented Jul 7, 2023

Updating the doc sounds better than adding another max(eps, denom) to push eps^2 to eps.

lezcano added a commit that referenced this pull request Jul 7, 2023
The behaviour of `cosine_similarity` was subtly changed in
#31378, but the docs were not
updated.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Jul 7, 2023
The behaviour of `cosine_similarity` was subtly changed in
#31378, but the docs were not
updated.

cc svekars carljparker

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Jul 7, 2023
The behaviour of `cosine_similarity` was subtly changed in
#31378, but the docs were not
updated.

cc svekars carljparker

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Jul 7, 2023
…ility and the gradients"

There was a regression in #31378
which was reported in #104564.
This PR should keep the efficiency and memory usage from the original
implementation, while keeping the stability of the latter.

This solution was already discussed in #31378,
but it was discarded because it modified the vector_norm in-place. The
only magic ingredient that was missing for that solution to work was to
add a `clone()` after calling the `vector_norm`.

I hope this PR takes shorter to land than #104564.

Fixes #104564

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Jul 7, 2023
The behaviour of `cosine_similarity` was subtly changed in
#31378, but the docs were not
updated.

ghstack-source-id: 390edad
Pull Request resolved: #104772
lezcano added a commit that referenced this pull request Jul 25, 2023
The behaviour of `cosine_similarity` was subtly changed in
#31378, but the docs were not
updated.

ghstack-source-id: 308ef4d
Pull Request resolved: #104772
lezcano added a commit that referenced this pull request Jul 25, 2023
The behaviour of `cosine_similarity` was subtly changed in
#31378, but the docs were not
updated.

cc svekars carljparker

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Jul 25, 2023
…keeping the stability and the gradients"

There was a regression in #31378
which was reported in #104564.
This PR should keep the efficiency and memory usage from the original
implementation, while keeping the stability of the latter.

This solution was already discussed in #31378,
but it was discarded because it modified the vector_norm in-place. The
only magic ingredient that was missing for that solution to work was to
add a `clone()` after calling the `vector_norm`.

I hope this PR takes shorter to land than #104564.

Fixes #104564

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Jul 25, 2023
…ility and the gradients"

There was a regression in #31378
which was reported in #104564.
This PR should keep the efficiency and memory usage from the original
implementation, while keeping the stability of the latter.

This solution was already discussed in #31378,
but it was discarded because it modified the vector_norm in-place. The
only magic ingredient that was missing for that solution to work was to
add a `clone()` after calling the `vector_norm`.

I hope this PR takes shorter to land than #104564.

Fixes #104564

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Jul 25, 2023
The behaviour of `cosine_similarity` was subtly changed in
#31378, but the docs were not
updated.

cc svekars carljparker

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jul 26, 2023
…he gradients (#104771)

There was a regression in #31378
which was reported in #104564.
This PR should keep the efficiency and memory usage from the original
implementation, while keeping the stability of the latter.

This solution was already discussed in #31378,
but it was discarded because it modified the vector_norm in-place. The
only magic ingredient that was missing for that solution to work was to
add a `clone()` after calling the `vector_norm`.

I hope this PR takes shorter to land than #104564.

Fixes #104564

Pull Request resolved: #104771
Approved by: https://github.com/albanD
pytorchmergebot pushed a commit that referenced this pull request Jul 26, 2023
The behaviour of `cosine_similarity` was subtly changed in
#31378, but the docs were not
updated.

Pull Request resolved: #104772
Approved by: https://github.com/albanD, https://github.com/svekars
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
**Previous behavior**: compute inner product, then normalize.
**This patch**: first normalize, then compute inner product. This should be more numerically stable because it avoids losing precision in inner product for inputs with large norms.
By design ensures that cosine similarity is within `[-1.0, +1.0]`, so it should fix [pytorch#29442](pytorch#29442).

P.S. I had to change tests because this implementation handles division by 0 differently.
This PR computes cosine similarity as follows: <x/max(eps, ||x||), y/max(eps, ||y||)>.
Let f(x,y) = <x,y>/(||x|| * ||y||), then
df/dx = y/(||x|| * ||y||) - (||y||/||x|| * <x,y> * x)/(||x|| * ||y||)^2.
The changed test checks division by zero in backward when x=0 and y != 0.
For this case the non-zero part of the gradient is just y / (||x|| * ||y||).
The previous test evaluates y/(||x|| * ||y||) to y / eps, and this PR to 1/eps * y/||y||.
Pull Request resolved: pytorch#31378
Approved by: https://github.com/ezyang, https://github.com/albanD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed module: bc-breaking Related to a BC-breaking change module: numerical-stability Problems related to numerical stability of operations open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

torch.nn.functional.cosine_similarity(x, y) result greater than 1.0, and x != y.

10 participants