#4475 : Add a safe_pairwise_distances function, dealing with zero varian...#4495
#4475 : Add a safe_pairwise_distances function, dealing with zero varian...#4495LowikC wants to merge 2 commits intoscikit-learn:mainfrom
Conversation
…ith zero variance samples when using correlation metric. The best fix would be to have the metric not returning NaN values, but as the correlation metric is actually computed by spicy, we can't modify it directly. So, in case of metric=='correlation', we replace rows and cols corresponding to zero variance samples by the maximum distance (here 1.0).
|
Why do you define a new function and not fix |
|
As there are several returns statements in pairwise_distances, and the check I added must be done for several of them, it seemed to me it would be clearer to do it after applying the original function. I pushed a new commit with the fix directly in pairwise_distances |
There was a problem hiding this comment.
why not an else here, so you don't need to initialize distances?
|
The implementation of correlation is 7 lines: https://github.com/scipy/scipy/blob/master/scipy/spatial/distance.py#L328 |
|
To explain my reasoning: Adding another general function to the public interface for a fix seemed overkill for a single metric fix. Adding metric-specific code in a very general function seemed like a bad idea (this function could become very long if every metric needed a fix). |
|
I agree that it would cleaner to fix directly the metric, but there are several points I'd like to clarify:
|
|
It would be fine to add a metric that only supports dense matrices. |
|
@Nzeuwik, do you intend to follow up @amueller's suggestion of implementing a metric? |
|
I am closing this PR since the original issue was closed: #4475. Thank you for working on this PR! |
#4475 : Add a safe_pairwise_distances function, dealing with zero variance samples when using correlation metric.
The best fix would be to have the metric not returning NaN values, but as the correlation metric is actually computed by scipy, we can't modify it directly.
So, when metric=='correlation', we replace rows and cols corresponding to zero variance samples by the maximum distance (here 1.0).
This change is