[MRG+1] Fix regression in silhouette_score for clusters of size 1#7438
[MRG+1] Fix regression in silhouette_score for clusters of size 1#7438ogrisel merged 7 commits intoscikit-learn:masterfrom
Conversation
|
the test could prob be strengthened |
|
@jnothman here is an excerpt from the Silhouette paper mentionned in the docstring:
http://www.sciencedirect.com/science/article/pii/0377042787901257 Unfortunately your PR does not seem to implement this: >>> import numpy as np
>>> from sklearn.metrics import silhouette_samples
>>> silhouette_samples(np.array([[0], [0.1], [1]]), np.array([0, 0, 1]))
array([ 0.9 , 0.88888889, -0.05 ])The last value of that array should be 0 instead of -0.05. |
|
Good catch. Though I think it might have been similarly misbehaved in 0.17. And annoyingly, the main example in the paper appears not to cover this case. |
|
Yes, this patch is wrong in a few ways. |
|
@ogrisel, what do you think behaviour should be for a cluster exclusively of multiple identical points??? Score 0 or 1? Should we leave this broken for 0.18.0? Add a note that we know it's broken? |
I think it should be 1 by continuity with a cluster with 2 very close points. |
ogrisel
left a comment
There was a problem hiding this comment.
That would be great to fix this before the 0.18 release although I would not make it a strict blocker.
| assert_false(np.isnan(silhouette)) | ||
| ss = silhouette_samples(D, labels, metric='precomputed') | ||
| assert_false(np.isnan(ss).any()) | ||
| assert_equal(ss[1], 0.) |
There was a problem hiding this comment.
Instead of using a random precomputed metric array, could you please use a toy 2D X array with shape (4, 1) in 1D euclidean spaces such as X = [[1., 0., 1., 2.]] and then check that the output of silhouette_samples is [1., 0., 1., 0.] as expected?
There was a problem hiding this comment.
Thanks for the review and for handing an example to me on a silver platter.
48aa0ef to
3c850dd
Compare
ogrisel
left a comment
There was a problem hiding this comment.
Please consider the following comments:
| # silhouette = [.5, .5, 0] | ||
| # Cluster 3: intra-cluster = [0, 0] | ||
| # inter-cluster = [arbitrary, arbitrary] | ||
| # silhouette = [1., 1.] |
There was a problem hiding this comment.
Please rename the cluster ids (1, 2, 3) in this comment to match 0, 1, and 2 used in the labels variable.
|
|
||
| # Find inter_clust_dist for all samples belonging to the same | ||
| # label. | ||
| mask = labels == curr_label |
There was a problem hiding this comment.
This is now broken if labels are not contiguous integers starting at zero, no?
There was a problem hiding this comment.
Yes, it was broken previously ::/
| # is not discussed in reference material, and we choose for it a sample | ||
| # score of 1. | ||
| X = [[0.], [1.], [1.], [2.], [3.], [3.]] | ||
| labels = np.array([0, 1, 1, 1, 2, 2]) |
There was a problem hiding this comment.
Actually please update this test to have non-contiguous labels that do not start at zero, for instance:
labels = np.array([1, 3, 3, 3, 4, 4])There was a problem hiding this comment.
I think that would be confusing here. Indeed there's already a "test_non_encoded_labels", and if I add a *2 in it, it's gappy. Actually, it was only testing silhouette_score not silhouette_samples. Changed. Also removed the double encoding of labels from silhouette_score.
| @@ -205,6 +209,8 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds): | |||
|
|
|||
| sil_samples = inter_clust_dists - intra_clust_dists | |||
| sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists) | |||
There was a problem hiding this comment.
This will still raise a division by zero warning if all samples are duplicate of one another and all belong to the same cluster. But maybe we don't care that much of such a rare edge case...
There was a problem hiding this comment.
Also if they belong to different cluster. Quick example.
silhouette_samples([[1.0], [1.0], [1.0], [1.0]], [0, 0, 1, 1])However, according to the formula, this might be expected behaviour.
There was a problem hiding this comment.
If all samples belong to one cluster, silhouette_score is undefined according to the paper. All points being identical is not covered in the paper. Understandably.
| # (cluster 0). We also test the case where there are identical samples | ||
| # as the only members of a cluster (cluster 2). To our knowledge, this case | ||
| # is not discussed in reference material, and we choose for it a sample | ||
| # score of 1. |
There was a problem hiding this comment.
Why is having identical samples in a cluster a special case? It comes directly from the formula, right? Here
After fixing i, a(i) can be viewed as mean distance to all other identical points which is zero and b(i) can be anything. So the formula reduces to b(i) / b(i) which is 1, no?
There was a problem hiding this comment.
Why is having identical samples in a cluster a special case?
It's not a special case, it just represents a non-smooth part of the function, where you might have expected, for clustering evaluation, that duplicating a point and assigning it the same label should not affect the score.
As a matter of implementation, I could have, even accidentally, handled the single-point cluster case in a way that assigned this 0, so it's important that we make a decision on it and test it.
|
|
||
| # Find inter_clust_dist for all samples belonging to the same | ||
| # label. | ||
| mask = labels == curr_label |
There was a problem hiding this comment.
Yes, it was broken previously ::/
|
LGTM pending @ogrisel 's comments |
| intra_clust_dists[mask] = np.sum( | ||
| current_distances[:, mask], axis=1) / n_samples_curr_lab | ||
| else: | ||
| intra_clust_dists[mask] = 0 |
There was a problem hiding this comment.
Is it worthy to initialize intra_clust_dists with np.zeros this at the start and get rid of the else?
| @@ -205,6 +209,8 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds): | |||
|
|
|||
| sil_samples = inter_clust_dists - intra_clust_dists | |||
| sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists) | |||
There was a problem hiding this comment.
Also if they belong to different cluster. Quick example.
silhouette_samples([[1.0], [1.0], [1.0], [1.0]], [0, 0, 1, 1])However, according to the formula, this might be expected behaviour.
| silhouette_score(X, labels + 10), silhouette_score(X, labels)) | ||
| silhouette_score(X, labels * 2 + 10), silhouette_score(X, labels)) | ||
| assert_array_equal( | ||
| silhouette_samples(X, labels * 2 + 10), silhouette_samples(X, labels)) |
* tag '0.18': (1286 commits) [MRG + 1] More versionadded everywhere! (scikit-learn#7403) minor doc fixes fix lbfgs rename (scikit-learn#7503) minor fixes to whatsnew fix scoring function table fix rebase messup DOC more what's new subdivision DOC Attempt to impose some order on What's New 0.18 no fixed width within bold REL changes for release in 0.18.X branch (scikit-learn#7414) [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325) DOC: Added Nested Cross Validation Example (scikit-learn#7111) Sync docstring and definition default argument in kneighbors (scikit-learn#7476) added contributors for 0.18, minor formatting fixes. Fix typo in whats_new.rst [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411) Addressing issue scikit-learn#7468. (scikit-learn#7472) Reorganize README clean up deprecation warning stuff in common tests [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438) ...
* releases: (1286 commits) [MRG + 1] More versionadded everywhere! (scikit-learn#7403) minor doc fixes fix lbfgs rename (scikit-learn#7503) minor fixes to whatsnew fix scoring function table fix rebase messup DOC more what's new subdivision DOC Attempt to impose some order on What's New 0.18 no fixed width within bold REL changes for release in 0.18.X branch (scikit-learn#7414) [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325) DOC: Added Nested Cross Validation Example (scikit-learn#7111) Sync docstring and definition default argument in kneighbors (scikit-learn#7476) added contributors for 0.18, minor formatting fixes. Fix typo in whats_new.rst [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411) Addressing issue scikit-learn#7468. (scikit-learn#7472) Reorganize README clean up deprecation warning stuff in common tests [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438) ...
* dfsg: (1286 commits) [MRG + 1] More versionadded everywhere! (scikit-learn#7403) minor doc fixes fix lbfgs rename (scikit-learn#7503) minor fixes to whatsnew fix scoring function table fix rebase messup DOC more what's new subdivision DOC Attempt to impose some order on What's New 0.18 no fixed width within bold REL changes for release in 0.18.X branch (scikit-learn#7414) [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325) DOC: Added Nested Cross Validation Example (scikit-learn#7111) Sync docstring and definition default argument in kneighbors (scikit-learn#7476) added contributors for 0.18, minor formatting fixes. Fix typo in whats_new.rst [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411) Addressing issue scikit-learn#7468. (scikit-learn#7472) Reorganize README clean up deprecation warning stuff in common tests [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438) ...
fixes the regression described in #5988. #6089, and the initial patch here, were incorrect.
Replaces #6089. This is #6089 rebased with an error context manager.