[MRG+1] Affinity propagation edge cases (#9612)#9635
[MRG+1] Affinity propagation edge cases (#9612)#9635jnothman merged 26 commits intoscikit-learn:masterfrom
Conversation
As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be unique for every sample.
Returns empty list as cluster center indices to prevent adding a dimension in fit() method, returns unique labels for samples making this consistent with (TBD) predict() behavior for non-convergence.
In this case, it will log a warning and return unique labels for every new sample.
…milarities and preferences
jnothman
left a comment
There was a problem hiding this comment.
Great work, thanks!
Could you please test that/when a warning is raised with assert_warns? (In this PR or elsewhere we should also use a ConvergenceWarning in the existing non-convergence case...)
| if self.cluster_centers_.size > 0: | ||
| return pairwise_distances_argmin(X, self.cluster_centers_) | ||
| else: | ||
| logger.warning("This model does not have any cluster centers " |
There was a problem hiding this comment.
For good or bad, we use warnings.warn, not logging...
| # cluster equal to the single sample | ||
| return (np.array([0]), np.array([0]), 0) if return_n_iter \ | ||
| else (np.array([0], np.array([0]))) | ||
| elif equal_similarities_and_preferences(S, preference): |
There was a problem hiding this comment.
Can't you just do if n_samples == 1 or equal... here?
| labels = np.empty((n_samples, 1)) | ||
| cluster_centers_indices = None | ||
| labels.fill(np.nan) | ||
| logger.warning("Affinity propagation did not converge, this model " |
There was a problem hiding this comment.
Use a sklearn.exceptions.ConvergenceWarning
| # It makes no sense to run the algorithm in this case, so return 1 or | ||
| # n_samples clusters, depending on preferences | ||
| if np.array(preference).flat[0] >= S.flat[1]: | ||
| return (np.arange(n_samples), np.arange(n_samples), 0) \ |
There was a problem hiding this comment.
Does this case deserve a warning?
n_samples == 1 case does not need a separate return statement.
Added assertions for warnings in tests.
|
|
||
| def all_equal_similarities(): | ||
| # Fill "diagonal" of S with first similarity value in S | ||
| S.flat[::S.shape[0] + 1] = S.flat[1] |
There was a problem hiding this comment.
I don't think we should be modifying S...?
| from ..metrics import pairwise_distances_argmin | ||
|
|
||
|
|
||
| def equal_similarities_and_preferences(S, preference): |
There was a problem hiding this comment.
prefix with _ to make clear this is not public API
| warnings.warn("All samples have mutually equal similarities, " | ||
| "returning arbitrary cluster center(s).") | ||
| if np.array(preference).flat[0] >= S.flat[n_samples - 1]: | ||
| return (np.arange(n_samples), np.arange(n_samples), 0) \ |
There was a problem hiding this comment.
Aesthetics:
if np.array(preference).flat[0] >= S.flat[n_samples - 1]:
return (np.arange(n_samples), np.arange(n_samples), 0)
if return_n_iter
else (np.arange(n_samples), np.arange(n_samples))
| # n_samples clusters, depending on preferences | ||
| warnings.warn("All samples have mutually equal similarities, " | ||
| "returning arbitrary cluster center(s).") | ||
| if np.array(preference).flat[0] >= S.flat[n_samples - 1]: |
There was a problem hiding this comment.
I think we can convert preference to an array outside of this condition and avoid repeatedly casting it...
| if n_samples == 1 or equal_similarities_and_preferences(S, preference): | ||
| # It makes no sense to run the algorithm in this case, so return 1 or | ||
| # n_samples clusters, depending on preferences | ||
| warnings.warn("All samples have mutually equal similarities, " |
| # It makes no sense to run the algorithm in this case, so return 1 or | ||
| # n_samples clusters, depending on preferences | ||
| warnings.warn("All samples have mutually equal similarities, " | ||
| "returning arbitrary cluster center(s).") |
There was a problem hiding this comment.
should this be a ConvergenceWarning? I suppose UserWarning is fine.
There was a problem hiding this comment.
It is a UserWarning right now
| # Force non-convergence by allowing only a single iteration | ||
| af = AffinityPropagation(preference=-10, max_iter=1).fit(X) | ||
|
|
||
| assert_array_equal(np.array([0, 1, 2]), af.predict(X)) |
There was a problem hiding this comment.
I'm afraid this doesn't make sense. We can't predict every sample as being in its own cluster in the inductive case, if this means that there are new clusters at predict time than at fit time. The implication here is also that some of the predicted items are clustered with the training points in the same position, which does not make sense.
Options:
- instead mark all points as not clustered, i.e. label -1 as in dbscan
- raise an error in
predictif there are no exemplars - make each training point an exemplar
There was a problem hiding this comment.
Good catch! That will prevent some confusion...
Returning a unique label for every sample in X suggests that these were based on actual clusters. Since there are no clusters, it makes more sense to return a negative label for all samples, indicating there were no clusters.
| if damping < 0.5 or damping >= 1: | ||
| raise ValueError('damping must be >= 0.5 and < 1') | ||
|
|
||
| preference_array = np.array(preference) |
There was a problem hiding this comment.
any reason not to reuse the name preference? It's quite clear from how it's used (e.g. .flat) that it must be an array.
| else: | ||
| warnings.warn("This model does not have any cluster centers " | ||
| "because affinity propagation did not converge. " | ||
| "Returning unique labels for the provided samples.") |
There was a problem hiding this comment.
No longer the case.
Are you sure we don't just want to raise an error when a user tries to predict with such a model? Perhaps an array of -1 makes sense in fit_predict if we're going to do it here..?
There was a problem hiding this comment.
predict() unexpectedly raising an error was the initial reason this PR got started. An error during predict() would be appropriate if the caller would provide incorrect data IMHO. Since these models can potentially live for a long time (in my use case, they're trained infrequently and are deserialized later with joblib.load()), I wouldn't want to make the caller of predict() responsible for dealing with potential crashes because of issues during training of the models. Returning -1 as labels (and logging the warning) seems to be a bit more friendly to the caller.
I'll address the incorrect warning message.
Codecov Report
@@ Coverage Diff @@
## master #9635 +/- ##
==========================================
+ Coverage 96.16% 96.16% +<.01%
==========================================
Files 336 336
Lines 62102 62154 +52
==========================================
+ Hits 59720 59772 +52
Misses 2382 2382
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #9635 +/- ##
==========================================
+ Coverage 96.16% 96.16% +<.01%
==========================================
Files 336 336
Lines 62102 62154 +52
==========================================
+ Hits 59720 59772 +52
Misses 2382 2382
Continue to review full report at Codecov.
|
|
But then don't you think that labels_ should also be -1 then? |
|
Definitely. I overlooked that part - busy with too many things at the same time :-/ |
jnothman
left a comment
There was a problem hiding this comment.
This is good, except that it deserves some documentation. Perhaps a Notes section in both class and function docstrings?
|
LGTM, thanks! Another review? |
|
+1 for MRG please just update what's new bug fix section and let's merge |
|
Not too sure what it means to "update ... bug fix section", was any action expected from my side? |
|
see doc/whats_new.rst
…On 5 September 2017 at 17:08, Jonatan Samoocha ***@***.***> wrote:
Not too sure what it means to "update ... bug fix section", was any action
expected from my side?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9635 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_ky-O-0Siv3r4l7d7PCBsHKP_-Cks5sfPNzgaJpZM4PDtqs>
.
|
|
Thanks! |
…earn#9635) * Added test exposing non-convergence issues As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be unique for every sample. * Addresses non-convergence issues Returns empty list as cluster center indices to prevent adding a dimension in fit() method, returns unique labels for samples making this consistent with (TBD) predict() behavior for non-convergence. * Made predict() handle case of non-convergence while fitting In this case, it will log a warning and return unique labels for every new sample. * Added helper function for detecting mutually equal similarities and preferences * Tidied imports * Immediately returning trivial clusters and labels in case of equal similarities and preferences * Simplified code for preference(s) equality test * Corrected for failing unit tests covering case of n_samples=1 * Corrected for PEP8 line too long * Rewriting imports to comply with max 80-column lines * Simplified code n_samples == 1 case does not need a separate return statement. * Replaced logging warnings by warnings.warn() Added assertions for warnings in tests. * Marking function as non-public * Using mask instead of modifying S * Improvement suggested by review comment * Avoided casting preference to array twice * Readability improvements * Improved returned labels in case of no cluster centers Returning a unique label for every sample in X suggests that these were based on actual clusters. Since there are no clusters, it makes more sense to return a negative label for all samples, indicating there were no clusters. * PEP8 line too long * Avoided creating separate variable for preference as array * Corrected warning message * Making labels consistent with predict() behavior in case of non-convergence * Minor readability improvement * Added detail to test comment about expected result * Added documentation about edge cases * Added documentation to 'what's new'
…earn#9635) * Added test exposing non-convergence issues As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be unique for every sample. * Addresses non-convergence issues Returns empty list as cluster center indices to prevent adding a dimension in fit() method, returns unique labels for samples making this consistent with (TBD) predict() behavior for non-convergence. * Made predict() handle case of non-convergence while fitting In this case, it will log a warning and return unique labels for every new sample. * Added helper function for detecting mutually equal similarities and preferences * Tidied imports * Immediately returning trivial clusters and labels in case of equal similarities and preferences * Simplified code for preference(s) equality test * Corrected for failing unit tests covering case of n_samples=1 * Corrected for PEP8 line too long * Rewriting imports to comply with max 80-column lines * Simplified code n_samples == 1 case does not need a separate return statement. * Replaced logging warnings by warnings.warn() Added assertions for warnings in tests. * Marking function as non-public * Using mask instead of modifying S * Improvement suggested by review comment * Avoided casting preference to array twice * Readability improvements * Improved returned labels in case of no cluster centers Returning a unique label for every sample in X suggests that these were based on actual clusters. Since there are no clusters, it makes more sense to return a negative label for all samples, indicating there were no clusters. * PEP8 line too long * Avoided creating separate variable for preference as array * Corrected warning message * Making labels consistent with predict() behavior in case of non-convergence * Minor readability improvement * Added detail to test comment about expected result * Added documentation about edge cases * Added documentation to 'what's new'
…earn#9635) * Added test exposing non-convergence issues As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be unique for every sample. * Addresses non-convergence issues Returns empty list as cluster center indices to prevent adding a dimension in fit() method, returns unique labels for samples making this consistent with (TBD) predict() behavior for non-convergence. * Made predict() handle case of non-convergence while fitting In this case, it will log a warning and return unique labels for every new sample. * Added helper function for detecting mutually equal similarities and preferences * Tidied imports * Immediately returning trivial clusters and labels in case of equal similarities and preferences * Simplified code for preference(s) equality test * Corrected for failing unit tests covering case of n_samples=1 * Corrected for PEP8 line too long * Rewriting imports to comply with max 80-column lines * Simplified code n_samples == 1 case does not need a separate return statement. * Replaced logging warnings by warnings.warn() Added assertions for warnings in tests. * Marking function as non-public * Using mask instead of modifying S * Improvement suggested by review comment * Avoided casting preference to array twice * Readability improvements * Improved returned labels in case of no cluster centers Returning a unique label for every sample in X suggests that these were based on actual clusters. Since there are no clusters, it makes more sense to return a negative label for all samples, indicating there were no clusters. * PEP8 line too long * Avoided creating separate variable for preference as array * Corrected warning message * Making labels consistent with predict() behavior in case of non-convergence * Minor readability improvement * Added detail to test comment about expected result * Added documentation about edge cases * Added documentation to 'what's new'
Reference Issue
Fixes #9612
What does this implement/fix? Explain your changes.
AffinityPropagation.predict(X)would fail in case of non-convergence of the algorithm when fitting the model. It now returns the same label ('-1' for noise) for every sample inXin that case.AffinityPropagation.fit()behavior was undefined and sometimes arbitrary (depending on preference and/or damping values) for cases when training samples had equal mutual similarities. It now behaves in a consistent way for these edge cases, i.e. returning one cluster when preference < mutual similarities and returningn_samplesclusters when preference >= mutual similarities.