[MRG+1] Affinity propagation edge cases (#9612) by jsamoocha · Pull Request #9635 · scikit-learn/scikit-learn

jsamoocha · 2017-08-27T06:14:31Z

Reference Issue

What does this implement/fix? Explain your changes.

AffinityPropagation.predict(X) would fail in case of non-convergence of the algorithm when fitting the model. It now returns the same label ('-1' for noise) for every sample in X in that case.
AffinityPropagation.fit() behavior was undefined and sometimes arbitrary (depending on preference and/or damping values) for cases when training samples had equal mutual similarities. It now behaves in a consistent way for these edge cases, i.e. returning one cluster when preference < mutual similarities and returning n_samples clusters when preference >= mutual similarities.

As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be unique for every sample.

Returns empty list as cluster center indices to prevent adding a dimension in fit() method, returns unique labels for samples making this consistent with (TBD) predict() behavior for non-convergence.

In this case, it will log a warning and return unique labels for every new sample.

…references

…milarities and preferences

jnothman

Great work, thanks!

Could you please test that/when a warning is raised with assert_warns? (In this PR or elsewhere we should also use a ConvergenceWarning in the existing non-convergence case...)

jnothman · 2017-08-27T23:15:20Z

sklearn/cluster/affinity_propagation_.py

+        if self.cluster_centers_.size > 0:
+            return pairwise_distances_argmin(X, self.cluster_centers_)
+        else:
+            logger.warning("This model does not have any cluster centers "


For good or bad, we use warnings.warn, not logging...

jnothman · 2017-08-27T23:19:47Z

sklearn/cluster/affinity_propagation_.py

+        # cluster equal to the single sample
+        return (np.array([0]), np.array([0]), 0) if return_n_iter \
+            else (np.array([0], np.array([0])))
+    elif equal_similarities_and_preferences(S, preference):


Can't you just do if n_samples == 1 or equal... here?

jnothman · 2017-08-27T23:20:45Z

sklearn/cluster/affinity_propagation_.py

-        labels = np.empty((n_samples, 1))
-        cluster_centers_indices = None
-        labels.fill(np.nan)
+        logger.warning("Affinity propagation did not converge, this model "


Use a sklearn.exceptions.ConvergenceWarning

jnothman · 2017-08-27T23:21:06Z

sklearn/cluster/affinity_propagation_.py

+        # It makes no sense to run the algorithm in this case, so return 1 or
+        # n_samples clusters, depending on preferences
+        if np.array(preference).flat[0] >= S.flat[1]:
+            return (np.arange(n_samples), np.arange(n_samples), 0) \


Does this case deserve a warning?

n_samples == 1 case does not need a separate return statement.

Added assertions for warnings in tests.

jnothman · 2017-08-28T07:15:01Z

sklearn/cluster/affinity_propagation_.py

+
+    def all_equal_similarities():
+        # Fill "diagonal" of S with first similarity value in S
+        S.flat[::S.shape[0] + 1] = S.flat[1]


I don't think we should be modifying S...?

jnothman · 2017-08-28T07:15:09Z

sklearn/cluster/affinity_propagation_.py

 from ..metrics import pairwise_distances_argmin


+def equal_similarities_and_preferences(S, preference):


prefix with _ to make clear this is not public API

jnothman · 2017-08-28T08:13:43Z

sklearn/cluster/affinity_propagation_.py

+        warnings.warn("All samples have mutually equal similarities, "
+                      "returning arbitrary cluster center(s).")
+        if np.array(preference).flat[0] >= S.flat[n_samples - 1]:
+            return (np.arange(n_samples), np.arange(n_samples), 0) \


Aesthetics:

if np.array(preference).flat[0] >= S.flat[n_samples - 1]: return (np.arange(n_samples), np.arange(n_samples), 0) if return_n_iter else (np.arange(n_samples), np.arange(n_samples))

jnothman · 2017-08-28T08:15:35Z

sklearn/cluster/affinity_propagation_.py

+        # n_samples clusters, depending on preferences
+        warnings.warn("All samples have mutually equal similarities, "
+                      "returning arbitrary cluster center(s).")
+        if np.array(preference).flat[0] >= S.flat[n_samples - 1]:


I think we can convert preference to an array outside of this condition and avoid repeatedly casting it...

jnothman · 2017-08-28T08:16:29Z

sklearn/cluster/affinity_propagation_.py

+    if n_samples == 1 or equal_similarities_and_preferences(S, preference):
+        # It makes no sense to run the algorithm in this case, so return 1 or
+        # n_samples clusters, depending on preferences
+        warnings.warn("All samples have mutually equal similarities, "


jnothman · 2017-08-28T08:16:31Z

sklearn/cluster/affinity_propagation_.py

+        # It makes no sense to run the algorithm in this case, so return 1 or
+        # n_samples clusters, depending on preferences
+        warnings.warn("All samples have mutually equal similarities, "
+                      "returning arbitrary cluster center(s).")


should this be a ConvergenceWarning? I suppose UserWarning is fine.

It is a UserWarning right now

jnothman · 2017-08-28T08:22:49Z

sklearn/cluster/tests/test_affinity_propagation.py

+    # Force non-convergence by allowing only a single iteration
+    af = AffinityPropagation(preference=-10, max_iter=1).fit(X)
+
+    assert_array_equal(np.array([0, 1, 2]), af.predict(X))


I'm afraid this doesn't make sense. We can't predict every sample as being in its own cluster in the inductive case, if this means that there are new clusters at predict time than at fit time. The implication here is also that some of the predicted items are clustered with the training points in the same position, which does not make sense.

Options:

instead mark all points as not clustered, i.e. label -1 as in dbscan

raise an error in predict if there are no exemplars

make each training point an exemplar

Good catch! That will prevent some confusion...

Returning a unique label for every sample in X suggests that these were based on actual clusters. Since there are no clusters, it makes more sense to return a negative label for all samples, indicating there were no clusters.

jnothman · 2017-08-28T13:42:51Z

sklearn/cluster/affinity_propagation_.py

    if damping < 0.5 or damping >= 1:
        raise ValueError('damping must be >= 0.5 and < 1')

+    preference_array = np.array(preference)


any reason not to reuse the name preference? It's quite clear from how it's used (e.g. .flat) that it must be an array.

jnothman · 2017-08-28T13:45:41Z

sklearn/cluster/affinity_propagation_.py

+        else:
+            warnings.warn("This model does not have any cluster centers "
+                          "because affinity propagation did not converge. "
+                          "Returning unique labels for the provided samples.")


No longer the case.

Are you sure we don't just want to raise an error when a user tries to predict with such a model? Perhaps an array of -1 makes sense in fit_predict if we're going to do it here..?

predict() unexpectedly raising an error was the initial reason this PR got started. An error during predict() would be appropriate if the caller would provide incorrect data IMHO. Since these models can potentially live for a long time (in my use case, they're trained infrequently and are deserialized later with joblib.load()), I wouldn't want to make the caller of predict() responsible for dealing with potential crashes because of issues during training of the models. Returning -1 as labels (and logging the warning) seems to be a bit more friendly to the caller.

I'll address the incorrect warning message.

codecov · 2017-08-28T14:52:46Z

Codecov Report

Merging #9635 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #9635      +/-   ##
==========================================
+ Coverage   96.16%   96.16%   +<.01%     
==========================================
  Files         336      336              
  Lines       62102    62154      +52     
==========================================
+ Hits        59720    59772      +52     
  Misses       2382     2382

Impacted Files	Coverage Δ
sklearn/cluster/tests/test_affinity_propagation.py	`100% <100%> (ø)`	⬆️
sklearn/cluster/affinity_propagation_.py	`98.34% <100%> (+0.3%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d13009...5fe3367. Read the comment docs.

codecov · 2017-08-28T14:52:46Z

Codecov Report

Merging #9635 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #9635      +/-   ##
==========================================
+ Coverage   96.16%   96.16%   +<.01%     
==========================================
  Files         336      336              
  Lines       62102    62154      +52     
==========================================
+ Hits        59720    59772      +52     
  Misses       2382     2382

Impacted Files	Coverage Δ
sklearn/cluster/tests/test_affinity_propagation.py	`100% <100%> (ø)`	⬆️
sklearn/cluster/affinity_propagation_.py	`98.34% <100%> (+0.3%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d13009...5fe3367. Read the comment docs.

jnothman · 2017-08-28T23:12:39Z

But then don't you think that labels_ should also be -1 then?

…rgence

jsamoocha · 2017-08-29T06:27:08Z

Definitely. I overlooked that part - busy with too many things at the same time :-/

jnothman

This is good, except that it deserves some documentation. Perhaps a Notes section in both class and function docstrings?

jnothman · 2017-08-31T08:59:33Z

LGTM, thanks! Another review?

agramfort · 2017-08-31T18:03:08Z

+1 for MRG

please just update what's new bug fix section and let's merge

jsamoocha · 2017-09-05T07:08:32Z

Not too sure what it means to "update ... bug fix section", was any action expected from my side?

jnothman · 2017-09-05T07:50:28Z

see doc/whats_new.rst

…

On 5 September 2017 at 17:08, Jonatan Samoocha ***@***.***> wrote: Not too sure what it means to "update ... bug fix section", was any action expected from my side? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#9635 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6_ky-O-0Siv3r4l7d7PCBsHKP_-Cks5sfPNzgaJpZM4PDtqs> .

jnothman · 2017-09-05T10:16:10Z

Thanks!

…earn#9635) * Added test exposing non-convergence issues As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be unique for every sample. * Addresses non-convergence issues Returns empty list as cluster center indices to prevent adding a dimension in fit() method, returns unique labels for samples making this consistent with (TBD) predict() behavior for non-convergence. * Made predict() handle case of non-convergence while fitting In this case, it will log a warning and return unique labels for every new sample. * Added helper function for detecting mutually equal similarities and preferences * Tidied imports * Immediately returning trivial clusters and labels in case of equal similarities and preferences * Simplified code for preference(s) equality test * Corrected for failing unit tests covering case of n_samples=1 * Corrected for PEP8 line too long * Rewriting imports to comply with max 80-column lines * Simplified code n_samples == 1 case does not need a separate return statement. * Replaced logging warnings by warnings.warn() Added assertions for warnings in tests. * Marking function as non-public * Using mask instead of modifying S * Improvement suggested by review comment * Avoided casting preference to array twice * Readability improvements * Improved returned labels in case of no cluster centers Returning a unique label for every sample in X suggests that these were based on actual clusters. Since there are no clusters, it makes more sense to return a negative label for all samples, indicating there were no clusters. * PEP8 line too long * Avoided creating separate variable for preference as array * Corrected warning message * Making labels consistent with predict() behavior in case of non-convergence * Minor readability improvement * Added detail to test comment about expected result * Added documentation about edge cases * Added documentation to 'what's new'

Jonatan Samoocha added 10 commits August 26, 2017 11:49

Added test exposing non-convergence issues

4f6a074

As discussed in issue scikit-learn#9612, expecting cluster centers to be an empty array and labels to be unique for every sample.

Addresses non-convergence issues

b3c5655

Returns empty list as cluster center indices to prevent adding a dimension in fit() method, returns unique labels for samples making this consistent with (TBD) predict() behavior for non-convergence.

Made predict() handle case of non-convergence while fitting

0c9e5cf

In this case, it will log a warning and return unique labels for every new sample.

Added helper function for detecting mutually equal similarities and p…

d86a495

…references

Tidied imports

d658978

Immediately returning trivial clusters and labels in case of equal si…

a751f3a

…milarities and preferences

Simplified code for preference(s) equality test

05ba06c

Corrected for failing unit tests covering case of n_samples=1

1e04123

Corrected for PEP8 line too long

55c98fc

Rewriting imports to comply with max 80-column lines

babdd07

jnothman reviewed Aug 27, 2017

View reviewed changes

Jonatan Samoocha added 2 commits August 28, 2017 08:05

Simplified code

455206b

n_samples == 1 case does not need a separate return statement.

Replaced logging warnings by warnings.warn()

4e6a0c4

Added assertions for warnings in tests.

jnothman reviewed Aug 28, 2017

View reviewed changes

Jonatan Samoocha added 7 commits August 28, 2017 11:58

Marking function as non-public

dea69d7

Using mask instead of modifying S

a48c87e

Improvement suggested by review comment

49dfa5e

Avoided casting preference to array twice

ea9024e

Readability improvements

791f995

Improved returned labels in case of no cluster centers

c2ab507

Returning a unique label for every sample in X suggests that these were based on actual clusters. Since there are no clusters, it makes more sense to return a negative label for all samples, indicating there were no clusters.

PEP8 line too long

c8c942b

jnothman added the Bug label Aug 28, 2017

jnothman reviewed Aug 28, 2017

View reviewed changes

Jonatan Samoocha added 2 commits August 28, 2017 16:49

Avoided creating separate variable for preference as array

daabaf1

Corrected warning message

5fe3367

Jonatan Samoocha added 2 commits August 29, 2017 08:13

Making labels consistent with predict() behavior in case of non-conve…

887bfc0

…rgence

Minor readability improvement

0acf079

Added detail to test comment about expected result

07b73af

jnothman reviewed Aug 29, 2017

View reviewed changes

Added documentation about edge cases

d35f7ec

jnothman changed the title ~~[MRG]Affinity propagation 9612~~ [MRG+1] Affinity propagation 9612 Aug 31, 2017

jnothman changed the title ~~[MRG+1] Affinity propagation 9612~~ [MRG+1] Affinity propagation edge cases (#9612) Aug 31, 2017

Added documentation to 'what's new'

ed208ed

jnothman merged commit fb64216 into scikit-learn:master Sep 5, 2017

jsamoocha deleted the affinity_propagation_9612 branch September 5, 2017 10:41

amueller mentioned this pull request Oct 20, 2017

Release of version 0.19.1 #9607

Merged

pietroppeter mentioned this pull request May 5, 2021

[affinity propagation] change logic in edge case where all similarities and preferences are equal #20043

Closed

		from ..metrics import pairwise_distances_argmin


		def equal_similarities_and_preferences(S, preference):

Uh oh!

Conversation

jsamoocha commented Aug 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Aug 28, 2017

Codecov Report

Uh oh!

codecov bot commented Aug 28, 2017

Codecov Report

Uh oh!

jnothman commented Aug 28, 2017

Uh oh!

jsamoocha commented Aug 29, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Aug 31, 2017

Uh oh!

agramfort commented Aug 31, 2017

Uh oh!

jsamoocha commented Sep 5, 2017

Uh oh!

jnothman commented Sep 5, 2017 via email

Uh oh!

jnothman commented Sep 5, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jsamoocha commented Aug 27, 2017 •

edited

Loading