[MRG+2] Outlier detection algorithms API consistency by ngoix · Pull Request #9015 · scikit-learn/scikit-learn

ngoix · 2017-06-06T16:06:50Z

Resolves #8693
Resolves #8707

Following up discussion in issue #8693

Make IsolationForest and LocalOutlierFactor decision functions use contamination as EllipticEnvelope does. It is used to define a drift such that the threshold on the decision function for being an outlier is 0 (positive value = inlier, negative value = outlier).
Add a score_samples method to OCSVM, iForest, LOF, EllipticEnvelope, enabling the user to access raw score functions from original papers. A new offset_ parameter allows to link this new method to decision_function through "decision_function = score_samples - offset_".
clean EllipticEnvelope, fix docs, deprecate raw_values parameter.
Add why lof decision_function is private (mentioned in [MRG] Add decision_function api to LocalOutlierFactor #8707)
ravel the decision function of OCSVM so that it is indeed of shape (n_samples,)

ngoix · 2017-06-06T17:35:03Z

sklearn/neighbors/lof.py

-        self.threshold_ = -scoreatpercentile(
-            -self.negative_outlier_factor_, 100. * (1. - self.contamination))
+        if self.contamination is None:
+            self.threshold_ = -1.1  # inliers score around 1.


1.5 is used in the paper in the experiments section. But it does not work on small dataset (the added tests break). I can put 1.5 and remove the added test in the case contamination is None.

ngoix · 2017-06-08T11:47:35Z

waiting for #9018 to be merged before dealing with EllipticEnvelop API

agramfort · 2017-06-15T10:17:58Z

sklearn/ensemble/iforest.py

+            return -scores
+
+    def score_samples(self, X):
+        """Opposite of the anomaly score define in the original paper.


agramfort · 2017-06-15T10:20:49Z

sklearn/ensemble/iforest.py

+
+        Returns
+        -------
+        scores : array of shape (n_samples,)


array, shape (n_samples,)

agramfort · 2017-06-15T10:21:37Z

sklearn/ensemble/tests/test_iforest.py

+    assert_greater(np.min(decision_func1[-2:]), np.max(decision_func1[:-2]))
+    assert_greater(np.min(decision_func2[-2:]), np.max(decision_func2[:-2]))
+    assert_array_equal(pred1, 6 * [1] + 2 * [-1])
+    assert_array_equal(pred2, 6 * [1] + 2 * [-1])


use a for loop with contamination to avoid the dupes

agramfort · 2017-06-15T10:22:49Z

sklearn/neighbors/lof.py

-        self.threshold_ = -scoreatpercentile(
-            -self.negative_outlier_factor_, 100. * (1. - self.contamination))
+        if self.contamination is None:
+            self.offset_ = 1.5  # inliers score around 1.


did you document the offset_ attribute in the docstrings?

I forgot as threshold_ wasn't documented, i will do this

agramfort · 2017-06-15T10:24:44Z

sklearn/neighbors/lof.py

+        else:  # we're in training
+            return scores
+
+    def _score_samples(self, X):


why do you need this private function?

We want a score_samples method for every anomaly detection algo (returning the "raw" decision function as defined in original papers).
Here it has to be private for the same reason as decision function and predict are private.

Besides what is written in the docstring I think it would be good to add a comment in the code saying that predict is private because fit_predict(X) would be different than fit(X).predict(X), and that decision_function and score_samples are private because predict is private. I think this is the real reason behind making these methods private. Otherwise we could have extended the original paper approach (which can be done using the private methods). If someone dives into the code and wants to use the private methods he should be aware of that, which is not mentioned in the current version of the code.

albertcthomas

We also need a score_samples for the OneClassSVM. A suggestion:

For IsolationForest, LocalOutlierFactorand EllipticEnvelope all the work to compute the score of normality is currently done in decision_function. decision_function is then called in score_samples and offset_ has to be used in both methods.

IMO it would maybe be better to do the opposite: do all the work to compute the score in score_samples and define decision_function by score_samples(X) - offset_. I find it more natural (and easier to understand) to compute the score first and threshold this score to obtain decision_function and predict instead of computing decision_function and un-threshold decision_function to obtain the score. This would also involve the following changes:

For IsolationForest you can call score_samples instead of decision_function in fit and compute offset_ from score_samples(X). decision_function would then be score_samples(X) - offset_. And I think that you don't need the following if statement anymore in decision_function

if hasattr(self, 'offset_'):  # means that we're in testing
    return -scores + self.offset_
else:  # we're in training
    return -scores

For LocalOutlierFactor, _decision_function would be _score_samples(X) - offset_ and as in the current version, no need to call score_samples nor decision_function in fit.
For EllipticEnvelope, decision_function would be score_samples(X) - offset_ and as in the current version, no need to call score_samples nor decision_function in fit.
For the OneClassSVM we might have to stick with the current solution for score_samples, i.e, score_samples(X) = decision_function(X) + offset_ as decision_function uses the LIBSVM interface?

albertcthomas · 2017-06-16T11:39:00Z

sklearn/neighbors/tests/test_lof.py

    assert_equal(clf.n_neighbors_, X.shape[0] - 1)
+
+
+def test__score_samples():


test_score_samples

we test _score_samples not score_samples so I'm not sure of the convention...

I would have removed the duplicated _ as well, but it's not a big deal.

albertcthomas · 2017-06-16T14:11:41Z

sklearn/neighbors/lof.py

+        else:  # we're in training
+            return scores
+
+    def _score_samples(self, X):


Besides what is written in the docstring I think it would be good to add a comment in the code saying that predict is private because fit_predict(X) would be different than fit(X).predict(X), and that decision_function and score_samples are private because predict is private. I think this is the real reason behind making these methods private. Otherwise we could have extended the original paper approach (which can be done using the private methods). If someone dives into the code and wants to use the private methods he should be aware of that, which is not mentioned in the current version of the code.

albertcthomas · 2017-06-16T14:17:23Z

sklearn/neighbors/lof.py

        Additional keyword arguments for the metric function.

-    contamination : float in (0., 0.5), optional (default=0.1)
+    contamination : float in (0., 0.5), optional (default=None)


None --> 'auto'

albertcthomas · 2017-06-16T14:17:30Z

sklearn/neighbors/lof.py

        The amount of contamination of the data set, i.e. the proportion
        of outliers in the data set. When fitting this is used to define the
-        threshold on the decision function.
+        threshold on the decision function. If None, the decision function


None --> 'auto'

albertcthomas · 2017-06-16T14:18:45Z

sklearn/neighbors/lof.py

    def __init__(self, n_neighbors=20, algorithm='auto', leaf_size=30,
                 metric='minkowski', p=2, metric_params=None,
-                 contamination=0.1, n_jobs=1):
+                 contamination=None, n_jobs=1):


None --> 'auto'

albertcthomas · 2017-06-16T14:19:21Z

sklearn/neighbors/lof.py

        """
-        if not (0. < self.contamination <= .5):
-            raise ValueError("contamination must be in (0, 0.5]")
+        if self.contamination is not None:


needs to be adapted to 'auto' instead of None

albertcthomas · 2017-06-16T14:19:58Z

sklearn/neighbors/lof.py


-        self.threshold_ = -scoreatpercentile(
-            -self.negative_outlier_factor_, 100. * (1. - self.contamination))
+        if self.contamination is None:


self.contamination == 'auto'

albertcthomas · 2017-06-16T14:33:39Z

sklearn/neighbors/lof.py

+
+        if hasattr(self, 'offset_'):  # means that we're in testing
+            return scores + self.offset_  # to make value 0 special
+        else:  # we're in training


I think this is useless here as _decision_function does not seem to be called during fit (there is a check_is_fitted in _decision_function).

you're right, thanks!

albertcthomas · 2017-06-16T14:36:55Z

sklearn/ensemble/iforest.py

@@ -219,7 +224,7 @@ def predict(self, X):
        """


add a check_is_fitted

albertcthomas · 2017-06-16T14:51:16Z

sklearn/ensemble/iforest.py

+        # abnormal) and substract self.offset_ to make 0 be the threshold
+        # value for being an outlier.
+
+        if hasattr(self, 'offset_'):  # means that we're in testing


See my general review feedback about this if-else statement.

agramfort · 2017-07-03T16:36:51Z

@ngoix why is this still WIP?

ngoix · 2017-07-04T20:38:40Z

It's not, I make the change.

albertcthomas

A few comments. We also need to make sure that removing the raw_values parameter in EllipticEnvelope, which will require a deprecation cycle, is OK. Otherwise LGTM.

albertcthomas · 2017-07-05T14:53:08Z

sklearn/covariance/elliptic_envelope.py


-    def decision_function(self, X, raw_values=False):
+    def decision_function(self, X):
        """Compute the decision function of the given observations.


if we remove the raw_values we need a deprecation warning

albertcthomas · 2017-07-05T14:54:59Z

sklearn/covariance/elliptic_envelope.py

-        else:
-            transformed_mahal_dist = mahal_dist ** 0.33
-            decision = self.threshold_ ** 0.33 - transformed_mahal_dist
+        negative_mahal_dist = self.score_samples(X)


we might need to double check the examples using EllipticEnvelope but if I remember correctly they should be ok

covariance/plot_outlier_detection.py and applications/plot_outlier_detection_housing.py seem ok.

albertcthomas · 2017-07-05T15:01:59Z

sklearn/covariance/elliptic_envelope.py

-            is_inlier[values <= self.threshold_] = 1
+            values = self.decision_function(X)
+            is_inlier[values > 0] = 1
        else:


do we really need this? the contamination parameter is 0.1 by default

albertcthomas · 2017-07-05T15:05:03Z

sklearn/covariance/elliptic_envelope.py

+        return -self.mahalanobis(X)

    def predict(self, X):
        """Outlyingness of observations in X according to the fitted model.


I would also change this docstring because "outlyingness" could mean that predict returns scores whereas it returns labels.

albertcthomas · 2017-07-05T15:11:29Z

sklearn/covariance/tests/test_elliptic_envelope.py

+    assert_raises(NotFittedError, clf.decision_function, X)
+    clf.fit(X)
+    y_pred = clf.predict(X)
+    decision = clf.score_samples(X)


albertcthomas · 2017-07-05T19:55:58Z

sklearn/covariance/elliptic_envelope.py


+    offset_ : float
+        Offset used to define the decision function from the raw scores.
+        We have the relation: decision_function = score_samples - offset_.


in the code it is: score_samples + offset_

albertcthomas · 2017-07-05T19:56:52Z

sklearn/neighbors/lof.py


+    offset_ : float
+        Offset used to define the decision function from the raw scores.
+        We have the relation: decision_function = score_samples - offset_.


in the code it is: score_samples + offset_

albertcthomas · 2017-07-05T20:01:15Z

sklearn/neighbors/tests/test_lof.py

+    assert_array_equal(clf1._score_samples([[2., 2.]]),
+                       clf1._decision_function([[2., 2.]]) - clf1.offset_)
+    assert_array_equal(clf2._score_samples([[2., 2.]]),
+                       clf2._decision_function([[2., 2.]]) - clf2.offset_)


maybe assert equality of clf1.score_samples and clf2.score_samples

albertcthomas · 2017-07-05T20:03:48Z

sklearn/svm/classes.py

+        We have the relation: decision_function = score_samples - offset_.
+        The offset is equal to intercept_ and is provided for consistency
+        with other outlier detection algorithms such as LocalOutlierFactor,
+        IsolationForest and EllipticEnvelope.


I would remove such as LocalOutlierFactor, IsolationForest and EllipticEnvelope. Otherwise if a new outlier detection estimator is added to scikit-learn we would need to update the list.

albertcthomas · 2017-07-05T20:08:21Z

sklearn/svm/classes.py

        return dec

+    def score_samples(self, X):
+        """Raw decision function of the samples.


Im not very pleased with Raw decision function. Maybe Shifted decision function or scoring function?

albertcthomas · 2017-08-24T13:03:38Z

@ngoix as soon as you take into account my review this PR will be ready for final reviews.

ngoix · 2017-09-06T12:55:30Z

thanks @albertcthomas for the review

ngoix · 2017-09-06T16:40:39Z

Ping @agramfort for another one?

jnothman

A partial review.

Could you please add an entry to what's new and be clear on any decision_function or predict behaviour that has changed.

jnothman · 2017-09-10T10:06:26Z

sklearn/covariance/elliptic_envelope.py

        X : array-like, shape (n_samples, n_features)

-        raw_values : bool
+        raw_values : bool, optional (default=None)


I don't think "(default=None)" means anything. Remove it.

jnothman · 2017-09-10T10:08:22Z

sklearn/covariance/elliptic_envelope.py

-        return decision
+        # raw_values deprecation:
+        if raw_values is not None:
+            warnings.warn("raw_values parameter is deprecated in 0.20 and will"


This is not tested, apparently.

jnothman · 2017-09-10T10:09:33Z

sklearn/covariance/elliptic_envelope.py

+        return negative_mahal_dist - self.offset_
+
+    def score_samples(self, X):
+        """Compute the Mahalanobis distances.


say negative here, please

jnothman · 2017-09-10T10:10:44Z

sklearn/covariance/elliptic_envelope.py

+            Returns -1 for anomalies/outliers and +1 for inliers.
        """
-        check_is_fitted(self, 'threshold_')
+        check_is_fitted(self, 'offset_')


This doesn't need repeating if done in decision_function.

Should this definition of predict be provided by a mixin?

@jnothman do you mean a mixin for outlier detection estimators? I was thinking that all outlier detection estimators should have a fit_predict like clustering estimators but they can't all have a predict unless we exclude LocalOutlierFactor which has a private predict.

Fair enough. We can make the presence of predict conditional on the presence of decision_function. It's a bit ugly, but possible with a descriptor/property...

I propose to create such a mixin in a separate PR, as this one is already a bit heavy

I will open a mixin PR once this PR is merged.

jnothman · 2017-09-10T10:13:08Z

sklearn/neighbors/lof.py

-            raise ValueError("contamination must be in (0, 0.5]")
+        if self.contamination != "auto":
+            if not(0. < self.contamination <= .5):
+                raise ValueError("contamination must be in (0, 0.5]")


Please add the actual value of self.contamination in the error message:

ValueError("contamination must be in (0, 0.5]," "got: %f" % self.contamination)

ngoix · 2017-09-11T16:58:01Z

Thanks @jnothman
Concerning what's new, should I add an "outlier detection methods" section, or add an entry for each algo in its respective section?

agramfort · 2017-09-11T18:59:45Z

@ngoix for the what's new put it on online that will render as tiny paragraph if line is long. Don't itemize it.

jnothman · 2017-09-11T21:57:16Z

Put it all in one paragraph. We can restructure later.

…

On 12 September 2017 at 04:59, Alexandre Gramfort ***@***.***> wrote: @ngoix <https://github.com/ngoix> for the what's new put it on online that will render as tiny paragraph if line is long. Don't itemize it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9015 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz67SIOTQ8yF1FmwDqhWnXU1b2Ae0Uks5shYMkgaJpZM4NxkWN> .

ngoix mentioned this pull request Jun 6, 2017

[MRG] Add decision_function api to LocalOutlierFactor #8707

Closed

ngoix commented Jun 6, 2017

View reviewed changes

ngoix changed the title ~~Outlier detection algorithms API consistency~~ [WIP] Outlier detection algorithms API consistency Jun 7, 2017

albertcthomas mentioned this pull request Jun 8, 2017

[MRG+2] clean outlier_detection.py #9018

Merged

albertcthomas mentioned this pull request Jun 15, 2017

Outlier detection common tests? #8677

Closed

agramfort reviewed Jun 15, 2017

View reviewed changes

sklearn/ensemble/iforest.py Outdated

Returns

-------

scores : array of shape (n_samples,)

Copy link
Copy Markdown

Member

agramfort Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array, shape (n_samples,)

agramfort reviewed Jun 15, 2017

View reviewed changes

albertcthomas reviewed Jun 16, 2017

View reviewed changes

ngoix force-pushed the consistency_AD branch from f4a8018 to b374675 Compare June 20, 2017 16:27

albertcthomas mentioned this pull request Jul 3, 2017

[MRG+1] Common tests for outlier detection estimators #9270

Merged

ngoix changed the title ~~[WIP] Outlier detection algorithms API consistency~~ [MRG] Outlier detection algorithms API consistency Jul 4, 2017

albertcthomas reviewed Jul 5, 2017

View reviewed changes

ngoix force-pushed the consistency_AD branch from 21cbff0 to 78b6032 Compare September 5, 2017 17:15

jnothman reviewed Sep 10, 2017

View reviewed changes

ngoix force-pushed the consistency_AD branch from 727e597 to 5c7dd33 Compare September 11, 2017 16:09

ngoix force-pushed the consistency_AD branch 3 times, most recently from 7da6313 to 8e06afe Compare September 12, 2017 08:22

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Jul 25, 2019

review and sync with scikit-learn#9015

37b1c3d

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Aug 22, 2019

review and sync with scikit-learn#9015

6f2a9f0

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Oct 5, 2019

review and sync with scikit-learn#9015

7fae158

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Oct 11, 2019

review and sync with scikit-learn#9015

ce18f2c

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Nov 1, 2019

review and sync with scikit-learn#9015

57cac32

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Jan 10, 2020

review and sync with scikit-learn#9015

19c639d

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Jan 30, 2020

review and sync with scikit-learn#9015

83ad791

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Jan 30, 2020

review and sync with scikit-learn#9015

6b8122f

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Feb 27, 2020

review and sync with scikit-learn#9015

2129f72

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Mar 12, 2020

review and sync with scikit-learn#9015

667e5f7

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Apr 17, 2020

review and sync with scikit-learn#9015

5542d99

ivannz added a commit to ivannz/scikit-learn that referenced this pull request May 3, 2020

review and sync with scikit-learn#9015

c57d674

ivannz added a commit to ivannz/scikit-learn that referenced this pull request May 30, 2020

review and sync with scikit-learn#9015

cb5b0a8

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Jul 5, 2020

review and sync with scikit-learn#9015

f09099d

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Jul 5, 2020

review and sync with scikit-learn#9015

b3cd466

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Aug 5, 2020

review and sync with scikit-learn#9015

33c0fe2

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Oct 15, 2020

review and sync with scikit-learn#9015

be9c5ef

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Nov 28, 2020

review and sync with scikit-learn#9015

60e4d45

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Dec 12, 2020

review and sync with scikit-learn#9015

43dbc06

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Feb 25, 2021

review and sync with scikit-learn#9015

11acfcd

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Feb 25, 2021

review and sync with scikit-learn#9015

7e3014f

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Mar 27, 2021

review and sync with scikit-learn#9015

3180c61

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Jun 15, 2021

review and sync with scikit-learn#9015

0503d0a

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Jul 23, 2021

review and sync with scikit-learn#9015

4404851

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Nov 10, 2021

review and sync with scikit-learn#9015

7da8939

ivannz added a commit to ivannz/scikit-learn that referenced this pull request May 15, 2022

review and sync with scikit-learn#9015

2827a5b

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Jun 14, 2022

review and sync with scikit-learn#9015

be597f5

thomasjpfan mentioned this pull request Jul 5, 2022

Unclear selection of LOF threshold #23837

Open

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Aug 29, 2022

review and sync with scikit-learn#9015

0da6cdc

ivannz added a commit to ivannz/scikit-learn that referenced this pull request Sep 5, 2022

review and sync with scikit-learn#9015

941ca4b

		assert_equal(clf.n_neighbors_, X.shape[0] - 1)


		def test__score_samples():

Uh oh!

Conversation

ngoix commented Jun 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoix commented Jun 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertcthomas left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertcthomas Jun 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agramfort commented Jul 3, 2017

Uh oh!

ngoix commented Jul 4, 2017

Uh oh!

albertcthomas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertcthomas Jul 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoix commented Jun 6, 2017 •

edited

Loading

albertcthomas left a comment •

edited

Loading

albertcthomas Jun 16, 2017 •

edited

Loading

albertcthomas Jul 5, 2017 •

edited

Loading