[MRG+2] Outlier detection algorithms API consistency#9015
[MRG+2] Outlier detection algorithms API consistency#9015jnothman merged 6 commits intoscikit-learn:masterfrom
Conversation
sklearn/neighbors/lof.py
Outdated
| self.threshold_ = -scoreatpercentile( | ||
| -self.negative_outlier_factor_, 100. * (1. - self.contamination)) | ||
| if self.contamination is None: | ||
| self.threshold_ = -1.1 # inliers score around 1. |
There was a problem hiding this comment.
1.5 is used in the paper in the experiments section. But it does not work on small dataset (the added tests break). I can put 1.5 and remove the added test in the case contamination is None.
|
waiting for #9018 to be merged before dealing with EllipticEnvelop API |
sklearn/ensemble/iforest.py
Outdated
| return -scores | ||
|
|
||
| def score_samples(self, X): | ||
| """Opposite of the anomaly score define in the original paper. |
sklearn/ensemble/iforest.py
Outdated
|
|
||
| Returns | ||
| ------- | ||
| scores : array of shape (n_samples,) |
| assert_greater(np.min(decision_func1[-2:]), np.max(decision_func1[:-2])) | ||
| assert_greater(np.min(decision_func2[-2:]), np.max(decision_func2[:-2])) | ||
| assert_array_equal(pred1, 6 * [1] + 2 * [-1]) | ||
| assert_array_equal(pred2, 6 * [1] + 2 * [-1]) |
There was a problem hiding this comment.
use a for loop with contamination to avoid the dupes
sklearn/neighbors/lof.py
Outdated
| self.threshold_ = -scoreatpercentile( | ||
| -self.negative_outlier_factor_, 100. * (1. - self.contamination)) | ||
| if self.contamination is None: | ||
| self.offset_ = 1.5 # inliers score around 1. |
There was a problem hiding this comment.
did you document the offset_ attribute in the docstrings?
There was a problem hiding this comment.
I forgot as threshold_ wasn't documented, i will do this
| else: # we're in training | ||
| return scores | ||
|
|
||
| def _score_samples(self, X): |
There was a problem hiding this comment.
why do you need this private function?
There was a problem hiding this comment.
We want a score_samples method for every anomaly detection algo (returning the "raw" decision function as defined in original papers).
Here it has to be private for the same reason as decision function and predict are private.
There was a problem hiding this comment.
Besides what is written in the docstring I think it would be good to add a comment in the code saying that predict is private because fit_predict(X) would be different than fit(X).predict(X), and that decision_function and score_samples are private because predict is private. I think this is the real reason behind making these methods private. Otherwise we could have extended the original paper approach (which can be done using the private methods). If someone dives into the code and wants to use the private methods he should be aware of that, which is not mentioned in the current version of the code.
There was a problem hiding this comment.
We also need a score_samples for the OneClassSVM. A suggestion:
For IsolationForest, LocalOutlierFactorand EllipticEnvelope all the work to compute the score of normality is currently done in decision_function. decision_function is then called in score_samples and offset_ has to be used in both methods.
IMO it would maybe be better to do the opposite: do all the work to compute the score in score_samples and define decision_function by score_samples(X) - offset_. I find it more natural (and easier to understand) to compute the score first and threshold this score to obtain decision_function and predict instead of computing decision_function and un-threshold decision_function to obtain the score. This would also involve the following changes:
- For
IsolationForestyou can callscore_samplesinstead ofdecision_functioninfitand computeoffset_fromscore_samples(X).decision_functionwould then bescore_samples(X) - offset_. And I think that you don't need the following if statement anymore indecision_function
if hasattr(self, 'offset_'): # means that we're in testing
return -scores + self.offset_
else: # we're in training
return -scores-
For
LocalOutlierFactor,_decision_functionwould be_score_samples(X) - offset_and as in the current version, no need to callscore_samplesnordecision_functioninfit. -
For
EllipticEnvelope,decision_functionwould bescore_samples(X) - offset_and as in the current version, no need to callscore_samplesnordecision_functioninfit. -
For the
OneClassSVMwe might have to stick with the current solution forscore_samples, i.e,score_samples(X) = decision_function(X) + offset_asdecision_functionuses the LIBSVM interface?
sklearn/neighbors/tests/test_lof.py
Outdated
| assert_equal(clf.n_neighbors_, X.shape[0] - 1) | ||
|
|
||
|
|
||
| def test__score_samples(): |
There was a problem hiding this comment.
we test _score_samples not score_samples so I'm not sure of the convention...
There was a problem hiding this comment.
I would have removed the duplicated _ as well, but it's not a big deal.
| else: # we're in training | ||
| return scores | ||
|
|
||
| def _score_samples(self, X): |
There was a problem hiding this comment.
Besides what is written in the docstring I think it would be good to add a comment in the code saying that predict is private because fit_predict(X) would be different than fit(X).predict(X), and that decision_function and score_samples are private because predict is private. I think this is the real reason behind making these methods private. Otherwise we could have extended the original paper approach (which can be done using the private methods). If someone dives into the code and wants to use the private methods he should be aware of that, which is not mentioned in the current version of the code.
sklearn/neighbors/lof.py
Outdated
| Additional keyword arguments for the metric function. | ||
|
|
||
| contamination : float in (0., 0.5), optional (default=0.1) | ||
| contamination : float in (0., 0.5), optional (default=None) |
sklearn/neighbors/lof.py
Outdated
| The amount of contamination of the data set, i.e. the proportion | ||
| of outliers in the data set. When fitting this is used to define the | ||
| threshold on the decision function. | ||
| threshold on the decision function. If None, the decision function |
sklearn/neighbors/lof.py
Outdated
| def __init__(self, n_neighbors=20, algorithm='auto', leaf_size=30, | ||
| metric='minkowski', p=2, metric_params=None, | ||
| contamination=0.1, n_jobs=1): | ||
| contamination=None, n_jobs=1): |
sklearn/neighbors/lof.py
Outdated
| """ | ||
| if not (0. < self.contamination <= .5): | ||
| raise ValueError("contamination must be in (0, 0.5]") | ||
| if self.contamination is not None: |
There was a problem hiding this comment.
needs to be adapted to 'auto' instead of None
sklearn/neighbors/lof.py
Outdated
|
|
||
| self.threshold_ = -scoreatpercentile( | ||
| -self.negative_outlier_factor_, 100. * (1. - self.contamination)) | ||
| if self.contamination is None: |
There was a problem hiding this comment.
self.contamination == 'auto'
sklearn/neighbors/lof.py
Outdated
|
|
||
| if hasattr(self, 'offset_'): # means that we're in testing | ||
| return scores + self.offset_ # to make value 0 special | ||
| else: # we're in training |
There was a problem hiding this comment.
I think this is useless here as _decision_function does not seem to be called during fit (there is a check_is_fitted in _decision_function).
There was a problem hiding this comment.
you're right, thanks!
| @@ -219,7 +224,7 @@ def predict(self, X): | |||
| """ | |||
There was a problem hiding this comment.
add a check_is_fitted
sklearn/ensemble/iforest.py
Outdated
| # abnormal) and substract self.offset_ to make 0 be the threshold | ||
| # value for being an outlier. | ||
|
|
||
| if hasattr(self, 'offset_'): # means that we're in testing |
There was a problem hiding this comment.
See my general review feedback about this if-else statement.
|
@ngoix why is this still WIP? |
|
It's not, I make the change. |
albertcthomas
left a comment
There was a problem hiding this comment.
A few comments. We also need to make sure that removing the raw_values parameter in EllipticEnvelope, which will require a deprecation cycle, is OK. Otherwise LGTM.
|
|
||
| def decision_function(self, X, raw_values=False): | ||
| def decision_function(self, X): | ||
| """Compute the decision function of the given observations. |
There was a problem hiding this comment.
if we remove the raw_values we need a deprecation warning
| else: | ||
| transformed_mahal_dist = mahal_dist ** 0.33 | ||
| decision = self.threshold_ ** 0.33 - transformed_mahal_dist | ||
| negative_mahal_dist = self.score_samples(X) |
There was a problem hiding this comment.
we might need to double check the examples using EllipticEnvelope but if I remember correctly they should be ok
There was a problem hiding this comment.
covariance/plot_outlier_detection.py and applications/plot_outlier_detection_housing.py seem ok.
| is_inlier[values <= self.threshold_] = 1 | ||
| values = self.decision_function(X) | ||
| is_inlier[values > 0] = 1 | ||
| else: |
There was a problem hiding this comment.
do we really need this? the contamination parameter is 0.1 by default
| return -self.mahalanobis(X) | ||
|
|
||
| def predict(self, X): | ||
| """Outlyingness of observations in X according to the fitted model. |
There was a problem hiding this comment.
I would also change this docstring because "outlyingness" could mean that predict returns scores whereas it returns labels.
| assert_raises(NotFittedError, clf.decision_function, X) | ||
| clf.fit(X) | ||
| y_pred = clf.predict(X) | ||
| decision = clf.score_samples(X) |
|
|
||
| offset_ : float | ||
| Offset used to define the decision function from the raw scores. | ||
| We have the relation: decision_function = score_samples - offset_. |
There was a problem hiding this comment.
in the code it is: score_samples + offset_
sklearn/neighbors/lof.py
Outdated
|
|
||
| offset_ : float | ||
| Offset used to define the decision function from the raw scores. | ||
| We have the relation: decision_function = score_samples - offset_. |
There was a problem hiding this comment.
in the code it is: score_samples + offset_
sklearn/neighbors/tests/test_lof.py
Outdated
| assert_array_equal(clf1._score_samples([[2., 2.]]), | ||
| clf1._decision_function([[2., 2.]]) - clf1.offset_) | ||
| assert_array_equal(clf2._score_samples([[2., 2.]]), | ||
| clf2._decision_function([[2., 2.]]) - clf2.offset_) |
There was a problem hiding this comment.
maybe assert equality of clf1.score_samples and clf2.score_samples
sklearn/svm/classes.py
Outdated
| We have the relation: decision_function = score_samples - offset_. | ||
| The offset is equal to intercept_ and is provided for consistency | ||
| with other outlier detection algorithms such as LocalOutlierFactor, | ||
| IsolationForest and EllipticEnvelope. |
There was a problem hiding this comment.
I would remove such as LocalOutlierFactor, IsolationForest and EllipticEnvelope. Otherwise if a new outlier detection estimator is added to scikit-learn we would need to update the list.
sklearn/svm/classes.py
Outdated
| return dec | ||
|
|
||
| def score_samples(self, X): | ||
| """Raw decision function of the samples. |
There was a problem hiding this comment.
Im not very pleased with Raw decision function. Maybe Shifted decision function or scoring function?
|
@ngoix as soon as you take into account my review this PR will be ready for final reviews. |
21cbff0 to
78b6032
Compare
|
thanks @albertcthomas for the review |
|
Ping @agramfort for another one? |
jnothman
left a comment
There was a problem hiding this comment.
A partial review.
Could you please add an entry to what's new and be clear on any decision_function or predict behaviour that has changed.
| X : array-like, shape (n_samples, n_features) | ||
|
|
||
| raw_values : bool | ||
| raw_values : bool, optional (default=None) |
There was a problem hiding this comment.
I don't think "(default=None)" means anything. Remove it.
| return decision | ||
| # raw_values deprecation: | ||
| if raw_values is not None: | ||
| warnings.warn("raw_values parameter is deprecated in 0.20 and will" |
There was a problem hiding this comment.
This is not tested, apparently.
| return negative_mahal_dist - self.offset_ | ||
|
|
||
| def score_samples(self, X): | ||
| """Compute the Mahalanobis distances. |
| Returns -1 for anomalies/outliers and +1 for inliers. | ||
| """ | ||
| check_is_fitted(self, 'threshold_') | ||
| check_is_fitted(self, 'offset_') |
There was a problem hiding this comment.
This doesn't need repeating if done in decision_function.
Should this definition of predict be provided by a mixin?
There was a problem hiding this comment.
@jnothman do you mean a mixin for outlier detection estimators? I was thinking that all outlier detection estimators should have a fit_predict like clustering estimators but they can't all have a predict unless we exclude LocalOutlierFactor which has a private predict.
There was a problem hiding this comment.
Fair enough. We can make the presence of predict conditional on the presence of decision_function. It's a bit ugly, but possible with a descriptor/property...
There was a problem hiding this comment.
I propose to create such a mixin in a separate PR, as this one is already a bit heavy
There was a problem hiding this comment.
I will open a mixin PR once this PR is merged.
sklearn/neighbors/lof.py
Outdated
| raise ValueError("contamination must be in (0, 0.5]") | ||
| if self.contamination != "auto": | ||
| if not(0. < self.contamination <= .5): | ||
| raise ValueError("contamination must be in (0, 0.5]") |
There was a problem hiding this comment.
Please add the actual value of self.contamination in the error message:
ValueError("contamination must be in (0, 0.5],"
"got: %f" % self.contamination)727e597 to
5c7dd33
Compare
|
Thanks @jnothman |
|
@ngoix for the what's new put it on online that will render as tiny paragraph if line is long. Don't itemize it. |
|
Put it all in one paragraph. We can restructure later.
…On 12 September 2017 at 04:59, Alexandre Gramfort ***@***.***> wrote:
@ngoix <https://github.com/ngoix> for the what's new put it on online
that will render as tiny paragraph if line is long. Don't itemize it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9015 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67SIOTQ8yF1FmwDqhWnXU1b2Ae0Uks5shYMkgaJpZM4NxkWN>
.
|
7da6313 to
8e06afe
Compare
Resolves #8693
Resolves #8707
Following up discussion in issue #8693
Make
IsolationForestandLocalOutlierFactordecision functions usecontaminationasEllipticEnvelopedoes. It is used to define a drift such that the threshold on the decision function for being an outlier is 0 (positive value = inlier, negative value = outlier).Add a
score_samplesmethod to OCSVM, iForest, LOF, EllipticEnvelope, enabling the user to access raw score functions from original papers. A newoffset_parameter allows to link this new method todecision_functionthrough "decision_function = score_samples - offset_".clean
EllipticEnvelope, fix docs, deprecateraw_valuesparameter.Add why lof decision_function is private (mentioned in [MRG] Add decision_function api to LocalOutlierFactor #8707)
ravel the decision function of OCSVM so that it is indeed of shape (n_samples,)