[MRG+1] IsolationForest max_samples warning and calculation by betatim · Pull Request #5678 · scikit-learn/scikit-learn

betatim · 2015-11-02T13:08:15Z

In response to #5672

Introduces max_samples='auto' as the new default to make it possible to check if the user set a non default value. Only warn about max_samples>n_samples if the user set the value explicitly.

Fixes a bug where max_depth was not recalculated when max_samples has changed.

Now 256 appears as a magic value in both fit and predict, which isn't nice. I'll think about how to make it nicer, suggestions welcome.

todo:

add test that max_depth recalculation works
add a test to check the value of self.max_samples_ in both cases (=self.max_samples or =n_samples)?

/cc @ngoix

max_samples='auto' replaces max_samples=256 as the default setting. Recalculate max_depth based on the actual value of max_samples after checking range and other constraints

ngoix · 2015-11-02T13:42:51Z

sklearn/ensemble/iforest.py

Here you are duplicating code from BaseBagging._fit(). You don't need L158,159,160 and paragraph L169.

Added a test that checks if max_depth is recalculated for small samples. Readded the max_samples_ property to gain access to it in predict()

betatim · 2015-11-03T10:42:02Z

I re-added the max_samples_ attribute to the classifier. This means we duplicate some code from BaseBagging in order to calculate the integer value for max_samples. Feels suboptimal but I can't see a better way. Do you remember why you initially decided against having the attribute? Maybe that reason is still valid and we should instead have self._max_samples?

Added a test to check that max_depth is calculated correctly in the case where max_samples is forced/limited to n_samples.

ngoix · 2015-11-03T12:26:22Z

We introduced this attribute before adding the _fit() method in BaseBagging, which make it useless at that time.
This PR looks good to me, except that I'm still not 100% sure about L174 (self.base_estimator or self.base_estimator_ ?), ping @agramfort.
Could you just please add a test to check the value of self.max_samples_ in both cases (=self.max_samples or =n_samples)?

GaelVaroquaux · 2015-11-03T13:22:06Z

sklearn/ensemble/iforest.py

It must be on base_estimator_, has you cannot modify an init argument at any time.

It must be on base_estimator_, has you cannot modify an init argument at any time.

+1

base_estimator_ doesn't exist. There is a list of the fitted estimators (self.estimators_) but setting the property after fitting is too late. BaseEnsemble explicitly delays instantiating the estimators so that the parameters of the base_estimator can still be changed. So I think this is what you want to be doing. The only alternative I see is not instantiating the base estimator until fit() is being called. For this we'd have to change BaseEnsemble though, as it expects you to pass a value for the base_estimator argument.

One other way would be to add an optional argument max_depth (argument to use instead of self.max_depth) to _fit() in BaseBagging, and to change self.base_estimator_.max_depth after self._validate_estimator() (L.290 in bagging.py). Indeed _validate_estimator() makes base_estimator_ exist.

sounds reasonable to me.

Sorry for being a bit thick, I think I get it now.

betatim · 2015-11-04T07:42:23Z

Added a test for max_samples_ being set correctly, as well as checking that max_samples='illegal_string' raises an exception.

Finally understood what you meant with base_estimator and base_estimator_ and implemented @ngoix's proposal. Should we implement a test for the max_depth parameter? Do we need to check if it made sense to pass max_depth to _fit (eg when the base estimator isn't a tree) or is it assumed the user knows what they are doing because it is _fit and not fit?

agramfort · 2015-11-04T09:18:25Z

design looks good !

you have a travis failure.

ngoix · 2015-11-04T10:21:45Z

sklearn/ensemble/bagging.py

you should say

Argument to use instead of the one initially passed to the base estimator. This is supported only if the base estimator has a max_depth parameter.

ngoix · 2015-11-04T12:09:10Z

Apart for this two minor comments, everything looks great to me. Thanks a lot @betatim :) !

betatim · 2015-11-04T16:40:27Z

Turns out in python the ordering of numbers and strings isn't defined. Legacy python does define it though. So I extended the isinstance checking to check explicitly for str vs int.

Thanks for your patience! 🍰

agramfort · 2015-11-06T11:30:48Z

@ngoix ok for you?

ngoix · 2015-11-06T11:46:26Z

Yes everything looks fine. Any idea why appveyor is not happy?

giorgiop · 2015-11-06T12:07:37Z

I don't why it only breaks for that particular instance (python 2.7.8 64bit), but the error is raised here:
AttributeError: 'long' object has no attribute 'shape'

giorgiop · 2015-11-06T12:11:56Z

What happens if you change the else statement in elif isinstance(n_samples_leaf, np.ndarray):?

glouppe · 2015-11-06T12:22:21Z

sklearn/ensemble/bagging.py

This is another left-out bug that we missed. It should be max_samples = int(max_samples * X.shape[0]) (without self).

@betatim if you make this change appveyor should be happy.

glouppe · 2015-11-06T13:17:25Z

There is another thing that I dont currently understand in the codebase. In predict, scores are normalized by _average_path_length(max_samples), why? In particular, the behaviour is different for float and integer values of max_samples. If float, then it becomes dependent on the size of X passed to predict, which means that scores will vary simply depending on the number of points you want to predict the abnormality score... This is obviously wrong. What do you think @ngoix ? (We should add regression tests to verify that results are identical for float and integer values and for varying size of X)

(Sorry @betatim for uncovering more bugs and make your PR longer :))

ngoix · 2015-11-06T13:23:48Z

Yes you are right @glouppe it was wrong, but it is fixed now :) (see L229)

glouppe · 2015-11-06T13:24:47Z

Oh yeah right, forget what I said :)

betatim · 2015-11-09T15:03:17Z

Appveyor is unhappy because L266 checks if the argument is an int and if not assumes it is a ndarray. For appveyor we get passed a long, which isn't an int, and so end up in the ndarray branch.

Not quite sure where we get the long from in the first place. Should we track it down? This fix correctly detects longs and ints as integral numbers.

agramfort · 2015-11-09T15:05:11Z

use six.integer_types to test if you have an int.

betatim · 2015-11-09T15:16:43Z

Merci. The check I used was inspired by others in ensemble/* should I make an issue for this?

betatim · 2015-11-11T07:12:11Z

All builds are green again.

agramfort · 2015-11-11T08:47:35Z

+1 merge

@glouppe merge is you're happy too

[MRG+1] IsolationForest max_samples warning and calculation

glouppe · 2015-11-11T14:02:50Z

Looking good, merging. Thanks Tim!

betatim · 2015-11-11T14:04:17Z

🍰

Just noticed the number of this PR!

Calculate max_depth when max_samples is known

efadbd3

max_samples='auto' replaces max_samples=256 as the default setting. Recalculate max_depth based on the actual value of max_samples after checking range and other constraints

betatim changed the title ~~IsolationForest max_samples warning and calculation~~ [WIP] IsolationForest max_samples warning and calculation Nov 2, 2015

ngoix reviewed Nov 2, 2015
View reviewed changes

betatim added 3 commits November 2, 2015 15:43

Fix docs and remove duplicated checks

9b0dd53

Comment on weird initialisation for scoping reasons

e37c69f

Test for max_depth calculation and re-adding max_samples_ property

1f650db

Added a test that checks if max_depth is recalculated for small samples. Readded the max_samples_ property to gain access to it in predict()

GaelVaroquaux reviewed Nov 3, 2015
View reviewed changes

Test max_samples_ attribute, set max_depth in BaseBagging

bc85b78

ngoix reviewed Nov 4, 2015
View reviewed changes

glouppe changed the title ~~[WIP] IsolationForest max_samples warning and calculation~~ [MRG+1] IsolationForest max_samples warning and calculation Nov 4, 2015

Fix doc string and number vs string ordering

7af49df

glouppe reviewed Nov 6, 2015
View reviewed changes

ngoix mentioned this pull request Nov 9, 2015

[MRG+1]: binary predict in IsolationForest #5638

Closed

Reference correct variable

6f6ec7f

Use six to detect integral numbers

b767117

betatim mentioned this pull request Nov 11, 2015

isinstance(x, (numbers.Integral, np.integer)) vs isinstance(x, six.integer_types) #5786

Closed

glouppe added a commit that referenced this pull request Nov 11, 2015

Merge pull request #5678 from betatim/no-warning-iforest

889e2d4

[MRG+1] IsolationForest max_samples warning and calculation

glouppe merged commit 889e2d4 into scikit-learn:master Nov 11, 2015

glouppe mentioned this pull request Nov 11, 2015

IsolationForest max samples warning #5672

Closed

betatim deleted the no-warning-iforest branch November 11, 2015 14:03

glouppe mentioned this pull request Nov 11, 2015

Expose base_estimator interface in isolation forest #5789

Open

ShanDeng123 mentioned this pull request Mar 17, 2022

TST Removes pytest.warns(None) in test_iforest #22874

Merged

Uh oh!

Conversation

betatim commented Nov 2, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

betatim commented Nov 3, 2015

Uh oh!

ngoix commented Nov 3, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

betatim commented Nov 4, 2015

Uh oh!

agramfort commented Nov 4, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoix commented Nov 4, 2015

Uh oh!

betatim commented Nov 4, 2015

Uh oh!

agramfort commented Nov 6, 2015

Uh oh!

ngoix commented Nov 6, 2015

Uh oh!

giorgiop commented Nov 6, 2015

Uh oh!

giorgiop commented Nov 6, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glouppe commented Nov 6, 2015

Uh oh!

ngoix commented Nov 6, 2015

Uh oh!

glouppe commented Nov 6, 2015

Uh oh!

betatim commented Nov 9, 2015

Uh oh!

agramfort commented Nov 9, 2015 via email

Uh oh!

betatim commented Nov 9, 2015

Uh oh!

betatim commented Nov 11, 2015

Uh oh!

agramfort commented Nov 11, 2015

Uh oh!

glouppe commented Nov 11, 2015

Uh oh!

betatim commented Nov 11, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants