Skip to content

[MRG+1] IsolationForest max_samples warning and calculation#5678

Merged
glouppe merged 8 commits intoscikit-learn:masterfrom
betatim:no-warning-iforest
Nov 11, 2015
Merged

[MRG+1] IsolationForest max_samples warning and calculation#5678
glouppe merged 8 commits intoscikit-learn:masterfrom
betatim:no-warning-iforest

Conversation

@betatim
Copy link
Copy Markdown
Member

@betatim betatim commented Nov 2, 2015

In response to #5672

Introduces max_samples='auto' as the new default to make it possible to check if the user set a non default value. Only warn about max_samples>n_samples if the user set the value explicitly.

Fixes a bug where max_depth was not recalculated when max_samples has changed.

Now 256 appears as a magic value in both fit and predict, which isn't nice. I'll think about how to make it nicer, suggestions welcome.

todo:

  • add test that max_depth recalculation works
  • add a test to check the value of self.max_samples_ in both cases (=self.max_samples or =n_samples)?

/cc @ngoix

max_samples='auto' replaces max_samples=256 as the default
setting. Recalculate max_depth based on the actual value
of max_samples after checking range and other constraints
@betatim betatim changed the title IsolationForest max_samples warning and calculation [WIP] IsolationForest max_samples warning and calculation Nov 2, 2015
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you are duplicating code from BaseBagging._fit(). You don't need L158,159,160 and paragraph L169.

Added a test that checks if max_depth is recalculated for
small samples.

Readded the max_samples_ property to gain access to it in
predict()
@betatim
Copy link
Copy Markdown
Member Author

betatim commented Nov 3, 2015

I re-added the max_samples_ attribute to the classifier. This means we duplicate some code from BaseBagging in order to calculate the integer value for max_samples. Feels suboptimal but I can't see a better way. Do you remember why you initially decided against having the attribute? Maybe that reason is still valid and we should instead have self._max_samples?

Added a test to check that max_depth is calculated correctly in the case where max_samples is forced/limited to n_samples.

@ngoix
Copy link
Copy Markdown
Contributor

ngoix commented Nov 3, 2015

We introduced this attribute before adding the _fit() method in BaseBagging, which make it useless at that time.
This PR looks good to me, except that I'm still not 100% sure about L174 (self.base_estimator or self.base_estimator_ ?), ping @agramfort.
Could you just please add a test to check the value of self.max_samples_ in both cases (=self.max_samples or =n_samples)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be on base_estimator_, has you cannot modify an init argument at any time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be on base_estimator_, has you cannot modify an init argument at any time.

+1

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base_estimator_ doesn't exist. There is a list of the fitted estimators (self.estimators_) but setting the property after fitting is too late. BaseEnsemble explicitly delays instantiating the estimators so that the parameters of the base_estimator can still be changed. So I think this is what you want to be doing. The only alternative I see is not instantiating the base estimator until fit() is being called. For this we'd have to change BaseEnsemble though, as it expects you to pass a value for the base_estimator argument.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other way would be to add an optional argument max_depth (argument to use instead of self.max_depth) to _fit() in BaseBagging, and to change self.base_estimator_.max_depth after self._validate_estimator() (L.290 in bagging.py). Indeed _validate_estimator() makes base_estimator_ exist.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds reasonable to me.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being a bit thick, I think I get it now.

@betatim
Copy link
Copy Markdown
Member Author

betatim commented Nov 4, 2015

Added a test for max_samples_ being set correctly, as well as checking that max_samples='illegal_string' raises an exception.

Finally understood what you meant with base_estimator and base_estimator_ and implemented @ngoix's proposal. Should we implement a test for the max_depth parameter? Do we need to check if it made sense to pass max_depth to _fit (eg when the base estimator isn't a tree) or is it assumed the user knows what they are doing because it is _fit and not fit?

@agramfort
Copy link
Copy Markdown
Member

design looks good !

you have a travis failure.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should say

Argument to use instead of the one initially passed to the base estimator.
This is supported only if the base estimator has a max_depth parameter.

@ngoix
Copy link
Copy Markdown
Contributor

ngoix commented Nov 4, 2015

Apart for this two minor comments, everything looks great to me. Thanks a lot @betatim :) !

@glouppe glouppe changed the title [WIP] IsolationForest max_samples warning and calculation [MRG+1] IsolationForest max_samples warning and calculation Nov 4, 2015
@betatim
Copy link
Copy Markdown
Member Author

betatim commented Nov 4, 2015

Turns out in python the ordering of numbers and strings isn't defined. Legacy python does define it though. So I extended the isinstance checking to check explicitly for str vs int.

Thanks for your patience! 🍰

@agramfort
Copy link
Copy Markdown
Member

@ngoix ok for you?

@ngoix
Copy link
Copy Markdown
Contributor

ngoix commented Nov 6, 2015

Yes everything looks fine. Any idea why appveyor is not happy?

@giorgiop
Copy link
Copy Markdown
Contributor

giorgiop commented Nov 6, 2015

I don't why it only breaks for that particular instance (python 2.7.8 64bit), but the error is raised here:
AttributeError: 'long' object has no attribute 'shape'

@giorgiop
Copy link
Copy Markdown
Contributor

giorgiop commented Nov 6, 2015

What happens if you change the else statement in elif isinstance(n_samples_leaf, np.ndarray):?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another left-out bug that we missed. It should be max_samples = int(max_samples * X.shape[0]) (without self).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@betatim if you make this change appveyor should be happy.

@glouppe
Copy link
Copy Markdown
Contributor

glouppe commented Nov 6, 2015

There is another thing that I dont currently understand in the codebase. In predict, scores are normalized by _average_path_length(max_samples), why? In particular, the behaviour is different for float and integer values of max_samples. If float, then it becomes dependent on the size of X passed to predict, which means that scores will vary simply depending on the number of points you want to predict the abnormality score... This is obviously wrong. What do you think @ngoix ? (We should add regression tests to verify that results are identical for float and integer values and for varying size of X)

(Sorry @betatim for uncovering more bugs and make your PR longer :))

@ngoix
Copy link
Copy Markdown
Contributor

ngoix commented Nov 6, 2015

Yes you are right @glouppe it was wrong, but it is fixed now :) (see L229)

@glouppe
Copy link
Copy Markdown
Contributor

glouppe commented Nov 6, 2015

Oh yeah right, forget what I said :)

@betatim
Copy link
Copy Markdown
Member Author

betatim commented Nov 9, 2015

Appveyor is unhappy because L266 checks if the argument is an int and if not assumes it is a ndarray. For appveyor we get passed a long, which isn't an int, and so end up in the ndarray branch.

Not quite sure where we get the long from in the first place. Should we track it down? This fix correctly detects longs and ints as integral numbers.

@agramfort
Copy link
Copy Markdown
Member

agramfort commented Nov 9, 2015 via email

@betatim
Copy link
Copy Markdown
Member Author

betatim commented Nov 9, 2015

Merci. The check I used was inspired by others in ensemble/* should I make an issue for this?

@betatim
Copy link
Copy Markdown
Member Author

betatim commented Nov 11, 2015

All builds are green again.

@agramfort
Copy link
Copy Markdown
Member

+1 merge

@glouppe merge is you're happy too

glouppe added a commit that referenced this pull request Nov 11, 2015
[MRG+1] IsolationForest max_samples warning and calculation
@glouppe glouppe merged commit 889e2d4 into scikit-learn:master Nov 11, 2015
@glouppe
Copy link
Copy Markdown
Contributor

glouppe commented Nov 11, 2015

Looking good, merging. Thanks Tim!

@betatim betatim deleted the no-warning-iforest branch November 11, 2015 14:03
@betatim
Copy link
Copy Markdown
Member Author

betatim commented Nov 11, 2015

🍰

Just noticed the number of this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants