[MRG+1] Fix iforest average path length by albertcthomas · Pull Request #13251 · scikit-learn/scikit-learn

albertcthomas · 2019-02-25T14:55:42Z

Reference Issues/PRs

Taking over #12085. Besides fixing the average path length for IsolationForest this PR also improves the checks for the predicted number of outliers in the common tests.
Closes #12085, Fixes #11839

Only modifications in the tests were required.

cc @joshuakennethjones

Fix Issue scikit-learn#11839 : sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.

Changed existing test to reflect correct values now produced by _average_path_length(), and added checks to ensure non-regression on all "base case" values in {0,1,2}.

Made recommended enhancements to comments, and change assert_almost_equal to assert_equal where constants should be returned.

Change assert_equal to assert ... == to adhere to latest conventions, and change test to properly deal with anomaly score ties in critical regions if 'decision_function' method is supported by the estimator in question, or default to the old behavior if not.

Refactoring and adding more tests to try and get coverage to an acceptable level.

ngoix · 2019-02-25T16:26:22Z

sklearn/ensemble/tests/test_iforest.py

-    assert_almost_equal(_average_path_length(1), 1., decimal=10)
+    assert _average_path_length(0) == 0.
+    assert _average_path_length(1) == 0.
+    assert _average_path_length(2) == 1.


can we also test that _average_path_length is increasing for more values? I guess _average_path_length(2) < _average_path_length(3) would be enough

albertcthomas · 2019-02-25T16:42:14Z

ping @agramfort for a review when you have time

sklearn/utils/estimator_checks.py

glemaitre

Could you add an entry inside in what's new

sklearn/ensemble/iforest.py

sklearn/ensemble/tests/test_iforest.py

glemaitre · 2019-02-26T10:33:39Z

sklearn/ensemble/tests/test_iforest.py

    assert_almost_equal(_average_path_length(999), result_two, decimal=10)
-    assert_array_almost_equal(_average_path_length(np.array([1, 5, 999])),
-                              [1., result_one, result_two], decimal=10)
+    assert_array_almost_equal(_average_path_length(np.array([1, 2, 5, 999])),


Since we are changing this line, could we use assert_allclose

Yes. What's the difference? assert_allclose does not check the shapes are the same?

all_close use rtol atol instead of decimal. It is just recommended by numpy for consistency:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.testing.assert_array_almost_equal.html

glemaitre · 2019-02-26T10:38:50Z

sklearn/utils/estimator_checks.py

@@ -1548,16 +1568,16 @@ def check_outliers_train(name, estimator_orig, readonly_memmap=True):

    decision = estimator.decision_function(X)


for func in ['decision_function', 'score_samples']: output = getattr(estimator, func)(X) assert output.dtype == np.dtype('float') assert output.shape == (n_samples,)

I now use a for loop but a bit different to your suggestion as we need the outputs for other checks.

sklearn/utils/estimator_checks.py

ngoix

appart from my small comment on _average_path_length monotonic testing and guillaume formatting comments LGTM

Co-Authored-By: albertcthomas <albertthomas88@gmail.com>

albertcthomas · 2019-02-26T13:08:10Z

Thanks for the reviews @glemaitre and @ngoix

agramfort · 2019-02-26T15:06:19Z

+1 for MRG

agramfort · 2019-02-26T15:06:50Z

ok to merge when green @ngoix and @glemaitre ?

glemaitre · 2019-02-26T16:49:40Z

Merging. Azure pipeline is green.

albertcthomas · 2019-02-26T16:53:54Z

Thanks for the reviews @ngoix, @glemaitre and @agramfort. Thanks for most of the work @joshuakennethjones and sorry for delaying the original PR.

joshuakennethjones · 2019-02-26T19:44:08Z

Thanks for pushing this one across the finish line @albertcthomas! Glad to be able to help out a little bit -- I appreciate the efforts of everyone involved in maintaining and improving what is obviously a very useful package.

This reverts commit 00cea26.

joshuakennethjones and others added 8 commits February 25, 2019 13:44

Fix issue scikit-learn#11839

8fddcb8

Fix Issue scikit-learn#11839 : sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.

Add non-regression test.

c1feccc

Changed existing test to reflect correct values now produced by _average_path_length(), and added checks to ensure non-regression on all "base case" values in {0,1,2}.

Improve comment & test.

f03b7f9

Made recommended enhancements to comments, and change assert_almost_equal to assert_equal where constants should be returned.

Sundry convention tweaks.

42ceb87

Add tests of tests.

8ac76ae

Refactoring and adding more tests to try and get coverage to an acceptable level.

fix test to take LOF into account

95901fe

fix check_outlier_corruption and corresponding tests

b955607

albertcthomas mentioned this pull request Feb 25, 2019

[MRG] sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3. #11839 #12085

Closed

ngoix mentioned this pull request Feb 25, 2019

[MRG] avoid storage of each tree predictions in iforest #13260

Merged

ngoix reviewed Feb 25, 2019

View reviewed changes

albertcthomas commented Feb 26, 2019

View reviewed changes

sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved

glemaitre requested changes Feb 26, 2019

View reviewed changes

ngoix approved these changes Feb 26, 2019

View reviewed changes

glemaitre and others added 4 commits February 26, 2019 12:24

apply suggestions from code review

9260d0b

Co-Authored-By: albertcthomas <albertthomas88@gmail.com>

more review comments

f3c24fe

whatsnew

f9066bc

sc

3000edb

sc whatsnew

e5087d1

ngoix mentioned this pull request Feb 26, 2019

[MRG] iforest chunks for score_samples #13283

Merged

agramfort changed the title ~~[MRG] Fix iforest average path length~~ [MRG+1] Fix iforest average path length Feb 26, 2019

glemaitre approved these changes Feb 26, 2019

View reviewed changes

glemaitre merged commit bcdeadd into scikit-learn:master Feb 26, 2019

albertcthomas mentioned this pull request Feb 26, 2019

[MRG+1] Fix pep 8 errors introduced in previous PR #13292

Merged

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

BUG: fix average path length in iforest (scikit-learn#13251)

00cea26

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "BUG: fix average path length in iforest (scikit-learn#13251)"

7965662

This reverts commit 00cea26.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "BUG: fix average path length in iforest (scikit-learn#13251)"

105cddf

This reverts commit 00cea26.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

BUG: fix average path length in iforest (scikit-learn#13251)

29a9a24

Konrad0 mentioned this pull request Nov 27, 2019

Average path length in iForest is inaccurate for small sizes #15724

Open

		@@ -1548,16 +1568,16 @@ def check_outliers_train(name, estimator_orig, readonly_memmap=True):

		decision = estimator.decision_function(X)

Uh oh!

Conversation

albertcthomas commented Feb 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

Uh oh!

ngoix Feb 25, 2019

Choose a reason for hiding this comment

Uh oh!

albertcthomas commented Feb 25, 2019

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre Feb 26, 2019

Choose a reason for hiding this comment

Uh oh!

albertcthomas Feb 26, 2019

Choose a reason for hiding this comment

Uh oh!

glemaitre Feb 26, 2019

Choose a reason for hiding this comment

Uh oh!

glemaitre Feb 26, 2019

Choose a reason for hiding this comment

Uh oh!

albertcthomas Feb 26, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngoix left a comment

Choose a reason for hiding this comment

Uh oh!

albertcthomas commented Feb 26, 2019

Uh oh!

agramfort commented Feb 26, 2019

Uh oh!

agramfort commented Feb 26, 2019

Uh oh!

glemaitre commented Feb 26, 2019

Uh oh!

albertcthomas commented Feb 26, 2019

Uh oh!

joshuakennethjones commented Feb 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

albertcthomas commented Feb 25, 2019 •

edited

Loading