[MRG+1] Fix iforest average path length#13251
[MRG+1] Fix iforest average path length#13251glemaitre merged 13 commits intoscikit-learn:masterfrom
Conversation
Fix Issue scikit-learn#11839 : sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3.
Changed existing test to reflect correct values now produced by _average_path_length(), and added checks to ensure non-regression on all "base case" values in {0,1,2}.
Made recommended enhancements to comments, and change assert_almost_equal to assert_equal where constants should be returned.
Change assert_equal to assert ... == to adhere to latest conventions, and change test to properly deal with anomaly score ties in critical regions if 'decision_function' method is supported by the estimator in question, or default to the old behavior if not.
Refactoring and adding more tests to try and get coverage to an acceptable level.
| assert_almost_equal(_average_path_length(1), 1., decimal=10) | ||
| assert _average_path_length(0) == 0. | ||
| assert _average_path_length(1) == 0. | ||
| assert _average_path_length(2) == 1. |
There was a problem hiding this comment.
can we also test that _average_path_length is increasing for more values? I guess _average_path_length(2) < _average_path_length(3) would be enough
|
ping @agramfort for a review when you have time |
glemaitre
left a comment
There was a problem hiding this comment.
Could you add an entry inside in what's new
| assert_almost_equal(_average_path_length(999), result_two, decimal=10) | ||
| assert_array_almost_equal(_average_path_length(np.array([1, 5, 999])), | ||
| [1., result_one, result_two], decimal=10) | ||
| assert_array_almost_equal(_average_path_length(np.array([1, 2, 5, 999])), |
There was a problem hiding this comment.
Since we are changing this line, could we use assert_allclose
There was a problem hiding this comment.
Yes. What's the difference? assert_allclose does not check the shapes are the same?
There was a problem hiding this comment.
all_close use rtol atol instead of decimal. It is just recommended by numpy for consistency:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.testing.assert_array_almost_equal.html
| @@ -1548,16 +1568,16 @@ def check_outliers_train(name, estimator_orig, readonly_memmap=True): | |||
|
|
|||
| decision = estimator.decision_function(X) | |||
There was a problem hiding this comment.
for func in ['decision_function', 'score_samples']:
output = getattr(estimator, func)(X)
assert output.dtype == np.dtype('float')
assert output.shape == (n_samples,)There was a problem hiding this comment.
I now use a for loop but a bit different to your suggestion as we need the outputs for other checks.
ngoix
left a comment
There was a problem hiding this comment.
appart from my small comment on _average_path_length monotonic testing and guillaume formatting comments LGTM
Co-Authored-By: albertcthomas <albertthomas88@gmail.com>
|
Thanks for the reviews @glemaitre and @ngoix |
|
+1 for MRG |
|
ok to merge when green @ngoix and @glemaitre ? |
|
Merging. Azure pipeline is green. |
|
Thanks for the reviews @ngoix, @glemaitre and @agramfort. Thanks for most of the work @joshuakennethjones and sorry for delaying the original PR. |
|
Thanks for pushing this one across the finish line @albertcthomas! Glad to be able to help out a little bit -- I appreciate the efforts of everyone involved in maintaining and improving what is obviously a very useful package. |
This reverts commit 00cea26.
This reverts commit 00cea26.
Reference Issues/PRs
Taking over #12085. Besides fixing the average path length for IsolationForest this PR also improves the checks for the predicted number of outliers in the common tests.
Closes #12085, Fixes #11839
Only modifications in the tests were required.
cc @joshuakennethjones