-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3. #11839
Description
Description
When an input value to _average_path_length() is in {0,1}, the return value should be zero, not one, as in the existing implementation. Also, when the input value is 2, the return value should be 1, not 0.15... as in the current implementation. These results should be expected for two reasons: first, based on the 2012 iForest paper (the original paper indicated a zero value for terminal nodes to which < 2 training examples had been sorted in the first paragraph of section 4.2, but left this vague in the algorithm/equation specifications, and did not specify a unique value for nodes to which exactly two training examples had been sorted), where it is explicitly stated that c(n) (the value computed by _average_path_length) should take the value zero for n in {0,1} and takes the value 1 for n=2. Also, from a rational perspective, we want these values to monotonically increase with n, and in the current implementation this is not the case. This is a pretty easy fix, I think -- just alter the existing cases for inputs in {0,1} to return zero instead of 1 (already hard-coded for these cases) and add a case for an input value of 2 to return 1. Since I have not contributed in the past, I felt it best to relay the issue this way vs. making my own pull request. This issue will impact anomaly scores in a subtle but potentially meaningful way.
Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation-based anomaly detection." ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012): 3.
Steps/Code to Reproduce
import numpy as np
from sklearn.ensemble.iforest import _average_path_length
_average_path_length(np.array([0,1,2,3]))Expected Results
array([0. , 0. , 1., 1.20739236])
Actual Results
array([1. , 1. , 0.15443133, 1.20739236])
Versions
Linux-4.14.47-56.37.amzn1.x86_64-x86_64-with-glibc2.2.5
('Python', '2.7.14 (default, May 2 2018, 18:31:34) \n[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)]')
('NumPy', '1.14.5')
('SciPy', '1.1.0')
('Scikit-Learn', '0.19.1')