Skip to content

sklearn.ensemble.IsolationForest._average_path_length returns incorrect values for input < 3. #11839

@joshuakennethjones

Description

@joshuakennethjones

Description

When an input value to _average_path_length() is in {0,1}, the return value should be zero, not one, as in the existing implementation. Also, when the input value is 2, the return value should be 1, not 0.15... as in the current implementation. These results should be expected for two reasons: first, based on the 2012 iForest paper (the original paper indicated a zero value for terminal nodes to which < 2 training examples had been sorted in the first paragraph of section 4.2, but left this vague in the algorithm/equation specifications, and did not specify a unique value for nodes to which exactly two training examples had been sorted), where it is explicitly stated that c(n) (the value computed by _average_path_length) should take the value zero for n in {0,1} and takes the value 1 for n=2. Also, from a rational perspective, we want these values to monotonically increase with n, and in the current implementation this is not the case. This is a pretty easy fix, I think -- just alter the existing cases for inputs in {0,1} to return zero instead of 1 (already hard-coded for these cases) and add a case for an input value of 2 to return 1. Since I have not contributed in the past, I felt it best to relay the issue this way vs. making my own pull request. This issue will impact anomaly scores in a subtle but potentially meaningful way.

Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation-based anomaly detection." ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012): 3.

Steps/Code to Reproduce

import numpy as np
from sklearn.ensemble.iforest import _average_path_length
_average_path_length(np.array([0,1,2,3]))

Expected Results

array([0. , 0. , 1., 1.20739236])

Actual Results

array([1. , 1. , 0.15443133, 1.20739236])

Versions

Linux-4.14.47-56.37.amzn1.x86_64-x86_64-with-glibc2.2.5
('Python', '2.7.14 (default, May  2 2018, 18:31:34) \n[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)]')
('NumPy', '1.14.5')
('SciPy', '1.1.0')
('Scikit-Learn', '0.19.1')

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions