Skip to content

Isolation forest - decision_function & average_path_length method are memory inefficient #12040

@anant9

Description

@anant9

Description

Isolation forest consumes too much memory due to memory ineffecient implementation of anomoly score calculation. Due to this the parallelization with n_jobs is also impacted as anomoly score cannot be calculated in parallel for each tree.

Steps/Code to Reproduce

Run a simple Isolation forest with n_estimators as 10 and as 50 respectively.
On memory profiling, it can be seen that each building of tree is not taking much memory but in the end a lot of memory is consumed as a for loop is iteration over all trees and calculating the anomoly score of all trees together and then averaging it.
-iforest.py line 267-281

        for i, (tree, features) in enumerate(zip(self.estimators_,
                                                 self.estimators_features_)):
            if subsample_features:
                X_subset = X[:, features]
            else:
                X_subset = X
            leaves_index = tree.apply(X_subset)
            node_indicator = tree.decision_path(X_subset)
            n_samples_leaf[:, i] = tree.tree_.n_node_samples[leaves_index]
            depths[:, i] = np.ravel(node_indicator.sum(axis=1))
            depths[:, i] -= 1

        depths += _average_path_length(n_samples_leaf)

        scores = 2 ** (-depths.mean(axis=1) / _average_path_length(self.max_samples_))

        # Take the opposite of the scores as bigger is better (here less
        # abnormal) and add 0.5 (this value plays a special role as described
        # in the original paper) to give a sense to scores = 0:
        return 0.5 - scores

Due to this, in case of more no. of estimators(1000), the memory consumed is quite high.

Expected Results

Possible Solution:
The above for loop should only do the averaging of anomoly score from each estimator instead of calculation. The logic of isoforest anomoly score calculation can be moved to base estimator class so it is done for each tree( i guess bagging.py file-similar to other method available after fitting)

Actual Results

The memory consumption is profound as we increase no. of estimators.

model=Isolationforest()
model.fit(data)

The fit method calls decision function & average anomoly score which are taking quite a lot memory.
the memory spike is too high in the very end, that is in finall call to average_path_length() method.

depths += _average_path_length(n_samples_leaf)

Versions

isoForest_memoryConsumption.docx

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions