Skip to content

IsolationForest extremely slow with large number of columns having discrete values #19275

@david-cortes

Description

@david-cortes

The following example takes an unreasonable amount of time to run:

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.datasets import fetch_rcv1
from scipy.sparse import csc_matrix, csr_matrix

X, y = fetch_rcv1(return_X_y=True)
X = csc_matrix(X)
X.sort_indices()
iso = IsolationForest(n_estimators=100, max_samples=256).fit(X)

In theory it should be very fast, since each sub-sample it takes is a small sparse matrix in which most columns will have only zeros.

Using n_jobs>1 also makes it use a very unreasonable amount of memory for some reason.

If the input is passed as dense, the running time still looks worse than it should. I guess the issue from a quick glance at the code is that it doesn't remember which columns are already not possible to split in a given node:

X_sample = csr_matrix(X)[:1000,:]
X_sample = X_sample.toarray()
iso = IsolationForest(n_estimators=100, max_samples=256).fit(X_sample)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions