IsolationForest extremely slow with large number of columns having discrete values

The following example takes an unreasonable amount of time to run:
```python
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.datasets import fetch_rcv1
from scipy.sparse import csc_matrix, csr_matrix

X, y = fetch_rcv1(return_X_y=True)
X = csc_matrix(X)
X.sort_indices()
iso = IsolationForest(n_estimators=100, max_samples=256).fit(X)
```

In theory it should be very fast, since each sub-sample it takes is a small sparse matrix in which most columns will have only zeros.

Using `n_jobs>1` also makes it use a very unreasonable amount of memory for some reason.

If the input is passed as dense, the running time still looks worse than it should. I guess the issue from a quick glance at the code is that it doesn't remember which columns are already not possible to split in a given node:
```python
X_sample = csr_matrix(X)[:1000,:]
X_sample = X_sample.toarray()
iso = IsolationForest(n_estimators=100, max_samples=256).fit(X_sample)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IsolationForest extremely slow with large number of columns having discrete values #19275

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

IsolationForest extremely slow with large number of columns having discrete values #19275

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions