ENH Optimize runtime for IsolationForest#23149
Conversation
thomasjpfan
left a comment
There was a problem hiding this comment.
Thank you for the PR!
This looks reasonable to do. What does the benchmark look like for dense data?
|
After more profiling, it seems like the indexing operation of # only index arrays when needed
X_ = X if max_features == X.shape[1] else X[:, features]
estimator_fit(X_, y, sample_weight=curr_sample_weight)and see improvements for both sparse and dense inputs. While the problem will still exist when |
sklearn/ensemble/_bagging.py
Outdated
|
|
||
| estimator.fit(X[:, features], y, sample_weight=curr_sample_weight) | ||
| # only index arrays when needed | ||
| X_ = X if max_features == X.shape[1] else X[:, features] |
There was a problem hiding this comment.
I would do this in a separate follow up PR. The check_input change is already an improvement that can be merged by itself.
As for this change, we can only do this optimization if bootstrap_features is False. bootstrap_features is always False for IsolationForest, but not for Bagging*. (If bootstrap_features is True, then the features are sampled with replacement which can result in repeated features.)
There was a problem hiding this comment.
Got it! I've reverted the changes in this PR, and will draft another PR once this got merged :)
This reverts commit ff4e429. revert last commit
thomasjpfan
left a comment
There was a problem hiding this comment.
Please add an entry to the change log at doc/whats_new/v1.1.rst with tag |Efficiency|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.
|
Merging this one. The real fix for the sparse case should happen in a follow-up PR probably targeting 1.2. |
|
@jeremiedbb I added the "to backport" label to not forget to include this one in the 1.1.X branch before the final 1.1.0 release. |






Reference Issues/PRs
Towards #19275
What does this implement/fix? Explain your changes.
As shown here in the original issue discussion, the
check_inputargument is set to False forForestclasses, while for bagging estimators it's left as default valueTrue, therefore we're validating the data repeatedly.The proposed fix adds a
check_inputargument to theensemble._bagging._parallel_build_estimators, which can be set to False when the base estimator actually supports that argument during the fit process.Performance Impact
Code used for profiling:
Before (total time: 6.78s)

After (total time: 5.017s)
