DOC Rework outlier detection estimators example#25878
DOC Rework outlier detection estimators example#25878glemaitre merged 57 commits intoscikit-learn:mainfrom
Conversation
…nto outlier_benchmarks
…arn into outlier_benchmarks
…nto outlier_benchmarks
|
This PR is good to be reviewed now. |
Co-authored-by: Tim Head <betatim@gmail.com>
|
For info this PR is failing on the CI with the following traceback: File "/home/circleci/project/examples/miscellaneous/plot_outlier_detection_bench.py", line 218, in <module>
y_pred[model_name]["ames_housing"] = fit_predict(
File "/home/circleci/project/examples/miscellaneous/plot_outlier_detection_bench.py", line 84, in fit_predict
y_pred = model.fit(X).decision_function(X)
File "/home/circleci/project/sklearn/pipeline.py", line 420, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)
File "/home/circleci/project/sklearn/ensemble/_iforest.py", line 291, in fit
X = self._validate_data(X, accept_sparse=["csc"], dtype=tree_dtype)
File "/home/circleci/project/sklearn/base.py", line 594, in _validate_data
out = check_array(X, input_name="X", **check_params)
File "/home/circleci/project/sklearn/utils/validation.py", line 964, in check_array
_assert_all_finite(
File "/home/circleci/project/sklearn/utils/validation.py", line 129, in _assert_all_finite
_assert_all_finite_element_wise(
File "/home/circleci/project/sklearn/utils/validation.py", line 178, in _assert_all_finite_element_wise
raise ValueError(msg_err)
ValueError: Input X contains NaN.
IsolationForest does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-valuesThe |
| from time import perf_counter | ||
|
|
||
|
|
||
| def fit_predict(X, model_name, expected_anomaly_frac, categorical_columns=()): |
There was a problem hiding this comment.
Not a big fan of the empty tuple. I would find it more natural with:
def fit_predict(..., categorical_columns=None):
categorical_columns = [] if categorical_columns is None else categorical_columnsThere was a problem hiding this comment.
Since expected_anomaly_frac is only use for LOF, I would expect it to be a have default to None.
Also, I would use expected_anomaly_fraction since this is only 3 letters more.
| model.fit(X) | ||
| y_pred = model[-1].negative_outlier_factor_ | ||
|
|
||
| if model_name == "IForest": |
| ordinal_encoder = OrdinalEncoder( | ||
| handle_unknown="use_encoded_value", unknown_value=-1 | ||
| ) | ||
| iforest = IsolationForest(random_state=rng) | ||
| preprocessor = ColumnTransformer( | ||
| [("categorical", ordinal_encoder, categorical_columns)], | ||
| remainder="passthrough", | ||
| ) | ||
| model = make_pipeline(preprocessor, iforest) |
There was a problem hiding this comment.
| ordinal_encoder = OrdinalEncoder( | |
| handle_unknown="use_encoded_value", unknown_value=-1 | |
| ) | |
| iforest = IsolationForest(random_state=rng) | |
| preprocessor = ColumnTransformer( | |
| [("categorical", ordinal_encoder, categorical_columns)], | |
| remainder="passthrough", | |
| ) | |
| model = make_pipeline(preprocessor, iforest) | |
| ordinal_encoder = OrdinalEncoder( | |
| handle_unknown="use_encoded_value", unknown_value=-1 | |
| ) | |
| preprocessor = ColumnTransformer( | |
| [("categorical", ordinal_encoder, categorical_columns)], | |
| remainder="passthrough", | |
| ) | |
| model = make_pipeline(preprocessor, IsolationForest(random_state=rng)) |
| ordinal_encoder = OrdinalEncoder( | ||
| handle_unknown="use_encoded_value", unknown_value=-1 | ||
| ) | ||
| iforest = IsolationForest(random_state=rng) |
There was a problem hiding this comment.
Using the global rng is not a good practice here. We should pass the random_state as an argument.
| from time import perf_counter | ||
|
|
||
|
|
||
| def fit_predict(X, model_name, expected_anomaly_frac, categorical_columns=()): |
There was a problem hiding this comment.
Reading a bit more about the example, I would decouple the creation of the estimator and the real fit_predict:
# %%
# ... text about the desired preprocessing and modelling
from sklearn.neighbors import LocalOutlierFactor
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import (
OneHotEncoder,
OrdinalEncoder,
RobustScaler,
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
def make_estimator(name, categorical_columns=None, **kwargs):
"""Create an outlier detection estimator based on its name."""
if name == "LOF":
outlier_detector = LocalOutlierFactor(**kwargs)
if categorical_columns is None:
preprocessor = RobustScaler()
else:
preprocessor = ColumnTransformer(
transformers=[("categorical", OneHotEncoder(), categorical_columns)],
remainder=RobustScaler(),
)
else: # name == "IForest"
outlier_detector = IsolationForest(**kwargs)
if categorical_columns is None:
preprocessor = None
else:
ordinal_encoder = OrdinalEncoder(
handle_unknown="use_encoded_value", unknown_value=-1
)
preprocessor = ColumnTransformer(
transformers=[
("categorical", ordinal_encoder, categorical_columns),
],
remainder="passthrough",
)
return make_pipeline(preprocessor, outlier_detector)
# %%
# ... text about the `fit_predict`
# %%
from time import perf_counter
def fit_predict(estimator, X):
tic = perf_counter()
if estimator[-1].__class__.__name__ == "LocalOutlierFactor":
estimator.fit(X)
y_pred = estimator[-1].negative_outlier_factor_
else: # "IsolationForest"
y_pred = estimator.fit(X).decision_function(X)
toc = perf_counter()
print(f"Duration for {model_name}: {toc - tic:.2f} s")
return y_predThere was a problem hiding this comment.
In the code above, it means that we will show how to define n_neighbors each time but we can also pass random_state when requesting an isolation forest.
There was a problem hiding this comment.
As LocalOutlierFactor and IsolationForest don't share kwargs, the suggested make_estimator function would only work inside if-statements that make more boilerplate code. The idea was to have a function to handle the cases to avoid this.
There was a problem hiding this comment.
I could still pass kwargs to the LocalOutlierFactor and set the random_state directly in the construction of the IsolationForest, though I'm aware this is not a good practice. WDYT?
Since Edit: I can indeed reproduce the error with pandas 2.0.1 while the code was working with pandas 1.5.1. |
|
Here is the difference: pandas 2.0.1:pandas 1.5.3So the difference is in the |
|
Basically, the culprit is pandas-dev/pandas#50286 where I think we can tweak it on our side by omitting, |
| from sklearn.datasets import fetch_kddcup99 | ||
| from sklearn.model_selection import train_test_split | ||
|
|
||
| rng = np.random.RandomState(42) |
There was a problem hiding this comment.
The rational is that each time rng is used, its state will be modified and changing code ordering could potentially lead to different results. So it is safer to just pass integer around.
| lof = LocalOutlierFactor(n_neighbors=int(n_samples * expected_anomaly_fraction)) | ||
|
|
||
| fig, ax = plt.subplots() | ||
| for model_idx, preprocessor in enumerate(preprocessor_list): |
There was a problem hiding this comment.
I think that the green line will be difficult to distinguish from the red for colorblind. Do you mind to also have a different linestyle for each line to distinguish them this way as well.
|
LGTM. Enabling auto-merge. Thanks @ArturoAmorQ |
Co-authored-by: ArturoAmorQ <arturo.amor-quiroz@polytechnique.edu> Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: ArturoAmorQ <arturo.amor-quiroz@polytechnique.edu> Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: ArturoAmorQ <arturo.amor-quiroz@polytechnique.edu> Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
ogrisel
left a comment
There was a problem hiding this comment.
I realized I had a pending review on this PR. Here were the comments:
| This example benchmarks two outlier detection algorithms, namely | ||
| :ref:`local_outlier_factor` (LOF) and :ref:`isolation_forest` (IForest), using | ||
| ROC curves on classical anomaly detection datasets. The goal is to show that | ||
| different algorithms perform well on different datasets. |
There was a problem hiding this comment.
| different algorithms perform well on different datasets. | |
| different algorithms perform well on different datasets and highlight | |
| differences in training speed and sensitivity to hyper-parameters. |
| 1. The algorithms are trained on the whole dataset which is assumed to | ||
| contain outliers. |
There was a problem hiding this comment.
| 1. The algorithms are trained on the whole dataset which is assumed to | |
| contain outliers. | |
| 1. The algorithms are trained (without labels) on the whole dataset which is assumed | |
| to contain outliers. |
| # similarly in terms of ROC AUC for the forestcover and cardiotocography | ||
| # datasets. The score for IForest is slightly better for the SA dataset and LOF | ||
| # performs considerably better on WDBC than IForest. | ||
| # |
There was a problem hiding this comment.
Let's add a something about the tradeoff between ROC and computational performance:
Recall however that Isolation Forest tends to train much faster than LOF on datasets with a large number of samples. Indeed LOF needs to compute pairwise distances to find nearest neighbors which is has a quadratic complexity in large dimensions. This can make this method prohibitive on large datasets.
Reference Issues/PRs
What does this implement/fix? Explain your changes.
In preparation to address this comment, the current state of the Evaluation of outlier detection estimators example can benefit from a "tutorialization" to explain the steps and results.
For such purpose, this PR:
LabelBinarizerto a properOneHotEncoderwhen needed;Any other comments?
The ablation section is a nice to have but makes the example take longer to run. Hopefully the soon to come speed-ups in neighbors computations will solve this issue.