mljar-supervised Add SHAP explanations to each prediction

We use SHAP explanations for models. It will be nice to have explanations for each prediction.

Aug 28 '20 20:08 pplonski

Hi, I want to work on this feature, and I want to get more info on it. By the way, I'm new to the open-source world, so please bear with me.

May 02 '21 13:05 DRMALEK

@DRMALEK thank you for your interest!

The idea is to provide explanations for each prediction made by AutoML. The SHAP package can be used for this. But, I'm also open to other explanation techniques.

The similar issue -> https://github.com/mljar/mljar-supervised/issues/180

The functionality can be obtained by adding the explain() method.

Do you have any ideas on how this can be achieved?

May 04 '21 08:05 pplonski

As far as I found, there is another technique besides the SHAP values, named LIME for local interpretability. It is implemented in https://github.com/interpretml/interpret package. LIME stands to be faster, than SHAP but has a weaker theoretical basis than SHAP. I think we can try to add both, chosen via a parameter for the explain() function. I don't know, is it the right way to go?

May 05 '21 12:05 DRMALEK

@DRMALEK sounds great! Having both methods would be fantastic. From my experience SHAP method is very slow. I never tried to run it with Ensemble because of the speed. However in theory it should work ...

Never tried LIME - to be honest. Are you able to provide an example (maybe a notebook) where you can create LIME explanations for the AutoML object? The example will:

Train AutoML on some dataset (AutoML with several models and Ensemble)
Compute prediction for one sample
Compute explanations for one sample It will be a nice example to check the speed of LIME for complex models.

May 05 '21 12:05 pplonski

Sure. I will start to work on it.

May 05 '21 12:05 DRMALEK

Hi, when I tried to get the local explanations, I noticed a kind of bug in the predict function of the AutoML object, when you provide it with numpy.ndarray instead of pandas.Dataframe. Basically, you got an error like:

`AutoMLException: Missing column: age in input data. Cannot predict.

due to the fact that _build_dataframe() add template-based column names like feature_1, feature_2, etc, which are not in the original _data_info list.

İ'm sharing a colab notebook for the problem: https://colab.research.google.com/drive/18GxiNCQ9ry-EuqNIKNDcySmj52l7zV0f?usp=sharing

May 07 '21 09:05 DRMALEK

@DRMALEK thank you for the notebook, few comments:

you don't need to apply features and target preprocessing to use AutoML, just throw data. Does LIME need any data preprocessing?
looks like you need to pass a data frame as an input for computing prediction, try automl.predict(X_test.iloc[:1])
maybe there is a bug in the code because I cant run predictions for X_test.iloc[0] only X_test.iloc[:1] works in predict method. I will add issues to fix this.

May 07 '21 10:05 pplonski

@DRMALEK maybe there is a need to access preprocessed data from AutoML to use it in LIME?

May 07 '21 10:05 pplonski

Yeah, LIME actually needs a preprocessed data to do the explanations, and at the same time, it needs it as a NumPy array. More than that, it uses the predict() of the trained model, to get the prediction. But for now, the predict method of mljar only works for Pandas data type.

I think I should try to integrate LIME directly in the source code of maljar when a similar method like SHAP is used and tested there, right?

May 07 '21 11:05 DRMALEK

I would do one more experiment with LIME, how it will work with ensemble model that is manually trained. For example, you train an ensemble of linear model from scikit-learn and CatBoost. The ensemble is simple average of both models. The linear model from scikit-learn will need preprocessing of the data (missing values inputation + categoricals handling). The CatBoost can work with categorical directly. Do you think you will be able to prepare such example? It can be on adult income dataset?

This example is tricky because ensemble will need two types of preprocessing, one for linear model and one for CatBoost. Then which one should be passed to LIME?

May 07 '21 11:05 pplonski

Sorry for my late reply due to my other responsibilities, I managed to use the LIME library on XGBoost and Random forest trees, However, on the Catboost algorithms, I couldn't. The reasons for that:

Catboost can work directly on the categorical data, without the need to encode it, so when you try to run explain_instance() on a single instance, LIME assumes that the passed data is encoded, however, Catboost assume that it is not. So an exception is raised.

2- I tried to used CatBoost Encoder from Sklearn, to encode the data before giving it to the catboost (so that cat boost can assume that the data is just numerical), however, there is a huge difference in the accuracy between the two models.

3- Basically, if Catboost alone couldn't be used, I assume stacking will not 😅

I'm attaching the collab: https://colab.research.google.com/drive/1WjCrY_ddaZ4uoDeoTpOT7eV3pee_e-OH?usp=sharing

May 15 '21 08:05 DRMALEK

I think since we have a problem with LIME integration, I will try to work on providing explanations using the SHAP library instead.

May 16 '21 21:05 DRMALEK

@DRMALEK great research. Many times the packages look good in theory but are hard to use in practice.

Maybe SHAP will be easier to use in any model, but the times needed to compute the explanations might be long. For sure it is worth checking.

May 17 '21 06:05 pplonski

hi, i managed to get SHAP single predictions on models like Catboost, and attached the used collab here. By the way, I noticed an issue regarding to Catboost shap prediction #114, but i managed to get the dependency plot and summary plot for it. I will work on closing that issue too.

https://colab.research.google.com/drive/1IYGxlgTuC7xrGqpLKpivdmyQkao3zZpE?usp=sharing

Regarding the stacked model, I believe any model rather than cat boost-based ones will work fine as the single models.

May 20 '21 07:05 DRMALEK

@DRMALEK great job! I found this repo https://github.com/oegedijk/explainerdashboard that might be interesting to you :)

May 20 '21 09:05 pplonski

Hi, I'm also trying to explain for a single prediction against an MLJAR model trained in compete mode. I'm trying to use ELI5 framework (but I have no problem to use another one like LIME or SHAP).

The way I tried it is:

import eli5

contrib = eli5.explain_prediction(
    mljar_model, # MLJAR AutoML object
    X_holdout.iloc[0], # My splitted dataframe (without the target)
    feature_names=list(df.columns), # My original dataframe
)

The error I get:

Explanation(estimator="AutoML(ml_task='binary_classification', mode='Compete',
       results_path='/model/a294cbb3/mljar', total_time_limit=60)",
       description=None, error="estimator AutoML(ml_task='binary_classification', mode='Compete',
       results_path='/model/a294cbb3/mljar', total_time_limit=60)

is not supported", method=None, is_regression=False, targets=None,
feature_importances=None, decision_tree=None, highlight_spaces=None,
transition_features=None, image=None)

I understand that I'm getting this error because ELI5 does not support MLJAR models. It support:

But I'm guessing, how can I get the base model that predicts from MLJAR? If I can explain a single prediction in an easier whay with other framework, I'm more than happy to change ELI5 with the new one.

Thank you very much in advance!

Sep 14 '21 12:09 fernaper

Was anyone able to get SHAP values for individual predictions with an MLJAR model? Or does it depend on the model type selected in the end?

Nov 15 '21 21:11 jrhorne

Hi @jrhorne, in theory shap should be model agnostic but in practice it is so slow to compute explanations. As far as I know the issue is still open. Maybe the option will be to search for other explanation tool. Maybe you can recommend something that is model agnostic and fast?

Nov 15 '21 21:11 pplonski

Thanks for the quick reply @pplonski. I can actually handle for it to be slow in this case, but using the automl.predict in SHAP was not working. I'll see if I can find that error again. I believe it has to do, potentially, with data pipelines and how SHAP passes in the features?

Nov 15 '21 21:11 jrhorne

You are right. It might be connected with features preprocessing that is done inside AutoML.

Nov 15 '21 21:11 pplonski

Would it be possible to extract only the feature processing part before passing back to SHAP? FWIW, I’m not currently using golden features. Thanks again for the input. I’ll see what I can find

Nov 15 '21 22:11 jrhorne