Add SHAP explanations to each prediction
We use SHAP explanations for models. It will be nice to have explanations for each prediction.
Hi, I want to work on this feature, and I want to get more info on it. By the way, I'm new to the open-source world, so please bear with me.
@DRMALEK thank you for your interest!
The idea is to provide explanations for each prediction made by AutoML. The SHAP package can be used for this. But, I'm also open to other explanation techniques.
The similar issue -> https://github.com/mljar/mljar-supervised/issues/180
The functionality can be obtained by adding the explain() method.
Do you have any ideas on how this can be achieved?
As far as I found, there is another technique besides the SHAP values, named LIME for local interpretability. It is implemented in https://github.com/interpretml/interpret package. LIME stands to be faster, than SHAP but has a weaker theoretical basis than SHAP. I think we can try to add both, chosen via a parameter for the explain() function. I don't know, is it the right way to go?
@DRMALEK sounds great! Having both methods would be fantastic. From my experience SHAP method is very slow. I never tried to run it with Ensemble because of the speed. However in theory it should work ...
Never tried LIME - to be honest. Are you able to provide an example (maybe a notebook) where you can create LIME explanations for the AutoML object? The example will:
- Train AutoML on some dataset (AutoML with several models and Ensemble)
- Compute prediction for one sample
- Compute explanations for one sample It will be a nice example to check the speed of LIME for complex models.
Sure. I will start to work on it.
Hi, when I tried to get the local explanations, I noticed a kind of bug in the predict function of the AutoML object, when you provide it with numpy.ndarray instead of pandas.Dataframe. Basically, you got an error like:
`AutoMLException: Missing column: age in input data. Cannot predict.
due to the fact that _build_dataframe() add template-based column names like feature_1, feature_2, etc, which are not in the original _data_info list.
İ'm sharing a colab notebook for the problem: https://colab.research.google.com/drive/18GxiNCQ9ry-EuqNIKNDcySmj52l7zV0f?usp=sharing
@DRMALEK thank you for the notebook, few comments:
- you don't need to apply features and target preprocessing to use AutoML, just throw data. Does LIME need any data preprocessing?
- looks like you need to pass a data frame as an input for computing prediction, try
automl.predict(X_test.iloc[:1]) - maybe there is a bug in the code because I cant run predictions for
X_test.iloc[0]onlyX_test.iloc[:1]works in predict method. I will add issues to fix this.
@DRMALEK maybe there is a need to access preprocessed data from AutoML to use it in LIME?
Yeah, LIME actually needs a preprocessed data to do the explanations, and at the same time, it needs it as a NumPy array. More than that, it uses the predict() of the trained model, to get the prediction. But for now, the predict method of mljar only works for Pandas data type.
I think I should try to integrate LIME directly in the source code of maljar when a similar method like SHAP is used and tested there, right?
I would do one more experiment with LIME, how it will work with ensemble model that is manually trained. For example, you train an ensemble of linear model from scikit-learn and CatBoost. The ensemble is simple average of both models. The linear model from scikit-learn will need preprocessing of the data (missing values inputation + categoricals handling). The CatBoost can work with categorical directly. Do you think you will be able to prepare such example? It can be on adult income dataset?
This example is tricky because ensemble will need two types of preprocessing, one for linear model and one for CatBoost. Then which one should be passed to LIME?
Sorry for my late reply due to my other responsibilities, I managed to use the LIME library on XGBoost and Random forest trees, However, on the Catboost algorithms, I couldn't. The reasons for that:
- Catboost can work directly on the categorical data, without the need to encode it, so when you try to run explain_instance() on a single instance, LIME assumes that the passed data is encoded, however, Catboost assume that it is not. So an exception is raised.
2- I tried to used CatBoost Encoder from Sklearn, to encode the data before giving it to the catboost (so that cat boost can assume that the data is just numerical), however, there is a huge difference in the accuracy between the two models.
3- Basically, if Catboost alone couldn't be used, I assume stacking will not 😅
I'm attaching the collab: https://colab.research.google.com/drive/1WjCrY_ddaZ4uoDeoTpOT7eV3pee_e-OH?usp=sharing
I think since we have a problem with LIME integration, I will try to work on providing explanations using the SHAP library instead.
@DRMALEK great research. Many times the packages look good in theory but are hard to use in practice.
Maybe SHAP will be easier to use in any model, but the times needed to compute the explanations might be long. For sure it is worth checking.
hi, i managed to get SHAP single predictions on models like Catboost, and attached the used collab here. By the way, I noticed an issue regarding to Catboost shap prediction #114, but i managed to get the dependency plot and summary plot for it. I will work on closing that issue too.
https://colab.research.google.com/drive/1IYGxlgTuC7xrGqpLKpivdmyQkao3zZpE?usp=sharing
Regarding the stacked model, I believe any model rather than cat boost-based ones will work fine as the single models.
@DRMALEK great job! I found this repo https://github.com/oegedijk/explainerdashboard that might be interesting to you :)
Hi, I'm also trying to explain for a single prediction against an MLJAR model trained in compete mode. I'm trying to use ELI5 framework (but I have no problem to use another one like LIME or SHAP).
The way I tried it is:
import eli5
contrib = eli5.explain_prediction(
mljar_model, # MLJAR AutoML object
X_holdout.iloc[0], # My splitted dataframe (without the target)
feature_names=list(df.columns), # My original dataframe
)
The error I get:
Explanation(estimator="AutoML(ml_task='binary_classification', mode='Compete',
results_path='/model/a294cbb3/mljar', total_time_limit=60)",
description=None, error="estimator AutoML(ml_task='binary_classification', mode='Compete',
results_path='/model/a294cbb3/mljar', total_time_limit=60)
is not supported", method=None, is_regression=False, targets=None,
feature_importances=None, decision_tree=None, highlight_spaces=None,
transition_features=None, image=None)
I understand that I'm getting this error because ELI5 does not support MLJAR models. It support:

But I'm guessing, how can I get the base model that predicts from MLJAR? If I can explain a single prediction in an easier whay with other framework, I'm more than happy to change ELI5 with the new one.
Thank you very much in advance!
Was anyone able to get SHAP values for individual predictions with an MLJAR model? Or does it depend on the model type selected in the end?
Hi @jrhorne, in theory shap should be model agnostic but in practice it is so slow to compute explanations. As far as I know the issue is still open. Maybe the option will be to search for other explanation tool. Maybe you can recommend something that is model agnostic and fast?
Thanks for the quick reply @pplonski. I can actually handle for it to be slow in this case, but using the automl.predict in SHAP was not working. I'll see if I can find that error again. I believe it has to do, potentially, with data pipelines and how SHAP passes in the features?
You are right. It might be connected with features preprocessing that is done inside AutoML.
Would it be possible to extract only the feature processing part before passing back to SHAP? FWIW, I’m not currently using golden features. Thanks again for the input. I’ll see what I can find