- The PR - scikit-learn/scikit-learn #5974
- The missing values (in each split) are handled by deciding the best way to send them based on the impurtiy decrease.
- XGBoost handles the missing values in a very similar way. (dmlc/xgboost#21 (comment)) (Thanks to Jacob!)
- This is working for
RandomForestClassifier,DecisionTreeClassifier,BestSplitter(dense only) andClassificationCriterion. - Yet to implement the this for
BestFirstTreeBuilder,RandomSplitter, and{Best|Random}SparseSplitter - Yet to implement the this for
RandomForestRegressor,DecisionTreeRegressorandRegressionCriterion - I wrote a
drop_valuefunction to successively introduce missing data. The code can be found here. - Selfnote: Ref - https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
- Also ref - https://github.com/hammerlab/fancyimpute
The benchmark notebook - https://github.com/rvraghav93/scikit_tree_methods_with_missing_value_support/blob/master/missing_val_bench.ipynb
Dataset: 1/20 of the covtype dataset
Scoring: mean score across 3 iterations of StratifiedShuffleSplit
n_estimators: 50
Comparing RF w/MV, Imp + RF w/o MV, XGBoost's RF w/MV, Imp + XGBoost's RF w/o MV
- When all the classes are present, and missing values across all the features correspond to one of the classes. (MNAR) (class 1)
We can see a significant advantage with our method compared to the imputation. Also this implementation vs imputation is similar to XGBoost's RF vs imputation.
The missingness, in this case, adds additional information (tells us which samples are label 1) and hence the performance increases with increasing missing fraction.

- When the missing values are completely at random (MCAR).
Our method performs very similar to imputation, while handling the missing data natively.
As the missingness, in this case, is basically noise, you can see the performance drop with increasing missing fraction.

- Simonoff's paper comparing different methods on BINARY RESPONSE DATA > 100 citations
- Handling Missing Values when Applying Classification Models > 100 citations
- Ding Simonoff PHD thesis - should find the actual link or is it same as 1 - Google books link
- Missing Data Imputation for Tree based methods - 2006 - Yan He, UCLA
- http://sci2s.ugr.es/keel/pdf/specific/capitulo/IFCS04r.pdf - Compares various imputation methods and case deletion.