tree_methods_missing_val_support

Notes

The PR - scikit-learn/scikit-learn #5974
The missing values (in each split) are handled by deciding the best way to send them based on the impurtiy decrease.
XGBoost handles the missing values in a very similar way. (dmlc/xgboost#21 (comment)) (Thanks to Jacob!)
This is working for RandomForestClassifier, DecisionTreeClassifier, BestSplitter (dense only) and ClassificationCriterion.
Yet to implement the this for BestFirstTreeBuilder, RandomSplitter, and {Best|Random}SparseSplitter
Yet to implement the this for RandomForestRegressor, DecisionTreeRegressor and RegressionCriterion
I wrote a drop_value function to successively introduce missing data. The code can be found here.
Selfnote: Ref - https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
Also ref - https://github.com/hammerlab/fancyimpute

Some recent benchmarks

The benchmark notebook - https://github.com/rvraghav93/scikit_tree_methods_with_missing_value_support/blob/master/missing_val_bench.ipynb

Dataset: 1/20 of the covtype dataset
Scoring: mean score across 3 iterations of StratifiedShuffleSplit
n_estimators: 50
Comparing RF w/MV, Imp + RF w/o MV, XGBoost's RF w/MV, Imp + XGBoost's RF w/o MV

When all the classes are present, and missing values across all the features correspond to one of the classes. (MNAR) (class 1)

We can see a significant advantage with our method compared to the imputation. Also this implementation vs imputation is similar to XGBoost's RF vs imputation.

The missingness, in this case, adds additional information (tells us which samples are label 1) and hence the performance increases with increasing missing fraction.

![MNAR](https://i.imgur.com/72l1FG8.png)

When the missing values are completely at random (MCAR).

Our method performs very similar to imputation, while handling the missing data natively.

As the missingness, in this case, is basically noise, you can see the performance drop with increasing missing fraction.

![MCAR](https://i.imgur.com/WaSZRwB.png)

References

Simonoff's paper comparing different methods on BINARY RESPONSE DATA > 100 citations
Handling Missing Values when Applying Classification Models > 100 citations
Ding Simonoff PHD thesis - should find the actual link or is it same as 1 - Google books link
Missing Data Imputation for Tree based methods - 2006 - Yan He, UCLA
http://sci2s.ugr.es/keel/pdf/specific/capitulo/IFCS04r.pdf - Compares various imputation methods and case deletion.

Name		Name	Last commit message	Last commit date
parent directory ..
benchmark_results_one_twentieth_covtype_0_to_1_label_correlations_xgbsrf_scikit_rf_with_mv_scores_and_times		benchmark_results_one_twentieth_covtype_0_to_1_label_correlations_xgbsrf_scikit_rf_with_mv_scores_and_times
benchmark_results_random_splitter		benchmark_results_random_splitter
kaggle_titanic_dataset		kaggle_titanic_dataset
1		1
2		2
README.md		README.md
RandomForestClassifier natively handling missing values in real world datasets.ipynb		RandomForestClassifier natively handling missing values in real world datasets.ipynb
Source.gv		Source.gv
Source.gv.pdf		Source.gv.pdf
bench_rf.py		bench_rf.py
get_graph.py		get_graph.py
kaggle_titanic_with_missing_values.ipynb		kaggle_titanic_with_missing_values.ipynb
missing_val_bench.ipynb		missing_val_bench.ipynb
scratchpad.ipynb		scratchpad.ipynb
temp.jpg		temp.jpg
temp.png		temp.png
temp.png.pdf		temp.png.pdf
tree.png		tree.png
value_dropper.py		value_dropper.py
value_dropper_testbed.ipynb		value_dropper_testbed.ipynb
xgbrf.py		xgbrf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Notes

Some recent benchmarks

References

FilesExpand file tree

tree_methods_missing_val_support

Directory actions

More options

Directory actions

More options

Latest commit

History

tree_methods_missing_val_support

Folders and files

parent directory

README.md

Notes

Some recent benchmarks

References