Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Notes

Some recent benchmarks

The benchmark notebook - https://github.com/rvraghav93/scikit_tree_methods_with_missing_value_support/blob/master/missing_val_bench.ipynb

Dataset: 1/20 of the covtype dataset
Scoring: mean score across 3 iterations of StratifiedShuffleSplit
n_estimators: 50
Comparing RF w/MV, Imp + RF w/o MV, XGBoost's RF w/MV, Imp + XGBoost's RF w/o MV

  1. When all the classes are present, and missing values across all the features correspond to one of the classes. (MNAR) (class 1)

We can see a significant advantage with our method compared to the imputation. Also this implementation vs imputation is similar to XGBoost's RF vs imputation.

The missingness, in this case, adds additional information (tells us which samples are label 1) and hence the performance increases with increasing missing fraction.

![MNAR](https://i.imgur.com/72l1FG8.png)
  1. When the missing values are completely at random (MCAR).

Our method performs very similar to imputation, while handling the missing data natively.

As the missingness, in this case, is basically noise, you can see the performance drop with increasing missing fraction.

![MCAR](https://i.imgur.com/WaSZRwB.png)

References
  1. Simonoff's paper comparing different methods on BINARY RESPONSE DATA > 100 citations
  2. Handling Missing Values when Applying Classification Models > 100 citations
  3. Ding Simonoff PHD thesis - should find the actual link or is it same as 1 - Google books link
  4. Missing Data Imputation for Tree based methods - 2006 - Yan He, UCLA
  5. http://sci2s.ugr.es/keel/pdf/specific/capitulo/IFCS04r.pdf - Compares various imputation methods and case deletion.