-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
[ENH] Provision generating missing values and add parameters to control the same #6284
Description
-
Would it be worthwhile to add parameters to control missingness in dataset generators?
I need this for benchmarking [MRG] ENH Add support for missing values to Tree based Classifiers #5974. Thought this might come in handy for teaching too.
Typically I would like to add
missing_rate,target_correlation_rate,feature_correlation_rateandmissing_values.The
target_correlation_ratewould control the extent to which the dataset is MNAR* andfeature_correlation_ratewould control the extent to which the dataset is MAR†.target_correlation_rate+feature_correlation_rate<= 11 - (target_correlation_rate + feature_correlation_rate)would control the extent to which the dataset is MCAR‡.Does this sound good?
-
Either as an addition or as an alternative to 1, could we have missing transformers with the above described params?
* - Missing Not At Random (Missingness is correlated with the target)
† - Missing At Random but correlated with the other feature values.
‡ - Missing Completely At Random (No correlation with either the target or parameters)