Skip to content

[ENH] Provision generating missing values and add parameters to control the same #6284

@raghavrv

Description

@raghavrv
  1. Would it be worthwhile to add parameters to control missingness in dataset generators?

    I need this for benchmarking [MRG] ENH Add support for missing values to Tree based Classifiers #5974. Thought this might come in handy for teaching too.

    Typically I would like to add missing_rate, target_correlation_rate, feature_correlation_rate and missing_values.

    The target_correlation_rate would control the extent to which the dataset is MNAR* and feature_correlation_rate would control the extent to which the dataset is MAR.

    target_correlation_rate + feature_correlation_rate <= 1

    1 - (target_correlation_rate + feature_correlation_rate) would control the extent to which the dataset is MCAR.

    Does this sound good?

  2. Either as an addition or as an alternative to 1, could we have missing transformers with the above described params?


* - Missing Not At Random (Missingness is correlated with the target)

† - Missing At Random but correlated with the other feature values.

‡ - Missing Completely At Random (No correlation with either the target or parameters)

Ping @agramfort @glouppe @GaelVaroquaux

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions