Skip to content

[SPRINT] Add warning notes in preprocessing functions #17402

@NicolasHug

Description

@NicolasHug

The goal here is to add a warning note in the docstring of the pre-processing functions (follow up to #17387) to warn about potential issues when using these functions, and recommend using a pipeline instead:

  • maxabs_scale
  • minmax_scale
  • normalize
  • quantile_transform
  • robust_scale
  • scale
  • power_transform

All of these are in sklearn/preprocessing/_data.py. Here is a warning template:

    .. warning:: Risk of data leak

        Do not use :func:`~sklearn.preprocessing.scale` unless you know what
        you are doing. A common mistake is to apply it to the entire data
        *before* splitting into training and test sets. This will bias the
        model evaluation because information would have leaked from the test
        set to the training set.
        In general, we recommend using
        :class:`~sklearn.preprocessing.StandardScaler` within a
        :ref:`Pipeline <pipeline>` in order to prevent most risks of data
        leaking: `pipe = make_pipeline(StandardScaler(), LogisticRegression()))`.

You should of course adapt scale and StandardScaler.

Please indicate below which function(s) you want to work on with e.g. "I'm working on scale and robust_scale" so that others don't pick the same ones

@scikit-learn/core-devs feel free to directly edit the warning message

Metadata

Metadata

Assignees

No one assigned

    Labels

    EasyWell-defined and straightforward way to resolveSprint

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions