Skip to content

Add an option to OrdinalEncoder to sort encoding by decreasing frequencies #32161

@ogrisel

Description

@ogrisel

Describe the workflow you want to enable

At the moment, when no categories are provided, the default ordering of categories for a given categorical column is based on the lexicographical ordering of the categories observed in the training set.

However, this is quite arbitrary and one could instead provide an option to encode such values based on their observed frequency in the training set.

The motivation would be to nudge the inductive bias of tree-based models (and models that favor axis aligned decision functions) into separating nominal from rare values more easily (e.g. fewer splits in a tree). This might be especially useful for outlier detection models such as IsolationForest.

Related to #15796.

Describe your proposed solution

Extend the categories option to have:

  • categories="lexigraphical" (default for backward compat);
  • categories=user_provided_list (same as today);
  • categories="frequency" (the new option).

If categories="frequency", then the generated category encodings would be:

  • 0 would encode the most frequent category observed in the training set,
  • 1 the second most frequent,
  • ...
  • and so on until the least frequent categories.

Tie breaking could be based on the lexicographical order to ensure that the behavior of OrdinalEncoder remains invariant under a shuffling of the rows of the training set.

EDIT: we could name this option "decreasing_frequency" instead of "frequency" to be more explicit that values closest to 0 mean higher frequency categories. And this would allow to also add "increasing_frequency" if we want, but I am not sure if it would be needed/useful to have both options.

Describe alternatives you've considered, if relevant

  • Use TargetEncoder that leverages some frequency info (mixed with the mean value of the target variable), but this is a supervised method and would therefore not be suitable for unsupervised anomaly detection tasks, for instance.

  • Introduce a new dedicated class, e.g. FrequencyEncoder.

    • pro: allow using the observed relative frequency (between 0 and 1) as value (but would collapse equally frequent categories into the same numerical value).
    • con: introduces yet another estimator class in our public API.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions