Add an option to OrdinalEncoder to sort encoding by decreasing frequencies

### Describe the workflow you want to enable

At the moment, when no categories are provided, the default ordering of categories for a given categorical column is based on the lexicographical ordering of the categories observed in the training set.

However, this is quite arbitrary and one could instead provide an option to encode such values based on their observed frequency in the training set.

The motivation would be to nudge the inductive bias of tree-based models (and models that favor axis aligned decision functions) into separating nominal from rare values more easily (e.g. fewer splits in a tree). This might be especially useful for outlier detection models such as `IsolationForest`.

Related to #15796.

### Describe your proposed solution


Extend the `categories` option to have:

- `categories="lexigraphical"` (default for backward compat);
- `categories=user_provided_list` (same as today);
- `categories="frequency"` (the new option).

If `categories="frequency"`, then the generated category encodings would be:

- 0 would encode the most frequent category observed in the training set,
- 1 the second most frequent,
- ...
- and so on until the least frequent categories.

Tie breaking could be based on the lexicographical order to ensure that the behavior of `OrdinalEncoder` remains invariant under a shuffling of the rows of the training set.

EDIT: we could name this option `"decreasing_frequency"` instead of `"frequency"` to be more explicit that values closest to 0 mean higher frequency categories. And this would allow to also add `"increasing_frequency"` if we want, but I am not sure if it would be needed/useful to have both options.

### Describe alternatives you've considered, if relevant

- Use `TargetEncoder` that leverages some frequency info (mixed with the mean value of the target variable), but this is a supervised method and would therefore not be suitable for unsupervised anomaly detection tasks, for instance.

- Introduce a new dedicated class, e.g. `FrequencyEncoder`.
  - pro: allow using the observed relative frequency (between 0 and 1) as value (but would collapse equally frequent categories into the same numerical value).
  - con: introduces yet another estimator class in our public API.

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add an option to OrdinalEncoder to sort encoding by decreasing frequencies #32161

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add an option to OrdinalEncoder to sort encoding by decreasing frequencies #32161

Description

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions