[ENH] umbrella issue - categorical feature support

Issue to collect material and input on adding categorical feature support, and summarizing discussion from various locations. For discussion.

### Aim

It has been requested a number of times to add support for categorical features in classification/regression (features) and forecasting (endogenous and exogenous data).


Examples, which also have some relevant discussion:

https://github.com/sktime/sktime/issues/5867
https://github.com/sktime/sktime/discussions/5943
https://github.com/sktime/sktime/discussions/3976
https://github.com/sktime/sktime/issues/4848
https://github.com/sktime/sktime/issues/4776


### Basic requirements

What does this mean:

* classifiers/regressors and forecasters *and transformations* can in-principle accept categorical features
    * not all estimators will support this out of the box, so I anticipate a tag is needed to distinguish those that can, from those that cannot
    * this may be complicated by `sklearn` transformations that are not fully inspectable as to this property (afaik)
* major compositions such as pipelines, reduction, ensembles also deal with categorical variables and the new tags correctly
* at least two estimators of each type are implemented, for testing and demonstration
    * and there are worked examples with composites and transformations
* tests are added for the above


### architectural issues, open design questions

Imo, architecture needs to be built "from the ground up" to avoid breaking the existing base framework. That is, starting not with estimators and base framework deactivation, but base framework rework and then estimator rework.

* currently, checks and conversions (localized in the `datatypes` module) do not allow categorical variables through. This is because `object` `dtype` in time series containers was no tallowed, and not all time series containers support categorical type
    * PR https://github.com/sktime/sktime/pull/5886 allows object `dtype` for the time series containers. While it works, it cannot be merged yet, due to lack of testing and/or informative error message when users pass categorical features.
    * design idea: using parts of the `__dataframe__` interchange protocol to determine variable type at input checks - https://github.com/sktime/sktime/issues/6470
    * open design question: it is not entirely clear how the distinction between containers that can represent categorical data (or mixed categorical/numerical) and those that cannot should be handled. Noting that 0/1 encoding in `numpy` must be treated as numerical since there is no intrinsic distinction between the two, and no extrinsic coding is available
    * open design question: how should conversions between data types be handled, between those that can or cannot? E.g., always apply a one-hot encoding if converting to `numpy`? This might substantially alter the logic in the converters as current. Or, assume this case never occurs?
    * related refactor: https://github.com/sktime/sktime/pull/6033, this could carry further tags for datatypes with "categorical capability", similar to estimators?
* tests need to be expanded to handle categorical variables. This should at least include systematic, scenario based testing with categoricals: https://github.com/sktime/sktime/pull/5964
    * one would expect that the test should check that the right error message is raised if the "categorical" tag is false, otherwise that all tests run through.
* open design question: in forecasters, do we need to distinguish categoricals in endogeneous vs exogeneous data? E.g., by tag, or by how we handle them?
* what could be helpful are configs to turn off input and/or output checks in classifiers and forecasters, similar to the corresponding `set_config` fields in transformations. This might allow easy bulk testing on whether estimators already support categorical features out-of-the-box, e.g., if internally only compatible `pandas` operations are used.

### possible work plan

Some thoughts about work plan:

* I think this is sufficiently complex that we need an enhancement proposal written up, for the desired end state, example code, a design that solves the above issues, and a list of estimators (native and/or interfaced, preferably one example of each per learning task) to implement or extend
    * this should also include a detailed implementation plan
* ths first implementation item is likely the data container framework, with an interplay of tags, container specifications, converters. The sequencing is not entirely clear, I would start with the existing PR and first add the tag (saying no estimator can support categorical at first), then slowly switch on support while further iterations on the framework take place.
* based on this, I'd add a few examples of forecasters, classifiers, transformations - preferably simple ones
* next are pipelines, compositions. Reduction is important, though it is possibly the most unpleasant to extend.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ENH] umbrella issue - categorical feature support #6109

Aim

Basic requirements

architectural issues, open design questions

possible work plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[ENH] umbrella issue - categorical feature support #6109

Description

Aim

Basic requirements

architectural issues, open design questions

possible work plan

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions