Skip to content

[ENH] umbrella issue - categorical feature support #6109

@fkiraly

Description

@fkiraly

Issue to collect material and input on adding categorical feature support, and summarizing discussion from various locations. For discussion.

Aim

It has been requested a number of times to add support for categorical features in classification/regression (features) and forecasting (endogenous and exogenous data).

Examples, which also have some relevant discussion:

#5867
#5943
#3976
#4848
#4776

Basic requirements

What does this mean:

  • classifiers/regressors and forecasters and transformations can in-principle accept categorical features
    • not all estimators will support this out of the box, so I anticipate a tag is needed to distinguish those that can, from those that cannot
    • this may be complicated by sklearn transformations that are not fully inspectable as to this property (afaik)
  • major compositions such as pipelines, reduction, ensembles also deal with categorical variables and the new tags correctly
  • at least two estimators of each type are implemented, for testing and demonstration
    • and there are worked examples with composites and transformations
  • tests are added for the above

architectural issues, open design questions

Imo, architecture needs to be built "from the ground up" to avoid breaking the existing base framework. That is, starting not with estimators and base framework deactivation, but base framework rework and then estimator rework.

  • currently, checks and conversions (localized in the datatypes module) do not allow categorical variables through. This is because object dtype in time series containers was no tallowed, and not all time series containers support categorical type
    • PR [ENH] Allow object dtype in series #5886 allows object dtype for the time series containers. While it works, it cannot be merged yet, due to lack of testing and/or informative error message when users pass categorical features.
    • design idea: using parts of the __dataframe__ interchange protocol to determine variable type at input checks - [ENH] categorical feature support: input checking - column type encoding by the __dataframe__ protocol #6470
    • open design question: it is not entirely clear how the distinction between containers that can represent categorical data (or mixed categorical/numerical) and those that cannot should be handled. Noting that 0/1 encoding in numpy must be treated as numerical since there is no intrinsic distinction between the two, and no extrinsic coding is available
    • open design question: how should conversions between data types be handled, between those that can or cannot? E.g., always apply a one-hot encoding if converting to numpy? This might substantially alter the logic in the converters as current. Or, assume this case never occurs?
    • related refactor: [ENH] draft design for refactoring datatypes module - classes #6033, this could carry further tags for datatypes with "categorical capability", similar to estimators?
  • tests need to be expanded to handle categorical variables. This should at least include systematic, scenario based testing with categoricals: [ENH] forecasting test scenarios for categorical variables #5964
    • one would expect that the test should check that the right error message is raised if the "categorical" tag is false, otherwise that all tests run through.
  • open design question: in forecasters, do we need to distinguish categoricals in endogeneous vs exogeneous data? E.g., by tag, or by how we handle them?
  • what could be helpful are configs to turn off input and/or output checks in classifiers and forecasters, similar to the corresponding set_config fields in transformations. This might allow easy bulk testing on whether estimators already support categorical features out-of-the-box, e.g., if internally only compatible pandas operations are used.

possible work plan

Some thoughts about work plan:

  • I think this is sufficiently complex that we need an enhancement proposal written up, for the desired end state, example code, a design that solves the above issues, and a list of estimators (native and/or interfaced, preferably one example of each per learning task) to implement or extend
    • this should also include a detailed implementation plan
  • ths first implementation item is likely the data container framework, with an interplay of tags, container specifications, converters. The sequencing is not entirely clear, I would start with the existing PR and first add the tag (saying no estimator can support categorical at first), then slowly switch on support while further iterations on the framework take place.
  • based on this, I'd add a few examples of forecasters, classifiers, transformations - preferably simple ones
  • next are pipelines, compositions. Reduction is important, though it is possibly the most unpleasant to extend.

Metadata

Metadata

Assignees

Labels

API designAPI design & software architectureenhancementAdding new functionalitymodule:base-frameworkBaseObject, registry, base frameworkmodule:classificationclassification module: time series classificationmodule:datatypesdatatypes module: data containers, checkers & convertersmodule:forecastingforecasting module: forecasting, incl probabilistic and hierarchical forecastingmodule:regressionregression module: time series regression

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions