You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue to collect material and input on adding categorical feature support, and summarizing discussion from various locations. For discussion.
Aim
It has been requested a number of times to add support for categorical features in classification/regression (features) and forecasting (endogenous and exogenous data).
Examples, which also have some relevant discussion:
classifiers/regressors and forecasters and transformations can in-principle accept categorical features
not all estimators will support this out of the box, so I anticipate a tag is needed to distinguish those that can, from those that cannot
this may be complicated by sklearn transformations that are not fully inspectable as to this property (afaik)
major compositions such as pipelines, reduction, ensembles also deal with categorical variables and the new tags correctly
at least two estimators of each type are implemented, for testing and demonstration
and there are worked examples with composites and transformations
tests are added for the above
architectural issues, open design questions
Imo, architecture needs to be built "from the ground up" to avoid breaking the existing base framework. That is, starting not with estimators and base framework deactivation, but base framework rework and then estimator rework.
currently, checks and conversions (localized in the datatypes module) do not allow categorical variables through. This is because objectdtype in time series containers was no tallowed, and not all time series containers support categorical type
PR [ENH] Allow object dtype in series #5886 allows object dtype for the time series containers. While it works, it cannot be merged yet, due to lack of testing and/or informative error message when users pass categorical features.
open design question: it is not entirely clear how the distinction between containers that can represent categorical data (or mixed categorical/numerical) and those that cannot should be handled. Noting that 0/1 encoding in numpy must be treated as numerical since there is no intrinsic distinction between the two, and no extrinsic coding is available
open design question: how should conversions between data types be handled, between those that can or cannot? E.g., always apply a one-hot encoding if converting to numpy? This might substantially alter the logic in the converters as current. Or, assume this case never occurs?
one would expect that the test should check that the right error message is raised if the "categorical" tag is false, otherwise that all tests run through.
open design question: in forecasters, do we need to distinguish categoricals in endogeneous vs exogeneous data? E.g., by tag, or by how we handle them?
what could be helpful are configs to turn off input and/or output checks in classifiers and forecasters, similar to the corresponding set_config fields in transformations. This might allow easy bulk testing on whether estimators already support categorical features out-of-the-box, e.g., if internally only compatible pandas operations are used.
possible work plan
Some thoughts about work plan:
I think this is sufficiently complex that we need an enhancement proposal written up, for the desired end state, example code, a design that solves the above issues, and a list of estimators (native and/or interfaced, preferably one example of each per learning task) to implement or extend
this should also include a detailed implementation plan
ths first implementation item is likely the data container framework, with an interplay of tags, container specifications, converters. The sequencing is not entirely clear, I would start with the existing PR and first add the tag (saying no estimator can support categorical at first), then slowly switch on support while further iterations on the framework take place.
based on this, I'd add a few examples of forecasters, classifiers, transformations - preferably simple ones
next are pipelines, compositions. Reduction is important, though it is possibly the most unpleasant to extend.
Issue to collect material and input on adding categorical feature support, and summarizing discussion from various locations. For discussion.
Aim
It has been requested a number of times to add support for categorical features in classification/regression (features) and forecasting (endogenous and exogenous data).
Examples, which also have some relevant discussion:
#5867
#5943
#3976
#4848
#4776
Basic requirements
What does this mean:
sklearntransformations that are not fully inspectable as to this property (afaik)architectural issues, open design questions
Imo, architecture needs to be built "from the ground up" to avoid breaking the existing base framework. That is, starting not with estimators and base framework deactivation, but base framework rework and then estimator rework.
datatypesmodule) do not allow categorical variables through. This is becauseobjectdtypein time series containers was no tallowed, and not all time series containers support categorical typedtypefor the time series containers. While it works, it cannot be merged yet, due to lack of testing and/or informative error message when users pass categorical features.__dataframe__interchange protocol to determine variable type at input checks - [ENH] categorical feature support: input checking - column type encoding by the__dataframe__protocol #6470numpymust be treated as numerical since there is no intrinsic distinction between the two, and no extrinsic coding is availablenumpy? This might substantially alter the logic in the converters as current. Or, assume this case never occurs?datatypesmodule - classes #6033, this could carry further tags for datatypes with "categorical capability", similar to estimators?set_configfields in transformations. This might allow easy bulk testing on whether estimators already support categorical features out-of-the-box, e.g., if internally only compatiblepandasoperations are used.possible work plan
Some thoughts about work plan: