[ENH] forecasting test scenarios for categorical variables by fkiraly · Pull Request #5964 · sktime/sktime

fkiraly · 2024-02-19T15:26:46Z

This PR adds two test scenarios for use of categorical variables in forecasting:

ForecasterFitPredictCategorical, with a categorical univariate y and no exogeneous data
ForecasterFitPredictCategoricalExog, with a mixed categorical/float exogeneous X, and float univariate y

Depends on #5886 which relaxes the pandas based types to allow categorical and object data.

FYI @yarnabrina - this test should systematically run all forecaster test parameter settings with the given input.

(cherry picked from commit cc3d056)

(cherry picked from commit 2de7d8c)

Ref. #5886 (review) (cherry picked from commit 2cecd6b)

(cherry picked from commit 1477012)

yarnabrina · 2024-02-19T16:42:51Z

I think you meant #5886 instead of #5962, so edited PR header accordingly. Please revert if that's not the case.

yarnabrina · 2024-02-19T16:46:48Z

sktime/utils/_testing/scenarios_forecasting.py

I am completely unfamiliar with this module, and do not know how do these get tested. So may be @benHeid / @achieveordie can review this one, I'm not eligible.

But I'll note that probably these changes allow categorical for classification/regression as well. So, may be that's also something you may want to test?

This module has scripted test scenarios with arguments to call fit and predict with, these are substituted in the scenarios argument of individual tests (methods) of TestAllForecasters.

The separation and tagging is needed, since not all estimators match with all scenarios.

But I'll note that probably these changes allow categorical for classification/regression as well. So, may be that's also something you may want to test?

Yes, I suppose - but only in the input X. Though, I would be fairly confident that if things work out well for forecastting, it will likely not be a problem in any other module. Of course not fully.

fkiraly · 2024-02-19T16:57:39Z

I think you meant #5886 instead of #5962, so edited PR header accordingly.

Yes, of course, my bad - thanks.

fkiraly · 2024-02-19T18:39:03Z

hm, that's quite a lot of failures - should we deal with this via tags, @yarnabrina?
Not sure how to tag, one perhaps for input and one for output?

yarnabrina · 2024-02-19T19:16:32Z

Before I say anything, can you please help me understand what do these failures mean?

I am guessing that the failures come from the fact that some (majority I guess, if not all) forecasters do not support categorical features directly and will need some encoding (label - can't use, so one-hot?) transformation earlier. Is that it, or is there something else as well?

fkiraly · 2024-02-19T22:23:24Z

I am guessing that the failures come from the fact that some (majority I guess, if not all) forecasters do not support categorical features directly and will need some encoding (label - can't use, so one-hot?) transformation earlier. Is that it, or is there something else as well?

I am guessing the same, but I'm also not sure.

My understanding of the failures:

most actually come from "within" the estimtator. That is, nothing in the boilerplate layer prevents the data from being passed to _fit etc anymore.
they failures look very distinct, hey look like specific to individual estimators

So yes, earlier transformation could prevent the failures, the main issue for me is that it doesn't juts fail, but fails with uninformative error.

yarnabrina · 2024-02-20T16:34:51Z

We can try to create a new boolean tag, say support-categorical-features and set it as False by default. For different subclasses, e.g. reduction ones may override these as True, as XGBoost, CatBoost, etc. can accept categorical columns directly, though they may need specification of special arguments during __init__. We can check the value for this tag in base class, and fail quickly if the requirement is not satisfied.

What I do not like is depending on previous transformations any forecaster may actually be okay, becuase what is being passed at the end will not be numeric. I don't know if that can be detected or not. We can try to set this categorical support tag separately for input/output of a transformer, but don't know how these will work for non-sktime transformers (which are sklearn-compatible and used with TabularToSeriesAdaptor).

fkiraly · 2024-02-20T17:36:33Z

What I do not like is depending on previous transformations any forecaster may actually be okay, becuase what is being passed at the end will not be numeric.

Can you give an example?

yarnabrina · 2024-02-20T17:58:54Z

earlier transformation could prevent the failure

Meant this case, e.g. something like this from #5867 (comment):

pipeline = (ColumnSelect(columns=["Q"]) * OneHotEncoder()) ** (StandardScaler() * ARIMA())

(Plus output conversion tag change that you suggested in this case)

fkiraly · 2024-02-20T22:51:55Z

yes, but that would work, no? The pipeline has the tag set to "supports", so nothing should break?

yarnabrina · 2024-02-21T03:23:48Z

yes, but that would work, no? The pipeline has the tag set to "supports", so nothing should break?

I'm confused, how will pipeline has "caregorical support" tag as True?

By default, ARIMA has False. We can only change tgat value for pipeline to True if there is a transformation on all categorical column(s) to make them numerical. How shall we detect these?

There's at least a transformation in the pipeline that can take categorical and convert to numerical (my doubt: we can't know behaviour of OneHotEncoder etc. with tags)
Such transformations are applied on all categorical columns (my doubt: do we store each columns and their types and what all columns are being transformed)

fkiraly · 2024-02-21T14:14:52Z

I'm confused, how will pipeline has "caregorical support" tag as True?

We'd just set it to True regardless, in the class.

So it would be wrong in some instances, e.g., sklearn components where we cannot detect, but the user will still receive an informative error message, namely where it first hits an sktime input check.

yarnabrina and others added 6 commits February 16, 2024 23:21

support "object" dtype in Series

c624e42

(cherry picked from commit cc3d056)

added test

a729843

(cherry picked from commit 2de7d8c)

@fkiraly edit: add rule to differentiate nested_univ and pd.DataFrame

25eb6f5

Ref. #5886 (review) (cherry picked from commit 2cecd6b)

support "object" dtype in Panel

0e28842

(cherry picked from commit 1477012)

@fkiraly suggestion: add rule to differentiate nested_univ and Panel

0eb9231

test case

5676410

fkiraly added module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting module:tests test framework functionality - only framework, excl specific tests enhancement Adding new functionality labels Feb 19, 2024

fkiraly requested review from achieveordie, benHeid and yarnabrina as code owners February 19, 2024 15:26

fkiraly added 3 commits February 19, 2024 16:31

Update scenarios_forecasting.py

a800f9f

Update scenarios_forecasting.py

409dfae

Update _base.py

4761d1a

yarnabrina reviewed Feb 19, 2024

View reviewed changes

fkiraly mentioned this pull request Feb 23, 2024

[ENH] make TabularToSeriesAdaptor compatible with sklearn transformers that accept only y, e.g., LabelEncoder #5982

Merged

fkiraly mentioned this pull request Mar 11, 2024

[ENH] umbrella issue - categorical feature support #6109

Open

Uh oh!

Conversation

fkiraly commented Feb 19, 2024 • edited by yarnabrina Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yarnabrina commented Feb 19, 2024

Uh oh!

yarnabrina Feb 19, 2024

Choose a reason for hiding this comment

Uh oh!

fkiraly Feb 19, 2024

Choose a reason for hiding this comment

Uh oh!

fkiraly Feb 19, 2024

Choose a reason for hiding this comment

Uh oh!

fkiraly commented Feb 19, 2024

Uh oh!

fkiraly commented Feb 19, 2024

Uh oh!

yarnabrina commented Feb 19, 2024

Uh oh!

fkiraly commented Feb 19, 2024

Uh oh!

yarnabrina commented Feb 20, 2024

Uh oh!

fkiraly commented Feb 20, 2024

Uh oh!

yarnabrina commented Feb 20, 2024

Uh oh!

fkiraly commented Feb 20, 2024

Uh oh!

yarnabrina commented Feb 21, 2024

Uh oh!

fkiraly commented Feb 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fkiraly commented Feb 19, 2024 •

edited by yarnabrina

Loading