Skip to content

[BUG] Can't combine make_reduction with HistGradientBoostingRegressor.categorical_features #4776

@davidgilbertson

Description

@davidgilbertson

Describe the bug

scikit-learn's HistGradientBoostingRegressor accepts an argument categorical_features, instructing it to treat certain columns (in my exogenous data) as categorical. But the make_reduction function changes the columns so all the options for providing categorical features fail. To be specific:

  • Providing a list of columns names: fails because a NumPy array is passed to .fit(), not the original dataframe
  • Providing a boolean mask: fails because it should match the number of columns, but this is different after make_reduction has done its magic
  • Providing the indexes of the columns: fails (sometimes silently) because the order of columns changes

To Reproduce

from sktime.datasets import load_longley
from sktime.forecasting.compose import make_reduction
from sktime.forecasting.model_selection import temporal_train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor

y, X = load_longley()
X["Year"] = X.index.year - 1900 # bad example, just to get a value that can be considered categorical
y_trn, y_tst, X_trn, X_tst = temporal_train_test_split(y, X, test_size=5)

forecaster = make_reduction(
    HistGradientBoostingRegressor(
        # categorical_features=["Year"],  # Strings
        # categorical_features=X.columns.isin(["Year"]),  # Boolean mask
        categorical_features=X.columns.get_indexer(["Year"]),  # Column positions
    ),
    window_length=2,
)

forecaster.fit(
    fh=[1, 2, 3],
    y=y_trn,
    X=X_trn,
)

Expected behavior

In a dream world, the column names I define in my DataFrame would still be there when X is eventually passed to HistGradientBoostingRegressor.fit.

Two solutions I can think of (with no regard for complexity or feasibility):

  1. sktime only adds columns to the end of the dataframe that the user provides as X (including adding y), so that we can at least rely on the column indexes remaining the same.
  2. sktime keeps the dataframe and column names, only adding new columns.

2 Would be ideal as it solves other problems. For example XGBoost has the (experimental) ability to deal with Pandas categorical columns (enable_categorical), but as far as I can tell this wouldn't work with sktime because that information is lost in the conversion to NumPy.

Additional context

Versions

Details

System:
python: 3.10.8 (main, Oct 12 2022, 19:14:26) [GCC 9.4.0]
executable: /home/davidg/.virtualenvs/learning/bin/python
machine: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python dependencies:
pip: 23.1.2
sktime: 0.19.1
sklearn: 1.2.2
numpy: 1.24.3
scipy: 1.10.1
pandas: 2.0.1
matplotlib: 3.7.0
joblib: 1.2.0
statsmodels: 0.13.5
numba: None
pmdarima: 2.0.3
tsfresh: None
tensorflow: None
tensorflow_probability: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmodule:forecastingforecasting module: forecasting, incl probabilistic and hierarchical forecasting

    Type

    No type

    Projects

    Status

    bugs

    Status

    Needs triage & validation

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions