Describe the bug
scikit-learn's HistGradientBoostingRegressor accepts an argument categorical_features, instructing it to treat certain columns (in my exogenous data) as categorical. But the make_reduction function changes the columns so all the options for providing categorical features fail. To be specific:
- Providing a list of columns names: fails because a NumPy array is passed to
.fit(), not the original dataframe
- Providing a boolean mask: fails because it should match the number of columns, but this is different after
make_reduction has done its magic
- Providing the indexes of the columns: fails (sometimes silently) because the order of columns changes
To Reproduce
from sktime.datasets import load_longley
from sktime.forecasting.compose import make_reduction
from sktime.forecasting.model_selection import temporal_train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
y, X = load_longley()
X["Year"] = X.index.year - 1900 # bad example, just to get a value that can be considered categorical
y_trn, y_tst, X_trn, X_tst = temporal_train_test_split(y, X, test_size=5)
forecaster = make_reduction(
HistGradientBoostingRegressor(
# categorical_features=["Year"], # Strings
# categorical_features=X.columns.isin(["Year"]), # Boolean mask
categorical_features=X.columns.get_indexer(["Year"]), # Column positions
),
window_length=2,
)
forecaster.fit(
fh=[1, 2, 3],
y=y_trn,
X=X_trn,
)
Expected behavior
In a dream world, the column names I define in my DataFrame would still be there when X is eventually passed to HistGradientBoostingRegressor.fit.
Two solutions I can think of (with no regard for complexity or feasibility):
- sktime only adds columns to the end of the dataframe that the user provides as X (including adding y), so that we can at least rely on the column indexes remaining the same.
- sktime keeps the dataframe and column names, only adding new columns.
2 Would be ideal as it solves other problems. For example XGBoost has the (experimental) ability to deal with Pandas categorical columns (enable_categorical), but as far as I can tell this wouldn't work with sktime because that information is lost in the conversion to NumPy.
Additional context
Versions
Details
System:
python: 3.10.8 (main, Oct 12 2022, 19:14:26) [GCC 9.4.0]
executable: /home/davidg/.virtualenvs/learning/bin/python
machine: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python dependencies:
pip: 23.1.2
sktime: 0.19.1
sklearn: 1.2.2
numpy: 1.24.3
scipy: 1.10.1
pandas: 2.0.1
matplotlib: 3.7.0
joblib: 1.2.0
statsmodels: 0.13.5
numba: None
pmdarima: 2.0.3
tsfresh: None
tensorflow: None
tensorflow_probability: None
Describe the bug
scikit-learn's
HistGradientBoostingRegressoraccepts an argumentcategorical_features, instructing it to treat certain columns (in my exogenous data) as categorical. But themake_reductionfunction changes the columns so all the options for providing categorical features fail. To be specific:.fit(), not the original dataframemake_reductionhas done its magicTo Reproduce
Expected behavior
In a dream world, the column names I define in my DataFrame would still be there when
Xis eventually passed toHistGradientBoostingRegressor.fit.Two solutions I can think of (with no regard for complexity or feasibility):
2 Would be ideal as it solves other problems. For example XGBoost has the (experimental) ability to deal with Pandas categorical columns (
enable_categorical), but as far as I can tell this wouldn't work with sktime because that information is lost in the conversion to NumPy.Additional context
Versions
Details
System:
python: 3.10.8 (main, Oct 12 2022, 19:14:26) [GCC 9.4.0]
executable: /home/davidg/.virtualenvs/learning/bin/python
machine: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python dependencies:
pip: 23.1.2
sktime: 0.19.1
sklearn: 1.2.2
numpy: 1.24.3
scipy: 1.10.1
pandas: 2.0.1
matplotlib: 3.7.0
joblib: 1.2.0
statsmodels: 0.13.5
numba: None
pmdarima: 2.0.3
tsfresh: None
tensorflow: None
tensorflow_probability: None