-
-
Notifications
You must be signed in to change notification settings - Fork 260
Silence UserWarning in ColumnTransformer._hstack #365
Copy link
Copy link
Closed
Description
In [2]: paste
import pandas as pd
import sklearn.compose
import sklearn.preprocessing
from sklearn.base import clone
import dask.dataframe as dd
import dask_ml.compose
import dask_ml.preprocessing
df = pd.DataFrame({"A": pd.Categorical(["a", "a", "b", "a"]), "B": [1.0, 2, 4, 5]})
ddf = dd.from_pandas(df, npartitions=2).reset_index(drop=True)
b = dask_ml.compose.make_column_transformer(
(["A"], dask_ml.preprocessing.OneHotEncoder(sparse=False)),
(["B"], dask_ml.preprocessing.StandardScaler()),
)
## -- End pasted text --
In [3]: b.fit_transform(ddf).compute()
/Users/taugspurger/sandbox/dask/dask/dataframe/multi.py:608: UserWarning: Concatenating dataframes with unknown divisions.
We're assuming that the indexes of each dataframes are
aligned. This assumption is not generally safe.
warnings.warn("Concatenating dataframes with unknown divisions.\n"
Out[3]:
A_a A_b B
0 1.0 0.0 -1.264911
1 1.0 0.0 -0.632456
0 0.0 1.0 0.632456
1 1.0 0.0 1.264911The data are coming from the same source, so we just need to rely on the fact that individual estimators in the pipeline don't change the n_samples in a transform. This is usually true (it's true everywhere in scikit-learn AFAIK) but there are some contrib packages that change the number of samples. Regardless, things may still be OK, as long as the partitions don't change...
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels