[WIP] CLN Encoder refactor#14972
Conversation
Does it mean that the user would get one warning for each column?? |
This refactor was done in a way to have the caller do the warning handling. The pattern is to change an attribute in |
|
The public interface is exactly the same as we have on master. On a side note, I considered having the |
|
I am hoping this makes it much easier to implement encoders or at least try out different encoders. Most of |
glemaitre
left a comment
There was a problem hiding this comment.
Do we get any regression regarding the performance?
sklearn/preprocessing/_encoders.py
Outdated
| needs_validation = True | ||
|
|
||
| n_samples, n_features = X.shape | ||
| n_features = X.shape[1] |
sklearn/preprocessing/_encoders.py
Outdated
| n_features = X.shape[1] | ||
| X_columns = [] | ||
|
|
||
| for i in range(n_features): |
sklearn/preprocessing/_encoders.py
Outdated
| X_mask = np.ones((n_samples, n_features), dtype=np.bool) | ||
| def _fit_list(self, X_list, encoders): | ||
| """Fit encoders on X_list""" | ||
| assert len(X_list) == len(encoders) |
There was a problem hiding this comment.
I would add an additional message in case we raise the AssertionError. It will not be really friendly.
sklearn/preprocessing/_encoders.py
Outdated
| for X_col, encoder in zip(X_list, encoders): | ||
| encoder.fit(X_col) | ||
|
|
||
| # map from X_trans indicies to indicies from the original X |
sklearn/preprocessing/_encoders.py
Outdated
| "and the X has {} features." | ||
| .format(len(self.categories_,), n_features) | ||
| ) | ||
| .format(len(self.categories_,), n_features)) |
There was a problem hiding this comment.
you can revert this change. Black style is fine :)
| categories = self._check_categories(n_features) | ||
| drop_kwargs = self._check_drop(n_features) | ||
|
|
||
| encoders = [ |
There was a problem hiding this comment.
the pattern
encoders = [...]
self._fit_list(...)will be common to each encoder. Would it make sense to have an abstract method across encoder which should be defined in all children?
Then _fit will just be an input validation (which could also be an abstract method) and the above pattern (+ additinonal thing depending of the encoder)
def _fit(self, X):
X = self._validate_input(X)
self._generate_and_fit_encoder(X)
...There was a problem hiding this comment.
But be aware that this is not clear to me :)
There was a problem hiding this comment.
I am glad you noticed this! One of my designs had something similar where _fit was an abstract method. It was even possible to get to a point where the child class only needed to define _fit and _EncoderUnion can define fit, transform, and fit_transform. It works, but it felt a little too magically. (There were additional lines to overwrite the docstring for transform, and fit_transform for a specific child encoder. For example, OneHotEncoder.transform can output sparse matrices.)
The benefits of this PR, is that OrdinalEncoder.fit explicitly defines the encoders, and explicitly passes it up into the parent class to process.
NicolasHug
left a comment
There was a problem hiding this comment.
Mostly nits about style for now (sorry)
sklearn/preprocessing/_encoders.py
Outdated
|
|
||
| return categories | ||
|
|
||
| def _fit_list(self, X_list, encoders): |
There was a problem hiding this comment.
I think it would help a bit to call all single encoders single_encoder instead of encoder.
There was a problem hiding this comment.
Also fit_list isn't super descriptive. fit_single_encoders maybe?
sklearn/preprocessing/_encoders.py
Outdated
| for X_col, encoder in zip(X_list, encoders): | ||
| encoder.fit(X_col) | ||
|
|
||
| # map from X_trans indices to indices from the original X |
There was a problem hiding this comment.
| # map from X_trans indices to indices from the original X | |
| # map X_trans indices to indices from the original X |
There was a problem hiding this comment.
Also, please indicate why this is needed
sklearn/preprocessing/_encoders.py
Outdated
| Xi = X_list[i] | ||
| diff, valid_mask = _encode_check_unknown(Xi, self.categories_[i], | ||
| return_mask=True) | ||
| X_trs = [] |
sklearn/preprocessing/_encoders.py
Outdated
| X_trans_to_orig_idx = [] | ||
| X_trans_idx = 0 | ||
| for encoder in encoders: | ||
| n_feat_out = encoder.n_features_out_ |
sklearn/preprocessing/_encoders.py
Outdated
| X_trans_to_orig_idx.append((begin, end)) | ||
| X_trans_idx += n_feat_out | ||
|
|
||
| self._X_trans_to_orig_idx = X_trans_to_orig_idx |
There was a problem hiding this comment.
Isn't this the reverse mapping? From (0, n_features - 1) to (0, X_trans.shape[1]) ?
sklearn/preprocessing/_encoders.py
Outdated
| """ | ||
| Base class for encoders that includes the code to categorize and | ||
| class _EncoderUnion(TransformerMixin, BaseEstimator): | ||
| """Base class for encoders that includes the code to categorize and |
There was a problem hiding this comment.
| """Base class for encoders that includes the code to categorize and | |
| """Base class for encoders that includes the code to encode and |
sklearn/preprocessing/_encoders.py
Outdated
| Base class for encoders that includes the code to categorize and | ||
| class _EncoderUnion(TransformerMixin, BaseEstimator): | ||
| """Base class for encoders that includes the code to categorize and | ||
| transform the input features. |
sklearn/preprocessing/_encoders.py
Outdated
|
|
||
| return X_int, X_mask | ||
| def _hstack(self, Xs): | ||
| if any(sparse.issparse(X_tran) for X_tran in Xs): |
There was a problem hiding this comment.
X_trans let's be consistent please
| if X.dtype != object and not np.all(np.sort(cats) == cats): | ||
| raise ValueError("Unsorted categories are not " | ||
| "supported for numerical categories") | ||
| diff = _encode_check_unknown(X, cats) |
There was a problem hiding this comment.
Looking at the diff this is protected by if handle_unknown == 'error': in master.
(Not sure if relevant)
There was a problem hiding this comment.
_SingleOrdinalEncoder will always error since it can not handle unknown (at the moment).
sklearn/preprocessing/_encoders.py
Outdated
| self.drop_idx_ = None | ||
| elif self.drop not in self.categories_: | ||
| # This is an error state. Caller should check this value and | ||
| # handle according. |
There was a problem hiding this comment.
| # handle according. | |
| # handle accordingly. |
|
Ok so this PR adds some overhead because of how the sparse matrices are from itertools import product
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from neurtu import Benchmark, delayed
n_samples = [50000, 100000, 200000, 500000]
n_features = [50, 100]
def benchmark_onehot_encoder(encoder):
rng = np.random.RandomState(42)
for n_sample, n_feature in product(n_samples, n_features):
clf = encoder()
X = rng.randint(0, 50, size=(n_sample, n_feature))
yield delayed(clf,
tags={
'n_sample': n_sample,
'n_feature': n_feature,
}).fit_transform(X)
encoder = OneHotEncoder
bench = Benchmark(wall_time=True)
df = bench(benchmark_onehot_encoder(encoder))
print(df)And the results: PR wall_time
n_sample n_feature
50000 50 0.394461
100 0.790315
100000 50 0.824488
100 1.671744
200000 50 1.645967
100 3.855874
500000 50 4.927292
100 10.548770Master wall_time
n_sample n_feature
50000 50 0.287379
100 0.626392
100000 50 0.621225
100 1.496794
200000 50 1.356239
100 3.314679
500000 50 4.053212
100 8.872062 |
|
what about the memory issue of creating a big sparse matrix about multiple smaller sparse matrices? |
Doublish. ;) (As expected) I wish there was a good way to build a sparse matrix incrementally. I have tried Currently this PR lets the single encoders construct a CSC and then this is hstacked and converted into a csr. This is the fastest way, but still needs to allocated more memory for the stacked matrix. I may need make a compromise to get memory usage the same as master. The |
|
I think you could probably engineer some kind of CSCBuilder especially
since here you know the maximum nnz. But I'd consider this premature
optimisation if the rest of the refactor is clearly beneficial. (I've not
reviewed)
|
|
@thomasjpfan told me he gave up ;) |
This PR refactors
_encoders.pyas follows:_SingleOneHotEncoderand_SingleOrdinalEncoderto encode a single categorical feature. They implement,fit,transformandinverse_transform.transformandfitoperates onXof shape(n_samples,)._EncoderUnionthat provides utilities for child classes to call, which will distribute the features to the encoders:_check_Xdoes validation checks and returns a list of features,X_list_fit_listtakes a list of encoders and sends the elements ofX_listto their respected encoders to be fitted._transform_listtakes a list ofX_listand sends the elements ofX_listto their respected encoders to be transformed.inverse_transformis defined for all encoders.OneHotEncoderandOrdinalEncoderdoes input validation and uses the utilities provided by_EncoderUnion.The motivation of this design is to separate the handling of distributing feature columns and encoding. When we introduce more encoders in the future, we would need to implement a child class of
_EncoderUnionand a_Single***Encoder. This refactor should also make it easier to implement more features into our encoders such as #14954.CC: @NicolasHug