ENH Allows multiclass target in TargetEncoder#26674
ENH Allows multiclass target in TargetEncoder#26674ogrisel merged 65 commits intoscikit-learn:mainfrom
TargetEncoder#26674Conversation
thomasjpfan
left a comment
There was a problem hiding this comment.
I'm happy with the ordering:
feat0_class0, feat0_class1, feat0_class2, feat1_class0
At this point, it'll good to prioritize writing tests to make sure multiclass gives reasonable results.
| n_classes = self._label_binarizer_.classes_.shape[0] | ||
| X_ordinal, X_valid = [ | ||
| np.repeat(X, n_classes, axis=1) for X in (X_ordinal, X_valid) | ||
| ] |
There was a problem hiding this comment.
Same here regarding not needing to repeat X_ordinal and X_valid.
thomasjpfan
left a comment
There was a problem hiding this comment.
Thank you for the updates!
ogrisel
left a comment
There was a problem hiding this comment.
Thanks for the PR. This LGTM besides the following suggestions:
| In the multiclass case, `X_ordinal` and `X_unknown_mask` have column | ||
| (axis=1) size `n_features`, while `encodings` has length of size | ||
| `n_features * n_classes`. `feat_idx` deals with this by repeating | ||
| feature indicies by `n_classes` E.g., for 3 features, 2 classes: |
There was a problem hiding this comment.
It seems that this suggestion was not applied.
|
@ogrisel thank you for the review, changes made. |
ogrisel
left a comment
There was a problem hiding this comment.
This is looking good. Thanks very much for the PR @lucyleeow! I pushed a few more small improvements / fixes and I will merge if CI is green.
|
Thanks @ogrisel and @thomasjpfan ! |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Reference Issues/PRs
closes #26613
What does this implement/fix? Explain your changes.
Allows multiclass target type in
TargetEncoder, following section 3.3 of Micci-Barreca et al.Uses
LabelBinarizerto perform on vs rest onyand for each feature and calculates one vs rest target mean for each class, thus expanding number of features ton_features * n_classes.Any other comments?
First attempt, needs more thought on some aspects.
I am conflicted on the best order of the output features. Currently the order of features is:
I think grouping features may make more sense:
which should not be too computationally expensive, should just require an additional re-ordering of
encodings_(list of ndarray), which can be done via list comprehension using list of reordering indices.Any suggestions welcome.
EDIT: have now amended such that same features are grouped together.
TODO:
get_feature_names_outfor new features names that include classescc @thomasjpfan