CLN Refactors _encode into two functions#17101
CLN Refactors _encode into two functions#17101NicolasHug merged 7 commits intoscikit-learn:masterfrom
Conversation
NicolasHug
left a comment
There was a problem hiding this comment.
Thanks for taking care of this @thomasjpfan it does seem to look better
sklearn/metrics/_ranking.py
Outdated
| from ..exceptions import UndefinedMetricWarning | ||
| from ..preprocessing import label_binarize | ||
| from ..preprocessing._label import _encode | ||
| from ..preprocessing._label import _encode, _unique |
There was a problem hiding this comment.
should we move these out of preprocessing._label and put these in utils, since they're not just used for encoding labels but also features?
sklearn/preprocessing/_label.py
Outdated
| uniques : array, optional | ||
| If passed, uniques are not determined from passed values (this | ||
| uniques : array | ||
| Uniques are not determined from passed values (this |
There was a problem hiding this comment.
the part in the parenthesis should be removed it seems?
I guess this could just be "The unique values in values"
There was a problem hiding this comment.
maybe we can also repeat that it should be sorted when relying on numpy
sklearn/preprocessing/_label.py
Outdated
| def _encode(values, uniques=None, encode=False, check_unknown=True): | ||
| """Helper function to factorize (find uniques) and encode values. | ||
| def _encode(values, *, uniques, check_unknown=True): | ||
| """Helper function encode values. |
There was a problem hiding this comment.
| """Helper function encode values. | |
| """Helper function to encode values into [0, n_uniques - 1] |
sklearn/preprocessing/_label.py
Outdated
|
|
||
|
|
||
| def _unique(values, *, return_inverse=False): | ||
| """Helper function to find uniques with support for python objects. |
There was a problem hiding this comment.
| """Helper function to find uniques with support for python objects. | |
| """Helper function to find unique values with support for python objects. |
sklearn/preprocessing/_label.py
Outdated
| The sorted uniique values | ||
|
|
||
| unique_inverse : ndarray | ||
| The indicies to reconstruct the original array from the unique array. |
There was a problem hiding this comment.
| The indicies to reconstruct the original array from the unique array. | |
| The indices to reconstruct the original array from the unique array. |
sklearn/preprocessing/_label.py
Outdated
| ret = (uniques, ) | ||
|
|
||
| if return_inverse: | ||
| table = {val: i for i, val in enumerate(uniques)} | ||
| inverse = np.array([table[v] for v in values]) | ||
| ret += (inverse, ) | ||
|
|
||
| if len(ret) == 1: | ||
| ret = ret[0] | ||
|
|
||
| return ret |
There was a problem hiding this comment.
Is this equivalent? it seems we can avoid the tuple juggling here
| ret = (uniques, ) | |
| if return_inverse: | |
| table = {val: i for i, val in enumerate(uniques)} | |
| inverse = np.array([table[v] for v in values]) | |
| ret += (inverse, ) | |
| if len(ret) == 1: | |
| ret = ret[0] | |
| return ret | |
| if return_inverse: | |
| table = {val: i for i, val in enumerate(uniques)} | |
| inverse = np.array([table[v] for v in values]) | |
| return uniques, inverse | |
| else: | |
| return uniques |
jnothman
left a comment
There was a problem hiding this comment.
Yes, I'm happy with these changes. Will these functions be more coupled if we handle NaNs?
sklearn/preprocessing/_label.py
Outdated
| unique_inverse : ndarray | ||
| The indicies to reconstruct the original array from the unique array. | ||
| Only provided if `return_inverse` is True. | ||
| """ |
From my early prototyping with nan/None support (which builds on top of this PR), there seems to be no change needed to |
|
|
Do you need reviews or are @jnothman and @NicolasHug on it? |
|
I would give them a few more days. But having another set if eyes would not hurt. This PR only updates private API which hopefully makes the encoder logic easier to reason about. |
NicolasHug
left a comment
There was a problem hiding this comment.
A few more but LGTM otherwise.
sklearn/utils/_encode.py
Outdated
|
|
||
|
|
||
| def _unique_python(values, *, return_inverse): | ||
| # Only used in _uniques below, see docstring there for details |
There was a problem hiding this comment.
lolol
| # Only used in _uniques below, see docstring there for details | |
| # Only used in _uniques above, see docstring there for details |
sklearn/utils/_encode.py
Outdated
| The uniques values in `values`. If the dtype is not object, then | ||
| `uniques` need to be sorted. |
There was a problem hiding this comment.
| The uniques values in `values`. If the dtype is not object, then | |
| `uniques` need to be sorted. | |
| The unique values in `values`. If the dtype is not object, then | |
| `uniques` needs to be sorted. |
sklearn/utils/_encode.py
Outdated
| Parameters | ||
| ---------- | ||
| values : ndarray | ||
| Values to factorize or encode. |
There was a problem hiding this comment.
| Values to factorize or encode. | |
| Values to encode. |
not sure what "factorize" means?
sklearn/utils/_encode.py
Outdated
| return np.searchsorted(uniques, values) | ||
|
|
||
|
|
||
| def _encode_check_unknown(values, uniques, return_mask=False): |
There was a problem hiding this comment.
I feel like this helper isn't specific to any encoding logic. Should this signature be instead:
def _check_unknown(values, known_values, return_mask=False):(it's OK to keep it in this file though)
| (np.array(['b', 'a', 'c', 'a', 'c']), | ||
| np.array(['a', 'b', 'c']))], | ||
| ids=['int64', 'object', 'str']) | ||
| def test_encode_util(values, expected): |
There was a problem hiding this comment.
Is the lexicographic order expected? If so maybe add a comment here. If it's a strict API contract it should also be in the docstring.
There was a problem hiding this comment.
lexicographic order is not expected. Updated with more test to check for unordered cases.
sklearn/utils/_encode.py
Outdated
| return np.searchsorted(uniques, values) | ||
|
|
||
|
|
||
| def _check_unknown(values, uniques, return_mask=False): |
There was a problem hiding this comment.
Should uniques be known_values?
The fact that they're unique isn't relevant for this funciton
There was a problem hiding this comment.
Updatted to known_values and also added in the docstring that it must be unique.
There was a problem hiding this comment.
OK, I thought it would be fine even with duplicated values.
Reference Issues/PRs
Related to #16018
What does this implement/fix? Explain your changes.
Refactors
_encodeinto two functions._encode: encodes_unique:np.uniquebut also works for objects. It only implements the subset of features ofnp.uniquethat we need.Any other comments?
This PR is to make #16018 easier to review.
_unique_pythonis written in a weird way because in #16018 I will be addingreturn_counts.CC @NicolasHug This may be the refactor we spoke about months ago.